# 2.1 Loading a Sample Dataset

**Problem**
You want to load a preexisting sample dataset.

**Solution**
scikit-learn comes with a number of popular datasets for you to use:

In [None]:
# Load scikit-learn's datasets
from sklearn import datasets

# Load digits dataset
digits = datasets.load_digits()

# Create features matrix
features = digits.data

In [None]:
# Create target vector
target = digits.target

In [None]:
# View first observation
features[0]

**load_boston**
Contains 503 observations on Boston housing prices. It is a good dataset for
exploring regression algorithms.

**load_iris**
Contains 150 observations on the measurements of Iris flowers. It is a good dataset
for exploring classification algorithms.

**load_digits**
Contains 1,797 observations from images of handwritten digits. It is a good dataset
for teaching image classification.

# 2.2 Creating a Simulated Dataset

**Problem**
You need to generate a dataset of simulated data.

**Solution**

When we want a dataset designed to be used with linear regression, make_regression
is a good choice:

In [None]:
# Load library
from sklearn.datasets import make_regression

# Generate features matrix, target vector, and the true coefficients
features, target, coefficients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

If we are interested in creating a simulated dataset for classification, we can use
make_classification:

In [None]:
# Load library
from sklearn.datasets import make_classification

# Generate features matrix and target vector
features, target = make_classification(n_samples = 100,
                                       n_features = 3,
                                       n_informative = 3,
                                       n_redundant = 0,
                                       n_classes = 2,
                                       weights = [.25, .75],
                                       random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Finally, if we want a dataset designed to work well with clustering techniques, scikitlearn
offers make_blobs:

In [None]:
# Load library
from sklearn.datasets import make_blobs

# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100,
                              n_features = 2,
                              centers = 3,
                              cluster_std = 0.5,
                              shuffle = True,
                              random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

In [None]:
# Load library
import matplotlib.pyplot as plt

# View scatterplot
plt.scatter(features[:,0], features[:,1], c=target)
plt.show()

# 2.3 Loading a CSV File

**Problem**
You need to import a comma-separated values (CSV) file.

**Solution**
Use the pandas library’s read_csv to load a local or hosted CSV file:

In [3]:
# Load library
import pandas as pd

# Create URL
url = 'https://raw.githubusercontent.com/haniramadhan-kkp/mlcourse/main/titanic.csv'

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows
dataframe.head(2)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
