# Data Programming in Python | BAIS:6040
# Module 9. Machine Learning with Scikit-Learn

Written by Kang-Pyo Lee 

Topics to be covered:
- Supervised learning - classification and regression (+ exercises)
- Unsupervised learning - clustering (+ exercises)

In [None]:
# ! pip install --user --upgrade scikit-learn

## Data Loading and Preparation

### Loading Data into a Pandas Dataframe

In [None]:
from seaborn import load_dataset

df = load_dataset("titanic")
df

### Selecting Columns of Interest

In [None]:
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare"]]
df

We want to filter out unnecessary or duplcate columns. 

### Handling Missing Data

Most machine learning libraries will not accept null values as input. Every null value in a data set must be removed or replaced with a valid value. 

In [None]:
df.info()

pandas.DataFrame.info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

The `age` columns has 714 non-null values, which means the other 177 values are null. 

In [None]:
df[df.isnull().any(axis=1)]

In [None]:
df = df.dropna()
df

We want to drop all rows with any missing values. Be aware that we have lost those 177 rows using this approach. 

In [None]:
df.info()

## Supervised Learning - Binary Classification

### Setting the Goal

Using the Titanic data set, we aim to build a classification model that is able to predict whether an imaginery passenger with a certain class, sex, age, company, and fare would have survived the Titanic accident or not. This is a binary classification problem. 

For example, suppose there was a man of age 25 who purchased a third class ticket at £7 and was on board by himself, would he probably have died or survived?

In [None]:
df.survived.value_counts()

For binary classification, we oftern refer to a *positive* class, in this case class 1, and a *negative* class, in this case class 0, with the understanding that the positive class is the one we are looking for, which is usually the minority class. 

### Defining the Features and the Target

In [None]:
features = ["pclass", "sex", "age", "sibsp", "parch", "fare"]
target = "survived"

According to the goal description above, we predict `survived` using `pclass`, `sex`, `age`, `sibsp`, `parch`, and `fare`. 

In [None]:
X = df[features]
y = df[target]

For a supervised learning task, you need a features set `X` and a target set `y`.

In [None]:
X

`X` is a Pandas dataframe.

In [None]:
y

`y` is a Pandas series.

In `scikit-learn`, following conventions from mathematics, data is usually denoted with an uppercase `X`, while labels are denoted by a lowercase `y`. We use an uppercase `X` because the data is a 2-dimensional array (a matrix) and a lowercase `y` because the target is a 1-dimensional array (a vector). 

### Converting Categorical Columns into Numerical Columns

As most machine learning packages will only accept numbers as input, every categorical column in a dataset must be replaced with a numerical column. 

In [None]:
X

In [None]:
X.info()

In [None]:
X.sex.value_counts()

In [None]:
X.sex = X.sex.apply(lambda x: 1 if x == "male" else 0)

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

We want to convert *male* to 1 and *female* to 0. 

In [None]:
X

In machine learning, the individual items are called *samples* and their properties are called *features*. In this case, we have 714 samples and 6 features. 

### Splitting Data into Training and Test data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0)

sklearn.model_selection.train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

To assess model performance, we need to randomly split the features set `X` and the target set `y` into two training sets `X_train` and `y_train` and two test sets `X_text` and `y_test`. Here, `X_train` and `y_train` will be used for training a model, while `X_test` and `y_test` will be used for testing the model. 

Setting the `test_size` parameter to 0.25 means splitting the data into 25% of test data and 75% of training data. 

The `random_state` parameter controls the shuffling applied to the data before applying the split. You can pass an int for reproducible (deterministic) output across multiple function calls. 

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

### Choosing a Classficiation Algorithm to Use

Let's start with the k-Nearest Neigobors (k-NNs) algorithm as our first classification algorith to try. 

### Initializing an Estimator by Setting Hyper-Parameters

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier(n_neighbors=1)
knc

class sklearn.neighbors.KNeighborsClassifier(`n_neighbors`=5, \*, `weights`='uniform', `algorithm`='auto', `leaf_size`=30, `p`=2, `metric`='minkowski', `metric_params`=None, `n_jobs`=None, \*\*kwargs)

sklearn.neighbors.KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

The number of neighbors `n_neighbors` is set to 1, which means it will consider only the closest neighbor. 

The initialization returns an object called estimator. The `knc` object will be used as the estimator for our k-NNs model.

### Fitting the Model to the Training Data

In [None]:
knc.fit(X_train, y_train)

Fitting, or training, is done. 

### Evaluating the Performance of Model

Accuracy is the number of correct predictions (TP + TN) divided by the number of all samples. 

In [None]:
knc.score(X_train, y_train)                   # Get the training set accuracy of the model 

In [None]:
knc.score(X_test, y_test)                     # Get the test set accuracy of the model 

### Making Predictions on New Data (Deploying Model)

Once you have chosen a model to deploy, we can now make predictions using the model on new unseen data for which we might not know the correct labels. Suppose we have three imaginary passengers `person1`, `person2`, and `person3`. 

In [None]:
person1 = {"pclass": 3,    # a man at age 25 and of the third class who was on board alone and paid £7. 
           "sex": 1,
           "age": 25,
           "sibsp": 0,
           "parch": 0,
           "fare": 7}

person2 = {"pclass": 1,     # a little girl at age 8 and of the first class who was on board with her parents and paid £40. 
           "sex": 0,
           "age": 8,
           "sibsp": 1,
           "parch": 2,
           "fare": 40}

person3 = {"pclass": 2,     # a woman at age 20 and of the second class who was on board alone and paid £15.
           "sex": 0,
           "age": 20,
           "sibsp": 0,
           "parch": 0,
           "fare": 15}

In [None]:
X_new = []

for person in [person1, person2, person3]:
    new_person = [person["pclass"], person["sex"], person["age"], person["sibsp"], person["parch"], person["fare"]]
    X_new.append(new_person)
    
X_new

`X_new` contains new data items.

In [None]:
import pandas as pd

X_new = pd.DataFrame(data=X_new, columns=features)
X_new

It would be more readable if we transform the raw `X_new` list into a Pandas dataframe with column labels. 

In [None]:
knc.predict(X_new)

The k-NNs model predicts that the persons 1 and 3 would have died, whereas person 2 would have survived.

In [None]:
summary = dict()

summary["k-NNs"] = round(knc.score(X_test, y_test), 3)
summary

We want to save the performance score of each algorithm in a dictionary, so that we can compare all the scores at the end. 

<hr>

### Trying Different Parameter Values

As for the the number of closest neighbors to consider, we have tried 1. Now let's try 3 this time with hopes that looking at three would work better than just one. 

In [None]:
knc = KNeighborsClassifier(n_neighbors=3)
knc

You need to re-initialize the estimator whenever you change any hyper-parameter. 

In [None]:
knc.fit(X_train, y_train)

In [None]:
knc.score(X_train, y_train), knc.score(X_test, y_test)

It seems like increasing `n_neighbors` would not help with the performance, which makes us stay with the previous model. 

In [None]:
knc.predict(X_new)

### Trying Different Classification Algorithms

Let's try the Logtistic Regression algorithm this time. 

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr

class sklearn.linear_model.LogisticRegression(`penalty`='l2', \*, `dual`=False, `tol`=0.0001, `C`=1.0, `fit_intercept`=True, `intercept_scaling`=1, `class_weight`=None, `random_state`=None, `solver`='lbfgs', `max_iter`=100, `multi_class`='auto', `verbose`=0, `warm_start`=False, `n_jobs`=None, `l1_ratio`=None)

sklearn.linear_model.LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

If you don't set any parameters, the default value will be taken for each parameter.

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

It seems like the Logistic Regression model works much better than the k-NNs model. 

In [None]:
lr.predict(X_new)

The Logistic Regression model predicts that person 3 would have survived, unlike the prediction from the above k-NNs model. Note that different algorithms and models could make different predictions. 

In [None]:
summary["Logistic Regression"] = round(lr.score(X_test, y_test), 3)
summary

## Modeling with Different Classification Algorithms

### Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=0)
dtc

class sklearn.tree.DecisionTreeClassifier(\*, `criterion`='gini', `splitter`='best', `max_depth`=None, `min_samples_split`=2, `min_samples_leaf`=1, `min_weight_fraction_leaf`=0.0, `max_features`=None, `random_state`=None, `max_leaf_nodes`=None, `min_impurity_decrease`=0.0, `min_impurity_split`=None, `class_weight`=None, `ccp_alpha`=0.0)

sklearn.tree.DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

The `random_state` parameter guarantees a deterministic outcome given the same data sets. 

In [None]:
dtc.fit(X_train, y_train)

In [None]:
dtc.score(X_train, y_train), dtc.score(X_test, y_test)

In [None]:
dtc.predict(X_new)

In [None]:
summary["Decision Trees"] = round(dtc.score(X_test, y_test), 3)
summary

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=0)
rfc

class sklearn.ensemble.RandomForestClassifier(`n_estimators`=100, \*, `criterion`='gini', `max_depth`=None, `min_samples_split`=2, `min_samples_leaf`=1, `min_weight_fraction_leaf`=0.0, `max_features`='auto', `max_leaf_nodes`=None, `min_impurity_decrease`=0.0, `min_impurity_split`=None, `bootstrap`=True, `oob_score`=False, `n_jobs`=None, `random_state`=None, `verbose`=0, `warm_start`=False, `class_weight`=None, `ccp_alpha`=0.0, `max_samples`=None)

sklearn.ensemble.RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
rfc.fit(X_train, y_train)

In [None]:
rfc.score(X_train, y_train), rfc.score(X_test, y_test)

In [None]:
rfc.predict(X_new)

In [None]:
summary["Random Forest"] = round(rfc.score(X_test, y_test), 3)
summary

### Linear Support Vector Machines (SVMs)

In [None]:
from sklearn.svm import LinearSVC

lsvc = LinearSVC(random_state=0)
lsvc

class sklearn.svm.LinearSVC(`penalty`='l2', `loss`='squared_hinge', \*, `dual`=True, `tol`=0.0001, `C`=1.0, `multi_class`='ovr', `fit_intercept`=True, `intercept_scaling`=1, `class_weight`=None, `verbose`=0, `random_state`=None, `max_iter`=1000)

sklearn.svm.LinearSVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [None]:
lsvc.fit(X_train, y_train)

In [None]:
lsvc.score(X_train, y_train), lsvc.score(X_test, y_test)

In [None]:
lsvc.predict(X_new)

In [None]:
summary["Linear SVMs"] = round(lsvc.score(X_test, y_test), 3)
summary

### Kernelized Support Vector Machines (SVMs)

In [None]:
from sklearn.svm import SVC

svc = SVC(C=1.0, kernel="rbf", gamma="scale", random_state=0)
svc

class sklearn.svm.SVC(\*, `C`=1.0, `kernel`='rbf', `degree`=3, `gamma`='scale', `coef0`=0.0, `shrinking`=True, `probability`=False, `tol`=0.001, `cache_size`=200, `class_weight`=None, `verbose`=False, `max_iter`=- 1, `decision_function_shape`='ovr', `break_ties`=False, `random_state`=None)

sklearn.svm.SVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
- `kernel`: The '*rbf*' refers to Radial Basis Function, also known as the Gaussian kernel
- `gamma`: Controls the width of the Gaussian kernel, which is set to '*auto*' (= 1/# of features) by default. 

In [None]:
svc.fit(X_train, y_train)

In [None]:
svc.score(X_train, y_train), svc.score(X_test, y_test)

In [None]:
svc.predict(X_new)

In [None]:
summary["Kernelized SVMs"] = round(svc.score(X_test, y_test), 3)
summary

### Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(10,), random_state=0)
mlpc

class sklearn.neural_network.MLPClassifier(`hidden_layer_sizes`=100, `activation`='relu', \*, `solver`='adam', `alpha`=0.0001, `batch_size`='auto', `learning_rate`='constant', `learning_rate_init`=0.001, `power_t`=0.5, `max_iter`=200, `shuffle`=True, `random_state`=None, `tol`=0.0001, `verbose`=False, `warm_start`=False, `momentum`=0.9, `nesterovs_momentum`=True, `early_stopping`=False, `validation_fraction`=0.1, `beta_1`=0.9, `beta_2`=0.999, `epsilon`=1e-08, `n_iter_no_change`=10, `max_fun`=15000)

sklearn.neural_network.MLPClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

Setting `hidden_layer_sizes` to (10,) means there is only one hidden layer with 10 hidden units. If you set it to (10, 10), that means you want two hidden layers, each with 10 hidden units. 

In [None]:
mlpc.fit(X_train, y_train)

In [None]:
mlpc.score(X_train, y_train), mlpc.score(X_test, y_test)

In [None]:
mlpc.predict(X_new)

In [None]:
summary["Neural Networks"] = round(mlpc.score(X_test, y_test), 3)
summary

## Choosing the Best Model

In [None]:
summary

You can simply choose the model with the best performance, i.e., the highest accuracy score. 

## Room for Improvement

- You may want to try many more different parameter settings for each classification algorithm to find the optimal setting that yields the best performance. 
- You may want to try other classification algorithms to find the algorithm that yields the best performance. 
- You may want to consider other classification metrics than accuracy such as precision, recall, f1-score, confusion matrix, average precision (AP), Average Precision and Area Under the Curve (AUC), etc. 

## Exercises for Regression

<hr>

## Unsupervised Learning - Clustering

Let's continue to use the Titanic dataframe `df` for clustering. 

In [None]:
from seaborn import load_dataset

df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare"]]
df = df.dropna()
df

### Setting the Goal

Using the Titanic data set, we aim to build a clustering model that is able to partition the data set into groups, or clusters, of similar passengers. This is a clustering problem.  

### Defining the Features

In [None]:
features = ["survived", "pclass", "sex", "age", "sibsp", "parch", "fare"]

Note that there is no target to predict in unsupervised learning. 

In [None]:
X = df[features]
X

Just because we don't have a target to predict, we don't have to define `y` in unsupervised learning. All we need is just `X`. Also, we don't have to split the data into training and test data either. 

### Converting Categorical Columns into Numerical Columns

In [None]:
X.sex = X.sex.apply(lambda x: 1 if x == "male" else 0)
X

### Choosing a Clustering Algorithm to Use

Let's use the K-Means Clustering algorithm. 

### Initializing a Model Object with Hyper-Parameters

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0)
kmeans

sklearn.cluster.KMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

The most important parameter for k-Means Clustering is `n_clusters`, which determines the number of clusters, or k, you want to find. 

### Fitting the Data

In [None]:
kmeans.fit(X)

In [None]:
kmeans.predict(X)

Each data point in `X` is assigned a cluster label, which is a number between 0 and k-1. 

In [None]:
df["label"] = kmeans.predict(X)
df

It would be more useful if we add a new column `lable` to the dataframe, so we can know which data point, or passenger, is assigned to which cluster. 

In [None]:
df.label.value_counts()                          # Count the number of values for each label 

Note that k-Means Clustering neither names the clusters nor gives any additional information about the clusters. It just yields cluster labels in numbers. It is you to identify what each cluster represents.

### Evaluating the Performance of Model

Note that there is no ground truth in unsupervised learning that can be used for evaluation. The focus of evaluation, therefore, should be on identifying the characteristics of each cluster.

In [None]:
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th = df.label.value_counts().index
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th

In [None]:
df[df.label == cluster_1st].sample(n=10, random_state=0)    # Select a random sample with 10 rows

You can see that the passengers in the largest cluster seem to be those who mostly died and were in lower classes, not so old, and on board alone.  

In [None]:
df[df.label == cluster_2nd].sample(n=10, random_state=0)

You can see that the passengers in the second largest cluster seem to be those who mostly survived and were in the first class.  

In [None]:
df[df.label == cluster_3rd].sample(n=10, random_state=0)

You can also see that the passengers in the third largest cluster seem to be those who mostly survived and were in the first class but paid more than the passengers in the previous cluster. We can suspect these two clusters could have been one cluster, but was further divided into two because the model had to find 5 clusters anyway.  

In [None]:
df[df.label == cluster_4th]

You can also see that the passengers in this cluster seem to have paied even more than those in the previous two clusters. 

In [None]:
df[df.label == cluster_5th]

In this cluster, you can see the three passengers who paid most. 

In [None]:
df.groupby("label").mean()

We can check the average for each cluster and column. For example, the survival rate of cluster 2 is only 33.5%, whereas the survival rate of cluster 0 is 63.7%, which is consistent with the above findings. 

### Trying Different k

In [None]:
kmeans = KMeans(n_clusters=6, random_state=0)
kmeans.fit(X)
df["label"] = kmeans.predict(X)
df.label.value_counts()

In [None]:
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th, cluster_6th = df.label.value_counts().index
cluster_1st, cluster_2nd, cluster_3rd, cluster_4th, cluster_5th, cluster_6th

In [None]:
df[df.label == cluster_1st].sample(n=10, random_state=0)

This largest clsuter looks not much different than the the largest cluster from the previous 5-means clustering. 

In [None]:
df[df.label == cluster_2nd].sample(n=10, random_state=0)

You can see that increasing k from 5 to 6 helps split the previous largest cluster into two clusters. 

In [None]:
df[df.label == cluster_3rd].sample(n=10, random_state=0)

In [None]:
df[df.label == cluster_4th].sample(n=10, random_state=0)

In [None]:
df[df.label == cluster_5th]

In [None]:
df[df.label == cluster_6th]

Note that k-means clustering does not allow you to control which cluster to split or which clusters to merge. In other words, increasing or decreasing k might not always work as expected. 

# Exercises for Clustering