# Assignment 5: Machine Learning

In this assignment, we will get you going on building simple machine learning models. We will use a popular Python machine learning library called [`scikit-learn`](https://scikit-learn.org/stable/index.html) and work on the `Titanic` dataset, a popular introductory dataset for binary classification that includes information on the passengers of the Titanic. In particular, we will try to build a model to estimate whether a passenger survived or not, based on the rest of the available information.

Before we begin, make sure you have chosen the right virtual environment (iap-data-venv) to run this notebook in. How to do this depends on your IDE. 

## Dataset

You can find the Titanic dataset in `datasets/ml/titanic.csv`. Let's take a look at the schema of the dataset:

In [None]:
import pandas as pd

data  = pd.read_csv('../datasets/ml/titanic.csv')
data.head()

In [None]:
data.shape

As you can see, the dataset contains the following fields for each of 891 passengers:

- `PassengerId`: A unique, monotonically increasing ID number for each passenger.
- `Survived`: Whether or not the passenger survived.
- `Pclass`: The class of the passenger (1 = 1st; 2 = 2nd; 3 = 3rd).
- `Name`, `Sex`, `Age`: Demographic informaton about the passenger.
- `SibSp` - Number of siblings/spouses aboard.
- `Parch` - Number of parents/children aboard.
- `Fare` - Fare paid by the passenger.
- `Cabin` - Cabin number.
- `Embarked` - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Let's preprocess the dataset to drop missing values, encode categorical vairables and create a training and testing split:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Drop passengers with missing age
data.dropna(subset='Age', inplace=True)
data.reset_index(drop=True, inplace=True)

# Splitting data
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = data['Survived']

# Append columns with one-hot encoding for Sex and Embarked
X = pd.concat([X, pd.get_dummies(X['Sex'], prefix='Sex')], axis=1)
X = pd.concat([X, pd.get_dummies(X['Embarked'], prefix='Embarked')], axis=1)
X.drop(columns=['Sex', 'Embarked'], inplace=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train.head()

Although we have now avoided having categorical columns, notice that the scale of our columns is different: `Pclass` only ranges from 1 to 3, while `Age` can take much larger values. To avoid improperly weighting these features just because their scales are different, it is common to apply a futher preprocessing step where we standardize them. Here we will use `preprocessing.StandardScaler()` to ensure that every numerical column has zero mean and unit variance:

In [None]:
from sklearn import preprocessing

# Scale numerical columns
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

We are now ready to experiment with differnt ML classifiers available in `scikit-learn`.

## Approach 1: Logistic Regression

We will start with [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Logistic Regression is a linear model for classification. It is very similar to Linear Regression, but instead of predicting a continuous value, it predicts the probability of an instance belonging to a class. We provide the code below:

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train model
logistic_model = LogisticRegression()
logistic_model.fit(X_train_scaled, y_train)

# Predictions and evaluation
train_predictions = logistic_model.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy on training data: {train_accuracy*100:.2f}%")

test_predictions = logistic_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Accuracy on test data: {test_accuracy*100:.2f}%")

## Approach 2: Decision Tree Classifier

We will next try a [decision tree classifier](https://scikit-learn.org/stable/modules/tree.html#classification).  Decision trees are a type of model that can be used for both classification and regression.  They are easy to interpret and visualize.  They are also the basis for more complex models such as random forests and gradient boosted trees.


Take a look at the documentation and fill in the blaks below. You will notice that t`scikit-learn` provides the same interface regardless of model, so the code will look very similar to what we wrote above.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train model
tree_model = ... # TODO: Create a model
# TODO: Fit the model to the data

# Predictions and evaluation
train_predictions = ... # TODO: Derive predictions for training data
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy on training data: {train_accuracy*100:.2f}%")

test_predictions = ... # TODO: Derive predictions for test data
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Accuracy on test data: {test_accuracy*100:.2f}%")


## Approach 3: Random Forest

Next, we will experiment with a [random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), an ensemble method that combines multiple decision trees to create a more powerful model. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train model
forest_model = ... # TODO: Create a model
# TODO: Fit the model to the data

# Predictions and evaluation
train_predictions = ... # TODO: Derive predictions for training data
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy on training data: {train_accuracy*100:.2f}%")

test_predictions = ... # TODO: Derive predictions for test data
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Accuracy on test data: {test_accuracy*100:.2f}%")


As we see, the random forest achieves the same accuracy on the training data but manages to generalize better to the test data. This is because the random forest is less prone to overfitting than a single decision tree.

## Approach 4: Support Vector Machine (SVM)

Next, we will also try a [support vector machine (SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). SVMs are a powerful class of models that can be used for both classification and regression. They work by finding a hyperplane that separates the data into two classes. The hyperplane is chosen such that the distance between the hyperplane and the nearest data point from each class is maximized. This distance is called the margin. The data points that are closest to the hyperplane are called support vectors. 

In [None]:
from sklearn.svm import SVC

# Initialize and train model
svm_model = ... # TODO: Create a model
# TODO: Fit the model to the data

# Predictions and evaluation
train_predictions = ... # TODO: Derive predictions for training data
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy on training data: {train_accuracy*100:.2f}%")

test_predictions = ... # TODO: Derive predictions for test data
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Accuracy on test data: {test_accuracy*100:.2f}%")

## Approach 5: k-Nearest Neighbors

Finally, we will try [k-Nearest Neighbors (kNN) classification](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). This algorithm is a bit different from the previous ones, as it does not learn a model. Instead, it memorizes the training data and uses it to classify new data points based on the k nearest points in the training data. The kNN algorithm is an example of an instance-based learning algorithm.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize and train model
knn_model = ... # TODO: Create a model
# TODO: Fit the model to the data

# Predictions and evaluation
train_predictions = ... # TODO: Derive predictions for training data
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy on training data: {train_accuracy*100:.2f}%")

test_predictions = ... # TODO: Derive predictions for test data
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Accuracy on test data: {test_accuracy*100:.2f}%")
