# Tutorial (Non Kishu Group)

The notebook tries to predict whether a person will survive or not in Titanic.

**Goal**: This notebook helps you get familiar with the form of tasks. You only need to **read the markdown instructions, no need to read the code**.

Please make sure you <span style="color: red">run every cell</span> in the task. 

You should <span style="color: red">**never**</span> unfold the next part before the previous part is done.

Once you see "CHOOSE ONE", copy, paste, and execute <span style="color: red"> any one of the choices </span> in the cell below.

You are free to run other cells.

# Task 1:  Data Analysis

## Part 0: Load Libraries and Reading data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from joblib import dump, load

plt.style.use("fivethirtyeight")
import warnings

warnings.filterwarnings("ignore")
from sklearn import metrics  # Accuracy measure
from sklearn import svm  # Support vector Machine
from sklearn.ensemble import RandomForestClassifier  # Random Forest
from sklearn.ensemble import BaggingClassifier  # Classifier

# importing all the required ML packages
from sklearn.linear_model import LogisticRegression  # Logistic regression
from sklearn.metrics import confusion_matrix  # For confusion matrix
from sklearn.model_selection import train_test_split  # Training and testing data split
from sklearn.naive_bayes import GaussianNB  # Naive bayes
from sklearn.neighbors import KNeighborsClassifier  # KNN
from sklearn.tree import DecisionTreeClassifier  # Decision Tree

%matplotlib inline

In [None]:
data = pd.read_csv("./titanic.csv")
print(data.head())

The `survived` is the thing you need to predict using other features.

## Part 1: Data Preprocessing

### Impute the missing values

In [None]:
# Impute missing age feature.
data["Initial"] = 0
for i in data:
    data['Initial'] = data.Name.str.extract('([A-Za-z]+)\.') # Let's extract the Salutations
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull()) & (data.Initial == "Mr"), "Age"] = 33
data.loc[(data.Age.isnull()) & (data.Initial == "Mrs"), "Age"] = 36
data.loc[(data.Age.isnull()) & (data.Initial == "Master"), "Age"] = 5
data.loc[(data.Age.isnull()) & (data.Initial == "Miss"), "Age"] = 22
data.loc[(data.Age.isnull()) & (data.Initial == "Other"), "Age"] = 46

In [None]:
# Impute embark feature.
data["Embarked"].fillna("S", inplace=True)

# Age and fare feature band (convert continous values into categorical values)
data["Age_band"] = 0
data.loc[data["Age"] <= 16, "Age_band"] = 0
data.loc[(data["Age"] > 16) & (data["Age"] <= 32), "Age_band"] = 1
data.loc[(data["Age"] > 32) & (data["Age"] <= 48), "Age_band"] = 2
data.loc[(data["Age"] > 48) & (data["Age"] <= 64), "Age_band"] = 3
data.loc[data["Age"] > 64, "Age_band"] = 4
data.head(2)

data["Fare_cat"] = 0
data.loc[data["Fare"] <= 7.91, "Fare_cat"] = 0
data.loc[(data["Fare"] > 7.91) & (data["Fare"] <= 14.454), "Fare_cat"] = 1
data.loc[(data["Fare"] > 14.454) & (data["Fare"] <= 31), "Fare_cat"] = 2
data.loc[(data["Fare"] > 31) & (data["Fare"] <= 513), "Fare_cat"] = 3

# Converting String Values into Numeric
data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
data["Embarked"].replace(["S", "C", "Q"], [0, 1, 2], inplace=True)
data["Initial"].replace(
    ["Mr", "Mrs", "Miss", "Master", "Other"], [0, 1, 2, 3, 4], inplace=True
)

In [None]:
# Making sure there is no missing value
data.isnull().sum()

## Part 2: Let's try one feature selection strategy

In [None]:
# Drop unneeded features.
data.drop(['Name', 'Ticket', 'Fare', 'PassengerId', 'Initial','Age'], axis=1, inplace=True)

## Part 3: Now let's train the model!

### Train test set split

In [None]:
# Split the data into train and test.
train, test = train_test_split(
    data, test_size=0.3, random_state=0, stratify=data["Survived"]
)

In [None]:
# Making sure splitting is successful
print(train.shape, test.shape)

### Divide the data into feature (X) and target (Y)

In [None]:
train_X = train[train.columns[1:]]
train_Y = train[train.columns[:1]]
test_X = test[test.columns[1:]]
test_Y = test[test.columns[:1]]

In [None]:
# Checking that the dataset is complete
print(train_X.head())
print(train_Y.head())

### Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Bagging KNN
```python
model = BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3), random_state=0, n_estimators=700)
model.fit(train_X, train_Y)
```
2. **Choice 2**: rbf-SVM
```python
model = svm.SVC(kernel='rbf', C=1, gamma=0.1)
model.fit(train_X,train_Y)
```

## Part 4: Model Evaluation

In [None]:
prediction = model.predict(test_X)
accu = metrics.accuracy_score(prediction, test_Y)
accu

## Part 5: Alternative Feature Engineering

Suppose you would like to explore a new approach. 

**Previously:** We tried these features in Part 2.

```python
data.drop(['Name', 'Ticket', 'Fare', 'PassengerId', 'Initial', 'Age'], axis=1, inplace=True)
```

**Now:** Let's try a different set of features (keep `Age`).

```python
data.drop(['Name', 'Ticket', 'Fare', 'PassengerId', 'Initial'], axis=1, inplace=True)
```

**Task**: Try the new feature engineering method instead of the previous one, retrain and re-evaluate the model of your previous choice based on the new method.

<span style="color: red">**Note**</span>: The `Age` feature is already dropped by the previous method, so you cannot run the new feature selection methods directly without recovering the data in kernel. 

___________________________

## Part 6

### Report result

**Task:** Please report which one has higher `accu` scores? Answer `old model` or `new model`

Answer in the <span style="color:red">text file</span> opened for you.

## Part 7

### Dump model

Let's export both models anyway for later investigation.

**Task**: Please export the models you trained based on the first and second feature engineering methods:

The exported files' name should be `"tutorial_old_way.joblib"` and `"tutorial_new_way.joblib"` respectively. 

To dump a file, you can use the following code:
   ```python
   dump(model, "tutorial_old_way.joblib") 
   ```
and
   ```python
   dump(model, "tutorial_new_way.joblib") 
   ```

You should be able to see `tutorial_old_way.joblib` and `tutorial_new_way.joblib` files at the left panel, which means the model is exported successfully.

# Task 2: Identifying and Recovering

## Part 1: Identifying

**Task**: Identify which **cell** drops the number of columns in `data` from 14 to 8.

**hint**: you may use `data.shape` to inspect the shape of data or any other ways to finish the task.

______________

## Part 2:
### Recover from System Failure

**Let's simulate system failure by restarting the kernel ("Kernel" > "Restart Kernel...").**

**Task**: After the restart, restore the variables before the restart and **execute the next cell to make sure the restoration is successful**.

In [None]:
# Inspect accu to see if the recovery is successful
print(accu)