# Tutorial (Control Group)

The notebook tries to predict survival in the Titanic.

**Goal**: This notebook helps you get familiar with the form of tasks. 

Please make sure you <span style="color: red">run every cell</span> in the task. 

Once you see "CHOOSE ONE", copy, paste, and execute <span style="color: red"> any one of the choices </span> in the cell below.

You are free to run other cells.

# Task 1:  Data Analysis

## Part 0: Load Libraries and Reading data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from joblib import dump, load

plt.style.use("fivethirtyeight")
import warnings

warnings.filterwarnings("ignore")
from sklearn import metrics  # accuracy measure
from sklearn import svm  # support vector Machine
from sklearn.ensemble import RandomForestClassifier  # Random Forest

# importing all the required ML packages
from sklearn.linear_model import LogisticRegression  # logistic regression
from sklearn.metrics import confusion_matrix  # for confusion matrix
from sklearn.model_selection import train_test_split  # training and testing data split
from sklearn.naive_bayes import GaussianNB  # Naive bayes
from sklearn.neighbors import KNeighborsClassifier  # KNN
from sklearn.tree import DecisionTreeClassifier  # Decision Tree

%matplotlib inline

In [None]:
data = pd.read_csv("./titanic.csv")

## Part 1: Data Preprocessing

In [None]:
# impute missing age feature
data["Initial"] = 0
for i in data:
    data['Initial'] = data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull()) & (data.Initial == "Mr"), "Age"] = 33
data.loc[(data.Age.isnull()) & (data.Initial == "Mrs"), "Age"] = 36
data.loc[(data.Age.isnull()) & (data.Initial == "Master"), "Age"] = 5
data.loc[(data.Age.isnull()) & (data.Initial == "Miss"), "Age"] = 22
data.loc[(data.Age.isnull()) & (data.Initial == "Other"), "Age"] = 46

In [None]:
# impute embark feature
data["Embarked"].fillna("S", inplace=True)

# Age and fare feature band (convert continous values into categorical values)
data["Age_band"] = 0
data.loc[data["Age"] <= 16, "Age_band"] = 0
data.loc[(data["Age"] > 16) & (data["Age"] <= 32), "Age_band"] = 1
data.loc[(data["Age"] > 32) & (data["Age"] <= 48), "Age_band"] = 2
data.loc[(data["Age"] > 48) & (data["Age"] <= 64), "Age_band"] = 3
data.loc[data["Age"] > 64, "Age_band"] = 4
data.head(2)

data["Fare_cat"] = 0
data.loc[data["Fare"] <= 7.91, "Fare_cat"] = 0
data.loc[(data["Fare"] > 7.91) & (data["Fare"] <= 14.454), "Fare_cat"] = 1
data.loc[(data["Fare"] > 14.454) & (data["Fare"] <= 31), "Fare_cat"] = 2
data.loc[(data["Fare"] > 31) & (data["Fare"] <= 513), "Fare_cat"] = 3

# Converting String Values into Numeric
data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
data["Embarked"].replace(["S", "C", "Q"], [0, 1, 2], inplace=True)
data["Initial"].replace(
    ["Mr", "Mrs", "Miss", "Master", "Other"], [0, 1, 2, 3, 4], inplace=True
)

## Part 2: Let's try one feature selection strategy

In [None]:
#drop unneeded features
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial','Age'],axis=1,inplace=True)

## Part 3: Now let's train the model!

In [None]:
# devide the data into train and test
train, test = train_test_split(
    data, test_size=0.3, random_state=0, stratify=data["Survived"]
)
train_X = train[train.columns[1:]]
train_Y = train[train.columns[:1]]
test_X = test[test.columns[1:]]
test_Y = test[test.columns[:1]]
X = data[data.columns[1:]]
Y = data["Survived"]

### Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Bagging KNN
```python
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X, train_Y)
```
2. **Choice 2**: rbf-SVM
```python
model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
```

In [None]:
prediction = model.predict(test_X)
accu = metrics.accuracy_score(prediction, test_Y)
print("Accuracy for the model is ", accu)

## Part 4: Alternative Feature Engineering

Suppose you would like to explore a new approach. 

**Previously:** We tried these features in Part 2

```python
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial','Age'],axis=1,inplace=True)
```

**Now:** Let's try a different set of features (keep `Age`)

```python
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial'],axis=1,inplace=True)
```

**Task**: Please try the new method and retrain the model of your previous choice.

**Hint**: The `Age` feature is already dropped from the data in the previous method, so you cannot run the new feature selection methods directly now.

___________________________

Unfortunately, not dropping `Age` leads to no significant improvement. 

**Task**: Try to retrieve the previous model, and then export it using the following code:
```python
dump(model, 'tutorial.joblib')
```

# Task2: Identifying

Suppose you notice that the variable `data` is supposed to have 15 columns, but now it only has 8 columns.

**Task**: Find which cell has changed the number of columns in `data` from 15 to 8. Tell the staff your answer.

**Task**: Find which columns are dropped when the number of columns in `data` changed from 15 to 8. Tell the staff your answer.

### Try Kishu: Recover

Say, the kernel is restarted (please manually restart the kernel, "Kernel" > "Restart Kernel..."). 

**Task**: Restart the kernel, and then try to recover your work and execute the cell below to prove the recover is successful

In [None]:
# Run accu to see if the recover is successful
print(accu)