Goal:
- train a classifier to predict if a passenger survives or not the titanic
  
Tasks:
- load the titanic.csv file into a DataFrame
- drop the irrelevant columns (`PassengerId`, `Name`, `Ticket`, `Cabin`) 
  - hint: `df.drop()` or `df.pop()`
- drop the rows with nan values (`df.shape` should return `(712, 8)` at this point)
  - hint: `df.dropna()`
- extract `y` target values (what you want to predict) from the DataFrame
  - hint: `y = df.pop().to_list()`
  - hint: `df.shape` should return `(712, 7)` at this point
- Problem: if you take a look at `df.head()`, you will see that `Sex` and `Embarked` colums are not numeric, they are `string`s. However, in order to build a classifier, we need the features to be numbers. Accordingly, they must be converted to numbers somehow: this is called `Encoding`. There are different approaches for this, `Label Encoding` and `One-Hot Encoding (OHE)`. If there is an order between the values, you should use `Label Encoding`. However, since there is no natural order in `Sex` and `Embarked` columns, you must use `OHE`. Google `OHE`!
  - df_encoded = pd.get_dummies(df, columns=['Sex', 'Embarked']) --> This will result in boolean values, convert them to integers
  - df_encoded['Sex_female'] = df_encoded['Sex_female'].astype(int) --> Repeat this for the other encoded columns as well
  - drop the original `Sex` and `Embarked` columns
  - `X = df_encoded.to_numpy()`
- Create train/test split with `train_test_split()`
  - hint: `from sklearn.model_selection import train_test_split`
- Create and train (fit) a decision tree classifier and a random forest classifier with the training data you created in the previous step
- predict and measure accuracy on the test set (`from sklearn.metrics import accuracy_score`)

  ```

In [None]:
# Good luck :)
#TODO Does the cabin matter?

In [29]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [71]:
df = pd.read_csv("titanic.csv")
df.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis = 1, inplace = True)
df.dropna(inplace = True)

In [72]:
# Using OHE to change the datatypes to int
df = pd.get_dummies(df, columns = ["Sex", "Embarked"])

df["Sex_female"] = df["Sex_female"].astype(int)
df["Sex_male"] = df["Sex_male"].astype(int)
df["Embarked_C"] = df["Embarked_C"].astype(int)
df["Embarked_Q"] = df["Embarked_Q"].astype(int)
df["Embarked_S"] = df["Embarked_S"].astype(int)


In [104]:
# y --> survived or not list
y = df.pop("Survived").to_list()

In [33]:
#create test and train datasets
x_train, x_test, y_train, y_test = train_test_split(df, y, train_size = 0.5, random_state = 11)
"""x_train = df.values[::2]
x_test = df.values[1::2]
y_train = y[::2]
y_test = y[1::2]"""

'x_train = df.values[::2]\nx_test = df.values[1::2]\ny_train = y[::2]\ny_test = y[1::2]'

In [39]:
# Decision Tree 50%

# Train the classifier
dt_clf = DecisionTreeClassifier()
dt_clf = dt_clf.fit(x_train, y_train)

# Test the classifier
dt_y_pred = dt_clf.predict(x_test)
dt_accuracy = accuracy_score(y_test, dt_y_pred)

0.7162921348314607


In [40]:
# Random Tree 50%

# Train the classifier
rt_clf = RandomForestClassifier()
rt_clf = rt_clf.fit(x_train, y_train)

# Test the classifier
rt_y_pred = rt_clf.predict(x_test)
rt_accuracy = accuracy_score(y_test, rt_y_pred)

0.7443820224719101


In [53]:
print(f"The classifiers accuracy with 50-50 split on the data\n    Decision tree     : {round(dt_accuracy,2)}\n    Random tree       : {round(rt_accuracy, 2)}")

The classifiers accuracy with 50-50 split on the data
    Decision tree     : 0.72
    Random tree       : 0.74


## Try it with 70% train and 30% test

In [55]:
#create test and train datasets
x70_train, x30_test, y70_train, y30_test = train_test_split(df, y, train_size = 0.7, random_state = 11)

(498, 10) (214, 10)


In [56]:
# Decision Tree 70%

# Train the classifier
dt70_clf = DecisionTreeClassifier()
dt70_clf = dt70_clf.fit(x70_train, y70_train)

# Test the classifier
dt70_y_pred = dt70_clf.predict(x30_test)
dt70_accuracy = accuracy_score(y30_test, dt70_y_pred)

0.7242990654205608


In [57]:
# Random Tree 70%

# Train the classifier
rt70_clf = RandomForestClassifier()
rt70_clf = rt70_clf.fit(x70_train, y70_train)

# Test the classifier
rt70_y_pred = rt70_clf.predict(x30_test)
rt70_accuracy = accuracy_score(y30_test, rt70_y_pred)

0.7990654205607477


In [58]:
print(f"The classifiers accuracy with 70-30 split on the data\n    Decision tree     : {round(dt70_accuracy,2)}\n    Random tree       : {round(rt70_accuracy, 2)}")

The classifiers accuracy with 70-30 split on the data
    Decision tree     : 0.72
    Random tree       : 0.8


## Conclusion
- 70-30 split is more accurate
- random tree model is more accurate

Does the cabin matter?

"_c" means cabin

In [113]:
df_c = pd.read_csv("titanic.csv")
df_c.drop(["PassengerId", "Name", "Ticket"], axis = 1, inplace = True)
df_c["Cabin"] = df_c["Cabin"].replace(to_replace = np.nan, value = "on_board")
df_c.dropna(inplace = True)

In [114]:
# Using OHE to change the datatypes to int
df_c = pd.get_dummies(df_c, columns = ["Sex", "Embarked", "Cabin"])
df_c_column_names = df_c.columns.to_list()
for column in df_c_column_names:
    df_c[column] = df_c[column].astype(int)

In [115]:
# y --> survived or not list
y_c = df_c.pop("Survived").to_list()

In [116]:
#create test and train datasets
x70c_train, x30c_test, y70c_train, y30c_test = train_test_split(df_c, y_c, train_size = 0.7, random_state = 11)

In [117]:
# Random Tree 70% with cabin

# Train the classifier
rt70c_clf = RandomForestClassifier()
rt70c_clf = rt70c_clf.fit(x70c_train, y70c_train)

# Test the classifier
rt70c_y_pred = rt70c_clf.predict(x30c_test)
rt70c_accuracy = accuracy_score(y30c_test, rt70c_y_pred)

In [119]:
print(f"The classifier accuracy included cabin with Random tree model is {round(rt70c_accuracy, 2)}")

The classifier accuracy included cabin with Random tree model is 0.79
