# Lambda School, Intro to Data Science, Day 8 — Classification!

## Assignment

Run this cell to load the Titanic data:

In [411]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

# Generate CSV file, just in case.
Data = sns.load_dataset("titanic")
Data.to_csv("Titanic.csv")

# Cleaning: "numerizing" categorical data in "sex" and "class".
Data.drop(labels = "who", axis = 1, inplace = True)     # "Who" column deletion.
Data.rename({"sex" : "male"}, axis = 1, inplace = True) # "Sex" column rename.
sex_ranks = {"male": 1, "female": 0}
Data["male"] = Data["male"].map(sex_ranks)
css_ranks = {"First": 1, "Second": 2, "Third" : 3}
Data["class"] = Data["class"].map(css_ranks)

# Cleaning: dealing with NaNs in "age" and "fare".
avg_age = [ Data["age"][Data["male"] == 0].mean(),  # Replace NaN ages with...
            Data["age"][Data["male"] == 1].mean() ] # ...its gender average.
Data["age"].fillna(avg_age[0], inplace = True)
Data["age"].fillna(avg_age[1], inplace = True)
Data["fare"].fillna(Data["fare"].mean(), inplace = True)

# Create training and testing dataframes.
train, test = train_test_split(Data.drop( \
    columns = "alive"), random_state = 0)
target = "survived"
print("%d columns, %d rows for training, %d rows for testing." \
      % (train.shape[1], train.shape[0], test.shape[0]))
Data.tail(10)

13 columns, 668 rows for training, 223 rows for testing.


Unnamed: 0,survived,pclass,male,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town,alive,alone
881,0,3,1,33.0,0,0,7.8958,S,3,True,,Southampton,no,True
882,0,3,0,22.0,0,0,10.5167,S,3,False,,Southampton,no,True
883,0,2,1,28.0,0,0,10.5,S,2,True,,Southampton,no,True
884,0,3,1,25.0,0,0,7.05,S,3,True,,Southampton,no,True
885,0,3,0,39.0,0,5,29.125,Q,3,False,,Queenstown,no,False
886,0,2,1,27.0,0,0,13.0,S,2,True,,Southampton,no,True
887,1,1,0,19.0,0,0,30.0,S,1,False,B,Southampton,yes,True
888,0,3,0,27.915709,1,2,23.45,S,3,False,,Southampton,no,False
889,1,1,1,26.0,0,0,30.0,C,1,True,C,Cherbourg,yes,True
890,0,3,1,32.0,0,0,7.75,Q,3,True,,Queenstown,no,True


Then, train a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba), [Decision Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), or [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model. Use any features and parameters you want. 

Try to get better than 78.0% accuracy on the test set! (This is not required, but encouraged.)

Do refer to the lecture notebook — but try not to copy-paste.

> You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons. —*[Learn Python the Hard Way](https://learnpythonthehardway.org/book/intro.html)*

After this, you may want to try [Kaggle's Titanic challenge](https://www.kaggle.com/c/titanic)!

In [412]:
# Data preview.
sf = 100*train.survived.value_counts(normalize = True)[1]
print("Percentage of survivors: %0.2f %%" % sf)

Percentage of survivors: 38.62 %


In [413]:
# Create model.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

features = ["male", "age", "fare", "class"]
target = "survived"
model = DecisionTreeClassifier(max_depth = 4)
model.fit(train[features], train[target])

# Train accuracy
y_true = train[target]
y_pred = model.predict(train[features])
train_score = accuracy_score(y_true, y_pred)
print(type(train_score.item()))


# Test accuracy
y_true = test[target]
y_pred = model.predict(test[features])
test_score = accuracy_score(y_true, y_pred)
print("Train accuracy: %0.2f%%   ;  Test accuracy: %0.2f%%" \
      % (100.0*train_score, 100.0*test_score))
print('Test accuracy has slightly improved over past 78%!')

<class 'float'>
Train accuracy: 84.58%   ;  Test accuracy: 81.61%
Test accuracy has slightly improved over past 78%!


In [414]:
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(y_true, y_pred), \
             ["Actual 0",    "Actual 1"     ], \
             ["Predicted 0", "Predicted 1"  ])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,126,13
Actual 1,28,56
