<a href="https://colab.research.google.com/github/sapinspys/lambda-ds-precourse/blob/master/LSDS_Intro_Assignment_8_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School, Intro to Data Science, Day 8 — Classification!

## Assignment

Run this cell to load the Titanic data:

In [0]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
train, test = train_test_split(sns.load_dataset('titanic').drop(columns=['alive']), random_state=0)
target = 'survived'

Then, train a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba), [Decision Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), or [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model. Use any features and parameters you want. 

Try to get better than 78.0% accuracy on the test set! (This is not required, but encouraged.)

Do refer to the lecture notebook — but try not to copy-paste.

> You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons. —*[Learn Python the Hard Way](https://learnpythonthehardway.org/book/intro.html)*

After this, you may want to try [Kaggle's Titanic challenge](https://www.kaggle.com/c/titanic)!

In [48]:
train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone
105,0,3,male,28.0,0,0,7.8958,S,Third,man,True,,Southampton,True
68,1,3,female,17.0,4,2,7.925,S,Third,woman,False,,Southampton,False
253,0,3,male,30.0,1,0,16.1,S,Third,man,True,,Southampton,False
320,0,3,male,22.0,0,0,7.25,S,Third,man,True,,Southampton,True
706,1,2,female,45.0,0,0,13.5,S,Second,woman,False,,Southampton,True


In [49]:
corr = train.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
survived,1.0,-0.326264,-0.0831464,-0.050014,0.0815843,0.228372,-0.568379,-0.195185
pclass,-0.326264,1.0,-0.358802,0.0985966,0.0173881,-0.532116,0.119928,0.130358
age,-0.0831464,-0.358802,1.0,-0.305921,-0.186961,0.103862,0.282477,0.20218
sibsp,-0.050014,0.0985966,-0.305921,1.0,0.425168,0.136919,-0.241871,-0.569505
parch,0.0815843,0.0173881,-0.186961,0.425168,1.0,0.206725,-0.338335,-0.590335
fare,0.228372,-0.532116,0.103862,0.136919,0.206725,1.0,-0.167024,-0.250177
adult_male,-0.568379,0.119928,0.282477,-0.241871,-0.338335,-0.167024,1.0,0.394433
alone,-0.195185,0.130358,0.20218,-0.569505,-0.590335,-0.250177,0.394433,1.0


## Logistic Regression: Attempt 1

In [50]:
from sklearn.linear_model import LogisticRegression

features = ["adult_male", "alone"]

model = LogisticRegression()
model.fit(train[features], train[target])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [51]:
from sklearn.metrics import accuracy_score

y_true = train[target]
y_pred = model.predict(train[features])
print('Train accuracy:', accuracy_score(y_true, y_pred))

y_true = test[target]
y_pred = model.predict(test[features])
print('Test accuracy:', accuracy_score(y_true, y_pred))

Train accuracy: 0.7949101796407185
Test accuracy: 0.7713004484304933


## Decision Tree: Attempt 2

Our first attempt had almost 78% accuracy! Let's see if we can do better using a decision tree model.


In [52]:
from sklearn.tree import DecisionTreeClassifier

features = ["adult_male", "alone"]

model = DecisionTreeClassifier(max_depth=2)
model.fit(train[features], train[target])

y_true = train[target]
y_pred = model.predict(train[features])
print('Train accuracy:', accuracy_score(y_true, y_pred))

y_true = test[target]
y_pred = model.predict(test[features])
print('Test accuracy:', accuracy_score(y_true, y_pred))

Train accuracy: 0.7949101796407185
Test accuracy: 0.7713004484304933


# Multi-Model: Attempt 3

Same results.. let's try different feature configurations

In [53]:
# Encoding sex
train['female'] = train.sex == 'female'
test['female'] = test.sex == 'female'

train[['sex', 'female']].head()

Unnamed: 0,sex,female
105,male,False
68,female,True
253,male,False
320,male,False
706,female,True


In [54]:
# Encoding class
print(f"{train['class'].value_counts()}\n")

class_ranks = {"First": 1, "Second": 2, "Third": 3}
train["class"] = train["class"].map(class_ranks)
test["class"] = test["class"].map(class_ranks)

train.head()

Third     367
First     163
Second    138
Name: class, dtype: int64



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone,female
105,0,3,male,28.0,0,0,7.8958,S,3,man,True,,Southampton,True,False
68,1,3,female,17.0,4,2,7.925,S,3,woman,False,,Southampton,False,True
253,0,3,male,30.0,1,0,16.1,S,3,man,True,,Southampton,False,False
320,0,3,male,22.0,0,0,7.25,S,3,man,True,,Southampton,True,False
706,1,2,female,45.0,0,0,13.5,S,2,woman,False,,Southampton,True,True


In [55]:
# Imputing age

train.age.fillna(train.age.mean(), inplace=True)
test.age.fillna(test.age.mean(), inplace=True)

print(train.age.isnull().sum())
print(test.age.isnull().sum())

0
0


In [56]:
#Logistic Regression 

features = ["class", "female", "age"]

model = LogisticRegression()
model.fit(train[features], train[target])

y_true = train[target]
y_pred = model.predict(train[features])
print('Train accuracy:', accuracy_score(y_true, y_pred))

y_true = test[target]
y_pred = model.predict(test[features])
print('Test accuracy:', accuracy_score(y_true, y_pred))

Train accuracy: 0.7964071856287425
Test accuracy: 0.7757847533632287




In [57]:
# Decision Tree

model = DecisionTreeClassifier(max_depth=2)
model.fit(train[features], train[target])

y_true = train[target]
y_pred = model.predict(train[features])
print('Train accuracy:', accuracy_score(y_true, y_pred))

y_true = test[target]
y_pred = model.predict(test[features])
print('Test accuracy:', accuracy_score(y_true, y_pred))

Train accuracy: 0.7949101796407185
Test accuracy: 0.7757847533632287


# Random Forest: Attempt 4

Same results... will this be the trump card that takes us above 78% test accuracy?

In [59]:
from sklearn.ensemble import RandomForestClassifier

features = ["class", "female", "age"]

model = RandomForestClassifier(n_estimators=10,max_depth=2)
model.fit(train[features], train[target])

y_true = train[target]
y_pred = model.predict(train[features])
print('Train accuracy:', accuracy_score(y_true, y_pred))

y_true = test[target]
y_pred = model.predict(test[features])
print('Test accuracy:', accuracy_score(y_true, y_pred))

Train accuracy: 0.7994011976047904
Test accuracy: 0.820627802690583


### Wooo! Our random forest model returns 82%  test accuracy using class, female, and age features.