# Exercises

## Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

- 1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

- 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

- 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

- 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

- 5. Run through steps 2-4 using a different max_depth value.

- 6. Which model performs better on your in-sample data?

- 7. Which model performs best on your out-of-sample data, the validate set?

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import prepare
import acquire

In [2]:
# df = pd.read_csv('titanic_df.csv')

In [3]:
# df = acquire.get_titanic_data()

In [4]:
df = data('titanic')

In [5]:
df.head()

Unnamed: 0,class,age,sex,survived
1,1st class,adults,man,yes
2,1st class,adults,man,yes
3,1st class,adults,man,yes
4,1st class,adults,man,yes
5,1st class,adults,man,yes


def clean_data(df):
    '''
    This function will drop any duplicate observations, 
    drop ['deck', 'embarked', 'class', 'age'], fill missing embark_town with 'Southampton'
    and create dummy vars from sex and embark_town. 
    '''
    df = df.drop_duplicates()
    df = df.drop(columns=['deck', 'embarked', 'class', 'age'])
    df['embark_town'] = df.embark_town.fillna(value='Southampton')
    dummy_df = pd.get_dummies(df[['sex', 'embark_town']], drop_first=True)
    df = pd.concat([df, dummy_df], axis=1)
    return df

def split_data(df):
    '''
    take in a DataFrame and return train, validate, and test DataFrames; stratify on survived.
    return train, validate, test DataFrames.
    '''
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.survived)
    train, validate = train_test_split(train_validate, 
                                       test_size=.3, 
                                       random_state=123, 
                                       stratify=train_validate.survived)
    return train, validate, test

In [8]:
df.head()

Unnamed: 0,class,age,sex,survived
1,1st class,adults,man,yes
2,1st class,adults,man,yes
3,1st class,adults,man,yes
4,1st class,adults,man,yes
5,1st class,adults,man,yes


In [101]:
df.tail()

Unnamed: 0.1,Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,...,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,sex_male.1,embark_town_Queenstown.1,embark_town_Southampton.1,sex_male.2,embark_town_Queenstown.2,embark_town_Southampton.2
886,886,886,0,2,male,27.0,0,0,13.0,S,...,1,1,0,1,1,0,1,1,0,1
887,887,887,1,1,female,19.0,0,0,30.0,S,...,1,0,0,1,0,0,1,0,0,1
888,888,888,0,3,female,,1,2,23.45,S,...,0,0,0,1,0,0,1,0,0,1
889,889,889,1,1,male,26.0,0,0,30.0,C,...,1,1,0,0,1,0,0,1,0,0
890,890,890,0,3,male,32.0,0,0,7.75,Q,...,1,1,1,0,1,1,0,1,1,0


df = clean_data(df)

def prep_titanic_data(df):
    '''
    This function takes in a df and will drop any duplicate observations, 
    drop ['deck', 'embarked', 'class', 'age'], fill missing embark_town with 'Southampton'
    create dummy vars from sex and embark_town, and perform a train, validate, test split. 
    Returns train, validate, and test DataFrames
    '''
    df = clean_data(df)
    train, validate, test = split_data(df)
    return train, validate, test

train, validate, test = prep_titanic_data(df)
train.info()

In [90]:
df.survived.nunique()

2

In [91]:
df.nunique()

Unnamed: 0                 891
passenger_id               891
survived                     2
pclass                       3
sex                          2
age                         88
sibsp                        7
parch                        7
fare                       248
embarked                     3
class                        3
deck                         7
embark_town                  3
alone                        2
sex_male                     2
embark_town_Queenstown       2
embark_town_Southampton      2
sex_male                     2
embark_town_Queenstown       2
embark_town_Southampton      2
dtype: int64

# What is your baseline prediction? What is your baseline accuracy? 
### remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

baseline:  survived = no

In [92]:
# convert column names to lowercase, replace '.' in column names with '_'
# df.columns = [col.lower().replace('.', '_') for col in df]

In [20]:
df.head()

Unnamed: 0,class,age,sex,survived
1,1st class,adults,man,yes
2,1st class,adults,man,yes
3,1st class,adults,man,yes
4,1st class,adults,man,yes
5,1st class,adults,man,yes


In [21]:
df.describe()

Unnamed: 0,class,age,sex,survived
count,1316,1316,1316,1316
unique,3,2,2,2
top,3rd class,adults,man,no
freq,706,1207,869,817


In [33]:
#df.replace([('1st class', 1), ('2nd class', 2), ('3rd class', 3), ('man', 0), ('woman', 1), ('yes', 1), ('no', 0)])
df = df.replace(['1st class', '2nd class', '3rd class', 'man', 'women', 'yes', 'no', 'adults', 'child'], [1, 2, 3, 0, 1, 1, 0, 1, 0])

In [68]:
df.head()

Unnamed: 0,class,age,sex,survived
1,1,1,0,1
2,1,1,0,1
3,1,1,0,1
4,1,1,0,1
5,1,1,0,1


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1316 entries, 1 to 1316
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   class     1316 non-null   int64
 1   age       1316 non-null   int64
 2   sex       1316 non-null   int64
 3   survived  1316 non-null   int64
dtypes: int64(4)
memory usage: 51.4 KB


In [53]:
from sklearn.model_selection import train_test_split

def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [54]:
# split into train, validate, test
train, validate, test = train_validate_test_split(df, target='survived', seed=123)

# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [55]:
clf = DecisionTreeClassifier(max_depth=3, random_state=123)

In [56]:
clf = clf.fit(X_train, y_train)

In [57]:
clf

DecisionTreeClassifier(max_depth=3, random_state=123)

In [58]:
import graphviz
from graphviz import Graph

dot_data = export_graphviz(clf, feature_names= X_train.columns, class_names=clf.classes_, rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data) 

graph.render('titanic_decision_tree', view=True)

TypeError: can only concatenate str (not "numpy.int64") to str

In [59]:
y_pred = clf.predict(X_train)
y_pred[0:5]

array([0, 0, 0, 0, 0])

In [60]:
y_pred_proba = clf.predict_proba(X_train)
y_pred_proba[0:5]

array([[0.57281553, 0.42718447],
       [0.85207101, 0.14792899],
       [0.85207101, 0.14792899],
       [0.85207101, 0.14792899],
       [0.625     , 0.375     ]])

In [61]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
      .format(clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.78


In [62]:
confusion_matrix(y_train, y_pred)

array([[443,  14],
       [145, 134]])

In [63]:
y_train.value_counts()

0    457
1    279
Name: survived, dtype: int64

In [64]:
labels = sorted(y_train.unique())

pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,0,1
0,443,14
1,145,134


In [65]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.97      0.85       457
           1       0.91      0.48      0.63       279

    accuracy                           0.78       736
   macro avg       0.83      0.72      0.74       736
weighted avg       0.81      0.78      0.76       736



In [66]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'
     .format(clf.score(X_validate, y_validate)))

Accuracy of Decision Tree classifier on validate set: 0.81


In [67]:
# Produce y_predictions that come from the X_validate
y_pred = clf.predict(X_validate)

# Compare actual y values (from validate) to predicted y_values from the model run on X_validate
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.98      0.87       196
           1       0.95      0.53      0.68       120

    accuracy                           0.81       316
   macro avg       0.86      0.75      0.77       316
weighted avg       0.84      0.81      0.79       316



In [69]:
dot_data = export_graphviz(clf, feature_names= X_train.columns, 
                           class_names=('Yes', 'No'), rounded=True, 
                           filled=True, out_file=None)