# Logistic Regression Exercises

## In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model. For all of the models you create, choose a threshold that optimizes for accuracy. Create a new notebook, logistic_regression, use it to answer the following questions:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
import acquire as ac
import prepare as prep

seed = 100

### 1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [2]:
t = prep.titanic()

train, val, test = prep.train_val_test(t,'survived')

x_train, y_train = prep.split_x_y(train,'survived')
x_val, y_val = prep.split_x_y(val,'survived')

In [3]:
prep.baseline(train, 'survived', 1)

Baseline accuracy is: 61.64%.
Baseline recall is: 0.0%.
Baseline precision is: 0.0%.



In [4]:
train.head()

Unnamed: 0,survived,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
410,0,28.0,0,0,7.8958,1,1,0,0,1,0,0,1
824,0,2.0,4,1,39.6875,0,1,0,0,1,0,0,1
11,1,58.0,0,0,26.55,1,0,1,0,0,0,0,1
851,0,74.0,0,0,7.775,1,1,0,0,1,0,0,1
219,0,30.0,0,0,10.5,1,1,0,1,0,0,0,1


In [5]:
train.columns

Index(['survived', 'age', 'sibsp', 'parch', 'fare', 'alone', 'sex_male',
       'class_First', 'class_Second', 'class_Third', 'embark_town_Cherbourg',
       'embark_town_Queenstown', 'embark_town_Southampton'],
      dtype='object')

In [6]:
new = ['survived', 'age', 'fare', 'class_First', 'class_Second', 'class_Third']

new_df = pd.DataFrame()

for n in new:
    new_df[n] = t[n]
new_df

train, val, test = prep.train_val_test(new_df,'survived')
x_train, y_train = prep.split_x_y(train,'survived')
x_val, y_val = prep.split_x_y(val,'survived')

logreg = LogisticRegression(random_state = seed, max_iter = 400)

logreg.fit(x_train, y_train)

logreg.score(x_train, y_train), logreg.score(x_val, y_val)

(0.7014446227929374, 0.6791044776119403)

In [7]:
# It performs barely better than the baseline at around 6% better

### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [8]:
new = ['survived', 'age', 'fare', 'class_First', 'class_Second', 'class_Third', 'sex_male']

new_df = pd.DataFrame()

for n in new:
    new_df[n] = t[n]
new_df

train, val, test = prep.train_val_test(new_df,'survived')
x_train, y_train = prep.split_x_y(train,'survived')
x_val, y_val = prep.split_x_y(val,'survived')

logreg = LogisticRegression(random_state = seed, max_iter = 100)

logreg.fit(x_train, y_train)

logreg.score(x_train, y_train), logreg.score(x_val, y_val)

(0.797752808988764, 0.7686567164179104)

### 3. Try out other combinations of features and models.

In [9]:
new = ['survived', 'age', 'fare', 'class_First', 'class_Second', 'class_Third', 'sex_male']

new_df = pd.DataFrame()

for n in new:
    new_df[n] = t[n]
new_df

train, val, test = prep.train_val_test(new_df,'survived')
x_train, y_train = prep.split_x_y(train,'survived')
x_val, y_val = prep.split_x_y(val,'survived')

logreg = LogisticRegression(random_state = seed, max_iter = 100)

logreg.fit(x_train, y_train)

logreg.score(x_train, y_train), logreg.score(x_val, y_val)

(0.797752808988764, 0.7686567164179104)

### 4. Use you best 3 models to predict and evaluate on your validate sample.

### 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?