# Fitting a Logistic Regression Model - Lab

## Introduction
You were previously given a broad overview of logistic regression. This included two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with statsmodels.



## Objectives

You will be able to:
* Implement logistic regression with statsmodels
* Interpret the statistical results associated with regression model parameters


## Review

The statsmodels example we covered had four essential parts:
* Importing the data
* Defining X and y
* Fitting the model
* Analyzing model results

The corresponding code to these four steps was:

```
import pandas as pd
import statsmodels.api as sm

#Step 1: Importing the data
salaries = pd.read_csv("salaries_final.csv", index_col = 0)

#Step 2: Defining X and y
x_feats = ["Race", "Sex", "Age"]
X = pd.get_dummies(salaries[x_feats], drop_first=True, dtype=float)
y = pd.get_dummies(salaries["Target"], dtype=float)

#Step 3: Fitting the model
X = sm.add_constant(X)
logit_model = sm.Logit(y.iloc[:,1], X)
result = logit_model.fit()

#Step 4: Analyzing model results
result.summary()
```

Most of this should be fairly familiar to you; importing data with Pandas, initializing a regression object, and calling the fit method of that object. However, step 2 warrants a slightly more in depth explanation.

Recall that we fit the salary data using `Race`, `Sex`, and `Age`. Since `Race` and `Sex` are categorical, we converted them to dummy variables using the `get_dummies()` method. The ```get_dummies()``` method will only convert `object` and `category` data types to dummy variables so it is safe to pass `Age`. Note that we also passed two additional arguments, ```drop_first=True``` and ```dtype=float```. The ```drop_first=True``` argument removes the first level for each categorical variable and the ```dtype=float``` argument converts the data type of all of the dummy variables to float. The data must be float in order to obtain accurate statistical results from statsmodel. Finally, note that y itself returns a pandas DataFrame with two columns as y itself was originally a categorical variable. With that, it's time to try and define a logistic regression model on your own!

## Your Turn - Step 1: Import the Data

Import the data stored in the file **titanic.csv**.

In [87]:
import pandas as pd

In [88]:
df = pd.read_csv('titanic.csv')
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Step 2: Define X and Y

For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the Titanic shipwreck or not (yes it's a bit morbid). Follow the programming patterns described above to define X and y.

In [89]:
# Your code here
y = df['Survived'].astype(float)
# X = df.drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin'], axis=1)
x_feats = ['Pclass', 'Sex', 'Age', 'Embarked', 'Fare', 'SibSp', 'Parch']
X = df[x_feats]

In [90]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
Embarked    889 non-null object
Fare        891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
dtypes: float64(2), int64(3), object(2)
memory usage: 48.8+ KB


In [91]:
X['Pclass'] = X['Pclass'].astype(str)
X['SibSp'] = X['SibSp'].astype(str)
X['Parch'] = X['Parch'].astype(str)
X.info()
X = pd.get_dummies(X,
                   prefix= ['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch'],
                   dtype=float,
                   drop_first=True)
X.head()

# Note: How to get all y values that correspond to the X values
# X = X.dropna()
# y = y[y.index.isin(X.index)]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Pclass      891 non-null object
Sex         891 non-null object
Age         714 non-null float64
Embarked    889 non-null object
Fare        891 non-null float64
SibSp       891 non-null object
Parch       891 non-null object
dtypes: float64(2), object(5)
memory usage: 48.8+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Age,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6
0,22.0,7.25,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,38.0,71.2833,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,26.0,7.925,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,35.0,53.1,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,35.0,8.05,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
import statsmodels.api as sm
X = sm.add_constant(X)
X.head()

Unnamed: 0,const,Age,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6
0,1.0,22.0,7.25,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,38.0,71.2833,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,26.0,7.925,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,35.0,53.1,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,35.0,8.05,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 3: Fit the model

Now with everything in place, initialize a regression object and fit your model!

### Warning: If you receive an error of the form "LinAlgError: Singular matrix"

Statsmodels was unable to fit the model due to some Linear Algebra problems. Specifically, the matrix was not invertible due to not being full rank. In layman's terms, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [93]:
X.Age.fillna(X.Age.median(), inplace=True)

In [94]:
X.isna().sum()

const         0
Age           0
Fare          0
Pclass_2      0
Pclass_3      0
Sex_male      0
Embarked_Q    0
Embarked_S    0
SibSp_1       0
SibSp_2       0
SibSp_3       0
SibSp_4       0
SibSp_5       0
SibSp_8       0
Parch_1       0
Parch_2       0
Parch_3       0
Parch_4       0
Parch_5       0
Parch_6       0
dtype: int64

In [95]:
logit_model = sm.Logit(y, X)
result = logit_model.fit()


         Current function value: 0.429996
         Iterations: 35




In [96]:
parameters = result.params

## Step 4: Analyzing results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [97]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,871.0
Method:,MLE,Df Model:,19.0
Date:,"Tue, 15 Oct 2019",Pseudo R-squ.:,0.3543
Time:,00:58:25,Log-Likelihood:,-383.13
converged:,False,LL-Null:,-593.33
,,LLR p-value:,2.479e-77

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.6695,0.493,7.440,0.000,2.703,4.636
Age,-0.0359,0.008,-4.308,0.000,-0.052,-0.020
Fare,0.0023,0.002,0.922,0.356,-0.003,0.007
Pclass_2,-0.9773,0.302,-3.232,0.001,-1.570,-0.385
Pclass_3,-2.0355,0.302,-6.740,0.000,-2.627,-1.444
Sex_male,-2.6765,0.203,-13.199,0.000,-3.074,-2.279
Embarked_Q,0.0629,0.386,0.163,0.870,-0.693,0.819
Embarked_S,-0.3179,0.244,-1.305,0.192,-0.795,0.159
SibSp_1,0.0989,0.224,0.442,0.659,-0.340,0.538


In [None]:
# It seems that some pedictor variables are more significant than others - these include
# Age, Pclass (categorical variable - Pclass_2 & Pclass_3), and Sex.
# Other variables like Parch, SibSp, Embarked (port), and Fare don't seem to be significant, however

## Your analysis here

## Level - up

Create a new model, this time only using those features you determined were influential based on your analysis in step 4.

In [66]:
# Your code here
y = df['Survived'].astype(float)
x_feats = ['Pclass', 'Sex', 'Age']
X = df[x_feats].copy()

In [67]:
X.isna().sum()

Pclass      0
Sex         0
Age       177
dtype: int64

In [70]:
median_age = X.Age.median()

In [73]:
X.Age.fillna(median_age, inplace=True)
X.Pclass = X.Pclass.astype('str')
X = pd.get_dummies(X, prefix=['Pclass', 'Sex'], dtype=float, drop_first=True)

In [75]:
logit_model = sm.Logit(y, X)
result = logit_model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.518649
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,887.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 15 Oct 2019",Pseudo R-squ.:,0.2211
Time:,00:54:08,Log-Likelihood:,-462.12
converged:,True,LL-Null:,-593.33
,,LLR p-value:,1.346e-56

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Age,0.0265,0.004,6.430,0.000,0.018,0.035
Pclass_2,0.3070,0.200,1.535,0.125,-0.085,0.699
Pclass_3,-0.5761,0.151,-3.827,0.000,-0.871,-0.281
Sex_male,-2.0628,0.168,-12.286,0.000,-2.392,-1.734


In [78]:
import numpy as np
np.exp(result.params)

Age         1.026875
Pclass_2    1.359332
Pclass_3    0.562106
Sex_male    0.127100
dtype: float64

## Summary 

Well done! In this lab, you practiced using statsmodels to build a logistic regression model. You then reviewed interpreting the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Sci-kit learn!