<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Logistic Regression Practice
**Possums**

<img src="../images/pos2.jpg" style="height: 250px">

*The common brushtail possum (Trichosurus vulpecula, from the Greek for "furry tailed" and the Latin for "little fox", previously in the genus Phalangista) is a nocturnal, semi-arboreal marsupial of the family Phalangeridae, native to Australia, and the second-largest of the possums.* -[Wikipedia](https://en.wikipedia.org/wiki/Common_brushtail_possum)

In [23]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### Get the data

Read in the `possum.csv` data (located in the `data` folder).

In [24]:
possums = pd.read_csv('../data/possum.csv')

In [25]:
possums.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0


In [27]:
possums['pop'].unique()

array(['Vic', 'other'], dtype=object)

### Preprocessing

> Check for & deal with any missing values.  
Convert categorical columns to numeric.  
Do any other preprocessing you feel is necessary.

In [28]:
# Check for missings
possums.isnull().sum()

site       0
pop        0
sex        0
age        2
head_l     0
skull_w    0
total_l    0
tail_l     0
dtype: int64

In [29]:
# Drop missings
possums.dropna(inplace = True)

In [30]:
# Convert sex m/f to 0/1
possums['sex_f'] = possums['sex'].map({'m': 0, 'f': 1})
possums.drop(columns = 'sex', inplace = True)

In [31]:
# Check out the pop column
possums['pop'].value_counts()

other    58
Vic      44
Name: pop, dtype: int64

In [32]:
# convert pop column to 0/1\
possums['pop'] = possums['pop'].map({'other': 0, 'Vic': 1})

### Modeling

> Build Logistic Regression model to predict `pop`; region of origin.  
Examine the performance of the model.

In [33]:
# Set up X and y
X = possums.drop(columns = 'pop')
y = possums['pop']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [34]:
# Instantiate model
logreg = LogisticRegression(solver = 'newton-cg') # changing solver b/c of convergence warning

In [35]:
# Fit the model
logreg.fit(X_train, y_train)

LogisticRegression(solver='newton-cg')

In [36]:
# training accuracy
logreg.score(X_train, y_train)

1.0

In [37]:
# testing accuracy
logreg.score(X_test, y_test)

0.9615384615384616

### Interpretation & Predictions

> Interpret at least one coefficient from your model.  
> Generate predicted probabilities for your testing set.  
> Generate predictions for your testing set.

In [38]:
X.columns

Index(['site', 'age', 'head_l', 'skull_w', 'total_l', 'tail_l', 'sex_f'], dtype='object')

In [39]:
logreg.coef_

array([[-2.0801011 ,  0.08575391, -0.15523514, -0.15910371,  0.15724717,
        -0.66021807, -0.12290276]])

In [40]:
# Check out coefficients
pd.Series(logreg.coef_[0], index = X.columns)

site      -2.080101
age        0.085754
head_l    -0.155235
skull_w   -0.159104
total_l    0.157247
tail_l    -0.660218
sex_f     -0.122903
dtype: float64

In [41]:
# Interpret coefficient for age:
np.exp(0.085753)

1.0895371793832975

> A 1 year increase in a possum's age suggests that it is 1.09 times as likely to live in the Vic region, holding all else constant.

In [42]:
# Predicted probabilities for test set
logreg.predict_proba(X_test)

array([[6.22004433e-03, 9.93779956e-01],
       [9.92282220e-01, 7.71778036e-03],
       [9.94511506e-01, 5.48849407e-03],
       [6.60151326e-01, 3.39848674e-01],
       [1.19971254e-02, 9.88002875e-01],
       [1.16962536e-01, 8.83037464e-01],
       [9.99901174e-01, 9.88263067e-05],
       [6.31925545e-01, 3.68074455e-01],
       [4.63657506e-02, 9.53634249e-01],
       [1.30018225e-02, 9.86998177e-01],
       [3.70466462e-03, 9.96295335e-01],
       [3.81666685e-02, 9.61833331e-01],
       [9.99856155e-01, 1.43845032e-04],
       [9.99905267e-01, 9.47329427e-05],
       [9.95112429e-01, 4.88757063e-03],
       [1.48071980e-02, 9.85192802e-01],
       [9.96638551e-01, 3.36144886e-03],
       [1.04744002e-01, 8.95255998e-01],
       [1.67040781e-02, 9.83295922e-01],
       [5.37321983e-03, 9.94626780e-01],
       [9.99818724e-01, 1.81275992e-04],
       [9.81947374e-01, 1.80526262e-02],
       [8.19353595e-03, 9.91806464e-01],
       [9.97524076e-01, 2.47592401e-03],
       [4.560022

In [43]:
# Predictions for test set
logreg.predict(X_test)

array([1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 0], dtype=int64)