## Guided Practice: Logit Function and Odds

In [None]:
import pandas as pd
import numpy as np
from math import log, exp, floor, ceil

In [None]:
def logit_func(odds):
    # in : odds (float)
    # out: log(odds) = logit
       
    return np.log(odds)# <Code Here>

def sigmoid_func(logit):
    # in : logit (float)
    # out: probability
    #odds_in = odds_set
    #prob = exp(odds_in) / 1 + exp(odds_in) 
    return 1 / (1 + np.exp(-logit)) # <Code Here>

odds_set = [5.0 / 1.0, 20.0 / 1.0, 1.1 / 1.0, 1.8 / 1.0, 1.6 / 1.0]
odds_set

In [None]:
for odds in odds_set:
    print(sigmoid_func(logit_func(odds)))

In [None]:
# Statsmodels logistic regression is sm.Logit
import statsmodels.api as sm

In [None]:
# Read in the data
df = pd.read_csv("../../../../data/collegeadmissions.csv")

In [None]:
df.head()

In [None]:
df = df.join(pd.get_dummies(df["rank"]))

In [None]:
df.head()

In [None]:
X = df[["gre", "gpa", 1, 2, 3,]]
X = sm.add_constant(X)
y = df["admit"]

lm = sm.Logit(y, X)
result = lm.fit()

result.summary()

In [None]:
print(df.admit.mean())

In [None]:
# You can easily convert these into odds using numpy.exp()
print(result.params)
print(np.exp(result.params))

The above makes it more clear that a schools rank as it approaches 4 decreases the odds of getting admitted.

The accuracy of the model with all features (removing one rank) is ~70%.

In [None]:
predicted = result.predict(X)
threshold = 0.5
predicted_classes = (predicted > threshold).astype(int)
from sklearn.metrics import accuracy_score
accuracy_score(y, predicted_classes)

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

In [None]:
lm = LogisticRegression()

In [None]:
lm.fit(df[["gre", "gpa", 1, 2, 3,]], df["admit"])

In [None]:
print(lm.coef_)
print(lm.intercept_)
print(df.admit.mean())

Below is some code to walk through confusion matrices. It will be useful for working through the Titanic problem.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Below the ROC curve is based on various thresholds: it shows with a false positive rate (x-axis) ~0, it also expects a true positive rate (y-axis) ~0 (the same, ish, for the top right hand of the figure).

The second chart, which does not play with thesholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

In [None]:
plt.figure(figsize = (4, 4))
plt.plot(roc_curve(df[["admit"]], predicted)[0], roc_curve(df[["admit"]], predicted)[1])
plt.show()

In [None]:
plt.figure(figsize = (4, 4))
plt.plot(roc_curve(df[["admit"]], predicted_classes)[0], roc_curve(df[["admit"]], predicted_classes)[1])
plt.show()

Finally, you can use the `roc_auc_score` function to calculate the area under these curves (AUC).

In [None]:
roc_auc_score(df["admit"], predicted_classes)

### Note: sklearn also has logistic regression:
```
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()
lm.fit(X, y)
```

### Titanic Problem

**Goals**
1. Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in `scikit-learn`. But a worst case scenario; identify one or two strong features that would be useful to include in the model.
2. Spend 1-2 minutes considering which **metric** makes the most sense to optimize. Accuracy? FPR or TPR? AUC? Given the business problem (understanding survival rate aboard the Titanic), why should you use this metric?
3. Build a tuned Logistic model. Be prepared to explain your design (including regularization), metric and feature set in predicting survival using the tools necessary (such as a fit chart).

In [None]:
# <Code Here>

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression


df = pd.read_csv("../../../../data/titanic.csv")

In [2]:
df = df.join(pd.get_dummies(df["Pclass"]))
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,1,2,3
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1.0,0.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0.0,0.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1.0,0.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0.0,0.0,1.0


In [None]:
#x = df[["Sex", "Age", 1, 2, 3,]]
#y = df[["Survived"]]



In [3]:
lm= LogisticRegression()
lm.fit(df[["Sex", "Age", 1, 2, 3,]], df["Survived"])

ValueError: could not convert string to float: male