From the power point we have seen that the logistic equation is

$$ y = \frac{1}{1 + e^{-(mx+b)}} $$

let $z = mx+b $ 

The loss function for logistic regression is

$ L = \sum( -y_i log(\hat{y_i}) - (1 - y_i) log(1 - \hat{y_i}) )$

the update equation for $m$ and $b$ with $\epsilon$ as a learning rate are:

$m = m - \epsilon \frac{\partial L}{\partial m} $

$b = b - \epsilon \frac{\partial L}{\partial b} $

In [None]:
import pandas as pd
import matplotlib as plt
import seaborn as sbn
%matplotlib inline

In [None]:
df = pd.read_csv("Titanic.csv")

In [None]:
print(df.columns)

In [None]:
print(df.shape)

In [None]:
df["family_size"] = df["SibSp"] + df["Parch"] + 1

In [None]:
print(df["Parch"].head())
print("+++++++++++++")
print(df["SibSp"].head())

In [None]:
print(df["family_size"].head())

In [None]:
print(df["Embarked"].unique())

In [None]:
print(df.isnull().sum())

In [None]:
df["Age"].fillna(value=df["Age"].median(), inplace=True)
df.isnull().sum()

In [None]:
df["Embarked"].describe()

In [None]:
df["Embarked"].fillna(value="S", inplace=True)
df.isnull().sum()

In [None]:
embarked ={"S":0, "C":1, "Q":2}

In [None]:
df.Embarked = [embarked[item] for item in df.Embarked]

In [None]:
gender ={"female":1, "male":0}

In [None]:
df.Sex = [gender[item] for item in df.Sex]

In [None]:
dfx = df[["Age", "Sex", "family_size", "Embarked"]].copy(deep=True)
dfy = df[["Survived"]].copy(deep=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
x = scaler.fit_transform(dfx)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(dfx, dfy, test_size = 0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1)

In [None]:
print(y_train.shape)
y_train = np.array(y_train).flatten()
print(y_train.shape)

In [None]:
model.fit(x_train, y_train)

In [None]:
ypred = model.predict(x_test)

#### Confusion Matrix

Is a table that gives a visual representation of performance of a classification. Each row in the table represents the instances in an actual class and each column represents the instances in a predicted class. Rows and columns can be interchanged.  

<img src ="confusion_matrix.png", width = 300, height = 200>

#### False Positive is known as Type I error and False Negative is known as Type II error. 

Let's consider another matrix and compute come metrics

<img src ="confusion_matrix2.png", width = 500, height = 400>

Important metrics

Recall = Sensitivity = True Positive rate = $\frac{TP}{TP+FN}$ 

Precision = Positive Predictive Value = $\frac{TP}{TP+FP} $  

Accuracy = $ \frac{TP+TN}{TP+TN+FP+FN} $

Specificity = True Negative rate = $ \frac{TN}{FP+TN} $ 

False Positive Rate = $\frac{FP}{FP+TN} $  = 1 - TNR = 1 - Specificity  

F1 score is an harmonic mean of Recall and Precision, 
F1 = $\frac{2*Precision*Recall}{Precision+Recall} $



Receiver Operating Characteristic curve, ROC curve is computed by plotting the True Positive Rate (TPR) with False Positive Rate (FPR) for different thresholds. 

<img src ="ROC_curves.svg", width = 400, height = 300>


Area Under the Receciver Operating Characteristic Curve, AUCROC, measures the separability of classes. Area close to 1 means the classes are properly classified. AUC under 0.5 means the classification is not good. 

Images courtesy of Wiki

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, ypred)

In [None]:
from sklearn.metrics import accuracy_score
print("Base rate accuracy is: %0.2f" %(accuracy_score(y_test, ypred)))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

In [None]:
logit_roc_auc = roc_auc_score(y_test, ypred)
print("Logistic AUC = %0.2f" %logit_roc_auc)
print(classification_report(y_test, ypred))

In [None]:
from sklearn.metrics import roc_curve
b = model.predict_proba(x_test)[:,1]
print(b[0:5])
fpr, tpr, threshold = roc_curve(y_test, b)

In [None]:
# plotting ROC curve
import matplotlib.pyplot as plt
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' %logit_roc_auc)
plt.plot([0,1], [0,1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

In [None]:
"""
In-class activity: Include Fare feature and fit a logistic regression. 
Calculate precision, recall and F1-score. Then plot ROC curve.
"""