# Lab - Logistic Regression

## Logistic Regression from scratch

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

  import pandas.util.testing as tm


Dataset **Titanic**

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/d/db/Titanic-Cobh-Harbour-1912.JPG/330px-Titanic-Cobh-Harbour-1912.JPG)

The dataset that we are working on is a list of passenger on the famous ship Titanic. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing a lot passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

**Data dictionary**
 
| Variable | Definition | Key |
|:--:|:--:|:--:|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class, a proxy for socio-economic status (SES) | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Gender | |
| Age | Age in years | |
| sibsp | # of siblings(brother,sister)/spouses(husband, wife) aboard the Titanic |
| parch | # of parents/children aboard the Titanic. Some children travelled only with a nanny, therefore parch=0 for them |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C=Cherbourg, Q=Queenstown, S=Southampton |

In [2]:
titanic = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/titanic.csv')

# Data manipulation
titanic.fillna(titanic['Age'].mean(), inplace=True)
titanic.replace({'Sex':{'male':0, 'female':1}}, inplace=True)
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1
titanic.drop(columns=['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], inplace=True)
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,FamilySize
0,0,3,0,22.0,2
1,1,1,1,38.0,2
2,1,3,1,26.0,1
3,1,1,1,35.0,2
4,0,3,0,35.0,1


In [3]:
from sklearn.model_selection import train_test_split

X = titanic[['Pclass', 'Sex', 'Age', 'FamilySize']].values
y = titanic[['Survived']].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=102)

print('Training set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)

Training set: (712, 4) (712, 1)
Test set: (179, 4) (179, 1)


### Scikit-learn Logistic Regression

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import log_loss

# Create Logistics Regression model from X and y
lg = LogisticRegression()
lg.fit(X_train, y_train)
predictions = lg.predict(X_test)

# Show metrics
print("Accuracy score: %f" % accuracy_score(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print('Log loss:', log_loss(y_test, predictions)/len(y_test))

# Show parameters
print('w = ', lg.coef_)
print('b = ', lg.intercept_)

Accuracy score: 0.793296
Confusion Matrix:
[[97 17]
 [20 45]]
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       114
           1       0.73      0.69      0.71        65

    accuracy                           0.79       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.79      0.79      0.79       179

Log loss: 0.039884782615024775
w =  [[-1.18387774  2.56284417 -0.04074789 -0.21591208]]
b =  [2.84100084]


### Handmade Logistic Regression

**Forward Propagation:**
$$Z = Xw + b$$
$$\hat{y} = \sigma(Z) =\sigma(Xw + b) $$
$$J(w, b) = -\frac{1}{m}\sum_{i=1}^m{ \Big( y^{(i)} log( \hat{y}^{(i)}) + (1-y^{(i)}) log(1 - \hat{y}^{(i)}) \Big)} \tag{5}$$

**and Backward**

$$ \frac{\partial J}{\partial w} = \frac{1}{m}X^T(\hat{y}-y)\tag{6}$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)}-y^{(i)})\tag{7}$$

In [5]:
# Initialize params
def initialize_params(X):
    '''Initialize w, b with zeros and return'''
    # Your code here
    pass

In [6]:
# Implement sigmoid
def sigmoid(Z):
    # Your code here
    pass

In [7]:
# Forward propagation
def forward(w, b, X):
    '''Return y_hat'''
    # Your code here
    pass

In [8]:
# Binary cross entropy loss
def birany_cross_entropy(y, y_hat):
    '''Calculate loss function J and return'''
    # Your code here
    pass

In [9]:
# Backward propagation
def backward(X, y, y_hat, w, b):
    '''Calculate dw, db and return'''
    # Your code here
    pass

# Update parameters
def update_params(w, b, dw, db, learning_rate):
    '''Update w, b and return'''
    # Your code here
    pass

In [10]:
# Training process
def train(X, y, iterations, learning_rate):
    '''Train w, b and return'''
    # Your code here
    pass

In [11]:
# Predict
def predict(w, b, X):
    '''Return predicted y of X'''
    # Your code here
    pass

**Evaluation**

In [12]:
# Train the model and predict X_test
# Your code here

In [None]:
# Evaluation
# Your code here

In [14]:
# Output of sklearn.LogisticRegression
# Accuracy score: 0.793296
# Confusion Matrix:
# [[97 17]
#  [20 45]]
#               precision    recall  f1-score   support

#            0       0.83      0.85      0.84       114
#            1       0.73      0.69      0.71        65

#     accuracy                           0.79       179
#    macro avg       0.78      0.77      0.77       179
# weighted avg       0.79      0.79      0.79       179

# Log loss: 0.039884782615024775
# w =  [[-1.18387774  2.56284417 -0.04074789 -0.21591208]]
# b =  [2.84100084]

**Well done!**