# Logistic Regression with Python

For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous data set and very often is a student's first step in machine learning! 

We'll be trying to predict a classification- survival or deceased.
Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

## Import Libraries
Let's import some libraries to get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore") # Ignore warning

pd.set_option('float_format', '{:2f}'.format) # Show full number instead of show number like "1.5e2"


## Load data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

In [None]:
df = pd.read_csv('data\\titanic_train.csv')

In [None]:
df.head()

## Cleaning data

### Missing data:

In [None]:
df.isna().sum()

#### Age field:

In [None]:
plt.figure(figsize=(5, 4))
sns.boxplot(x='Pclass', y='Age', data=df, palette='winter')
plt.show()

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these median age values to impute based on Pclass for Age.

In [None]:
# get mean of age each pclass:
median1 = df[df['Pclass'] == 1]['Age'].median()
median2 = df[df['Pclass'] == 2]['Age'].median()
median3 = df[df['Pclass'] == 3]['Age'].median()

median1, median2, median3


In [None]:
# Impute age to each class:
df.loc[(df['Pclass'] == 1) & (df['Age'].isna()), 'Age'] = median1
df.loc[(df['Pclass'] == 2) & (df['Age'].isna()), 'Age'] = median2
df.loc[(df['Pclass'] == 3) & (df['Age'].isna()), 'Age'] = median3


In [None]:
df.isna().sum()

#### Cabin field

Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

In [None]:
df.drop(columns='Cabin', inplace=True)

#### Embarked field

I decided to drop the Embarked field because there were only two null values and you couldn‚Äôt use other fields to predict the missing values. 

In [None]:
# Drop remaining NA values:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace = True)


In [None]:
df.isna().sum()

## Preprocessing data

### Transforming categorical data

In [None]:
df.head()

üëâ Field 'sex': assigning male: 1; female: 0

In [None]:
df.loc[df['Sex'] == 'male', 'Sex'] = 1
df.loc[df['Sex'] == 'female', 'Sex'] = 0


In [None]:
df.head()

üëâ Field 'Embarked': using 'get_dummy' (or OneHotEncoder)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Using pd.get_dummies()
embark = pd.get_dummies(df['Embarked'])
embark.head()

In [None]:
# Use ordinal encoder to 
oh_encoder = OneHotEncoder(sparse_output=False)
embarked_enc = oh_encoder.fit_transform(df[['Embarked']])
embarked_enc[:5]

In [None]:
# show onehot categories:
oh_encoder.categories_[0]

üëâ Combine with transformed data:


In [None]:
df_onehot = pd.DataFrame(embarked_enc, columns= oh_encoder.categories_[0])
df_onehot.head()

In [None]:
df2 = pd.concat([df, df_onehot], axis=1)
df2.head()

üëâ Drop unnecessary columns: `PassengerId`, `Name`, `Ticket`, `Embarked`

In [None]:
# Drop unnecessary columns:
df2.drop(columns=['PassengerId','Embarked','Name','Ticket'], inplace=True)

In [None]:
df2.head()

In [None]:
df2.describe()

üòâ Great! Our data is ready for our model!

# Building a Logistic Regression model



<img src="https://www.saedsayad.com/images/LogReg_1.png" width="600">


+ Similar to linear regression, `logistic regression` is also used to` estimate the relationship between a dependent variable` and `one or more independent variables`, but it is used to make a prediction about a `categorical variable` versus a continuous one. A categorical variable can be true or false, yes or no, 1 or 0, et cetera. The unit of measure also differs from linear regression as `it produces a probability`, but the logit function transforms the S-curve into straight line.

+ As default, Sklearn Logistic Regression uses 0.5 as the threshold to classify 2 classes

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df2.drop(columns='Survived'), 
                                                    df2['Survived'], 
                                                    test_size=0.2, 
                                                    random_state=101)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
X_train

In [None]:
logreg1 = LogisticRegression()
logreg1.fit(X_train,y_train)

In [None]:
# make prediction and return result as label:
y_train_pred = logreg1.predict(X_train)
y_test_pred = logreg1.predict(X_test)

# Make prediction and return result as probability:
y_train_pred_prop = logreg1.predict_proba(X_train)
y_test_pred_prop = logreg1.predict_proba(X_test)



In [None]:
# Let's take a look in our results:
print('Result of ".predict(X_train)":', y_train_pred[:5], sep = '\n')
print('=='*30)
print('Result of ".predict_proba(X_train)":', y_train_pred_prop[:5], sep = '\n')


## Model Evaluation Metrics:

In [None]:
# Import evaluation metrics:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### üëâ Accuracy metrics
The ratio between the number of correctly predicted points and the total number of points in the data set. 

It's simple! Right :))

In [None]:
# Accuracy on trainset:
accuracy_score(y_train, y_train_pred)

In [None]:
# Accuracy on testset:
accuracy_score(y_test, y_test_pred)

### üëâ Confusion Matrix

In [None]:
# Calculating non-normalized confustion matrix on Testset:
confusion_matrix(y_test, y_test_pred)

In [None]:
# Normalized confustion matrix on Testset
confusion_matrix(y_test, y_test_pred, normalize='true')


In [None]:
plt.figure(figsize = (11, 4))
plt.subplot(121)
conf_matrix = confusion_matrix(y_test, y_test_pred)
df_cm = pd.DataFrame(conf_matrix, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})

plt.subplot(122)
conf_matrix_norm = confusion_matrix(y_test, y_test_pred, normalize='true')
df_cm = pd.DataFrame(conf_matrix_norm, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 14})
plt.suptitle('CONFUSION MATRIX')
plt.show()

In [None]:
print(classification_report(y_test,y_test_pred))

üëâ **Type 1 and Type 2 errors**

![alt text](https://www.statisticssolutions.com/wp-content/uploads/2017/12/rachnovblog.jpg)

source: https://www.statisticssolutions.com/to-err-is-human-what-are-type-i-and-ii-errors/

+ **Type 1 error** (**False Positive**), X·∫£y ra khi gi·∫£ thuy·∫øt **th·ª±c ch·∫•t l√† sai** nh∆∞ng **ƒë∆∞·ª£c cho l√† ƒë√∫ng**

    *+ V√≠ d·ª•:* B·∫°n x√¢y d·ª±ng model ƒë·ªÉ d·ª± ƒëo√°n b·ªánh nh√¢n c√≥ b·ªã covid hay kh√¥ng (Trong ƒë√≥, positive l√† kh·ªèe m·∫°nh v√† negative l√† b·ªã covid). 
    
    N·∫øu model d·ª± ƒëo√°n b·ªánh nh√¢n kh·ªèe m·∫°nh nh∆∞ng th·ª±c t·∫ø h·ªç c√≥ b·ªã th√¨ ƒë√≥ ƒë∆∞·ª£c g·ªçi l√† m·ªôt Type 1 error

+ **Type 2 error** (**False Negative**), x·∫£y ra khi gi·∫£ thuy·∫øt **th·ª±c ch·∫•t l√† ƒë√∫ng** nh∆∞ng **ƒë∆∞·ª£c cho l√† sai**

    *+ V√≠ d·ª•:* B·∫°n x√¢y d·ª±ng model ƒë·ªÉ d·ª± ƒëo√°n b·ªánh nh√¢n c√≥ b·ªã covid hay kh√¥ng (Trong ƒë√≥, positive l√† kh·ªèe m·∫°nh v√† negative l√† b·ªã covid). 
    
    N·∫øu model d·ª± ƒëo√°n b·ªánh nh√¢n b·ªã covid nh∆∞ng th·ª±c t·∫ø h·ªç l·∫°i kh√¥ng b·ªã th√¨ ƒë√≥ ƒë∆∞·ª£c g·ªçi l√† m·ªôt Type 2 error*

Th√¥ng th∆∞·ªùng ch√∫ng ta c·∫ßn xem x√©t vi·ªác gi·∫£m c·∫£ 2 lo·∫°i l·ªói n√†y ƒë·ªÉ model c·ªßa ch√∫ng ta ƒë·∫°t hi·ªáu qu·∫£ cao nh·∫•t


### Terminologies

**Recall, sensitivity, hit rate, or true positive rate(TPR)**: l√† t·ªâ l·ªá s·ªë ƒëi·ªÉm true positive trong s·ªë nh·ªØng ƒëi·ªÉm th·ª±c s·ª± l√† positive (TP + FN). Hay n√≥i c√°ch kh√°c l√† t·ªâ l·ªá model d·ª± ƒëo√°n ƒë√∫ng l√† Positive (1) tr√™n t·ªïng s·ªë th·ª±c t·∫ø Positive (1) c·ªßa data

ƒê·ªÉ tƒÉng Recall ta c·∫ßn gi·∫£m FN, t·ª©c l√† gi·∫£m Type 2 error


$$
Recall = \frac{TP}{P} = \frac{TP}{TP + FN}
$$

**Precision or positive predictive value (PPV)**: l√† t·ªâ l·ªá s·ªë ƒëi·ªÉm true positive trong s·ªë nh·ªØng ƒëi·ªÉm ƒë∆∞·ª£c ph√¢n lo·∫°i l√† positive c·ªßa model (TP + FP). Hay n√≥i c√°ch kh√°c l√† t·ªâ l·ªá model d·ª± ƒëo√°n ƒë√∫ng l√† Positive (1) tr√™n t·ªïng s·ªë d·ª± ƒëo√°n l√† Positive (1) c·ªßa model 

ƒê·ªÉ tƒÉng Precision ta c·∫ßn gi·∫£m FP, t·ª©c l√† gi·∫£m Type 1 error

$$
Precision = \frac{TP}{TP + FP}
$$


**F1 score**: the **harmonic mean** of **precision** and **recall**

$$
F_1 = 2 \frac{Precision . Recall}{Precision + Recall}
$$


We can check `precision`, `recall`, `f1-score` using `classification report`!

In [None]:
print(classification_report(y_test,y_test_pred))

ü§î Note: If you curious about `macro avg` and `weight avg` and here is the answer:


+ In the case of **Weighted average** the performance metrics are weighted accordingly:

In [None]:
# Example Weight Avg. of recall (The same with Precision):
# (Percentage_of_positive)*Positive_Recall + (Percentage_of_Negative)*Negative_Recall

(71/178)*0.66 + (107/178)*0.92

+ In the case of **Macro average** is just the **mean** of metrics of classes:

In [None]:
# Example Macro Avg. of recall (The same with Precision):
# (Positive_Recall + Negative_Recall)/2

(0.66 + 0.92)/2

## ü§î Let's try to train model without `Parch` column and Use SMOTE OverSampling technique 

üëâ Drop Parch column:

In [None]:
df2.head()

In [None]:
df2.describe()

In [None]:
df3 = df2.drop(columns = 'Parch')
df3.head()

üëâ Use SMOTE OVERSAMPLING technique

<img src="https://www.researchgate.net/publication/347937180/figure/fig3/AS:973429209563136@1609095017080/Illustration-of-the-SMOTE-oversampling-approach.ppm" width="600">

In [None]:
X = df3.drop(columns='Survived')
y = df3['Survived']

In [None]:
X.shape, y.shape

In [None]:
y.value_counts()

In [None]:
# Install imblearn library
! pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE

# X v√† y l√† c√°c feature v√† label c·ªßa d·ªØ li·ªáu
smote = SMOTE(k_neighbors = 3, random_state=96)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [None]:
X_resampled.shape, y_resampled.shape

In [None]:
# Check value count of label:
y_resampled.value_counts()

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_resampled, 
                                                    y_resampled, 
                                                    test_size=0.2, 
                                                    random_state=101,
                                                    stratify = y_resampled)

X_train1.shape, y_train1.shape, X_test1.shape, y_test1.shape

In [None]:
# Load model:
logreg2 = LogisticRegression()
# Train model:
logreg2.fit(X_train1, y_train1)

In [None]:
# Prediction on trainset and testset:
y_test_pred1 = logreg2.predict(X_test1)


In [None]:
plt.figure(figsize = (10, 10))
# Plot before drop parch col:
plt.subplot(221)
conf_matrix = confusion_matrix(y_test, y_test_pred)
df_cm = pd.DataFrame(conf_matrix, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16}, cbar = False)
plt.title('CONFUSION MATRIX (before drop PARCH)')

plt.subplot(222)
conf_matrix_norm = confusion_matrix(y_test, y_test_pred, normalize='true')
df_cm = pd.DataFrame(conf_matrix_norm, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 14}, cbar = False)

# Plot after drop parch cols:
plt.subplot(223)
conf_matrix = confusion_matrix(y_test1, y_test_pred1)
df_cm = pd.DataFrame(conf_matrix, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Greens", annot=True,annot_kws={"size": 16}, cbar = False)
plt.title('CONFUSION MATRIX (after drop PARCH)')

plt.subplot(224)
conf_matrix_norm = confusion_matrix(y_test1, y_test_pred1, normalize='true')
df_cm = pd.DataFrame(conf_matrix_norm, columns=np.unique(['Negative (0)', 'Positive (1)']), 
                     index = np.unique(['Negative (0)', 'Positive (1)']))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Greens", annot=True,annot_kws={"size": 14}, cbar = False)

plt.subplots_adjust(hspace=0.4, wspace=0.4)
plt.show()

In [None]:
print('Classification after drop "Parch" and Over sampling')
print(classification_report(y_test1,y_test_pred1))


In [None]:
print('Classification before drop "Parch" and Over sampling')
print(classification_report(y_test,y_test_pred))


## Let's predict


A customer has data like this:
+ Pclass: 2
+ Sex: 1
+ Age: 25
+ SibSp: 0
+ Parch: 1
+ Fare: 70
+ Embarked: Q

Will be alive or not?

In [None]:
# Your code here: