**D1AED: Análise Estatística para Ciência de Dados** <br/>
IFSP Campinas

Prof: Samuel Martins <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_style("whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

# <font color='#0C509E' style='font-size: 40px;'>Logistic Regression - V1</font>

## 1. Exploring the Data

This is a dummy dataset that pesents if customers have purchased a given product or not.<br/>

**Dataset:** https://www.kaggle.com/rakeshrau/social-network-ads

### 1.1. Importing the Dataset

In [None]:
df = pd.read_csv('./datasets/Social_Network_Ads.csv')

In [None]:
df.head()

### 1.2. Basic Information about the Dataset

In [None]:
print(f'This dataset has {df.shape[0]} observations/samples/rows and {df.shape[1]} attributes/features/colunas')

In [None]:
df.info()

<br/>**"Purchased"** is a _dependent variable_ and the others are _independent variables_.

**Two clases: Binary** Classification Problem.

There is a **class imbalance**: _class 0_ represents 64.25% of the samples while _class 1_ has 35.75%. <br/>
This can _hinder_ the model training. <br/>
However, we will ignore this for now.

### 1.3. Correlation matrix

In [None]:
df.corr()

## 2. Training a Logistic Regression model with a _single_ feature

Let's start with a simpler version of our **binary classifier**, training a logistic regressor on _just_ **a single independent variable** (_feature_).

### 2.1. Extracting the independent and dependent variables

In [None]:
df.head()

#### Feature matrix (X)

In [None]:
X

#### Ground-truth Classes (y)

In [None]:
y

### 2.2. Splitting the dataset into Training Set and Test Set
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

**PS:** Use _stratified_ split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### **Checking training and test set sizes**

In [None]:
X.shape, y.shape

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

##### **Checking the class proportions**
Recall that, considering the whole data, _class 0_ has 64.25% of the samples while _class 1_ has 35.75%

In [None]:
# training set


In [None]:
# training set


### 2.3 Visualizing the training set

### 2.4. Training a Logistic Regression Model
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#### Model's Parameters

Portanto, os parâmetros aprendidos para nosso **modelo de regressão logística**, a partir do conjunto de treinamento utilizado, foi:

$\theta^T = [\theta_0, \theta_1] = [-7.32, 0.17]$

<span style="font-size: 20pt">
$
h_\theta(x) = \frac{1}{1 + e^{-\theta^{T}*x}} = \frac{1}{1 + e^{-(-7.32 + 0.17 * x_1)}}
$
</span>

### 2.5. Visualizing the trained Model

#### **Trained Model (Sigmoid)**

In [None]:
x_sig = X_train['Age']

# prob. of the samples being classified as the positive class
y_sig = classifier.predict_proba(X_train)[:, 1]

#### **Decision boundary**

Using a _single feature_, the **decision boundary** is a _vertical line_:

<span style='font-size: 20pt'>
$\theta_0 + \theta_1 * x_1 = 0$

$x_1 = \frac{-\theta_0}{\theta_1}$
</span>

In [None]:
theta_0 = classifier.intercept_[0]
theta_1 = classifier.coef_[0, 0]

In [None]:
decision_boundary = -theta_0 / theta_1
decision_boundary

The found _decision boundary_ is:
    
$x_1 = 42.81738968528535$

#### **Putting it all together**

In [None]:
train_indices = X_train.index
train_indices

In [None]:
df_train = df.iloc[train_indices]
df_train

In [None]:
plt.figure(figsize=(16,8))

ax = sns.scatterplot(data=df_train, x='Age', y='Purchased', style='Purchased', hue='Purchased', s=100)
sns.lineplot(x=x_sig, y=y_sig, color='red', ax=ax)
plt.hlines(0.5, df_train['Age'].min(), decision_boundary, colors='gray', linestyles='--')
plt.axvline(decision_boundary, color='green')

x_ticks = np.append(ax.get_xticks(), decision_boundary)

ax.set_xticks(x_ticks)
plt.yticks(np.arange(0, 1.001, 0.1))
plt.xlim(df_train['Age'].min(), df_train['Age'].max())
plt.ylim(-0.01, 1.01)
plt.title('Logistic Regression - TRAIN')
# plt.legend(['learned model (sigmoid)', 'learned linear model', 'decision boundary'])

## 3. Classification / Prediction

## 4. Evaluation Metrics
We will use just a few metrics for simplificity. A complete overview about evaluation metrics for classification will be seen in the future. 

### 4.1. Confusion Matrix

<img src='imgs/confusion_matrix.png' width=250px/>

In [None]:
conf_matrix_df = pd.DataFrame({
    'Pred Label – Negative': [tn, fn],
    'Pred Label – Positive': [fp, tp]
}, index=['True Label – Negative', 'True Label – Positive'])

conf_matrix_df

### 4.2. Precision

What is the **proportion** of **_true_ positives** among _all_ instances _classified_ (_correctly_ and _incorrectly_) as **_positives_**? <br/>
**How precise** is the **_positive_** classification?

_“From all patients classified as **cancer**, how many (**proportion**) actually had **cancer**?”_

<span style="font-size: 16pt">
$
precision = \frac{TP}{TP + FP}
$
</span>

In [None]:
print(f'Precision: {precision}')

#### Alternatively
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html <br/>
By default, the label of the positive class is 1

In [None]:
print(f'Precision: {precision_sk}')

### 4.3. Recall / Sensitivity / True Positive Rate
What is the **proportion** of ***positives*** that were _correctly classified_ as ***positives***?

_What is the **proportion** of patients with **cancer** that were **correctly identified**? <br/>
**How sensitive** is the classifier to **correctly identify** patients with **cancer**?_

<span style="font-size: 16pt">
$
recall = \frac{TP}{FN+TP}
$
</span>

In [None]:
print(f'Recall: {recall}')

#### Alternatively
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html <br/>
By default, the label of the positive class is 1

In [None]:
print(f'Recall: {recall_sk}')

### 4.4. Accuracy
What was the (overall) **classification hit rate**?

<span style="font-size: 16pt">
$
accuracy = \frac{TP + TN}{TN + FN + FP + TP}
$
</span>

In [None]:
print(f'Accuracy: {accuracy}')

#### Alternatively
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
print(f'Accuracy: {accuracy_sk}')

#### Problems:
Issue with **biased dataset (class imbalance)**: Easy to ***cheat** the score:
- Consider that _only_ 1% of the samples are ***positives***
- Guess everything as ***negative***
- Achieve 99% accuracy

**Solutions:** Balanced Accuracy or F1-Score

### 4.5. F1-Score
How **good** (_precision_) and **complete** (_sensitive_) are the _predictions_?

It is the _harmonic mean_ of the _precision_ and _recall_. <br/>
The **highest possible value** is **1.0**, indicating _perfect precision_ and recall_; <br/>
The **lowest possible value** is 0, if either the _precision_ or the _recall_ is **zero**.

<span style="font-size: 16pt">
$
F1 = 2 * \frac{precision \space * \space recall}{precision \space + \space recall}
$
</span>

In [None]:
print(f'F1 Score: {f1}')

#### Alternatively
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
print(f'F1 Score: {f1_sk}')

### 4.6. Classification Report
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

## 5. Visualizing the Predictions

In [None]:
test_indices = X_test.index
test_indices

In [None]:
df_test = df.iloc[test_indices]
df_test

In [None]:
plt.figure(figsize=(16,8))

ax = sns.scatterplot(data=df_test, x='Age', y='Purchased', style='Purchased', hue='Purchased', s=100)
sns.lineplot(x=x_sig, y=y_sig, color='red', ax=ax)
plt.hlines(0.5, X_test['Age'].min(), decision_boundary, colors='gray', linestyles='--')
plt.axvline(decision_boundary, color='green')

x_ticks = np.append(ax.get_xticks(), decision_boundary)

ax.set_xticks(x_ticks)
plt.yticks(np.arange(0, 1.001, 0.1))
plt.xlim(df_test['Age'].min(), df_test['Age'].max())
plt.ylim(-0.01, 1.01)
plt.title('Logistic Regression - TEST')