**D1AED: Análise Estatística para Ciência de Dados** <br/>
IFSP Campinas

Prof: Samuel Martins <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_style("whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

# <font color='#0C509E' style='font-size: 40px;'>Logistic Regression - v2</font>

## 1. Importing the data

This is a dummy dataset that pesents if customers have purchased a given product or not.<br/>

**Dataset:** https://www.kaggle.com/rakeshrau/social-network-ads

In [None]:
df = pd.read_csv('./datasets/Social_Network_Ads.csv')

In [None]:
df.head()

## 2.Extracting the independent and dependent variables
For visualization purposes, we will consider only **two independent variables** (_Age_ and _EstimatedSalary_) for training the logistic regressor.

## 3. Splitting the dataset into Training Set and Test Set
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

**PS:** Use _stratified_ split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

##### **Checking training and test set sizes**

In [None]:
X.shape, y.shape

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

##### Saving the samples' indices for visualization purposes

In [None]:
train_indices = X_train.index
test_indices = X_test.index

## 4. Normalizando os dados

**StandardScaler**: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# descobre os parâmetros da normalização (fit) e já transforma/normaliza o conjunto de treinamento


In [None]:
# equivalente a: np.sqrt(var_)


In [None]:
# transforma/normaliza o conjunto de teste baseado no "normalizador" aprendido a partir do conjunto de treinamento


## 5. Training a Logistic Regression Model
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#### Learned Logistic Regression Model

Therefore, the parameters learned for our **logistic regression model**, from the training set used, were:

$\theta^T = [\theta_0, \theta_1, \theta_2] = [-0.995, 1.96, 1.13]$

<span style="font-size: 20pt">
$
h_\theta(x) = \frac{1}{1 + e^{-\theta^{T}*x}} = \frac{1}{1 + e^{-(-0.995 + 1.96 * x_1 + 1.13 * x_2)}}
$
</span>

## 6. Visualizando o Modelo

## 2D

#### **Decision boundary**

Considering two dependent variables, the _decision boundary_ is a vertical plane:

<span style='font-size: 20pt'>
$\theta_0 + \theta_1 * x_1 + \theta_2 * x_2 = 0$

$\theta_1 * x_1 + \theta_2 * x_2 = -\theta_0$
    
$x_2 = - (\theta_0 + \theta_1 * x_1) / \theta_2$
</span>

In [None]:
theta_0 = classifier.intercept_[0]
theta_1 = classifier.coef_[0, 0]
theta_2 = classifier.coef_[0, 1]

In [None]:
x1_decision_line = np.array([X_train[:,0].min(), X_train[:,0].max()])

In [None]:
x2_decision_line = -(theta_0 + (theta_1 * x1_decision_line)) / theta_2

#### **Putting it all together**

In [None]:
prob_train = classifier.predict_proba(X_train)[:,1].round(2)

In [None]:
df_train = df.loc[train_indices, ['Age', 'EstimatedSalary', 'Purchased']].copy()
df_train

In [None]:
df_train['Normalized Age'] = X_train[:,0]
df_train['Normalized Salary'] = X_train[:,1]
df_train['Estimated Prob'] = classifier.predict_proba(X_train)[:,1].round(2)

df_train['Purchased'].replace(0, 'No', inplace=True)
df_train['Purchased'].replace(1, 'Yes', inplace=True)

df_train.head()

In [None]:
# interactive plot by Plotly

import plotly.express as px
import plotly.graph_objects as go

fig = px.scatter(df_train, x='Normalized Age', y='Normalized Salary', color='Purchased', hover_data=['Age', 'EstimatedSalary', 'Estimated Prob'], color_discrete_sequence=px.colors.qualitative.T10)
fig.add_trace(go.Scatter(x=x1_decision_line, y=x2_decision_line, mode='lines', name='Decision Boundary'))

fig.update_layout(title='Training set and Decision Boundary',
                  xaxis_title='Normalized Age', yaxis_title='Normalized Salary', width=700, height=600)
fig.update_xaxes(range=[X_train.min() - 0.5, X_train.max() + 0.5])
fig.update_yaxes(range=[X_train.min() - 0.5, X_train.max() + 0.5])

fig.show()

In [None]:
# The same static plot by SeaBorn

plt.figure(figsize=(8,8))

sns.scatterplot(data=df_train, x='Normalized Age', y='Normalized Salary', hue='Purchased')
sns.lineplot(x=x1_decision_line, y=x2_decision_line, color='lightseagreen')

lim = df_train[['Normalized Age', 'Normalized Salary']].min().min() - 0.5, df_train[['Normalized Age', 'Normalized Salary']].max().max() + 0.5
plt.xlim(lim)
plt.ylim(lim)

## 3D

#### **Learned model (sigmoid)**

In [None]:
x_sig = np.arange(X_train[:,0].min() - 0.25, X_train[:,0].max() + 0.25, step=0.1)
y_sig = np.arange(X_train[:,1].min() - 0.25, X_train[:,1].max() + 0.25, step=0.1)
xv, yv = np.meshgrid(x_sig, y_sig)  # combinação dos x's e y's
z_sig = classifier.predict_proba(np.array([xv.ravel(), yv.ravel()]).T)[:,1].reshape(xv.shape)

#### **Plane at the 50% Probability%**

In [None]:
x1_plane = [X_train[:,0].min(), X_train[:,0].max()]
x2_plane = [X_train[:,1].min(), X_train[:,1].max()]
z_plane = [[0.5, 0.5], [0.5, 0.5]]

#### **Decision Boundary**

Considering two dependent variables, the _decision boundary_ is a vertical plane:

<span style='font-size: 20pt'>
$\theta_0 + \theta_1 * x_1 + \theta_2 * x_2 = 0$

$\theta_1 * x_1 + \theta_2 * x_2 = -\theta_0$
    
$x_2 = - (\theta_0 + \theta_1 * x_1) / \theta_2$
</span>

In [None]:
theta_0 = classifier.intercept_[0]
theta_1 = classifier.coef_[0, 0]
theta_2 = classifier.coef_[0, 1]

In [None]:
x1_decision = np.arange(X_train[:,0].min(), X_train[:,0].max(), step=0.1)
z_decision = np.linspace(0, 1.0, x1_decision.size)

In [None]:
X1_decision, Z_decision = np.meshgrid(x1_decision, z_decision)
X2_decision = -(theta_0 + (theta_1 * X1_decision)) / theta_2

#### **Putting it all together**

In [None]:
# https://stackoverflow.com/a/53116010

import plotly.graph_objects as go

fig = go.Figure(data=[
                    go.Surface(x=X1_decision, y=X2_decision, z=Z_decision, colorscale=[[0, 'paleturquoise'], [1.0, 'paleturquoise']]),
                    go.Surface(x=x1_plane, y=x2_plane, z=z_plane, colorscale=[[0, 'gray'], [1.0, 'gray']]),
                    go.Scatter3d(x=X_train[:,0], y=X_train[:,1], z=y_train, mode='markers',
                                marker=dict(size=6, color=y_train, colorscale=[[0, '#4C78A8'], [1, '#F58518']], opacity=0.8)),
                    go.Surface(x=x_sig, y=y_sig, z=z_sig),
            ])

fig.update_layout(title='Training set and Decision Boundary',
                  scene=dict(xaxis_title='Normalized Age', yaxis_title='Normalized Salary', zaxis_title='Estimated Prob'), width=1200, height=1000)
fig.update_xaxes(range=[X_train.min() - 0.5, X_train.max() + 0.5])
fig.update_yaxes(range=[X_train.min() - 0.5, X_train.max() + 0.5])

fig.show()

## 7. Classification / Prediction

## 8. Metrics

## 5. Visualizing the Predictions

In [None]:
plt.figure(figsize=(8,8))

sns.scatterplot(x=X_test[:,0], y=X_test[:,1], hue=y_test)
sns.lineplot(x=x1_decision_line, y=x2_decision_line, color='lightseagreen')

lim = X_test.min() - 0.5, X_test.max() + 0.5
plt.xlabel('Normalized Age')
plt.ylabel('Normalized Salary')
plt.xlim(lim)
plt.ylim(lim)