# Formação 2 - Machine Learning Development in Python
---
*João Brazuna*

## Apresentação dos Conjuntos de Dados
---
Nesta formação, usaremos maioritariamente dois conjuntos de dados reais e de acesso livre:

### [**Titanic**](https://www.kaggle.com/competitions/titanic/data)
<img src="images/Titanic.jpg" width="200">

Os dados do Titanic contêm informação real sobre os passageiros e a indicação se o passageiro sobreviveu ou não, que é aquilo que se pretende que seja previsto.

|  ID  | Variável    | Descrição                                 | Valores                                                              |
| ---: | :----       | :----                                     | :---                                                                 |
|    0 | PassengerId | ID do Passageiro                          | 1, 2, 3,..., 891                                                     |
|    1 | Survival    | Indicatriz da Sobrevivência do Passageiro | 1 se o passageiro tiver sobrevivido, 0 caso contrário                |
|    2 | Pclass      | Classe Económica do Passageiro            | 1 para 1ª classe, 2 para 2ª classe e 3 para passageiros de 3ª classe |
|    3 | Name        | Nome do Passageiro                        | Nome do passageiro, incluindo o título (Mr., Miss., Mrs.,...)        |
|    4 | Sex         | Género do Passageiro                      | "male" ou "female"                                                   |
|    5 | Age         | Idade do Passageiro em Anos               | 0.42, 0.67, 0.75, 0.83, 0.92, 1, 2,...                               |
|    6 | SibSp       | Número de Irmãos e Cônjuges no Titanic    | 0, 1, 2,...                                                          |
|    7 | Parch       | Número de Pais e Filhos no Titanic        | 0, 1, 2,...                                                          |
|    8 | Ticket      | Número do Bilhete                         | "A/5 21171", "PC 17599", "STON/O2. 3101282",...                      |
|    9 | Fare        | Preço do Bilhete                          | 0, 4.0125, 5, 6.2375,...                                             |
|   10 | Cabin       | Camarote onde ficou o Passageiro          | "C85", "C123", "E46",...                                             |
|   11 | Embarked    | Porto onde o Passageiro Embarcou          | "C" para Cherbourg, "Q" para Queenstown ou "S" para Southampton      |

## Importar Pacotes Necessários
---

In [35]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression, LogisticRegression

import statsmodels.api as sm

## Carregar os Dados
---

In [36]:
titanic_df = pd.read_csv('data/Titanic.csv')
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [37]:
class ConfusionMatrix():
    def __init__(self, y_real, y_pred):
        if (len(np.intersect1d(y_real, [1, 0])) != 2) or len(y_real.unique()) != 2:
            raise ValueError(f"y_real parameter can only take levels 1 or 0.")
        cm = confusion_matrix(y_real, y_pred, labels=[1, 0])
        cm_df = pd.DataFrame(cm, index=['Real Positive', 'Real Negative'], columns=['Predicted Positive', 'Predicted Negative'])
        self._confusion_matrix = cm_df
    
    @property
    def confusion_matrix(self):
        return self._confusion_matrix
    
    @property
    def tp(self):
        return self.confusion_matrix.loc['Real Positive', 'Predicted Positive']
    
    @property
    def tn(self):
        return self.confusion_matrix.loc['Real Negative', 'Predicted Negative']
    
    @property
    def fp(self):
        return self.confusion_matrix.loc['Real Negative', 'Predicted Positive']
    
    @property
    def fn(self):
        return self.confusion_matrix.loc['Real Positive', 'Predicted Negative']
    
    @property
    def positive(self):
        return self.confusion_matrix.loc['Real Positive', :].sum()
    
    @property
    def negative(self):
        return self.confusion_matrix.loc['Real Negative', :].sum()
    
    @property
    def predicted_positive(self):
        return self.confusion_matrix.loc[:, 'Predicted Positive'].sum()
    
    @property
    def predicted_negative(self):
        return self.confusion_matrix.loc[:, 'Predicted Negative'].sum()
    
    ## Performance Measures ##
    @property
    def accuracy(self):
        return (self.tp + self.tn) / (self.positive + self.negative)
    
    # Sensibility
    @property
    def sensibility(self):
        return self.tp / self.positive
    
    @property
    def recall(self):
        return self.sensibility
    
    @property
    def tpr(self):
        return self.sensibility
    
    # Specificity
    @property
    def specificity(self):
        return self.tn / self.negative
    
    @property
    def selectivity(self):
        return self.specificity
    
    @property
    def tpr(self):
        return self.specificity
    
    # False Positive Rate
    @property
    def fpr(self):
        return 1 - self.specificity
    
    # False Negative Rate
    @property
    def fnr(self):
        return 1 - self.sensitivity
    
    # PPV / Precision
    @property
    def precision(self):
        return self.tp / self.predicted_positive
    
    @property
    def ppv(self):
        return self.precision
    
    # NPV
    @property
    def npv(self):
        return self.tn / self.predicted_negative
    
    # False Discovery Rate
    @property
    def fdr(self):
        return 1 - self.ppv
    
    # False Omission Rate
    @property
    def fomr(self):
        return 1 - self.npv
    
    # Performance Measures
    @property
    def performance_measures(self):
        pm_dct = {'Accuracy': self.accuracy,
                  'Precision / PPV': self.precision,
                  'Recall / Sensibility': self.sensibility,
                  'NPV': self.npv,
                  'Specificity': self.specificity}
        return pd.Series(pm_dct).to_frame('Performance Measures').T
    
    
def target_by_feature(x, target, data, fun='mean', bins=None,
                      xlabel=None, ylabel=None, title=None, targetlabels=None, 
                      xrotation=None, show=True):
    df = data.copy()
    df['Number of Observations'] = 1
    
    if isinstance(target, str):
        targets = [target]
    elif isinstance(target, list) and all([isinstance(y, str) for y in target]):
        targets = target
    else:
        raise ValueError(f"target argument can only be a string or a list of strings")
    
    if targetlabels is None:
        targetlabels = targets
    elif (isinstance(targetlabels, str) and len(targets) == 1):
        targetlabels = [targetlabels]
    elif ((isinstance(targetlabels, str) and len(targets) != 1)) or (isinstance(targetlabels, list) and (len(targetlabels) != len(targets))):
        raise ValueErrors(f"targetlabels argument must be a string if target is a string or a singleton list, or a list with the same number of elements as target.")

    if bins is not None:
        if isinstance(bins, list):
            if len(bins) != 3:
                raise ValueError(f"When a list, bins must be have 3 elements")
            bins = [df[x].min() - bins[2], *np.arange(bins[0], bins[1], bins[2]).tolist(), df[x].max()]     
            n_bins = len(bins)
        elif isinstance(bins, int):
            n_bins = bins
            
        unique_vals = df[x].unique()
        if len(unique_vals) > n_bins:
            df[x] = pd.cut(df[x], bins=bins)#.apply(lambda x: x.right)
    
    agg_funs = {'Number of Observations': 'sum'}
    for target in targets:
        agg_funs[target] = fun
    
    grp_df = df.groupby(x).agg(agg_funs).sort_index()
    grp_df['Number of Observations (%)'] = grp_df['Number of Observations'] / grp_df['Number of Observations'].sum() * 100
    grp_df.reset_index(inplace=True)
    
    ax = sns.barplot(x=x, y='Number of Observations (%)', data=grp_df, color='grey', alpha=0.4)
    ax.yaxis.set_major_formatter(PercentFormatter(decimals=0))
    ylims = ax.get_ylim()
    y_adj = (ylims[1] - ylims[0]) / 100
    
    for ii, p in enumerate(ax.patches):
        txt = str(int(grp_df.iloc[ii]['Number of Observations']))
        txt_x = p.get_x() + p.get_width() / 2
        txt_y = p.get_height() + y_adj
        ax.text(txt_x, txt_y, txt, ha='center')
    
    ax2 = ax.twinx()
    
    for target in targets:
        sns.lineplot(x=ax.get_xticks(), y=target, data=grp_df, ax=ax2)
    
    if xlabel is None:
        xlabel = x.title()
    ax.set_xlabel(xlabel)
    
    preffix = 'Average ' if fun in ['mean', np.mean] else 'Sum ' if fun in ['sum', np.sum] else ''
    if ylabel is None:
        ylabel = preffix + 'Target Value' 
    ax2.set_ylabel(ylabel)
    
    if title is None:
        title = ylabel + ' by ' + xlabel
    ax.set_title(title)
    
    plt.legend(labels=targetlabels, bbox_to_anchor=(1.7,1))
    
    if xrotation is not None:
        ax.tick_params(axis='x', rotation=90)
    if show:
        plt.show()

## Regressão Logística - Prever a Probabilidade de Sobrevivência

In [38]:
numeric_vars = titanic_df.dtypes[titanic_df.dtypes != 'object'].index.to_list()

In [44]:
X = titanic_df.drop(['PassengerId', 'Pclass','Name','Cabin','Ticket'], axis=1).dropna(axis=1) #passengerid and pclass out ; name > not relevant; cabin > to mutch missing
X['Sex'] = (X['Sex'] == 'male').astype(int)
X


Unnamed: 0,Survived,Sex,SibSp,Parch,Fare
0,0,1,1,0,7.2500
1,1,0,1,0,71.2833
2,1,0,0,0,7.9250
3,1,0,1,0,53.1000
4,0,1,0,0,8.0500
...,...,...,...,...,...
886,0,1,0,0,13.0000
887,1,0,0,0,30.0000
888,0,0,1,2,23.4500
889,1,1,0,0,30.0000


In [45]:
y = pd.get_dummies(titanic_df['Pclass']).iloc[:,1] #1, 2 ou 3
y

0      0
1      0
2      0
3      0
4      0
      ..
886    1
887    0
888    0
889    0
890    0
Name: 2, Length: 891, dtype: uint8

In [58]:
y2 = pd.get_dummies(titanic_df['Pclass']).iloc[:,2] #1, 2 ou 3
y2

0      1
1      0
2      1
3      0
4      1
      ..
886    0
887    0
888    1
889    0
890    1
Name: 3, Length: 891, dtype: uint8

In [47]:
lm = sm.Logit(y, sm.add_constant(X))
res = lm.fit()
res.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 7


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,2,No. Observations:,891.0
Model:,Logit,Df Residuals:,885.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 20 Apr 2023",Pseudo R-squ.:,inf
Time:,21:01:59,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.0927,0.229,-4.777,0.000,-1.541,-0.644
Survived,0.6388,0.206,3.095,0.002,0.234,1.043
Sex,-0.1753,0.211,-0.830,0.406,-0.589,0.239
SibSp,-0.0837,0.104,-0.805,0.421,-0.288,0.120
Parch,0.1219,0.116,1.054,0.292,-0.105,0.349
Fare,-0.0161,0.004,-4.117,0.000,-0.024,-0.008


In [48]:
lm2 = LogisticRegression(random_state=123)
lm2.fit(X, y)
lm2.score(X, y)

0.7934904601571269

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(712, 5) (179, 5) (712,) (179,)


In [50]:
lm2.fit(X_train, y_train)
lm2.score(X_train, y_train)

0.7907303370786517

In [51]:
lm2.score(X_test, y_test)

0.8044692737430168

In [52]:
confusion_matrix(y_test, lm2.predict(X_test), labels=[1, 0])

array([[  0,  35],
       [  0, 144]], dtype=int64)

In [53]:
(35 + 90) / (35 + 90 + 24 + 30)

0.6983240223463687

In [54]:
y_pred = lm2.predict(X_test)

In [55]:
cm = ConfusionMatrix(y_test, lm2.predict(X_test))
cm.confusion_matrix.T

Unnamed: 0,Real Positive,Real Negative
Predicted Positive,0,0
Predicted Negative,35,144


In [56]:
cm.performance_measures

  return self.tp / self.predicted_positive


Unnamed: 0,Accuracy,Precision / PPV,Recall / Sensibility,NPV,Specificity
Performance Measures,0.804469,,0.0,0.804469,1.0
