In [None]:
# initial setup
try:
    # settings colab:
    import google.colab
        
except ModuleNotFoundError:    
    # settings local:
    %run "../../../common/0_notebooks_base_setup.py"

---

<img src='../../../common/logo_DH.png' align='left' width=35%/>


# Naive Bayes

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder 
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## Problema

Vamos a construir un clasificador que intente predecir si una persona ganará más de US$ 50.000 por año 

Hicimos preparación de los datos en la práctica de checkpoint. Los datasets resultado de esa práctica son el input de ésta. Si no la hicieron, comiencen con esa y después sigan en esta notebook.


## Dataset

https://archive.ics.uci.edu/ml/datasets/Adult

Los datos corresponden a un censo de 1994.

Los campos son

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-

op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, 
Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


In [None]:
data_train_location = '../Data/adult_train.csv'
data_test_location = '../Data/adult_test.csv'

data_train = pd.read_csv(data_train_location, sep='\t', low_memory=False)
data_test = pd.read_csv(data_test_location, sep='\t', low_memory=False)

data_train.head(3)

In [None]:
data_test.head(3)

## Ejercicio 1 - Freatures Target

Sabiendo que 'income' es el nombre de la columna target, construyamos la matriz de features y el vector target para los conjuntos de train y test

In [None]:
X_train = data_train.drop('income', axis = 1)
X_test = data_test.drop('income', axis = 1)

Y_train = data_train.income
Y_test = data_test.income

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

## Ejercicio 2 - Training

Instanciemos y entrenemos uno modelo naive bayes gaussiano.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [None]:
gnb = GaussianNB()

gnb.fit(X_train, Y_train)

## Ejercicio 3 - Predict

Usemos el modelo entrenado en el ejercicio 2 para predecir la etiqueta de los datos de test

In [None]:
Y_pred = gnb.predict(X_test)

Y_pred

## Ejercicio 4 - Performance

Para los datos de test, calculemos accuracy:

Accuracy = (TP + TN)/(TP + TN + FP + FN)

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
# accuracy
accuracy_score(Y_test, Y_pred)

Comparemos la performance que obtuvimos con la performance del modelo nulo.

El modelo nulo es el que predice todas las instancias con la etiqueta de la clase mayoritaria.

In [None]:
# null accuracy

# comparamos la performance del modelo con lo que obtenemos si siempre elegimos la clase mayoritaria como predicción

Y_train.value_counts()

La clase mayoritaria es <=50K.

Calculemos null_acuracy como si en test hubieramos predicho <=50K para todos los registros:

In [None]:
tp = 0
tn = 6829
fp = 0
fn = 2220
null_accuracy = (tp + tn)/(tp + tn + fp + fn)
null_accuracy

Vemos que null accuracy coincide con la proporción de clase mayoritaria, debido a que el modelo nulo de clasificación es predecir la moda.

In [None]:
Y_test.value_counts(normalize=True).max()

Calculemos la matriz de confusión sobre los datos de test y grafiquemos el heatmap de esta matriz.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
# confusion matrix

conf_mat = confusion_matrix(Y_test, Y_pred)

print('Confusion matrix\n\n', conf_mat)

print('\nTrue Positives(TP) = ', conf_mat[1,1])

print('\nTrue Negatives(TN) = ', conf_mat[0,0])

print('\nFalse Positives(FP) = ', conf_mat[0,1])

print('\nFalse Negatives(FN) = ', conf_mat[1,0])


In [None]:
conf_mat_df = pd.DataFrame(data=conf_mat, 
                           index=['Actual Negative: 0', 'Actual Positive: 1'], 
                           columns=['Predict Negative: 0', 'Predict Positive: 1'])

sns.heatmap(conf_mat_df, annot=True, fmt='d', cmap='YlGnBu');