#Modelos de Classificação : Regressão Logística

### Importando libs  e funções

Importando libs e funções:

In [0]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

### Etapa de exploração e tratamento dos dados

Importando os dados. O objetivo é determinar se uma pessoa ganhar mais de 50k por ano.

Fonte: [UCL](https://archive.ics.uci.edu/ml/datasets/Adult)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/intelligentagents/aprendizagem-supervisionada/master/data/adult.csv')

Descrevendo o dataset

In [36]:
# Visualizando o dataset
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,gains
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Descrevendo o dataset:

In [37]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age                32561 non-null int64
 workclass         32561 non-null object
 fnlwgt            32561 non-null int64
 education         32561 non-null object
 education-num     32561 non-null int64
 marital-status    32561 non-null object
 occupation        32561 non-null object
 relationship      32561 non-null object
 race              32561 non-null object
 sex               32561 non-null object
 capital-gain      32561 non-null int64
 capital-loss      32561 non-null int64
 hours-per-week    32561 non-null int64
 native-country    32561 non-null object
 gains             32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Transformando as colunas categóricas em numericas:

In [0]:
# Selecionando as colunas:
columns = df.select_dtypes(include=['object']).columns

# Criando o labelEncoder
le =  LabelEncoder()

for column in columns:
    df[column] = le.fit_transform(df[column])

Visualizando o dataset transformado:

In [23]:
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,gains
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


Definindo as variáveis dependentes/independentes:

In [0]:
X = df.iloc[:, :14].values
y = df.iloc[:, -1].values

 Criando os subconjuntos de treinamento e testes:

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Etapa de Treinamento e Validação do Modelo

Importando e treinando o modelo com o conjunto de treinamento:

In [42]:
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Prevendo os resultados do modelo criado com o conjunto de testes

In [43]:
y_pred = classifier.predict(X_test)

y_pred

array([0, 0, 0, ..., 1, 0, 0])

Criando e exibindo os valores da matriz de confusão com o conjunto de testes 

In [44]:
confusion_matrix(y_test, y_pred)

array([[4778,  140],
       [1182,  413]])

Criando dataframe que irá guardar os resultados finais:

In [0]:
df_results = pd.DataFrame(columns=['classifier', 'accuracy', 'precision', 'recall', 'f1'], index=None)

Armazenando as métricas em um dataframe e exibindo os resultados finais das métricas:

In [46]:
df_results.loc[len(df_results), :] = ['Regressão Logística', accuracy_score(y_test, y_pred), precision_score (y_test, y_pred, average = 'macro'),
                   recall_score(y_test, y_pred,  average = 'macro'), f1_score(y_test, y_pred,  average = 'macro')]

df_results

Unnamed: 0,classifier,accuracy,precision,recall,f1
0,Regressão Logística,0.797021,0.774257,0.615234,0.631507
