# 3. Modelos e Análises

Nessa etapa serão implementados os modelos de classificação:
    * AdaBoost
    * RandomForest
    * SVM com kernel RBF
    * Regressão Logística com otimização L2

In [1]:
# importações
import pandas as pd
import numpy as np
import scipy
import matplotlib as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import auc

%matplotlib inline

#### Importação do conjunto de dados
Para avaliação dos modelos será necessário utilizar apenas o conjunto de treinamento. Por ser um desafio do Kaggle, o conjunto de teste é fornecido sem a classe, portanto não é possível obter métricas de validação dos modelos a partir desses dados.

In [19]:
# importa dataset de treino
ids = pd.read_csv('dataset/train_ids.csv')
target = pd.read_csv('dataset/train_target.csv')
features = scipy.sparse.load_npz('dataset/processed_train.npz')

In [23]:
# split em treino e teste
X_train, X_test, y_train, y_test, ids_train, ids_test = train_test_split(features, target, ids, test_size=0.2, shuffle=True)

#### Funções genéricas de treinamento e avaliação dos modelos

In [30]:
def treinamento(X_train, y_train, model_name):
    if model_name=='adaboost':
        model = AdaBoostClassifier()
    if model_name=='randomforest':
        model = RandomForestClassifier()
    if model_name=='svm':
        model = SVC()
    if model_name=='logreg':
        model = LogisticRegression()
    
    model.fit(X_train,y_train)
    
    return model

In [78]:
def predict(X_test, model):
    pred = model.predict(X_test)
    proba = model.predict_proba(X_test)
    return pred, proba

In [82]:
def validation(X_test, y_test, proba, pred, model, model_name):
    #plt.ylabel('Feature Importance Score')
    print('Accuracy of the '+model_name+' on test set: {:.3f}'.format(model.score(X_test, y_test)))

    print(classification_report(y_test, proba[:,1]>0.5))
    print('F-score: '+str(f1_score(y_test,pred)))
    print('AUC: ' + str(auc(y_test, pred)))
    return None

def plot_roc_curve(y_score, y_true):
    fpr,tpr,thr = roc_curve(y_score=y_score,y_true=y_true)
    roc_auc = auc(y_score=y_score,y_true=y_true)
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="best")
    plt.savefig('roc_curve.png')
    return fpr,tpr,thr

### Adaboost

In [34]:
model_name = 'adaboost'
model = treinamento(X_train, y_train.values.ravel(), model_name)

In [49]:
predict, proba = predict(X_test, model)

In [66]:
validation(X_test, y_test, proba, predict, model, model_name)

Accuracy of the adaboost on test set: 0.961
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      2044
           1       0.96      0.96      0.96      2024

   micro avg       0.96      0.96      0.96      4068
   macro avg       0.96      0.96      0.96      4068
weighted avg       0.96      0.96      0.96      4068

F-score: 0.9605328071040947


ValueError: x is neither increasing nor decreasing : [0 1 0 ... 0 0 0].

In [68]:
roc_curve(predict, y_test)

RecursionError: maximum recursion depth exceeded

### Random Forest

In [69]:
model_name = 'randomforest'
model = treinamento(X_train, y_train.values.ravel(), model_name)



In [72]:
predict, proba = predict(X_test, model)

In [73]:
validation(X_test, y_test, proba, predict, model, model_name)

Accuracy of the randomforest on test set: 0.850
              precision    recall  f1-score   support

           0       0.81      0.92      0.86      2044
           1       0.91      0.78      0.84      2024

   micro avg       0.85      0.85      0.85      4068
   macro avg       0.86      0.85      0.85      4068
weighted avg       0.86      0.85      0.85      4068

F-score: 0.8375566817818085


ValueError: x is neither increasing nor decreasing : [0 1 0 ... 0 0 0].

In [74]:
roc_curve(predict, y_test)

RecursionError: maximum recursion depth exceeded

### SVM

In [75]:
model_name = 'svm'
model = treinamento(X_train, y_train.values.ravel(), model_name)



KeyboardInterrupt: 

In [None]:
predict, proba = predict(X_test, model)

In [None]:
validation(X_test, y_test, proba, predict, model, model_name)

In [None]:
roc_curve(predict, y_test)

### Logistic Regression

In [76]:
model_name = 'logreg'
model = treinamento(X_train, y_train.values.ravel(), model_name)



In [79]:
predict, proba = predict(X_test, model)

In [83]:
validation(X_test, y_test, proba, predict, model, model_name)

Accuracy of the logreg on test set: 0.955
              precision    recall  f1-score   support

           0       0.96      0.95      0.96      2044
           1       0.95      0.96      0.96      2024

   micro avg       0.96      0.96      0.96      4068
   macro avg       0.96      0.96      0.96      4068
weighted avg       0.96      0.96      0.96      4068

F-score: 0.9552605703048181


ValueError: x is neither increasing nor decreasing : [0 1 0 ... 0 0 0].

In [81]:
roc_curve(predict, y_test)

RecursionError: maximum recursion depth exceeded