## Comparativo entre Técnicas de Classificação

### Pipeline de Classificação

Importando os dados. Os dados contém informações relacionadas a empresas indianas coletadas por auditores com o objetivo de construir um modelo para realizar tarefas de classificação de empresas suspeitas. Os atributos estão relacionados a métricas de auditorias como: scores, riscos, etc.

Mais informações a respeito do dataset: [UCL](https://archive.ics.uci.edu/ml/datasets/Audit+Data#)

In [0]:
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

Importando os dados. Os dados contém informações relacionadas a empresas indianas coletadas por auditores com o objetivo de construir um modelo para realizar tarefas de classificação de empresas suspeitas. Os atributos estão relacionados a métricas de auditorias como: scores, riscos, etc.

Mais informações a respeito do dataset: [UCL](https://archive.ics.uci.edu/ml/datasets/Audit+Data#)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/intelligentagents/aprendizagem-supervisionada/master/data/audit_risk.csv')

Visualizando e descrevendo  o dataset

In [3]:
# Exporando o dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 776 entries, 0 to 775
Data columns (total 18 columns):
Sector_score     776 non-null float64
LOCATION_ID      776 non-null object
PARA_A           776 non-null float64
SCORE_A          776 non-null int64
PARA_B           776 non-null float64
SCORE_B          776 non-null int64
TOTAL            776 non-null float64
numbers          776 non-null float64
Marks            776 non-null int64
Money_Value      775 non-null float64
MONEY_Marks      776 non-null int64
District         776 non-null int64
Loss             776 non-null int64
LOSS_SCORE       776 non-null int64
History          776 non-null int64
History_score    776 non-null int64
Score            776 non-null float64
Risk             776 non-null int64
dtypes: float64(7), int64(10), object(1)
memory usage: 109.2+ KB


In [4]:
df.head(5)

Unnamed: 0,Sector_score,LOCATION_ID,PARA_A,SCORE_A,PARA_B,SCORE_B,TOTAL,numbers,Marks,Money_Value,MONEY_Marks,District,Loss,LOSS_SCORE,History,History_score,Score,Risk
0,3.89,23,4.18,6,2.5,2,6.68,5.0,2,3.38,2,2,0,2,0,2,2.4,1
1,3.89,6,0.0,2,4.83,2,4.83,5.0,2,0.94,2,2,0,2,0,2,2.0,0
2,3.89,6,0.51,2,0.23,2,0.74,5.0,2,0.0,2,2,0,2,0,2,2.0,0
3,3.89,6,0.0,2,10.8,6,10.8,6.0,6,11.75,6,2,0,2,0,2,4.4,1
4,3.89,6,0.0,2,0.08,2,0.08,5.0,2,0.0,2,2,0,2,0,2,2.0,0


Descrevendo o dataset:

In [5]:
df.describe()

Unnamed: 0,Sector_score,PARA_A,SCORE_A,PARA_B,SCORE_B,TOTAL,numbers,Marks,Money_Value,MONEY_Marks,District,Loss,LOSS_SCORE,History,History_score,Score,Risk
count,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,775.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0
mean,20.184536,2.450194,3.512887,10.799988,3.131443,13.218481,5.067655,2.237113,14.137631,2.909794,2.505155,0.029639,2.061856,0.104381,2.167526,2.702577,0.626289
std,24.319017,5.67887,1.740549,50.083624,1.698042,51.312829,0.264449,0.803517,66.606519,1.597452,1.228678,0.18428,0.37508,0.531031,0.679869,0.858923,0.4841
min,1.85,0.0,2.0,0.0,2.0,0.0,5.0,2.0,0.0,2.0,2.0,0.0,2.0,0.0,2.0,2.0,0.0
25%,2.37,0.21,2.0,0.0,2.0,0.5375,5.0,2.0,0.0,2.0,2.0,0.0,2.0,0.0,2.0,2.0,0.0
50%,3.89,0.875,2.0,0.405,2.0,1.37,5.0,2.0,0.09,2.0,2.0,0.0,2.0,0.0,2.0,2.4,1.0
75%,55.57,2.48,6.0,4.16,4.0,7.7075,5.0,2.0,5.595,4.0,2.0,0.0,2.0,0.0,2.0,3.25,1.0
max,59.85,85.0,6.0,1264.63,6.0,1268.91,9.0,6.0,935.03,6.0,6.0,2.0,6.0,9.0,6.0,5.2,1.0


Deletando a coluna de localização:

In [0]:
df = df.drop('LOCATION_ID', axis=1)

Analisando se existem valores nulos:

In [7]:
df[df.isnull().values.any(axis=1)]

Unnamed: 0,Sector_score,PARA_A,SCORE_A,PARA_B,SCORE_B,TOTAL,numbers,Marks,Money_Value,MONEY_Marks,District,Loss,LOSS_SCORE,History,History_score,Score,Risk
642,55.57,0.23,2,0.0,2,0.23,5.0,2,,2,2,0,2,0,2,2.0,0


Preechendo os valores nulos com a mediana:

In [0]:
df = df.fillna(df.median())

Definindo as variáveis indepedentes e dependentes

In [0]:
X = df.iloc[:, :17].values
y = df.iloc[:, -1].values

Dividindo o dataset em conjunto de treinamento e testes

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


Criando o dicionário contendo todos os classificadores:

In [0]:
estimators = {'Decision Tree': DecisionTreeClassifier(criterion = 'entropy', random_state = 0),
              'KNN': KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean'),
              'Random Forest': RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0),
              'SVC': SVC(kernel = 'rbf', random_state = 0)}

Criando dataframe que irá guardar os resultados finais dos classificadores:

In [0]:
df_results = pd.DataFrame(columns=['classifier', 'accuracy', 'precision', 'recall', 'f1'], index=None)

Percorrendo o dicionário e treinando e avaliando os modelos:

In [15]:
for name, estim in estimators.items():
    
    # print("Treinando Estimador {0}: ".format(name))
    
    # Treinando os classificadores com Conjunto de Treinamento
    estim.fit(X_train, y_train)
    
    # Prevendo os resultados do modelo criado com o conjunto de testes
    y_pred = estim.predict(X_test)
    
    
    # Armazenando as métricas de cada classificador em um dataframe
    df_results.loc[len(df_results), :] = [name, accuracy_score(y_test, y_pred), precision_score (y_test, y_pred, average = 'macro'),
                   recall_score(y_test, y_pred,  average = 'macro'), f1_score(y_test, y_pred,  average = 'macro')]



Exibindo os resultados finais:

In [17]:
df_results

Unnamed: 0,classifier,accuracy,precision,recall,f1
0,Decision Tree,1.0,1.0,1.0,1.0
1,KNN,0.987179,0.985075,0.989011,0.986869
2,Random Forest,1.0,1.0,1.0,1.0
3,SVC,1.0,1.0,1.0,1.0
4,Decision Tree,1.0,1.0,1.0,1.0
5,KNN,0.987179,0.985075,0.989011,0.986869
6,Random Forest,1.0,1.0,1.0,1.0
7,SVC,1.0,1.0,1.0,1.0
