<h1>Prevendo a ocorrência de diabetes</h1>
<h4>Realizada por <i>Samuel Almeida</i></h4>

Definindo o Business Problem, necessitamos prever se uma pessoa irá desenvolver diabetes, então vou utilizar <i>Dados Históricos</i> e <i>Machine Learning</i> para realizar essa verificação. O modelo criado em seguida possui objetivo de atingir o nível em acurácia de 70% de precisão, então será necessário selecionar as ferramentas de análise mais adequadas e é de vital importância compreender os atributos dos dados coletados.
<br><br>
O dataset usado apresenta os dados abaixo:<br>
<li>Quantidade de Gravidez.</li>
<li>Nível Glicose.</li>
<li>Pressão Sanguínea.</li>
<li>Espessura da Pele.</li>
<li>Insulina.</li>
<li>IMC.</li>
<li>DiabetesPedigreeFunção.</li>
<li>Idade.</li>
<li>Resultado - Valores 0 para Falso e 1 para Verdadeiro.</li>

<h3>Leitura do Dataset e Data Cleaning</h3>

In [144]:
#Importando bibliotecas
import pandas as pd
import numpy as np

In [145]:
#carregando dataset
df = pd.read_csv('diabetes.csv')

In [146]:
#Verificando o formato dos dados
df.shape

(768, 9)

In [147]:
#Verificando o carregamento do dataset
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [148]:
#Verificando se existem valores nulos
df.isnull().values.any()

False

In [149]:
#Identificando a correlação entre as variáveis
def plot_corr(df, size=9):
    corr = df.corr()
    fig, ax = plt.subplots(figure = (size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)

In [150]:
#Visualizando a correlação em tabela
df.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


In [151]:
#Verificando como os dados estão distribuidos
num_true = len(df.loc[df['Outcome'] == True])
num_false = len(df.loc[df['Outcome'] == False])
print('Número de casos verdadeiros: {0} ({1:2.2f}%)'.format(num_true, (num_true / (num_true + num_false)) * 100))
print('Número de casos falsos     : {0} ({1:2.2f}%)'.format(num_false, (num_false / (num_true + num_false)) * 100))

Número de casos verdadeiros: 268 (34.90%)
Número de casos falsos     : 500 (65.10%)


<h3>Spliting</h3>

In [152]:
from sklearn.model_selection import train_test_split

In [153]:
#Seleção de variáveis préditoras (feature selection)
atributos = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

In [154]:
#Variável a ser prevista
atrib_prev = ['Outcome']

In [155]:
# Criando objetos
x = df[atributos].values
y = df[atrib_prev].values

In [156]:
# Criando dados de treino e de teste
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 42)

In [157]:
# Imprimindo os resultados
print('{0:0.2f}% nos dados de treino'.format((len(x_train) / len(df.index)) * 100))
print('{0:0.2f}% nos dados de teste'.format((len(x_test) / len(df.index)) * 100))

69.92% nos dados de treino
30.08% nos dados de teste


In [158]:
x_train

array([[  1.   ,  95.   ,  60.   , ...,  23.9  ,   0.26 ,  22.   ],
       [  5.   , 105.   ,  72.   , ...,  36.9  ,   0.159,  28.   ],
       [  0.   , 135.   ,  68.   , ...,  42.3  ,   0.365,  24.   ],
       ...,
       [ 10.   , 101.   ,  86.   , ...,  45.6  ,   1.136,  38.   ],
       [  0.   , 141.   ,   0.   , ...,  42.4  ,   0.205,  29.   ],
       [  0.   , 125.   ,  96.   , ...,  22.5  ,   0.262,  21.   ]])

In [159]:
# Verificando o balanceamento do split
print("Original True : {0} ({1:0.2f}%)".format(len(df.loc[df['Outcome'] == 1]), 
                                               (len(df.loc[df['Outcome'] ==1])/len(df.index) * 100)))

print("Original False : {0} ({1:0.2f}%)".format(len(df.loc[df['Outcome'] == 0]), 
                                               (len(df.loc[df['Outcome'] == 0])/len(df.index) * 100)))
print("")
print("Training True : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), 
                                               (len(y_train[y_train[:] == 1])/len(y_train) * 100)))

print("Training False : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), 
                                               (len(y_train[y_train[:] == 0])/len(y_train) * 100)))
print("")
print("Test True : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), 
                                               (len(y_test[y_test[:] == 1])/len(y_test) * 100)))

print("Test False : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), 
                                               (len(y_test[y_test[:] == 0])/len(y_test) * 100)))

Original True : 268 (34.90%)
Original False : 500 (65.10%)

Training True : 188 (35.01%)
Training False : 349 (64.99%)

Test True : 80 (34.63%)
Test False : 151 (65.37%)


<h3>Tratando Missing Values - Impute</h3>

In [160]:
#Verificando que não possui valores nulos, os mesmo foram preenchidos com zero
print("# Linhas no dataframe {0}".format(len(df)))
print("# Linhas missing Glucose: {0}".format(len(df.loc[df['Glucose'] == 0])))
print("# Linhas missing BloodPressure: {0}".format(len(df.loc[df['BloodPressure'] == 0])))
print("# Linhas missing SkinThickness: {0}".format(len(df.loc[df['SkinThickness'] == 0])))
print("# Linhas missing Insulin: {0}".format(len(df.loc[df['Insulin'] == 0])))
print("# Linhas missing BMI: {0}".format(len(df.loc[df['BMI'] == 0])))
print("# Linhas missing Age: {0}".format(len(df.loc[df['Age'] == 0])))

# Linhas no dataframe 768
# Linhas missing Glucose: 5
# Linhas missing BloodPressure: 35
# Linhas missing SkinThickness: 227
# Linhas missing Insulin: 374
# Linhas missing BMI: 11
# Linhas missing Age: 0


In [161]:
from sklearn.impute import SimpleImputer

In [162]:
# Criando objeto
preenche_0 = SimpleImputer(missing_values = 0, strategy = "mean")

# Subtituindo os valores igais a zero, pela média dos dados
x_train = preenche_0.fit_transform(x_train)
x_test = preenche_0.fit_transform(x_test)

In [163]:
x_train

array([[  1.        ,  95.        ,  60.        , ...,  23.9       ,
          0.26      ,  22.        ],
       [  5.        , 105.        ,  72.        , ...,  36.9       ,
          0.159     ,  28.        ],
       [  4.34056399, 135.        ,  68.        , ...,  42.3       ,
          0.365     ,  24.        ],
       ...,
       [ 10.        , 101.        ,  86.        , ...,  45.6       ,
          1.136     ,  38.        ],
       [  4.34056399, 141.        ,  72.24131274, ...,  42.4       ,
          0.205     ,  29.        ],
       [  4.34056399, 125.        ,  96.        , ...,  22.5       ,
          0.262     ,  21.        ]])

<h3>Construindo e treinando o modelo com Naive Bayes</h3>

In [164]:
# Importando um classificador Naive Bayes
from sklearn.naive_bayes import GaussianNB

In [165]:
# Criando o modelo preditivo
modelo_v1 = GaussianNB()

In [166]:
# Treinando o modelo
modelo_v1.fit(x_train, y_train.ravel())

GaussianNB(priors=None, var_smoothing=1e-09)

<h3>Verificando a exatidão no modelo nos dados de treino</h3>

In [167]:
from sklearn import metrics

In [168]:
nb_predict_train = modelo_v1.predict(x_train)
print("Exatidão (Accuracy): {0:.4f}".format(metrics.accuracy_score(y_train, nb_predict_train)))

Exatidão (Accuracy): 0.7542


<h3>Verificando a exatidão no modelo nos dados de teste</h3>

In [169]:
nb_predict_test = modelo_v1.predict(x_test)
print("Exatidão (Accuracy): {0:.4f}".format(metrics.accuracy_score(y_test, nb_predict_test)))

Exatidão (Accuracy): 0.7359


<h3>Métricas</h3>

In [170]:
# Criando uma Confusion Matrix
print("Confusion Matrix")

print("{0}".format(metrics.confusion_matrix(y_test, nb_predict_test, labels = [1, 0])))

print("Classification Report")
print(metrics.classification_report(y_test, nb_predict_test, labels = [1, 0]))

Confusion Matrix
[[ 52  28]
 [ 33 118]]
Classification Report
              precision    recall  f1-score   support

           1       0.61      0.65      0.63        80
           0       0.81      0.78      0.79       151

   micro avg       0.74      0.74      0.74       231
   macro avg       0.71      0.72      0.71       231
weighted avg       0.74      0.74      0.74       231



<h3>Otimizando o modelo com RandomForest</h3>

In [171]:
# Importar a biblioteca
from sklearn.ensemble import RandomForestClassifier

In [172]:
modelo_v2 = RandomForestClassifier(random_state = 42)
modelo_v2.fit(x_train, y_train.ravel())



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [173]:
# Verificando os dados de treino
rf_predict_train = modelo_v2.predict(x_train)
print("Exatidão (accuracy): {0:.4f}".format(metrics.accuracy_score(y_train, rf_predict_train)))

Exatidão (accuracy): 0.9870


In [174]:
# Verificando os dados de teste
rf_predict_test = modelo_v2.predict(x_test)
print("Exatidão (accuracy): {0:.4f}".format(metrics.accuracy_score(y_test, rf_predict_test)))

Exatidão (accuracy): 0.7100


In [175]:
print("Confusion Matrix")

print("{0}".format(metrics.confusion_matrix(y_test, rf_predict_test, labels = [1, 0])))

print("Classification Report")
print(metrics.classification_report(y_test, rf_predict_test, labels = [1, 0]))

Confusion Matrix
[[ 43  37]
 [ 30 121]]
Classification Report
              precision    recall  f1-score   support

           1       0.59      0.54      0.56        80
           0       0.77      0.80      0.78       151

   micro avg       0.71      0.71      0.71       231
   macro avg       0.68      0.67      0.67       231
weighted avg       0.70      0.71      0.71       231



<h3>Otimizando o modelo com Regressão Logística</h3>

In [176]:
# Importando biblioteca
from sklearn.linear_model import LogisticRegression

In [177]:
modelo_v3 = LogisticRegression(C = 0.7, random_state = 42)
modelo_v3.fit(x_test, y_test.ravel())
lr_predict_test = modelo_v3.predict(x_test)



In [178]:
print("Exatidão (accuracy): {0:.4f}".format(metrics.accuracy_score(y_test, lr_predict_test)))

print("Classification Report")
print(metrics.classification_report(y_test, lr_predict_test, labels = [1, 0]))

Exatidão (accuracy): 0.7706
Classification Report
              precision    recall  f1-score   support

           1       0.71      0.57      0.63        80
           0       0.80      0.87      0.83       151

   micro avg       0.77      0.77      0.77       231
   macro avg       0.75      0.72      0.73       231
weighted avg       0.76      0.77      0.76       231



<h3> Resumindo - Exatidão nos dados de teste</h3>
<li> Modelo usando algoritmo Naive Bayes = 0.7359</li>
<li> Modelo usando algoritmo Random Forest = 0.7100</li>
<li> Modelo usando algoritmo Regressão Logística = 0.7706</li>

<h3>Fazendo previsões com o modelo treinado</h3>

In [179]:
import pickle

In [180]:
# Salvando o modelo para usar mais tarde
filename = 'modelo_treinado_v3.sav'
pickle.dump(modelo_v3, open(filename, 'wb'))

In [181]:
x_test

array([[6.00000000e+00, 9.80000000e+01, 5.80000000e+01, ...,
        3.40000000e+01, 4.30000000e-01, 4.30000000e+01],
       [2.00000000e+00, 1.12000000e+02, 7.50000000e+01, ...,
        3.57000000e+01, 1.48000000e-01, 2.10000000e+01],
       [2.00000000e+00, 1.08000000e+02, 6.40000000e+01, ...,
        3.08000000e+01, 1.58000000e-01, 2.10000000e+01],
       ...,
       [4.85714286e+00, 1.27000000e+02, 8.00000000e+01, ...,
        3.63000000e+01, 8.04000000e-01, 2.30000000e+01],
       [6.00000000e+00, 1.05000000e+02, 7.00000000e+01, ...,
        3.08000000e+01, 1.22000000e-01, 3.70000000e+01],
       [5.00000000e+00, 7.70000000e+01, 8.20000000e+01, ...,
        3.58000000e+01, 1.56000000e-01, 3.50000000e+01]])

In [182]:
# Carregando o modelo e fazendo previsão com novos conjuntos de dados
loaded_model = pickle.load(open(filename, 'rb'))
resultado1 = loaded_model.predict(x_test[15].reshape(1, -1))
resultado2 = loaded_model.predict(x_test[18].reshape(1, -1))
print(resultado1)

[1]


Obtendo um novo conjunto de dados, este foi previsto com o valor 1, ou seja Verdadeiro.

<h3>Conclusão</h3>

Embora todos modelos utilizados ultrapassaram a marca pré estabelecida de acurácia - 70% de precisão - , o que apresentou um melhor resultado foi o algoritmo Regressão Logística, sendo a escolha para produção e assim resolução do nosso Business Problem.