_Sınıf dengesizliğine sahip kalp krizi riski'ne ilişkin veri kümesini kullanarak:_
- _Lojistik Regresyon modeli oluşturup modelinizin performansını ölçün,_
- _Değişik yöntemler ve sınıf oranlarını deneyerek, sınıf dengesizliğinin üstesinden gelin ve performansı en yüksek yöntemi ve sınıflar arası oranı belirleyin._

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE, ADASYN
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_excel('heart.xlsx')

> Excel üzerinde derlendi.

In [4]:
df.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,negative
1,37,1,3,130,250,0,0,187,0,3.5,3,0,3,negative
2,41,0,2,130,204,0,2,172,0,1.4,1,0,3,negative


In [5]:
df['ca'] = df.ca.replace('<null>', 0)
df['thal'] = df.thal.replace('<null>', 3)
df['num'] = df.num.replace({'positive':1, "negative":0})

### Model

In [6]:
def confMatrix(X, Y):
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)
    
    standardscaler = StandardScaler()
    x_train = standardscaler.fit_transform(x_train)
    x_test = standardscaler.transform(x_test)
    
    logr = LogisticRegression().fit(x_train, y_train)

    y_pred = logr.predict(x_test)
    y_prob = logr.predict_proba(x_test)[:,1]
    
    dfMatrix = pd.concat([pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['pred_0', 'pred_1']), 
               pd.DataFrame(precision_recall_fscore_support(y_test, y_pred), index=['precision', 'recall', 'f1-score', 'support']).T],
               ignore_index=False, axis=1)
    print("Model Score (Accuracy):", '%.5f' % logr.score(x_test, y_test))
    return dfMatrix

In [7]:
print('1:', '%.2f' % (sum(df.num)/len(df.num)*100)) 
print('0:', '%.2f' % ((len(df.num)-sum(df.num))/len(df.num)*100))

1: 7.34
0: 92.66


> Hedef yüzdesi incelendiğinde, dengesizlik görülmektedir.

In [16]:
df_0 = df[df.num == 0]
df_1 = df[df.num == 1]

reSample1 = resample(df_1, replace = True, n_samples = len(df_0), random_state = 111)
reSample0 = resample(df_0, replace = True, n_samples = len(df_1), random_state = 111)

df_increased = pd.concat([df_0, reSample1])
df_reduced = pd.concat([df_1, reSample0])

#### Dengeli Olmayan

In [23]:
Y = df['num']
X = df.drop('num', axis=1)
confMatrix(X, Y)

Model Score (Accuracy): 0.97222


Unnamed: 0,pred_0,pred_1,precision,recall,f1-score,support
0,33,0,0.970588,1.0,0.985075,33.0
1,1,2,1.0,0.666667,0.8,3.0


#### Örneklem Artırılmış

In [24]:
Y = df_increased['num']
X = df_increased.drop('num', axis=1)
confMatrix(X, Y)

Model Score (Accuracy): 1.00000


Unnamed: 0,pred_0,pred_1,precision,recall,f1-score,support
0,34,0,1.0,1.0,1.0,34.0
1,0,32,1.0,1.0,1.0,32.0


#### Örneklem Azaltılmış

In [19]:
Y = df_reduced['num']
X = df_reduced.drop('num', axis=1)
confMatrix(X, Y)

Model Score (Accuracy): 0.83333


Unnamed: 0,pred_0,pred_1,precision,recall,f1-score,support
0,2,0,0.666667,1.0,0.8,2.0
1,1,3,1.0,0.75,0.857143,4.0


#### SMOTE

In [25]:
Y = df.num
X = df.drop('num', axis=1)
x_smote, y_smote = SMOTE(random_state=11, sampling_strategy=1.0).fit_sample(X, Y)
confMatrix(x_smote, y_smote)

Model Score (Accuracy): 0.98485


Unnamed: 0,pred_0,pred_1,precision,recall,f1-score,support
0,33,1,1.0,0.970588,0.985075,34.0
1,0,32,0.969697,1.0,0.984615,32.0


#### ADASYN

In [26]:
Y = df.num
X = df.drop('num', axis=1)
x_adasyn, y_adasyn = ADASYN().fit_sample(X, Y)
confMatrix(x_adasyn, y_adasyn)

Model Score (Accuracy): 0.96970


Unnamed: 0,pred_0,pred_1,precision,recall,f1-score,support
0,31,2,1.0,0.939394,0.96875,33.0
1,0,33,0.942857,1.0,0.970588,33.0


> Test verilerinin az olmasından dolayı, modelin performansı verimli görünmüyor.