## BUILDING THE PREDICTIVE MODEL

Let's now build our model using Logistic Regression. Logistic regression is a widely-used algorithm for classification tasks, especially for binary classification. It directly models the probability that a given data point belongs to a particular class.

Let's load the libraries:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.metrics import log_loss
from sklearn.utils import resample
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

Now, let's load the DataSet into a DataFrame:

In [2]:
stroke = pd.read_csv(r"C:\Users\maria\Desktop\proyecto infarto de miocardio\healthcare-dataset-stroke-data.csv")
stroke = stroke.dropna(subset=['bmi'])
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


In [3]:
column_mapping = {
    'gender': {"Male": 0, "Female": 1, "Other": 2},
    'ever_married': {"Yes": 1, "No": 0},
    'work_type': {"Private": 0, "Self-employed": 1, "Govt_job": 2, "children": 3, "Never_worked": 4},
    'Residence_type':{"Urban": 0, "Rural": 1},
    'smoking_status': {"formerly smoked": 0, "never smoked": 1, "smokes": 2, "Unknown": 3}
}

for column, mapping in column_mapping.items():
    fmap = np.vectorize(lambda t: mapping.get(t, -1))
    stroke[column] = fmap(stroke[column])

In [4]:
drop_columns=['id', 'Residence_type'] 

Let's create the numpy arrays for train and test:

In [5]:
X = stroke.drop('stroke',axis=1)
X = X.drop(drop_columns, axis=1)
Y = stroke['stroke'].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)
len(X_train), len(X_test), len(X_train.columns)

(3436, 1473, 9)

Let's standardize all the data to ensure uniform scaling:

In [6]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

Let's now build our model using Logistic Regression:

In [7]:
lr = LogisticRegression()
lr.fit(X_train, Y_train)

Let's assess the quality of our model using two metrics:

- Accuracy: Simply counts how many of the model's classifications are correct; it returns a value between 0 and 1 (higher is better).
- Negative Log-Likelihood (log loss): Takes into account the probability; it returns a value between 0 and 1 (lower is better).

To obtain the probability of belonging to a class instead of the class itself, we can use the predict_proba() method.

In [8]:
Y_pred = lr.predict(X_train)
Y_pred_proba = lr.predict_proba(X_train)

print("TRAIN ACCURACY: "+str(accuracy_score(Y_train, Y_pred)))
print("TRAIN LOG LOSS: "+str(log_loss(Y_train, Y_pred_proba)))

TRAIN ACCURACY: 0.9551804423748544
TRAIN LOG LOSS: 0.14476207853231224


In [9]:
Y_pred = lr.predict(X_test)
Y_pred_proba = lr.predict_proba(X_test)

print("TEST ACCURACY: "+str(accuracy_score(Y_test, Y_pred)))
print("TEST LOG LOSS: "+str(log_loss(Y_test, Y_pred_proba)))

TEST ACCURACY: 0.9633401221995926
TEST LOG LOSS: 0.1287002806824131


As we can see, the accuracy of our model on the test set is 0.9633, which is excellent. Let's now build our model using other classification algorithms. We'll try K-NN, Random Forest, and SVM to see if we can achieve better results.

**K-NN**

In [10]:
Ks = [1,2,3,4,5,7,10,12,15,20]

for K in Ks:
    print("K="+str(K))
    knn = KNeighborsClassifier(n_neighbors=K)
    knn.fit(X_train,Y_train)
    
    Y_pred_train = knn.predict(X_train)
    Y_prob_train = knn.predict_proba(X_train)
    
    Y_pred = knn.predict(X_test)
    Y_prob = knn.predict_proba(X_test)
    
    accuracy_train = accuracy_score(Y_train, Y_pred_train)
    accuracy_test = accuracy_score(Y_test, Y_pred)

    loss_train = log_loss(Y_train, Y_prob_train)
    loss_test = log_loss(Y_test, Y_prob)
    
    print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))
    print("LOG LOSS: TRAIN=%.4f TEST=%.4f" % (loss_train,loss_test))

K=1
ACCURACY: TRAIN=1.0000 TEST=0.9280
LOG LOSS: TRAIN=0.0000 TEST=2.5938
K=2
ACCURACY: TRAIN=0.9590 TEST=0.9566
LOG LOSS: TRAIN=0.0569 TEST=1.3486
K=3
ACCURACY: TRAIN=0.9584 TEST=0.9484
LOG LOSS: TRAIN=0.0781 TEST=0.9708
K=4
ACCURACY: TRAIN=0.9555 TEST=0.9593
LOG LOSS: TRAIN=0.0924 TEST=0.9242
K=5
ACCURACY: TRAIN=0.9555 TEST=0.9572
LOG LOSS: TRAIN=0.0977 TEST=0.8811
K=7
ACCURACY: TRAIN=0.9558 TEST=0.9593
LOG LOSS: TRAIN=0.1059 TEST=0.6718
K=10
ACCURACY: TRAIN=0.9552 TEST=0.9627
LOG LOSS: TRAIN=0.1152 TEST=0.6518
K=12
ACCURACY: TRAIN=0.9552 TEST=0.9627
LOG LOSS: TRAIN=0.1209 TEST=0.6065
K=15
ACCURACY: TRAIN=0.9552 TEST=0.9627
LOG LOSS: TRAIN=0.1255 TEST=0.4700
K=20
ACCURACY: TRAIN=0.9552 TEST=0.9627
LOG LOSS: TRAIN=0.1299 TEST=0.3986


As observed, the best results are obtained with K=10 and K=12, K=15 and K=20, with an accuracy on the test set of 0.9627. However, the results obtained through Logistic Regression continue to be slightly better.

**Random Forest**

In [11]:
forest = RandomForestClassifier(n_estimators=30, max_depth=8, random_state=False)

forest.fit(X_train, Y_train)

Y_pred_train = forest.predict(X_train)
Y_pred = forest.predict(X_test)

accuracy_train = accuracy_score(Y_train, Y_pred_train)
accuracy_test = accuracy_score(Y_test, Y_pred)

print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))

ACCURACY: TRAIN=0.9642 TEST=0.9599


In this case, using 30 estimators and a maximun depth of 8, we achieve a test accuracy of 0.9599, which is worse than the one obtained with K-NN.

**Support Vector Machine (SVM)**

In [12]:
svc = LinearSVC()
svc.fit(X_train, Y_train)
print("ACCURACY: Train=%.4f Test=%.4f" % (svc.score(X_train, Y_train), svc.score(X_test,Y_test)))

ACCURACY: Train=0.9552 Test=0.9627




As we can see, using SVM we obtain exactly the same results as with K-NN.

**And what would have happended if we had used all the feautres?**

In [13]:
X = stroke.drop('stroke',axis=1).values
Y = stroke['stroke'].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=1)

In [14]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, Y_train)

In [15]:
Y_pred = lr.predict(X_train)
Y_pred_proba = lr.predict_proba(X_train)

print("TRAIN ACCURACY: "+str(accuracy_score(Y_train, Y_pred)))
print("TRAIN LOG LOSS: "+str(log_loss(Y_train, Y_pred_proba))) 

TRAIN ACCURACY: 0.9577997671711292
TRAIN LOG LOSS: 0.1394062385118252


In [16]:
Y_pred = lr.predict(X_test)
Y_pred_proba = lr.predict_proba(X_test)

print("TEST ACCURACY: "+str(accuracy_score(Y_test, Y_pred)))
print("TEST LOG LOSS: "+str(log_loss(Y_test, Y_pred_proba)))

TEST ACCURACY: 0.956551255940258
TEST LOG LOSS: 0.14152030074887548


As observed, if we hadn't removed the features identified during the exploratory data analysis, we would have obtained a worse model. In this case, we have achieved a test accuracy of 0.9565. 

## CONCLUSION

We conclude that by extracting features identified during exploratory data analysis, which provided less information, we have obtained a model with excellent performance, achieving a test accuracy of 96.33%. We also tried building our model using K-NN, Random Forest, and SVM. While the results obtained with these approaches are also very good (test accuracy also exceeding 96% for K-NN and SVM, and exceeding 95% for Random Forest), the results with Logistic Regression still outperform them.