# Stroke Prediction Project
The goal behind this project is to use different machine learning models to predict if someone is likely to have a stroke or not.

I plan on using K-Nearest Neighbor and multiple linear regression to do the predictions.

First, we will import everything that we need.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

In [2]:
df = pd.read_csv('C:/Users/scott/Desktop/Programming Stuff/Python Stuff/Stroke Project/healthcare-dataset-stroke-data.csv')

In [3]:
print(df.head(10))
print(len(df))
print(df['work_type'].unique())
print(df['smoking_status'].unique())

      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   
5  56669    Male  81.0             0              0          Yes   
6  53882    Male  74.0             1              1          Yes   
7  10434  Female  69.0             0              0           No   
8  27419  Female  59.0             0              0          Yes   
9  60491  Female  78.0             0              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private    

We have 5110 entries in our dataset.
Let's first do a multiple linear regression model.

- First, we need to clean the data a bit. We will need to change everything to a binary state. So all of the categorical variables will be changed to either 1 or 0 to represent one or more states.

- We will also need to drop any entries with a NaN for BMI and we'll replace it with the average BMI.


In [4]:
df.drop(df[df['gender'] == 'Other'].index, inplace = True)
df['gender'].replace('Female', 1,inplace=True)
df['gender'].replace('Male', 0,inplace=True)

df['ever_married'] = df['ever_married'].map(lambda x: 1 if x == 'Yes' else 0)

df['Residence_type'].replace('Urban', 1,inplace=True)
df['Residence_type'].replace('Rural', 0,inplace=True)


df['smokes'] = df['smoking_status'].apply(lambda x: 1 if x == 'smokes' else 0)
df['formerly_smoked'] = df['smoking_status'].apply(lambda x: 1 if x == 'formerly smoked' else 0)
df['never_smoked'] = df['smoking_status'].apply(lambda x: 1 if x == 'never smoked' else 0)

df['Private'] = df['work_type'].apply(lambda x: 1 if x == 'Private' else 0)
df['Self-employed'] = df['work_type'].apply(lambda x: 1 if x == 'Self-employed' else 0)
df['Govt_job'] = df['work_type'].apply(lambda x: 1 if x == 'Govt_job' else 0)
df['Child'] = df['work_type'].apply(lambda x: 1 if x == 'children' else 0)
df['Never_worked'] = df['work_type'].apply(lambda x: 1 if x == 'Never_worked' else 0)

max_age = max(df['age'])
min_age = min(df['age'])
df['age'] = df['age'].apply(lambda x: ((x - min_age)/(max_age - min_age)))

max_glucose = max(df['avg_glucose_level'])
min_glucose = min(df['avg_glucose_level'])
df['avg_glucose_level'] = df['avg_glucose_level'].apply(lambda x: ((x - min_glucose)/(max_glucose - min_glucose)))

df['bmi'].fillna(np.mean(df['bmi']), inplace=True)
max_bmi = max(df['bmi'])
min_bmi = min(df['bmi'])
df['bmi'] = df['bmi'].apply(lambda x: ((x - min_bmi)/(max_bmi - min_bmi)))

del df['work_type']
del df['smoking_status']
del df['id']

x = df[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'Residence_type', 'avg_glucose_level', 'bmi', 'smokes', 'formerly_smoked', 'never_smoked', 'Private', 'Self-employed', 'Govt_job', 'Child', 'Never_worked']]
y = df[['stroke']]

print(df.head(25))

    gender       age  hypertension  heart_disease  ever_married  \
0        0  0.816895             0              1             1   
1        1  0.743652             0              0             1   
2        0  0.975586             0              1             1   
3        1  0.597168             0              0             1   
4        1  0.963379             1              0             1   
5        0  0.987793             0              0             1   
6        0  0.902344             1              1             1   
7        1  0.841309             0              0             0   
8        1  0.719238             0              0             1   
9        1  0.951172             0              0             1   
10       1  0.987793             1              0             1   
11       1  0.743652             0              1             1   
12       1  0.658203             0              0             1   
13       0  0.951172             0              1             

Next, we want to split our data. We'll do a 80-20 split.
And we'll make and fit our models.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state=100)
lr = LinearRegression()
model = lr.fit(x_train, y_train)
y_predict = lr.predict(x_test)

Let's look at the correlation between each of our factors and strokes:

In [6]:
print(model.coef_)

[[ 0.00038966  0.26809247  0.03799421  0.02402563 -0.03885273  0.01064307
   0.04891244 -0.0461496  -0.00036265  0.0113806  -0.00374972 -0.00597773
  -0.02991557 -0.02289352  0.04491061  0.01387622]]


Here's what we can tell from this data:
- Gender: .0004
- Age: .2681
- Hypertension: .0380
- Heart Disease: .0240
- Ever Married: -.0389
- Work Type: .0106
- Residence Type: .0106
- Average Glucose Level: .0489
- BMI: -.0461
- Smoking Status: .01

So the factors that *most* impact one's risk of a stroke are Hypertension, Heart Disease, Smoking Status, and Ever Married.

Let's check our accuracy scores and see how the model has performed.

In [7]:
print(lr.score(x_test, y_test))
print(lr.score(x_train, y_train))

0.08651669046511956
0.08215215092395012


Let's run a prediction model on myself and see if I'm at risk.

In [11]:
my_prediction = [[0, 0.246514, 1, 0, 1, 1, 0.234512, 0.159221, 0, 1, 0, 1, 0, 0, 0, 0]]
predict = lr.predict(my_prediction)
print(predict)

[[0.01127006]]


I'd say I'm looking pretty good at .0113.
Let's test our model on someone who did have a stroke and see what it gives us.

In [9]:
print(df.head(1))


   gender       age  hypertension  heart_disease  ever_married  \
0       0  0.816895             0              1             1   

   Residence_type  avg_glucose_level      bmi  stroke  smokes  \
0               1           0.801265  0.30126       1       0   

   formerly_smoked  never_smoked  Private  Self-employed  Govt_job  Child  \
0                1             0        1              0         0      0   

   Never_worked  
0             0  


In [12]:
test_prediction = [[0, 0.816895, 0, 1, 1, 1, .801265, .30126, 0, 1, 0, 1, 0, 0, 0, 0]]
predict=lr.predict(test_prediction)
print(predict)

[[0.17138256]]


A much higher score.