# Brain Stroke prediction- DecisionTree

## Context:

A stroke is a medical condition in which poor blood flow to the brain causes cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both cause parts of the brain to stop functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side. Signs and symptoms often appear soon after the stroke has occurred. If symptoms last less than one or two hours, the stroke is a transient ischemic attack (TIA), also called a mini-stroke. A hemorrhagic stroke may also be associated with a severe headache. The symptoms of a stroke can be permanent. Long-term complications may include pneumonia and loss of bladder control.

The main risk factor for stroke is high blood pressure. Other risk factors include high blood cholesterol, tobacco smoking, obesity, diabetes mellitus, a previous TIA, end-stage kidney disease, and atrial fibrillation. An ischemic stroke is typically caused by blockage of a blood vessel, though there are also less common causes. A hemorrhagic stroke is caused by either bleeding directly into the brain or into the space between the brain's membranes. Bleeding may occur due to a ruptured brain aneurysm. Diagnosis is typically based on a physical exam and supported by medical imaging such as a CT scan or MRI scan. A CT scan can rule out bleeding, but may not necessarily rule out ischemia, which early on typically does not show up on a CT scan. Other tests such as an electrocardiogram (ECG) and blood tests are done to determine risk factors and rule out other possible causes. Low blood sugar may cause similar symptoms.

Prevention includes decreasing risk factors, surgery to open up the arteries to the brain in those with problematic carotid narrowing, and warfarin in people with atrial fibrillation. Aspirin or statins may be recommended by physicians for prevention. A stroke or TIA often requires emergency care. An ischemic stroke, if detected within three to four and half hours, may be treatable with a medication that can break down the clot. Some hemorrhagic strokes benefit from surgery. Treatment to attempt recovery of lost function is called stroke rehabilitation, and ideally takes place in a stroke unit; however, these are not available in much of the world.

### Attribute Information:

1) gender: "Male", "Female" or "Other"
2) age: age of the patient
3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
4) heartdisease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 5) evermarried: "No" or "Yes"
6) worktype: "children", "Govtjov", "Neverworked", "Private" or "Self-employed" 7) Residencetype: "Rural" or "Urban"
8) avgglucoselevel: average glucose level in blood
9) bmi: body mass index
10) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
11) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## IMPORTING LIBRARIES AND LOADING DATA

In [1]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib as joblib

In [2]:
df = pd.read_csv('full_data.csv')
df.head(10)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
5,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
6,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
7,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1
8,Female,81.0,1,0,Yes,Private,Rural,80.43,29.7,never smoked,1
9,Female,61.0,0,1,Yes,Govt_job,Rural,120.46,36.8,smokes,1


## DATA EXPLORATION

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   int64  
 3   heart_disease      4981 non-null   int64  
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 428.2+ KB


In [4]:
df.describe()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4981.0,4981.0,4981.0,4981.0,4981.0,4981.0
mean,43.419859,0.096165,0.05521,105.943562,28.498173,0.049789
std,22.662755,0.294848,0.228412,45.075373,6.790464,0.217531
min,0.08,0.0,0.0,55.12,14.0,0.0
25%,25.0,0.0,0.0,77.23,23.7,0.0
50%,45.0,0.0,0.0,91.85,28.1,0.0
75%,61.0,0.0,0.0,113.86,32.6,0.0
max,82.0,1.0,1.0,271.74,48.9,1.0


In [5]:
print(df['gender'].unique())
print(df['work_type'].unique())
print(df['Residence_type'].unique())
print(df['smoking_status'].unique())
print(df['ever_married'].unique())

['Male' 'Female']
['Private' 'Self-employed' 'Govt_job' 'children']
['Urban' 'Rural']
['formerly smoked' 'never smoked' 'smokes' 'Unknown']
['Yes' 'No']


## VIZUALIZATION

In [6]:
import cufflinks as cf

cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

### The Proportion of Stroke among Gender

In [7]:
gender = df.groupby(df['gender'])['stroke'].sum()
df_gender = pd.DataFrame({'labels': gender.index,'values': gender.values})
colors = ['lightpink', 'lightskyblue']
df_gender.iplot(kind='pie',labels='labels',values='values', title='The Proportion of Stroke among Gender', colors = colors)

### Work type of people who had stroke

In [8]:
job = df.groupby(df['work_type'])['stroke'].sum()
df_job = pd.DataFrame({'labels': job.index,'values': job.values})
colors2= ['palegreen','paleturquoise','thistle','moccasin']
df_job.iplot(kind='pie',labels='labels',values='values', title='Work type of people who had stroke', colors = colors2, pull=[0.1, 0.1, 0.1, 0.2])

### Smoking status of people who had stroke

In [9]:
smoke = df.groupby(df['smoking_status'])['stroke'].sum()
df_smoke = pd.DataFrame({'labels': smoke.index,'values': smoke.values})
df_smoke.iplot(kind='pie',labels='labels',values='values', title='Smoking status of people who had stroke', colors = colors2, pull=[0.02, 0.02, 0.1, 0.02])

### Residence area of people who had stroke

In [10]:
Residence = df.groupby(df['Residence_type'])['stroke'].sum()
df_Residence = pd.DataFrame({'labels': Residence.index,'values': Residence.values})
df_Residence.iplot(kind='pie',labels='labels',values='values', title='Residence area of people who had stroke', colors = colors2, pull=[0.02, 0.02],hole = 0.3)

### Marriage status of people who had stroke

In [11]:
Married = df.groupby(df['ever_married'])['stroke'].sum()
df_Married = pd.DataFrame({'labels': Married.index,'values': Married.values})
df_Married.iplot(kind='pie',labels='labels',values='values', title='Marriage status of people who had stroke', colors = colors2, pull=[0.02, 0.02],hole = 0.3)

### Stroke age among gender

In [12]:
stroke = df.loc[df['stroke']== 1].reset_index()

stroke["male_age"]=stroke[stroke["gender"]=="Male"]["age"]
stroke["female_age"]=stroke[stroke["gender"]=="Female"]["age"]
stroke[["male_age","female_age"]].iplot(kind="histogram", bins=20, theme="white", title="Stroke Ages",xTitle='Ages', yTitle='Count')

## DATA PREPROCESSING

In [13]:
df['ever_married'] = [ 0 if i !='Yes' else 1 for i in df['ever_married'] ]
df['gender'] = [0 if i != 'Female' else 1 for i in df['gender']]
df.head(5)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,67.0,0,1,1,Private,Urban,228.69,36.6,formerly smoked,1
1,0,80.0,0,1,1,Private,Rural,105.92,32.5,never smoked,1
2,1,49.0,0,0,1,Private,Urban,171.23,34.4,smokes,1
3,1,79.0,1,0,1,Self-employed,Rural,174.12,24.0,never smoked,1
4,0,81.0,0,0,1,Private,Urban,186.21,29.0,formerly smoked,1


In [14]:
df = pd.get_dummies(df, columns = ['work_type', 'Residence_type','smoking_status'])
df.sample(5)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,stroke,work_type_Govt_job,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
2217,0,1.64,0,0,0,115.12,21.1,0,0,0,0,1,0,1,1,0,0,0
3177,1,51.0,0,0,1,85.59,30.5,0,1,0,0,0,1,0,0,0,1,0
3788,1,61.0,0,0,1,71.4,29.2,0,0,1,0,0,1,0,0,1,0,0
2380,0,49.0,0,0,1,92.02,38.1,0,0,1,0,0,0,1,0,0,1,0
2177,0,62.0,0,0,1,145.37,33.3,0,0,1,0,0,0,1,1,0,0,0


In [15]:
df.isnull().sum()

gender                            0
age                               0
hypertension                      0
heart_disease                     0
ever_married                      0
avg_glucose_level                 0
bmi                               0
stroke                            0
work_type_Govt_job                0
work_type_Private                 0
work_type_Self-employed           0
work_type_children                0
Residence_type_Rural              0
Residence_type_Urban              0
smoking_status_Unknown            0
smoking_status_formerly smoked    0
smoking_status_never smoked       0
smoking_status_smokes             0
dtype: int64

### Target and Feature values / Train Test Split

In [16]:
X = df.drop(['stroke'], axis = 1)
y = df['stroke']

In [17]:
X_train, X_test, y_train , y_test = train_test_split(X,y, test_size = 0.33, random_state = 42)
X_train.shape, X_test.shape

((3337, 17), (1644, 17))

## MODEL BUILDING

### Decision Tree Classifier and Gini method

In [18]:
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=0,max_depth= 5)
clf_gini.fit(X_train, y_train)

### Prediction Model File Generation

#### Uncomment the lines below when any changes are made to the data

In [19]:
# model = DecisionTreeClassifier()
# model.fit(X_train,y_train)
# joblib.dump(model, 'stroke-prediction-model.joblib')

### Prediction Model File Loading

In [20]:
model = joblib.load('stroke-prediction-model.joblib')

### Model accuracy score

#### Testing Accuracy

In [21]:
y_pred_gini = clf_gini.predict(X_test)
print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Model accuracy score with criterion gini index: 0.9507


In [22]:
y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

#### Training Accuracy

In [23]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

Training-set accuracy score: 0.9527


In [24]:
print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

Training set score: 0.9527
Test set score: 0.9507


## MODEL WORKING GRAPH

In [25]:
export_graphviz(model, out_file='stroke-prediction-visual-model.dot', 
                        feature_names=['gender', 'age',	'hypertension',	'heart_disease',	'ever_married',	'avg_glucose_level',	'bmi',	'work_type_Govt_job',	'work_type_Private',	'work_type_Self-employed',	'work_type_children',	'Residence_type_Rural',	'Residence_type_Urban',	'smoking_status_Unknown',	'smoking_status_formerly smoked',	'smoking_status_never smoked',	'smoking_status_smokes'],
                        label='all',
                        rounded=True,
                        filled=True)

## PREDICTION TESTING

### Case 1

In [26]:
print("Gender = Male \nAge = 40 \nHypertention = 0 \nHeart Disease = 0 \nEver Married = 0 \n\
Avg Glucose Level = 100 \nbmi = 35 \nWork Type Govt Job = 0 \nWork Type Private = 1 \nWork Type Self-Employed = 0 \n\
Work Type Children = 0 \nResidence Type Rural = 0 \nResidence Type Urban = 1 \nSmoking Status Unknown = 0 \nSmoking Status Formerly Smoked = 0 \n\
Smoking Status Never Smoked = 1 \nSmoking Status Smoked = 0")

predictionOutcome = model.predict([[1,40,0,0,0,100,35,0,1,0,0,0,1,0,1,0,0]])

if predictionOutcome == 0:
    print("\nModel predicts NO STROKE")
else:
    print("\nModel predicts STROKE")


Gender = Male 
Age = 40 
Hypertention = 0 
Heart Disease = 0 
Ever Married = 0 
Avg Glucose Level = 100 
bmi = 35 
Work Type Govt Job = 0 
Work Type Private = 1 
Work Type Self-Employed = 0 
Work Type Children = 0 
Residence Type Rural = 0 
Residence Type Urban = 1 
Smoking Status Unknown = 0 
Smoking Status Formerly Smoked = 0 
Smoking Status Never Smoked = 1 
Smoking Status Smoked = 0

Model predicts NO STROKE


### Case 2

In [27]:
print("Gender = Female \nAge = 35 \nHypertention = 1 \nHeart Disease = 0 \nEver Married = 1 \n\
Avg Glucose Level = 80 \nbmi = 45 \nWork Type Govt Job = 1 \nWork Type Private = 0 \nWork Type Self-Employed = 0 \n\
Work Type Children = 0 \nResidence Type Rural = 0 \nResidence Type Urban = 1 \nSmoking Status Unknown = 0  \nSmoking Status Formerly Smoked = 1 \n\
Smoking Status Never Smoked = 0 \nSmoking Status Smoked = 0")

predictionOutcome = model.predict([[1,35,1,0,1,80,45,1,0,0,0,0,1,1,0,0,0]])

if predictionOutcome == 0:
    print("\nModel predicts NO STROKE")
else:
    print("\nModel predicts STROKE")

Gender = Female 
Age = 35 
Hypertention = 1 
Heart Disease = 0 
Ever Married = 1 
Avg Glucose Level = 80 
bmi = 45 
Work Type Govt Job = 1 
Work Type Private = 0 
Work Type Self-Employed = 0 
Work Type Children = 0 
Residence Type Rural = 0 
Residence Type Urban = 1 
Smoking Status Unknown = 0  
Smoking Status Formerly Smoked = 1 
Smoking Status Never Smoked = 0 
Smoking Status Smoked = 0

Model predicts NO STROKE


### Case 3

In [28]:
print("Gender = Male \nAge = 20 \nHypertention = 0 \nHeart Disease = 0 \nEver Married = 0 \n\
Avg Glucose Level = 80 \nbmi = 27 \nWork Type Govt Job = 0 \nWork Type Private = 0 \nWork Type Self-Employed = 0 \n\
Work Type Children = 1 \nResidence Type Rural = 0 \nResidence Type Urban = 1 \nSmoking Status Unknown = 0  \nSmoking Status Formerly Smoked = 0 \n\
Smoking Status Never Smoked = 0 \nSmoking Status Smokes = 1")

predictionOutcome = model.predict([[0,20,0,0,0,80,27,0,0,0,1,0,1,0,0,0,1]])

if predictionOutcome == 0:
    print("\nModel predicts NO STROKE")
else:
    print("\nModel predicts STROKE")

Gender = Male 
Age = 20 
Hypertention = 0 
Heart Disease = 0 
Ever Married = 0 
Avg Glucose Level = 80 
bmi = 27 
Work Type Govt Job = 0 
Work Type Private = 0 
Work Type Self-Employed = 0 
Work Type Children = 1 
Residence Type Rural = 0 
Residence Type Urban = 1 
Smoking Status Unknown = 0  
Smoking Status Formerly Smoked = 0 
Smoking Status Never Smoked = 0 
Smoking Status Smokes = 1

Model predicts NO STROKE


### Case 4

In [29]:
print("Gender = Male \nAge = 20 \nHypertention = 0 \nHeart Disease = 0 \nEver Married = 0 \n\
Avg Glucose Level = 80 \nbmi = 22 \nWork Type Govt Job = 0 \nWork Type Private = 0 \nWork Type Self-Employed = 0 \n\
Work Type Children = 1 \nResidence Type Rural = 0 \nResidence Type Urban = 1 \nSmoking Status Unknown = 0  \nSmoking Status Formerly Smoked = 0 \n\
Smoking Status Never Smoked = 0 \nSmoking Status Smokes = 1")

predictionOutcome = model.predict([[0,20,0,0,0,80,22,0,0,0,1,0,1,0,0,0,1]])

if predictionOutcome == 0:
    print("\nModel predicts NO STROKE")
else:
    print("\nModel predicts STROKE")

Gender = Male 
Age = 20 
Hypertention = 0 
Heart Disease = 0 
Ever Married = 0 
Avg Glucose Level = 80 
bmi = 22 
Work Type Govt Job = 0 
Work Type Private = 0 
Work Type Self-Employed = 0 
Work Type Children = 1 
Residence Type Rural = 0 
Residence Type Urban = 1 
Smoking Status Unknown = 0  
Smoking Status Formerly Smoked = 0 
Smoking Status Never Smoked = 0 
Smoking Status Smokes = 1

Model predicts NO STROKE


### Case 5

In [30]:
print("Gender = Male \nAge = 20 \nHypertention = 1 \nHeart Disease = 0 \nEver Married = 0 \n\
Avg Glucose Level = 80 \nbmi = 26 \nWork Type Govt Job = 0 \nWork Type Private = 0 \nWork Type Self-Employed = 0 \n\
Work Type Children = 1 \nResidence Type Rural = 0 \nResidence Type Urban = 1 \nSmoking Status Unknown = 0  \nSmoking Status Formerly Smoked = 0 \n\
Smoking Status Never Smoked = 0 \nSmoking Status Smokes = 1")

predictionOutcome = model.predict([[0,20,1,0,0,80,26,0,0,0,1,0,1,0,0,0,1]])

if predictionOutcome == 0:
    print("\nModel predicts NO STROKE")
else:
    print("\nModel predicts STROKE")

Gender = Male 
Age = 20 
Hypertention = 1 
Heart Disease = 0 
Ever Married = 0 
Avg Glucose Level = 80 
bmi = 26 
Work Type Govt Job = 0 
Work Type Private = 0 
Work Type Self-Employed = 0 
Work Type Children = 1 
Residence Type Rural = 0 
Residence Type Urban = 1 
Smoking Status Unknown = 0  
Smoking Status Formerly Smoked = 0 
Smoking Status Never Smoked = 0 
Smoking Status Smokes = 1

Model predicts NO STROKE


### Automated Runs

In [40]:
import random

count=0
array=[]
predictionOutcome=[0]

print("######################################################### PROGRAM STARTED #########################################################")

while predictionOutcome == [0]:
    arr1=np.random.randint(2, size=1) #Gender
    arr2=np.random.randint(83, size=1) #Age
    arr3=np.random.randint(2, size=1) #Hypertention
    arr4=np.random.randint(2, size=1) #Heart Disease
    arr5=np.random.randint(2, size=1) #Ever Married
    arr6=np.random.randint(272, size=1) #Avg Glucose Level
    arr7=np.random.randint(100, size=1) #bmi
    
    worktype= random.randint(0,4) #Work Type
    
    arr8=[0] #Govt Job
    arr9=[0] #Private
    arr10=[0] #Self-Employed
    arr11=[0] #Children
    
    if worktype == 0:
        arr8=[1] #Govt Job
        arr9=[0] #Private
        arr10=[0] #Self-Employed
        arr11=[0] #Children
    elif worktype == 1:
        arr8=[0] #Govt Job
        arr9=[1] #Private
        arr10=[0] #Self-Employed
        arr11=[0] #Children
    elif worktype == 2:
        arr8=[0] #Govt Job
        arr9=[0] #Private
        arr10=[1] #Self-Employed
        arr10=[0] #Children
    elif worktype == 3:
        arr8=[0] #Govt Job
        arr10=[0] #Self-Employed
        arr10=[0] #Self-Employed
        arr11=[1] #Children
    
    residencetype= random.randint(0,2) #Residence Type
    
    arr12=[0] #Rural
    arr17=[0] #Urban
    
    if worktype == 0:
        arr12=[1] #Rural
        arr17=[0] #Urban
    elif worktype == 1:
        arr12=[0] #Rural
        arr17=[1] #Urban
    
    smokingtype= random.randint(0,4) #Smoking Type
    
    arr13=[0] #Unknown
    arr14=[0] #Formerly Smoked
    arr15=[0] #Never Smoked
    arr16=[0] #Smokes
    
    if worktype == 0:
        arr13=[1] #Unknown
        arr14=[0] #Formerly Smoked
        arr15=[0] #Never Smoked
        arr16=[0] #Smokes
    elif worktype == 1:
        arr13=[0] #Unknown
        arr14=[1] #Formerly Smoked
        arr15=[0] #Never Smoked
        arr16=[0] #Smokes
    elif worktype == 2:
        arr13=[0] #Unknown
        arr14=[0] #Formerly Smoked
        arr15=[1] #Never Smoked
        arr16=[0] #Smokes
    elif worktype == 3:
        arr13=[0] #Unknown
        arr14=[0] #Formerly Smoked
        arr15=[0] #Never Smoked
        arr16=[1] #Smokes
    
    array = np.concatenate((arr1, arr2, arr3, arr4, arr5, arr6, arr7, arr8, arr9, arr10, arr11, arr12, arr17, arr13, arr14, arr15, arr16), axis=0)
    
    print("\nArray %d =" %count, array)
    
    predictionOutcome = model.predict([array])
    
    if predictionOutcome == 0:
        print("Model predicts NO STROKE = ", predictionOutcome)
        print("###################################################################################################################################")
    else:
        print("Model predicts STROKE = ", predictionOutcome)
    
    count+=1
else:
    print("######################################################### PROGRAM FINISHED #########################################################")


######################################################### PROGRAM STARTED #########################################################

Array 0 = [  0  20   0   0   0 111  17   1   0   0   0   1   0   1   0   0   0]
Model predicts NO STROKE =  [0]
###################################################################################################################################

Array 1 = [ 1 19  0  1  1 85 98  1  0  0  0  1  0  1  0  0  0]
Model predicts NO STROKE =  [0]
###################################################################################################################################

Array 2 = [ 0 80  0  0  1 27 89  1  0  0  0  1  0  1  0  0  0]
Model predicts STROKE =  [1]
######################################################### PROGRAM FINISHED #########################################################
