<a href="https://colab.research.google.com/github/manigs2007/Heart_Disease_Prediction/blob/main/Heart_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement
#### Prediction of Heart disease from the given dataset.

## Work Flow
#### Get The Heart Dataset --> Data Pre-Processing --> Train Test Split --> Feeding into Machine Learning Model (Logistic Regression Model) --> We get a Trained Logistic Model --> Feed New Data for fresh prediction.

In [2]:
## This particular use case is a Binary classification. This means that prediction can be either positive or negative. 
## In case of Binary Classification, Logistic Regression Model is very useful.

## Importing the dependencies

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Data Collection and Processing

In [5]:
## Loading the csv data to a Pandas dataframe

heart_data = pd.read_csv('/heart_disease_data.csv')

In [6]:
## Print the first 5 rows of the dataset

heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
## print last 5 rows of the dataset

heart_data.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [10]:
## checking the umber of rows and columns

heart_data.shape

(303, 14)

In [11]:
## getting some info about the data

heart_data.info

<bound method DataFrame.info of      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1    

In [None]:
## we can see there are no non-null values or any missing values.

In [12]:
## checking for missing values in another way

heart_data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [13]:
## getting some statistical measures about the data 

heart_data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [None]:
## 25% (percentile) values represent that 25% of the values are below 47.5 (age column)

In [14]:
## Checking the distribution of target variable
## value_counts will give us the number of values that are 1 or 0

heart_data['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

1 --> Defective Heart

0 --> Healthy Heart

## Splitting the Features and Target

In [15]:
## If we see the in the dataset there are 14 columns. Last column name is "Target"
## Apart from "Target" column all the other columns are called "Features" columns

In [16]:
X = heart_data.drop(columns = 'target', axis = 1)  ## axis=1 means we are dropping a column. axis=0 means we drop a row
Y = heart_data['target']

In [17]:
print(X)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  
0        0   0     1  
1        0   0     2  
2        2   0    

In [18]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64


## Splitting the data into training Data and Test data

In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)


In [None]:
## test_size could be any percentage of data you want to test (e.g. 10%, 20% etc.)

''' stratify = Y means that it will split the data similarly as X_test contain 
    similar proportion of one and zero as present in the original case.
    If we dont mention stratify = Y there is a possibility that all the 
    values in the X Test may contain a 'Zero' or all the values of this 
    exchange may contain 'One' '''

''' When we mention stratify = Y these two classes which are either one or zero will be distributed
    in an even manner throughout your training data and test data as it was present in your original data '''

''' random_state split the data in a specific way (in this case, 2) '''

In [21]:
print(X.shape, X_train.shape, X_test.shape)

(303, 13) (242, 13) (61, 13)


## Model Training

### LogisticRegression

In [22]:
model = LogisticRegression()

In [23]:
## Training the LogisticRegression model with Training Data

model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

## Model Evaluation

### Accuracy Score

In [24]:
## accuracy on training data

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [25]:
print("Accuracy on Training Data : ", training_data_accuracy)

Accuracy on Training Data :  0.8512396694214877


In [26]:
## accuracy on test data

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [27]:
print("Accuracy on Test Data : ", test_data_accuracy)

Accuracy on Test Data :  0.819672131147541


## Building a Predictive System

In [29]:
input_data = (43,1,0,150,247,0,1,171,0,1.5,2,0,2)

## change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

## Reshape the numpy array as we are predicting for only one instance
input_data_reshape = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshape)
print(prediction)

if (prediction[0] == 0):
  print("The Person does not have a Heart Disease")
else:
  print("The Person has a Heart Disease")

[1]
The Person has a Heart Disease


  "X does not have valid feature names, but"


In [None]:
## Similarly check for other values by feeding different data from Heart Data or any other data..