In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
print(os.listdir())

import warnings
warnings.filterwarnings('ignore')

['.git', '.gitignore', '.ipynb_checkpoints', 'Heart.csv', 'initial_exploration.ipynb', 'linear_regression.ipynb', 'README.md']


In [2]:
heart_data=pd.read_csv('Heart.csv')

## 2. Features : All except target
1. Target: Will predict if a person has a disease or not
2. Based on the features(sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal), a model will be developed which will predict if a person is suffering from heart disease or not

In [3]:
X = heart_data.iloc[:, :-1].values
y = heart_data.iloc[:, -1].values

In [4]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler(feature_range =(0, 1))  
X_preprocessed = min_max_scaler.fit_transform(X)

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X_preprocessed,y,test_size=0.20,random_state=0)

## linear regression cannot be performed on this type of data, since the target variable is categorical

In [8]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,Y_train)

In [9]:
from sklearn.metrics import accuracy_score

In [10]:
Y_pred_lr = lr.predict(X_test)

In [11]:
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)  
print("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+" %")

The accuracy score achieved using Logistic Regression is: 83.61 %


In [12]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred_lr))

              precision    recall  f1-score   support

           0       0.88      0.76      0.81        29
           1       0.81      0.91      0.85        32

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.83        61



## 4. Results 
1. Accuracy obtained using logistic regression is 83.61%. 
2. Precision is high for the people who are not effected with heart disease.
3. Recall is high for the people who are effected with heart disease. 

In [13]:
from sklearn.model_selection import train_test_split

X_train2,X_test2,Y_train2,Y_test2 = train_test_split(X_preprocessed,y,test_size=0.3,random_state=0)

In [14]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train2,Y_train2)

In [15]:
from sklearn.metrics import accuracy_score
Y_pred_lr2 = lr.predict(X_test2)
score_lr = round(accuracy_score(Y_pred_lr2,Y_test2)*100,2)
print(score_lr)

80.43


In [16]:
from sklearn.metrics import classification_report
print(classification_report(Y_test2, Y_pred_lr2))

              precision    recall  f1-score   support

           0       0.82      0.75      0.79        44
           1       0.79      0.85      0.82        48

    accuracy                           0.80        92
   macro avg       0.81      0.80      0.80        92
weighted avg       0.81      0.80      0.80        92



## Looking at the parameters what I found and learnt are:
- The data set contained categorical data, where the Linear Regression cannot be performed 
- Logistic Regression must be used 
- Difference in accuracy can be identified by changing the % of training and test datasets
- By changing the lengths of training and test sets, there will be change in precision, recall and accuracy 
- Resulting in better prediction and analysis 

## Important Features
| Column |  Non-Null Count | Dtype |  
| ---    |   ---           |  ----- |
| 0 |  age |   304 non-null |  int64 |  
| 1 |  sex |   304 non-null |  int64 |
| 2 |  cp |    304 non-null |  int64 |
| 3 |trestbps|  304 non-null| int64| 
| 4 | chol |    304 non-null|    int64|  
| 5 |  fbs  |     304 non-null|    int64|  
| 6 | restecg|   304 non-null|   int64|  
| 7 |  thalach|   304 non-null|    int64|  
| 8   |exang     |304 non-null    |int64  |
| 9  | oldpeak |  304 non-null|    float64|
| 10  |slope     |304 non-null    |int64  |
| 11|  ca |       304 non-null|    int64 | 
| 12|  thal|      304 non-null|    int64|   

# Machine-Learning-Project-Starter
Repository of all project documentation and Code

## State of the project
Logistic Regression is used to implement the model

Important Results:

               precision    recall  f1-score   support

           0       0.82      0.75      0.79        44
           1       0.79      0.85      0.82        48

    accuracy                           0.80        92
   macro avg       0.81      0.80      0.80        92
weighted avg       0.81      0.80      0.80        92

## Overview of the project
In this project, I worked with the heart data of people to develop a machine learning model. This System predicts the arising possibilities of heart disease. 
Considering all the features taken, will contribute to the output prediction if a person has a disease or not. This will be more of a classification type which will predict 1 for positive result that is person suffering from heart disease and 0 that is negative for a person not suffering from heart disease


## Important Features
| Column |  Non-Null Count | Dtype |  
| ---    |   ---           |  ----- |
| 0 |  age |   304 non-null |  int64 |  
| 1 |  sex |   304 non-null |  int64 |
| 2 |  cp |    304 non-null |  int64 |
|3   |trestbps  |304 non-null    |int64 | 
| 4 |  chol |     304 non-null|    int64 | 
 |5 |  fbs  |     304 non-null|    int64|  
 |6 |  restecg|   304 non-null|    int64|  
 |7 |  thalach|   304 non-null|    int64|  
 |8 |  exang  |   304 non-null|    int64 | 
| 9 |  oldpeak |  304 non-null|    float64|
 |10|  slope  |   304 non-null|    int64|  
 |11|  ca     |   304 non-null|    int64|  
 |12|  thal      304 non-null|    int64|   

There were some interesting relations between the features in dataset which I have shown using visualization.

The target variable is 'target'. This variable can have two possible outcomes: 0 or 1 where 0 refers to the case where a person don't have a heart disease and 1 refers to a case where a person has a heart disease.

## Exploratory Data Analysis
I came up with some interesting questions on the dataset and I tried to find answers for the same during EDA process through visualization.

## Data Preprocessing
Performed - 
1. Removing missing values
2. Feature scaling 
3. Splitting the data into training and testing sets

## Models used:
Logistic Regression is incredibly easy to implement and very efficient to train. 

(Logistic Regression implementation is best instead of Linear Regression because the target variable in the dataset is of type categorical, not continuous. So, my dataset is not suitable for linear regression.)


## Conclusion: 
I found that adding features to the data can help overcome underfitting, getting more data i.e., increasing records in data can help overcome overfitting. 





