### **Load the data into google colab**

In [None]:
from google.colab import files
uploaded = files.upload()

Saving heart.csv to heart.csv


### **1. Importing the required packages**

In [None]:
import pandas as pd
import numpy as np

#Machine Learning related packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### **2. Reading and Exploring the data**

In [None]:
heart = pd.read_csv('heart.csv')

In [None]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
heart.shape

(303, 14)

In [None]:
heart.dtypes  #print the datatypes of values in each column

Unnamed: 0,0
age,int64
sex,int64
cp,int64
trestbps,int64
chol,int64
fbs,int64
restecg,int64
thalach,int64
exang,int64
oldpeak,float64


In [None]:
heart.isnull().sum()  #print the total number of missing values column-wise

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


In [None]:
heart.isnull().sum(axis = 1)  #print the total number of missing values row-wise

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
298,0
299,0
300,0
301,0


In [None]:
heart.duplicated().sum()  #print the total number of rows which are duplicates

1

In [None]:
heart[heart.duplicated()]  #print all the duplicate rows

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


In [None]:
heart[heart.duplicated(subset = ['trestbps', 'chol'])]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
72,29,1,1,130,204,0,0,202,0,0.0,2,0,2,1
113,43,1,0,110,211,0,1,161,0,0.0,2,0,3,1
140,51,0,2,120,295,0,0,157,0,0.6,2,0,2,1
155,58,0,0,130,197,0,1,131,0,0.6,1,0,2,1
160,56,1,1,120,240,0,1,169,0,0.0,0,0,2,1
170,56,1,2,130,256,1,0,142,1,0.6,1,1,1,0
178,43,1,0,120,177,0,0,120,1,2.5,1,0,3,0
184,50,1,0,150,243,0,0,128,0,2.6,1,0,3,0
186,60,1,0,130,253,0,1,144,1,1.4,2,1,3,0


In [None]:
heart.drop_duplicates(inplace = True)

#### **Outlier Detection and Removal**

In [None]:
heart.describe()  #statistical description of the data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0
mean,54.42053,0.682119,0.963576,131.602649,246.5,0.149007,0.52649,149.569536,0.327815,1.043046,1.397351,0.718543,2.31457,0.543046
std,9.04797,0.466426,1.032044,17.563394,51.753489,0.356686,0.526027,22.903527,0.470196,1.161452,0.616274,1.006748,0.613026,0.49897
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,133.25,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.5,1.0,1.0,130.0,240.5,0.0,1.0,152.5,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.75,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


**Observation**: There seems to be presence of outlier in age, trestbps, chol, thalach, oldpeak columns and further we would make boxplots for these columns to be 100% sure.

### **Machine Learning Process**

In [None]:
X = heart.drop(columns = 'target')#store all the input columns
y = heart['target']   #store the output column

In [None]:
#split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

#### **Standardization(Scaling) of the data**

- **Note**: Always perform the scaling after splitting the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### **Apply the Logistic Regression Algorithm on the data**

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

In [None]:
log_reg.coef_  #m1 to m13

array([[ 0.02651805, -0.76115681,  0.78058821, -0.2162428 , -0.45756905,
        -0.11290994,  0.0609823 ,  0.44218324, -0.33736004, -0.5613874 ,
         0.37441386, -0.68913694, -0.62006129]])

In [None]:
log_reg.intercept_  #c value

array([0.30759926])

In [None]:
y_pred = log_reg.predict(X_test_scaled)

In [None]:
accuracy_score(y_test, y_pred)

0.819672131147541

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_test, y_pred)

0.7575757575757576

In [None]:
recall_score(y_test, y_pred)

0.8928571428571429

### Performance Metrics used in Classification Models

1. **`Confusion Matrix`** : A confusion matrix is a table that summarizes the performance of a classification model by showing the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

    - **`When to Use`**: The confusion matrix is useful for understanding the types of errors the model is making and for calculating other metrics like precision, recall, and specificity.
    
![confusion_matrix](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/8_confusion-matrix-python.jpg)

---

2. **`Precision Score`** : Precision is important when the cost of false positives is high, so false positives must be reduced.
    - `For example` :
        - In spam detection, a high precision ensures that most of the emails marked as spam are indeed spam and genuine emails are not marked as spam since it can be a serious problem.
        - In financial fraud detection system, it might prioritize high precision – minimizing false positives (wrongly declined transactions) to avoid inconveniencing customers.
        - While classifying whether or not a bank customer is a loan defaulter, it is desirable to have high precision since the bank wouldn’t want to lose customers who were denied a loan based on the model’s prediction that they would be defaulters.

---

3. **`Recall Score/Sensitivity`** : Recall is crucial when the cost of false negatives is high and we need to eliminate false negatives as much as possible.
    - `For Example`:
        - In medical diagnosis, a high recall is crucial since it ensures that most of the actual positive cases (e.g., diseases) are identified. False Negative would mean that we classified a patient as a healthy person which would be fatal.

---

4. **`Accuracy Score`** : It is suitable to use when the classes in the dataset are balanced. It provides a straightforward measure of overall correctness. However, it can be misleading in cases of class imbalance, where one class significantly outnumbers the other.

    - Using accuracy as a defining metric for our model makes sense intuitively, but more often than not, it is advisable to use Precision and Recall too. There might be other situations where our accuracy is very high, but our precision or recall is low. Ideally, for our model, we would like to avoid any situations where the patient has heart disease, but our model classifies as him not having it, i.e., aim for high recall.

    - On the other hand, for the cases where the patient is not suffering from heart disease and our model predicts the opposite, we would also like to avoid treating a patient with no heart disease (crucial when the input parameters could indicate a different ailment, but we end up treating him/her for a heart ailment).

    - Although we do aim for high precision and high recall value, achieving both at the same time is not possible. For example, if we change the model to one giving us a high recall, we might detect all the patients who actually have heart disease, but we might end up giving treatments to many patients who don’t suffer from it.

    - Similarly, suppose we aim for high precision to avoid giving any wrong and unrequired treatment. In that case, we end up getting a lot of patients who actually have heart disease going without any treatment.

---

5. **`F1 Score`** : We saw above that there is a trade-off between recall and precision-score. When we try to increase one, the other one starts to reduce. But sometime both the score are important.
    - `Ex`:  if the doctor informs us that the patients who were incorrectly classified as suffering from heart disease are equally important since they could be indicative of some other ailment, then we would aim for not only a high recall but a high precision as well.
    
---


6. **`ROC_AUC Score`** - Used commonly with imbalanced data.

---

7. **`Specificity`** : Specificity is a performance metric used in classification models, particularly in binary classification. It is also known as the true negative rate. Specificity measures the proportion of actual negative cases that are correctly identified by the model. This metric tells us how well the model is at identifying negative instances.
    - `Ex` : In medical testing, a high specificity is important when a false positive result could lead to unnecessary stress, further invasive testing, or treatment. For example, in cancer screening, a test with high specificity ensures that healthy individuals are not incorrectly diagnosed with cancer, avoiding unnecessary biopsies or treatments