## Learning Outcomes
- Exploratory data analysis & preparing the data for model building. 
- Machine Learning - Supervised Learning Classification
  - Logistic Regression
  - Naive bayes Classifier
  - KNN Classifier
  - Decision Tree Classifier
  - Random Forest Classifier
  - Ensemble methods
- Training and making predictions using different classification models.
- Model evaluation

## Objective: 
- The Classification goal is to predict “heart disease” in a person with regards to different factors given. 

## Context:
- Heart disease is one of the leading causes of death for people of most races in the US. At least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. 
- Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Machine learning methods may detect "patterns" from the data and can predict whether a patient is suffering from any heart disease or not..

## Dataset Information

#### Source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?datasetId=1936563&sortBy=voteCount
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. 

This dataset consists of eighteen columns
- HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
- BMI: Body Mass Index (BMI)
- Smoking: smoked at least 100 cigarettes in your entire life
- AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
- Stroke:Ever had a stroke?
- PhysicalHealth: physical health, which includes physical illness and injury
- MentalHealth: for how many days during the past 30 days was your mental health not good?
- DiffWalking: Do you have serious difficulty walking or climbing stairs?
- Sex: male or female?
- AgeCategory: Fourteen-level age category
- Race: Imputed race/ethnicity value
- Diabetic: diabetes?
- PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
- GenHealth: Would you say that in general your health is good, fine or excellent?
- SleepTime: On average, how many hours of sleep do you get in a 24-hour period?
- Asthma: you had asthma?
- KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
- SkinCancer: Ever had skin cancer?

### 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier

### 2. Load the dataset and display a sample of five rows of the data frame.

In [None]:
df = pd.read_csv('heart_2020_cleaned.csv')
df.sample(5)

### 3. Check the shape of the data (number of rows and columns). Check the general information about the dataframe using the .info() method.

In [None]:
print(f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
df.info()

### 4. Check the statistical summary of the dataset and write your inferences.

In [None]:
df.describe().T

### 5. Check the percentage of missing values in each column of the data frame. Drop the missing values if there are any.

In [None]:
missing_values = df.isnull().mean() * 100
print(missing_values)
df = df.dropna()

### 6. Check if there are any duplicate rows. If any drop them and check the shape of the dataframe after dropping duplicates.

In [None]:
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

df = df.drop_duplicates()
print(f"Shape after dropping duplicates: {df.shape}")

### 7. Check the distribution of the target variable (i.e. 'HeartDisease') and write your observations.

In [None]:
df['HeartDisease'].value_counts().plot(kind='pie',autopct='%1.0f%%')
plt.title('Distribution of HeartDisease')
plt.show()

### 8. Visualize the distribution of the target column 'Heart disease' with respect to various categorical features and write your observations.

In [None]:
categorical_features = df.select_dtypes(include=[object]).columns
plt.figure(figsize=(30, 25))

i = 1
for feature in categorical_features:
    plt.subplot(6, 3, i)
    sns.countplot(x=feature, hue='HeartDisease', data=df)
    plt.title(f'Distribution of Heart Disease by {feature}')
    i += 1

plt.tight_layout()
plt.show()


### 9. Check the unique categories in the column 'Diabetic'. Replace 'Yes (during pregnancy)' as 'Yes' and 'No, borderline diabetes' as 'No'.

In [21]:
df['Diabetic'] = df['Diabetic'].replace({'Yes (during pregnancy)': 'Yes', 'No, borderline diabetes': 'No'})
df['Diabetic'].value_counts()

Diabetic
No     258572
Yes     43145
Name: count, dtype: int64

### 10. For the target column 'HeartDiease', Replace 'No' as 0 and 'Yes' as 1. 

In [23]:
df['HeartDisease'] = df['HeartDisease'].replace({'No': 0, 'Yes': 1}).astype(int)
df['HeartDisease'].value_counts()

HeartDisease
0    274456
1     27261
Name: count, dtype: int64

### 11. Label Encode the columns "AgeCategory", "Race", and "GenHealth". Encode the rest of the columns using dummy encoding approach.

In [24]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
df['AgeCategory'] = label_enc.fit_transform(df['AgeCategory'])
df['Race'] = label_enc.fit_transform(df['Race'])
df['GenHealth'] = label_enc.fit_transform(df['GenHealth'])

df_encoded = pd.get_dummies(df, drop_first=True)
df.head(4)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.6,Yes,No,No,3.0,30.0,No,Female,7,5,Yes,Yes,4,5.0,Yes,No,Yes
1,0,20.34,No,No,Yes,0.0,0.0,No,Female,12,5,No,Yes,4,7.0,No,No,No
2,0,26.58,Yes,No,No,20.0,30.0,No,Male,9,5,Yes,Yes,1,8.0,Yes,No,No
3,0,24.21,No,No,No,0.0,0.0,No,Female,11,5,No,No,2,6.0,No,No,Yes


### 12. Store the target column (i.e.'HeartDisease') in the y variable and the rest of the columns in the X variable.

In [25]:
X = df_encoded.drop('HeartDisease', axis=1)
y = df_encoded['HeartDisease']

### 13. Split the dataset into two parts (i.e. 70% train and 30% test) and print the shape of the train and test data

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Train set shape: {X_train.shape}, Test set shape: {X_test.shape}")

Train set shape: (211201, 17), Test set shape: (90516, 17)


### 14. Standardize the numerical columns using Standard Scalar approach for both train and test data.

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.head(4)

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,AgeCategory,Race,GenHealth,SleepTime,Smoking_Yes,AlcoholDrinking_Yes,Stroke_Yes,DiffWalking_Yes,Sex_Male,Diabetic_Yes,PhysicalActivity_Yes,Asthma_Yes,KidneyDisease_Yes,SkinCancer_Yes
176022,21.11,0.0,1.0,5,5,0,8.0,True,False,False,False,False,False,True,False,False,False
209180,28.7,0.0,0.0,11,5,2,6.0,False,False,False,False,True,True,True,False,False,False
167240,28.7,0.0,0.0,9,5,0,8.0,False,False,False,False,False,True,True,False,False,False
8444,25.77,0.0,5.0,9,5,4,8.0,True,False,False,False,True,False,True,False,True,True


### 15. Write a function.
- i) Which can take the model and data as inputs.
- ii) Fits the model with the train data.
- iii) Makes predictions on the test set.
- iv) Returns the Accuracy Score.

In [28]:
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_train, y_train, X_test, y_test):

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    return accuracy


### 16. Use the function and train a Logistic regression, KNN, Naive Bayes, Decision tree, Random Forest, Adaboost, GradientBoost, and Stacked Classifier models and make predictions on test data and evaluate the models, compare and write your conclusions and steps to be taken in future in order to improve the accuracy of the model.

In [None]:
def fit_n_print(model, X_train, X_test, y_train, y_test):

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

lr = LogisticRegression()
nb = GaussianNB()
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
adb = AdaBoostClassifier()
gb = GradientBoostingClassifier()

estimators = [('rf', rf), ('knn', knn), ('gb', gb), ('adb', adb)]
sc = StackingClassifier(estimators=estimators, final_estimator=rf)

result = pd.DataFrame(columns=['Accuracy'])

for model, model_name in zip([lr, nb, knn, dt, rf, adb, gb, sc], 
                             ['Logistic Regression', 'Naive Bayes', 'KNN', 'Decision tree', 
                              'Random Forest', 'Ada Boost', 'Gradient Boost', 'Stacking']):
    
    accuracy = fit_n_print(model, X_train_scaled, X_test_scaled, y_train, y_test)
    result.loc[model_name] = [accuracy]

result




### Conclusion

----
## Happy Learning:)
----