### **Problem 1**
Define a Python function call "evaluate", which measures how good a classifier is.

This function has 3 inputs: model, X (features), and y (target variable).

This function will cross validate the model using shuffled 10-fold cross validation using the F1 metric.

This function returns the average F1 score across the 10 iterations.

In [None]:
import numpy as np
from sklearn import metrics
from sklearn.model_selection import cross_val_score
def evaluate(clf, X, y):
  scores = cross_val_score(clf, X, y, cv=10, scoring='f1_macro')
  return np.mean(scores)

## Reading the dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("/content/heart_126487352.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## **Problem 2**
Build a random forest classifier to model the heart disease dataset.

Experiment with different values of min_samples_leaf and max_depth to get the best model you can have. You should use the "evaluate" function in the previous problem to determine the performance of models with different parameters.

In [None]:
import pandas as pd

In [None]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
df.shape

(303, 14)

In [None]:
X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal']]
y = df['target']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rc = RandomForestClassifier(min_samples_leaf=1, max_depth=2, random_state=42)
rc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=None, oob_score=False,
                       random_state=42, verbose=0, warm_start=False)

In [None]:
y_pred = rc.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.87

In [None]:
rc.feature_importances_

array([0.02403587, 0.01970864, 0.19012011, 0.01688462, 0.00187467,
       0.00144509, 0.00483352, 0.13790344, 0.1273905 , 0.1128476 ,
       0.05463285, 0.16257459, 0.14574849])

In [None]:
evaluate(rc, X,y)

0.8473999288802425

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
parameters = {
    'random_state': [0, 42],
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    }

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(rc, param_grid = parameters, scoring='accuracy', cv=10)  
clf.fit(X_train,y_train)   #runs need time

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=2,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             n_jobs=None,
     

In [None]:
print("Tuned Hyperparameters :", clf.best_params_)
print("Accuracy :",clf.best_score_)

Tuned Hyperparameters : {'max_depth': 2, 'min_samples_leaf': 7, 'random_state': 0}
Accuracy : 0.8471428571428572


Model after tuning the parameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rc = RandomForestClassifier(max_depth=2, min_samples_leaf=7, random_state=0)
rc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_samples_leaf=7,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=None, oob_score=False,
                       random_state=0, verbose=0, warm_start=False)

In [None]:
y_pred = rc.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.85

In [None]:
evaluate(rc, X,y)

0.8365202786515935

# **Problem 3**
Build a logistic regression classifier to model the heart disease dataset.

Experiment with the parameter C to find the best model you can have. You should use the "evaluate" function in the previous problem to determine the performance of models with different parameters.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=4,solver='lbfgs', max_iter=1000 )

In [None]:
lr.fit(X_train, y_train)

LogisticRegression(C=4, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8

In [None]:
evaluate(lr, X, y)

0.8063490967737839

In [None]:
parameters = { 
    'C'       : np.logspace(-3,3,7),
    'solver'  : ['newton-cg', 'lbfgs', 'liblinear']
    }
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(lr, param_grid = parameters, scoring='accuracy', cv=10)  
clf.fit(X_train,y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=4, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=1000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             n_jobs=None,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'solver': ['newton-cg', 'lbfgs', 'liblinear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [None]:
print("Tuned Hyperparameters :", clf.best_params_)
print("Accuracy :",clf.best_score_)

Tuned Hyperparameters : {'C': 1.0, 'solver': 'liblinear'}
Accuracy : 0.8761904761904761


Model after parameter tuning.

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1.0, solver='liblinear', max_iter=1000 )
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_pred = lr.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8

In [None]:
evaluate(lr, X, y)

0.8200633262803187

# **Problem 4**
The default AdaBoost classifier uses a decision tree with max_depth of 1 as base learners. Will using decision trees with max_depth of 2, 3, 4 or 5 improve the performance?

Use the heart dataset, compare and discuss what is the best decision trees to use as base learners.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dc = DecisionTreeClassifier(max_depth=2)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
ac = AdaBoostClassifier(base_estimator=dc)

In [None]:
ac.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=2,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         random_state=None,
                                                         splitter='best'),
                   learning_rate=1.0

In [None]:
y_pred = ac.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.78

In [None]:
ac.feature_importances_

array([0.15531026, 0.02148023, 0.04707578, 0.13601764, 0.1633251 ,
       0.00465397, 0.01377356, 0.12178066, 0.01292005, 0.11783754,
       0.04746625, 0.1197616 , 0.03859737])

In [None]:
evaluate(ac, X, y)

0.7496118459838852

In [None]:
evaluation_scores = []
max_depth = [1, 3, 4, 5]
for depth in max_depth:
  dc = DecisionTreeClassifier(max_depth=depth)
  ac = AdaBoostClassifier(base_estimator=dc)
  ac.fit(X_train, y_train)
  y_pred = ac.predict(X_test)
  accuracy_score(y_test, y_pred)
  score = evaluate(ac, X, y)
  evaluation_scores.append(score)
evaluation_scores

[0.8070647507381787, 0.7780901521227777, 0.7720581303301891, 0.7840796467484015]

In [None]:
compare = {'Maxdepth':[1, 2, 3, 4, 5],
        'Score':[0.8070647507381787, 0.7496118459838852, 0.7929577444098346, 0.774754422627173, 0.7820630061659474]}
compare_df = pd.DataFrame(compare)
compare_df

Unnamed: 0,Maxdepth,Score
0,1,0.807065
1,2,0.749612
2,3,0.792958
3,4,0.774754
4,5,0.782063


As from the above table it is clear that the best decision trees to use as base learners is one with **max_depth = 1**.

## **Problem 5**
Modeling this data using random forest, which are the two most important features?

Modeling this data using AdaBoost, which are the two most important features?

In [None]:
rc.feature_importances_

array([0.02403587, 0.01970864, 0.19012011, 0.01688462, 0.00187467,
       0.00144509, 0.00483352, 0.13790344, 0.1273905 , 0.1128476 ,
       0.05463285, 0.16257459, 0.14574849])

In [None]:
rc_feature_importance = {'Features':['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
        'Importance':[0.02403587, 0.01970864, 0.19012011, 0.01688462, 0.00187467,
       0.00144509, 0.00483352, 0.13790344, 0.1273905 , 0.1128476 ,
       0.05463285, 0.16257459, 0.14574849]}
rc_feature_imp_df = pd.DataFrame(rc_feature_importance)
rc_feature_imp_df

Unnamed: 0,Features,Importance
0,age,0.024036
1,sex,0.019709
2,cp,0.19012
3,trestbps,0.016885
4,chol,0.001875
5,fbs,0.001445
6,restecg,0.004834
7,thalach,0.137903
8,exang,0.12739
9,oldpeak,0.112848


The two most important features after modelling this data using random forest are cp and ca.

In [None]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
ac.feature_importances_

array([0.15531026, 0.02148023, 0.04707578, 0.13601764, 0.1633251 ,
       0.00465397, 0.01377356, 0.12178066, 0.01292005, 0.11783754,
       0.04746625, 0.1197616 , 0.03859737])

In [None]:
ac_feature_importance = {'Features':['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
        'Importance':[0.15531026, 0.02148023, 0.04707578, 0.13601764, 0.1633251 ,
       0.00465397, 0.01377356, 0.12178066, 0.01292005, 0.11783754,
       0.04746625, 0.1197616 f, 0.03859737]}
ac_feature_imp_df = pd.DataFrame(ac_feature_importance)
ac_feature_imp_df

Unnamed: 0,Features,Importance
0,age,0.15531
1,sex,0.02148
2,cp,0.047076
3,trestbps,0.136018
4,chol,0.163325
5,fbs,0.004654
6,restecg,0.013774
7,thalach,0.121781
8,exang,0.01292
9,oldpeak,0.117838


The two most important features after modelling this data using AdaBoost are chol and age.

## **Problem 6**
Row 214 of the dataset is a 56-year-old man who is not diagnosed with heart disease.

This person has a cholesterol level of 249.

What is the chance that this person has heart disease?

If another person with the same profile, but 10 years older and with a cholesterol level of 500, what is the chance of that person has heart disease?

In [None]:
df.loc[ 214, : ]

age          56.0
sex           1.0
cp            0.0
trestbps    125.0
chol        249.0
fbs           1.0
restecg       0.0
thalach     144.0
exang         1.0
oldpeak       1.2
slope         1.0
ca            1.0
thal          2.0
target        0.0
Name: 214, dtype: float64

**Prediction for the man.**

In [None]:
ac.predict(np.array([56.0,1.0,0.0,125.0,249.0,1.0,0.0,144.0,1.0,1.2,1.0,1.0,2.0]).reshape(1,-1))[0]

0

As the target value is 0 which means < 50% diameter angiographic narrowing, i.e, the **man does not have a heart disease**.

**Prediction for another person.**

In [None]:
ac.predict(np.array([66.0,1.0,0.0,125.0,500.0,1.0,0.0,144.0,1.0,1.2,1.0,1.0,2.0]).reshape(1,-1))[0]

1

As the target value is 1 which means > 50% diameter angiographic narrowing, i.e, the **the person likely has a heart disease**.