##### Heart Disease Detection (Decision tree)

<img src="heart.png">

## Desicion Tree (with sklearn)
In the first part of this project, we build a decision tree with the help of sklearn library. <br>
In this project, we have patients features and the target is the patient have heart disease or not. <br>
We build a decision tree with all of features and train it by 70 percent of data and test this solution with the other 30 percent.

### Read data
At first, we read the csv file with __pandas read_csv__ function.

In [269]:
import pandas as pd

def read_data(file_path):
    return pd.read_csv(file_path)
csv_path = "./heart.csv"
heart_disease = read_data(csv_path)
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### Splitting set to data & test
We should have to sets. One for training our model and one for testing our model. The reason for this partitioning is avoding overfitting on train data and have a good model for unseen data. <br>
Our train data is 70 percent and the other 30 percent is for test data. We do this splitting by __train_test_split__ of __sklearn__ library.

In [270]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(heart_disease, test_size=0.3, random_state=42)
train_set.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2,1
72,29,1,1,130,204,0,0,202,0,0.0,2,0,2,1
15,50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
10,54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
163,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


### Split features and target
In this function, we split target value from features. We should do this because we need this two parts seperated for fitting our model and predicting test features.

In [271]:
def split_features_target(dataset):
    data_X = dataset.drop("target", axis=1)
    data_y = dataset["target"].copy()
    return data_X, data_y

train_X, train_y = split_features_target(train_set)
test_X, test_y = split_features_target(test_set)
# train_X.head()
# train_y.head()

### Decision tree fitting by sklearn
__Sickit Learn__ library has _DecisionTreeClassifier_. We fit this model on our train set by _fit_ function of _DecisionTreeClassifier_ class. In this function, this model learn our data and make a decision tree of features. By this model, we can predict unseen data classes.

In [272]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(train_X, train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

### Calculate model accuracy
For calculating accuracy, we should count correct predicted labels and calculate percent of them. For this purpose, I use __accuracy_score__ function of __sklearn__ library. This function calculate accuracy of our model. <br>
Accuracy of decision tree model is __73.6%__.

In [273]:
from sklearn.metrics import accuracy_score

test_y_pred = decision_tree.predict(test_X)
print("Decision tree accuracy is:")
print(accuracy_score(test_y, test_y_pred))

Decision tree accuracy is:
0.7362637362637363


### Implementing bagging and random forest
Decision tree in the base format isn't useful very much. It has some improved version. <br> 
One of the technique to improve basic decision tree is using __ensemble methods__. <br>
Two general idea for ensemble methods exists: <br>
1. Bagging (Bootstraping + aggregating)
2. Random forest
#### 1. Make 5 groups with 150 random records from main dataset
In this two techniques, we use multiple classifiers and rely on majority of them to improve our classification result. <br>
So we have multiple classifiers and we want diffrent training set for each of them. Because of that, we split main train set to smaller groups to create dataset for each classifier. <br>
In this part, I create 5 groups with 150 records in each. __make_one_data_group__ create each of this groups. __make_all_data_groups__ outputs all training sets for classifiers. <br>
This part called __bootstrapping__.

In [274]:
def make_one_data_group(dataset, records_count):
    shuffled_indices = np.random.choice(len(data_set), records_count)
    chosen_indices = shuffled_indices[:records_count]
    return data_set.iloc[chosen_indices]

def make_all_data_groups(dataset, group_count, records_count):
    data_groups = []
    for _ in range(group_count):
        data_group = make_one_data_group(dataset, records_count)
        data_groups.append(data_group)
    return data_groups

group_count = 5
group_members_count = 150
train_set, test_set = train_test_split(heart_disease, test_size=0.2, random_state=42)
data_groups = make_all_data_groups(train_set, group_count, group_members_count)

### 2. Implementing bagging ensemble method
In this part, we want to create bagging ensemble model. It has two parts: __bootrapping__ and __aggregating__.
In __botstrapping__, we prepare data groups for different classifiers. _split_data_groups_ split features and target values in data groups of classifiers. After that we make multiple classifiers and fit related data to each classifier. <br>
In __aggregating__, the final step of bagging, we predict each data by each decision tree and label data by the majority class that predicted by classifiers. __bagging_predict__ is the main function of predict test set by bagging ensemble method. __calculate_majority__ calculates final label of each data record. <br>
Bagging accuracy is __77.1%__.

In [279]:
from operator import add

def split_data_groups(data_groups):
    splitted_data_groups = []
    for index in range(len(data_groups)):
        data_group_X, data_group_y = split_features_target(data_groups[index])
        splitted_data_groups.append({'X': data_group_X ,'y': data_group_y})
    return splitted_data_groups

def make_classifiers(training_sets):
    classifiers = []
    for index in range(len(training_sets)):
        decision_tree = DecisionTreeClassifier()
        decision_tree.fit(training_sets[index]['X'], training_sets[index]['y'])
        classifiers.append(decision_tree)
    return classifiers

def calculate_majority(sum_prediction):
    return [int(sum_predict>2) for sum_predict in sum_prediction]

def bagging_predict(classifiers, test_X):
    prediction = [0] * len(test_set)
    for i in range(len(classifiers)):
        predict_i = classifiers[i].predict(test_X)
        prediction = list(map(add, prediction, predict_i))
    return calculate_majority(prediction)

splitted_data_groups = split_data_groups(data_groups)
classifiers = make_classifiers(splitted_data_groups)
test_X, test_y = split_features_target(test_set)
test_y_pred = bagging_predict(classifiers, test_X)
print("Bagging accuracy is:")
print(accuracy_score(test_y, test_y_pred))

Bagging accuracy is:
0.7704918032786885


#### 3. Calculating accuracy with deleting features
In this part, we delete every feature once and make decision tree without that feature and calculates model accuracy to obtain the most important feature. <br>
We see that _sex_ and _exang_ doesn't have effect on accuracy. Removing _ca_ and _oldpeak_ attributes improve accuracy of model.

In [282]:
def ommit_feature_from(data_groups):
    modified_data_groups = []
    for data_group in data_groups:
        modified_data_group = data_group.drop(feature, axis=1)
        modified_data_groups.append(modified_data_group)
    return modified_data_groups

group_count = 5
group_members_count = 150
features = list(train_set.columns)
features.pop(-1)
for feature in features:
    modified_data_groups = ommit_feature_from(data_groups)
    splitted_data_groups = split_data_groups(modified_data_groups)
    classifiers = make_classifiers(splitted_data_groups)
    test_X, test_y = split_features_target(test_set)
    test_X.drop(feature, axis=1, inplace=True)
    test_y_pred = bagging_predict(classifiers, test_X)
    print("Bagging without " + feature + " accuracy is:")
    print(accuracy_score(test_y, test_y_pred))
    print("************************************")

Bagging without age accuracy is:
0.7540983606557377
************************************
Bagging without sex accuracy is:
0.7704918032786885
************************************
Bagging without cp accuracy is:
0.6885245901639344
************************************
Bagging without trestbps accuracy is:
0.6885245901639344
************************************
Bagging without chol accuracy is:
0.7377049180327869
************************************
Bagging without fbs accuracy is:
0.7540983606557377
************************************
Bagging without restecg accuracy is:
0.7213114754098361
************************************
Bagging without thalach accuracy is:
0.7540983606557377
************************************
Bagging without exang accuracy is:
0.7704918032786885
************************************
Bagging without oldpeak accuracy is:
0.7868852459016393
************************************
Bagging without slope accuracy is:
0.7540983606557377
************************************


#### 4. Choosing 5 random features and make decision tree for them
We choose 5 random feature from features set and we make decision tree for this features. <br>
This is pre exercise for random forest. Because random forest consists of trees like this. Each tree has some random features for learning.

In [277]:
def get_random_features(features, features_count):
    shuffled_indices = np.random.permutation(len(features))
    chosen_indices = shuffled_indices[:features_count]
    return [features[index] for index in chosen_indices]

random_features = get_random_features(features, 5)
random_features.append('target')
modified_train_set = train_set[random_features]
modified_test_set = test_set[random_features]
train_X, train_y = split_features_target(modified_train_set)
test_X, test_y = split_features_target(modified_test_set)
decision_tree = DecisionTreeClassifier()
decision_tree.fit(train_X, train_y)
test_y_pred = decision_tree.predict(test_X)
print("Decision tree with random features" + str(random_features) + "accuracy is:")
print(accuracy_score(test_y, test_y_pred))

Decision tree with random features['oldpeak', 'restecg', 'ca', 'slope', 'exang', 'target']accuracy is:
0.7377049180327869


#### 5. Implementing random forest
We create multiple trees, each one have 5 random feature and created by those features. <br>
We have 5 trees. __make_classifiers__ function create this trees. <br> 
After creating trees, we predict each test data by trees and label data with majority predicted class of trees. __random_forest_predict__ play main role in this part. __calculate_majority__ function calculates final label for the record. Final accuracy of this model is __80.3%__.

In [293]:
def make_classifiers(training_sets):
    classifiers = []
    selected_features = [] 
    for index in range(len(training_sets)):
        random_features = get_random_features(features, 5)
        modified_train_set = training_sets[index]['X'][random_features]
        decision_tree = DecisionTreeClassifier()
        decision_tree.fit(modified_train_set, training_sets[index]['y'])
        classifiers.append(decision_tree)
        selected_features.append(random_features)
    return classifiers, selected_features

def calculate_majority(sum_prediction):
    return [int(sum_predict>2) for sum_predict in sum_prediction]

def random_forest_predict(classifiers, features, test_X):
    prediction = [0] * len(test_set)
    for i in range(len(classifiers)):
        random_features = features[i]
        test_X_i = test_X[random_features]
        predict_i = classifiers[i].predict(test_X_i)
        prediction = list(map(add, prediction, predict_i))
    return calculate_majority(prediction)

splitted_data_groups = split_data_groups(data_groups)
classifiers, classifier_features = make_classifiers(splitted_data_groups)
test_X, test_y = split_features_target(test_set)
test_y_pred = random_forest_predict(classifiers, classifier_features, test_X)
print("Random forest accuracy is:")
print(accuracy_score(test_y, test_y_pred))

Random forest accuracy is:
0.8032786885245902


## Questions 
### 1. What is bootstrapping? What is it's effect on variance and standard deviation?
Given a training set, we produce multiple different training sets (called bootstrap samples), by sampling with replacement from the original dataset. This process is called __bootstrapping__. <br>
This method decreases _variance_ and _standard deviation_.
### 2. What is overfitting and why decision tree is sensitive about it ? What is effect of bagging ? 
Overfitting is when your train model has a very high accuracy on train set. In this situation, we can say decision tree is fitted very much to train set and can't predict new unseen data. <br> 
Decision tree is very sensitive about it because if we had enough features, it's fitted right to the train set because it's learn the whole data and can guess all of data by it's decision points. Our model accuracy on train set can reach to 1. <br> 
__Bagging__ designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. It's correlated to train data less because each tree learn some parts of data and because of that chance of overfitting get decreased.
### 3. What is relation between random forest and bagging ? What is the effect of random forest ? 
The __random forest__ approach is a __bagging method__ where deep trees, fitted on bootstrap samples, are combined to produce an output with lower variance. However, random forests also use another trick to make the multiple fitted trees a bit less correlated with each others: when growing each tree, _instead of only sampling over the observations in the dataset to generate a bootstrap sample, we also __sample over features__ and keep only a random subset of them to build the tree._ <br>
Sampling over features has indeed the effect that all trees do not look at the exact same information to make their decisions and, so, it reduces the correlation between the different returned outputs. Another advantage of sampling over the features is that it makes the decision making process more robust to missing data: observations (from the training dataset or not) with missing data can still be regressed or classified based on the trees that take into account only features where data are not missing. Thus, random forest algorithm combines the concepts of bagging and random feature subspace selection to create more robust and more accurate models.
### 4. What is the conclusion and relation between decision tree, bagging and random forests result? 
__Bagging__ has more accuracy than __normal decision tree__ because it use multiple classes and has less error. __Random forests__ has the most accuracy between this models because it use random different features and train variant trees to consider different features. For this reason, it has less variance and has more accuracy.