In [1]:
# Import the libraries
from typing import List, Tuple

import pandas as pd
import numpy as np

# Random Forest

For this assigment you will be implementing an algorithm which is very commonly used in practice: `Random Forest`.  
So what is `Random Forest`? To answer this question, let's first learn a couple of new statistical concepts.

## Ensemble Methods

Recall `Central Limit Theorem` from your probability and statistics course. It states that if you have a population with mean $\mu$ and standard deviation $\sigma$ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large.

Concretly, we can say that if:

$$ X_1, X_2, \ldots, X_n \sim P(X) $$

where $X_i$ are i.i.d. random samples from any distribution $P(X)$ with mean $\mu$ and standard deviation $\sigma$, then we have:
$$ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \sim N(\mu, \frac{\sigma}{\sqrt{n}}) $$

or

$$ Var(\bar{X}) = \frac{\sigma^2}{n} $$

- In the case of Machine Learning, we are many times interested in decreasing the variance of our model.  
- In `Ensemble Methods`, we do so by training multiple different machine learning models such as (Logistic Regression, Decision Trees, Support Vector Machines, etc.) and then combining them in some way to get a final prediction.  
- Doing so, we are able to decrease the variance of our model and thus improve its performance.

This formula for variance reduction works only if the samples are independent. In practice, this is not the case. There tends to be a certain amount of correlation between the samples. The equation for variance reduction can be modified to account for this correlation.

$$ Var(\bar{X}) = \rho \sigma^2 + \frac{1 - \rho}{n} \sigma^2$$

where $\rho$ is the correlation between the samples.

Notice that if $\rho = 0$, then we get the original formula for variance reduction. and if $\rho = 1$, then there is no variance reduction at all.

There are many different forms of ensemble methods. Each of those try to decrease this correlation term ($\rho$) as much as possible.  
In this assignment(`Random Forest`), we will be focusing on one specific type called `Bagging`

## Baseline Model

For the baseline model, implement a normal `Decision Tree` classifier

### Data Proprocessing (if any)

In [2]:
df_train = pd.read_csv('train.csv')
df_val = pd.read_csv('val.csv')

In [3]:
df_train.head()

Unnamed: 0,Party,Feature-1,Feature-2,Feature-3,Feature-4,Feature-5,Feature-6,Feature-7,Feature-8,Feature-9,Feature-10,Feature-11,Feature-12,Feature-13,Feature-14,Feature-15,Feature-16
0,democrat,n,y,y,n,y,y,y,n,y,y,y,n,y,y,n,y
1,democrat,y,n,y,n,n,n,y,y,y,n,n,n,y,n,y,y
2,democrat,n,y,y,n,n,y,n,y,y,y,y,n,y,n,y,y
3,democrat,y,y,y,n,n,n,y,y,y,n,y,n,n,n,y,y
4,democrat,n,n,y,n,n,n,y,y,y,n,n,n,n,n,y,y


In [4]:
# encode the categorical columns
d = {'y': 1, 'n': 0}
d_party = {'democrat': 0, 'republican': 1}
for col in df_train.columns:
    if col != 'Party':
        df_train[col] = df_train[col].map(d)
        df_val[col] = df_val[col].map(d)
    else:
        df_train[col] = df_train[col].map(d_party)
        df_val[col] = df_val[col].map(d_party)
        
df_train.head()

Unnamed: 0,Party,Feature-1,Feature-2,Feature-3,Feature-4,Feature-5,Feature-6,Feature-7,Feature-8,Feature-9,Feature-10,Feature-11,Feature-12,Feature-13,Feature-14,Feature-15,Feature-16
0,0,0,1,1,0,1,1,1,0,1,1,1,0,1,1,0,1
1,0,1,0,1,0,0,0,1,1,1,0,0,0,1,0,1,1
2,0,0,1,1,0,0,1,0,1,1,1,1,0,1,0,1,1
3,0,1,1,1,0,0,0,1,1,1,0,1,0,0,0,1,1
4,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1


Implement the Decision Tree class below.

In [5]:
class Node:
    def __init__(self, dataset: pd.DataFrame, features: list, target: str):
        """Initializes the Node class. This class represents the current state of the dataset after a split has been made.

        dataset : Current state of the dataset
        features : A list of features to consider for the split
        target : The target variable

        best_feature: Feature which gives the highest information gain. This starts off as None and is updated after the split
        best_feature_values: A dictionary which contains the unique values of the best feature and the corresponding subset of the dataset

        class_names: A list of the unique values in the target variable
        target_probabilities: A NumPy array which contains the probabilities of each unique value in the target attribute
        entropy: Entropy of the target variable in the given dataset

        Args:
            dataset (pd.DataFrame): Current state of the dataset
            features (list): A list of features to consider for the split
            target (str): The target variable
        """
        self.dataset = dataset
        self.features = features
        self.target = target

        self.best_feature = None
        self.best_feature_values = {}

        self.class_names = dataset[target].unique().tolist()
        self.target_probabilities = self.calculate_target_probabilities(self.dataset)
        self.entropy = self.calculate_entropy(self.dataset)
    
    def calculate_target_probabilities(self, dataset: pd.DataFrame) -> np.ndarray:
        """This function calculates the probabilities of each unique value in the target variable.
        You can think of this as the prior probabilities (as in Naive Bayes)

        Args:
            dataset (pd.DataFrame): The dataset over which we want to calculate the probabilities

        Returns:
            np.ndarray: Returns a NumPy array which contains the probabilities of each unique value in the target attribute
        """
        target_probabilties = dataset[self.target].value_counts(normalize=True).values
        return target_probabilties
    
    def calculate_entropy(self, dataset: pd.DataFrame) -> float:
        """Calculates the entropy of the target variable in the given dataset

        Args:
            dataset (pd.DataFrame): dataset over which we want to calculate the entropy

        Returns:
            float: Entropy of the target variable in the given dataset
        """
        target_probabilities = dataset[self.target].value_counts(normalize=True).values
        target_log_probabilities = np.log2(target_probabilities)

        entropy = -np.sum(target_probabilities * target_log_probabilities)
        return entropy

    def calculate_information_gain(self, feature: str) -> float:
        """Calculates the information gain for the gievn feature

        Args:
            feature (str): Feature for which we want to calculate the information gain

        Returns:
            float: Information gain for the given feature
        """

        # Find the unique values of the feature
        feature_values = self.dataset[feature].unique()

        # Find the probabilities of each unique value
        feature_probabilities = self.dataset[feature].value_counts(normalize=True).values

        # Find the entropy of each unique value
        feature_entropies = []
        for feature_value in feature_values:
            feature_subset = self.dataset[self.dataset[feature] == feature_value]
            feature_subset_entropy = self.calculate_entropy(feature_subset)
            feature_entropies.append(feature_subset_entropy)

        # Find the weighted average of the entropies
        feature_entropies = np.array(feature_entropies)
        weighted_average = np.sum(feature_probabilities * feature_entropies)

        # Find the information gain
        information_gain = self.entropy - weighted_average
        return information_gain

    def find_best_feature(self) -> str:
        """Finds the best feature which gives the highest information gain. This is the feature which we will use to split the dataset

        Returns:
            str: Feature which gives the highest information gain
        """
        # Get information gain for all features and select the one with the highest information gain
        information_gains = []
        for feature in self.features:
            information_gain = self.calculate_information_gain(feature)
            information_gains.append(information_gain)

        information_gains = np.array(information_gains)
        best_feature_index = np.argmax(information_gains)
        best_feature = self.features[best_feature_index]
        return best_feature
    
    def is_pure(self) -> bool:
        """Checks if the node is pure. A node is pure if all the target values are the same

        Returns:
            bool: Pure or not
        """
        return self.entropy == 0

    def predict(self) -> str:
        """Predicts the class of the current node. This is the class which has the highest probability

        Returns:
            str: Prediction
        """
        idx = np.argmax(self.target_probabilities)
        prediction = self.class_names[idx]
        return prediction

In [6]:
class DecisionTree:
    def __init__(self, max_depth=None):
        self.root = None
        self.max_depth = max_depth
    
    def fit(self, dataset: pd.DataFrame, features: list, target: str, node: Node=None, depth: int=1):
        """Trains the decision tree model on the given dataset

        Args:
            dataset (pd.DataFrame): Dataset on which we want to train the model
            features (list): List of features to consider for training
            target (str): The target attribute
            node (Node): The current node. Defaults to None.
            depth (int): Current depth of the tree.
        """

        if node is not None and node.is_pure(): # Do not split if the node is pure
            return
        
        if self.max_depth is not None and depth > self.max_depth: # Do not split if the max depth is reached
            return
        
        if not features: # Do not split if there are no features left
            return
        
        if node is None:
            node = Node(dataset, features, target)
            self.root = node
        
        # Find the best feature to split on
        best_feature = node.find_best_feature()
        node.best_feature = best_feature # Set the best feature of the node
        best_feature_values = dataset[best_feature].unique() # Find the unique values of the best feature

        for best_feature_value in best_feature_values:

            # Create a subset of the dataset which contains only the current best feature value and remove the best feature from the dataset
            #! NOTE: Remember to create a copy of the dataset while removing the feature. Otherwise, the original dataset will be modified
            best_feature_subset = dataset[dataset[best_feature] == best_feature_value]
            best_feature_subset = best_feature_subset.drop(columns=[best_feature])

            # Create a subset of the features which does not contain the best feature
            best_feature_subset_features = list(best_feature_subset.columns)
            best_feature_subset_features.remove(target)

            # Create a new node for the best split
            best_feature_subset_root = Node(best_feature_subset, best_feature_subset_features, target)
            node.best_feature_values[best_feature_value] = best_feature_subset_root


            # Recursively fit the model on the best feature subset
            self.fit(best_feature_subset, best_feature_subset_features, target, node=best_feature_subset_root, depth=depth+1)
        
    def predict(self, features: pd.Series) -> tuple:
        """Predict the class for the given features using the trained model

        Args:
            features (pd.Series): features

        Returns:
            str: Prediction of the class
            list: List of decisions made by the model. These are the features on which the model split the dataset
        """


        node = self.root
        decisions = []

        while node.best_feature is not None:
            decisions.append(node.best_feature)
            feature_value = features[node.best_feature]
            if feature_value not in node.best_feature_values:
                break
            node = node.best_feature_values[feature_value]
        
        prediction = node.predict()
        return prediction, decisions

In [7]:
print(df_train.columns)


Index(['Party', 'Feature-1', 'Feature-2', 'Feature-3', 'Feature-4',
       'Feature-5', 'Feature-6', 'Feature-7', 'Feature-8', 'Feature-9',
       'Feature-10', 'Feature-11', 'Feature-12', 'Feature-13', 'Feature-14',
       'Feature-15', 'Feature-16'],
      dtype='object')


Train the model on the training data and report the accuracy and [F1 score](https://en.wikipedia.org/wiki/F-score) on the validation data.

In [8]:
features = df_train.columns.tolist()
features.remove('Party')
target = 'Party'
dtree = DecisionTree()

In [9]:
clf = DecisionTree(max_depth=5)
clf.fit(df_train, features, target)

In [10]:
df_test = pd.read_csv('Val.csv')
df_test.head()

Unnamed: 0,Party,Feature-1,Feature-2,Feature-3,Feature-4,Feature-5,Feature-6,Feature-7,Feature-8,Feature-9,Feature-10,Feature-11,Feature-12,Feature-13,Feature-14,Feature-15,Feature-16
0,republican,n,n,n,y,y,y,y,n,n,y,n,y,n,y,y,y
1,republican,y,y,y,y,n,n,y,y,y,y,y,n,n,y,n,y
2,democrat,y,n,y,n,n,n,y,y,y,n,n,n,n,n,y,y
3,republican,n,y,y,y,y,y,n,n,n,y,n,y,y,y,n,y
4,democrat,y,y,y,n,n,n,y,y,y,n,y,n,n,n,y,y


In [11]:
def score(actual, predicted):
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for i in range(len(actual)):
        if actual[i] == 1 and predicted[i] == 1:
            true_positives += 1
        elif actual[i] == 0 and predicted[i] == 1:
            false_positives += 1
        elif actual[i] == 1 and predicted[i] == 0:
            false_negatives += 1
        else:
            continue
        

    print(f'True positives: {true_positives}')
    print(f'False positives: {false_positives}')
    print(f'False negatives: {false_negatives}')

    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)

    print(f'Precision: {precision}')
    print(f'Recall: {recall}')

    f1_score = 2 * precision * recall / (precision + recall)

    return f1_score

In [12]:
# predict on validation data
# remove party column from validation data
y_actual = df_val['Party'].tolist()
df_val = df_val.drop(columns=['Party'])

y_pred = []
for idx, row in df_val.iterrows():
    y_pred.append(dtree.predict(row)[0])

# calculate F1 score manually 
score = score(y_actual, y_pred)
print(f"the F1 score is {score}")


AttributeError: 'NoneType' object has no attribute 'best_feature'

## Bagging

Bagging ([Bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) Aggregation) is a method of constructing an ensemble of classifiers by training each classifier on a random subset (bootstrap sample) of the training set. Concretly,

- Given a training set $X = x_1, x_2, \ldots, x_n$ with labels $Y = y_1, y_2, \ldots, y_n$.
- We create $k$ bootstrap samples $X_1, X_2, \ldots, X_k$ by sampling $r$ examples from $X$ uniformly at random with replacement.
- We train $k$ different classifiers $h_1, h_2, \ldots, h_k$ on the bootstrap samples $X_1, X_2, \ldots, X_k$.
- We combine the predictions of each classifier using majority voting.

In doing so, we are able to decrease the correlation between the classifiers and thus decrease the variance of our model.  
Notice that if we use the same training set to train each classifier, then the correlation between the classifiers will be very high ($\rho \approx 1$) and thus there will be almost no variance reduction at all.

Implement the function which creates bootstrap samples from the training data

In [None]:
def get_bootstrap_samples(df: pd.DataFrame, n_samples: int, sample_fraction: float, target: str) -> List[pd.DataFrame]:
    """Generate bootstrap samples using the given dataframe.

    Args:
        df (pd.DataFrame): The dataframe to generate bootstrap samples from.
        n_samples (int): The number of bootstrap samples to generate.
        sample_fraction (float): The fraction of the dataframe to use for each sample with replacement. (between 0 and 1)
        target (str): The name of the target column.

    Returns:
        List[pd.DataFrame]: A list of bootstrap samples.
    """

    bootstrap_samples = []
    for i in range(n_samples):
        bootstrap_sample = df.sample(frac=sample_fraction, replace=True)
        bootstrap_samples.append(bootstrap_sample)

    
    return bootstrap_samples

Now take 'n' bootstrap samples from the training data and train a `Decision Tree` classifier on each of them. Make sure you use the `DecisionTree` class you implemented above to train the models and store the trained trees in a list.

In [None]:
n_trees = 10
sample_fraction = 0.8
target = "Party" # TODO: Set the target column
trees = []

bootstrap_samples = get_bootstrap_samples(df_train, n_trees, sample_fraction, target)

# Write your code below

# Create a decision tree for each bootstrap sample and append it to the trees list
for _s in bootstrap_samples:
    tree = DecisionTree()
    tree.fit(_s, features, target)
    trees.append(tree)

# Write code to predict the class of the current set of features and append the prediction to the preds list
# Also append the decisions made by the model to the decisions list
preds = []
decisions = []
for _, row in df_val.iterrows():
    decision_importance = []
    for tree in trees:
        pred, decision = tree.predict(row)
        decision_importance.append(pred)
        decisions.append(decision)
    
    pred = max(set(decision_importance), key=decision_importance.count)
    preds.append(pred)

Use the trained trees to make predictions on the validation data and report the accuracy and [F1 Score](https://en.wikipedia.org/wiki/F-score) on the validation data. To make the prediction follow the following steps:
- For each data point in the validation set, make predictions using each of the trained trees. (you will have 'n_trees' predictions for each data point)
- Combine the predictions of each classifier using majority voting. 
- eg. If you have 5 trees and 3 of them predict __class 1__ and 2 of them predict __class 2__, then the final prediction will be __class 1__.

In [None]:
# predict on validation data

aggregate_decisions = []

for tree in trees:
    y_pred = []
    for idx, row in df_val.iterrows():
        y_pred.append(tree.predict(row)[0])
    aggregate_decisions.append(y_pred)


# combine the predictions from all the trees
aggregate_preds = []
for i in range(len(aggregate_decisions[0])):
    votes = []
    for j in range(len(aggregate_decisions)):
        votes.append(aggregate_decisions[j][i])
    aggregate_preds.append(max(set(votes), key=votes.count))

# calculate F1 score manually
score = score(y_actual, aggregate_preds)
print(f"the F1 score is {score}")


TypeError: 'float' object is not callable

## Random Forest Algorithm

`Random Forest` is a special type of bagging algorithm where we not only train each classifier on a bootstrap sample of the training data but also a random subset of the features. This is done to further decrease the correlation between the classifiers. Concretly,

- Given a training set $X = x_1, x_2, \ldots, x_n$ with labels $Y = y_1, y_2, \ldots, y_n$.
- We create $k$ bootstrap samples $X_1, X_2, \ldots, X_k$ by sampling $r$ examples from $X$ uniformly at random with replacement.
- Each time we create a bootstrap sample, we also select a random subset of the features $F_i$ to train the classifier on.
- We train $k$ different classifiers $h_1, h_2, \ldots, h_k$ on the bootstrap samples $X_1, X_2, \ldots, X_k$.
- We combine the predictions of each classifier using majority voting.

Implement this new version of bootstrap sampling below.

In [None]:
def get_random_forest_bootstrap_samples(df: pd.DataFrame, n_samples: int, sample_fraction: float, feature_fraction: float, target: str) -> Tuple[List[pd.DataFrame], List[str]]:
    """Generate bootstrap samples using the given dataframe.

    Args:
        df (pd.DataFrame): The dataframe to generate bootstrap samples from.
        n_samples (int): The number of bootstrap samples to generate.
        sample_fraction (float): The fraction of the dataframe to use for each sample with replacement. (between 0 and 1)
        feature_fraction (float): The fraction of features to use for each sample. (between 0 and 1)

    Returns:
        List[pd.DataFrame]: A list of bootstrap samples.
        List[str]: A list of features used for each sample.
    """

    samples = []
    features = []

    target_features = df.columns.tolist()
    target_features.remove(target)

    for i in range(n_samples):
        bootstrap_sample = df.sample(frac=sample_fraction, replace=True)

        # get random features
        n_features = int(feature_fraction * len(df.columns))
        feats = np.random.choice(target_features, n_features, replace=False)
        feats = list(feats)
        feats.append(target)
        features.append(feats)

        # keep only the selected features
        bootstrap_sample = bootstrap_sample[feats]
        samples.append(bootstrap_sample)
    
    return samples, features

Now take 'n' bootstrap samples from the training data and train a `Decision Tree` classifier on each of them. Make sure you use the `DecisionTree` class you implemented above to train the models and store the trained trees in a list.

In [None]:
n_trees = 10
sample_fraction = 0.8
feature_fraction = 0.7
target = "Party" # TODO: Set the target column
trees = []

bootstrap_samples, sample_features = get_random_forest_bootstrap_samples(df_train, n_trees, sample_fraction, feature_fraction, target)

# Write your code below
print(bootstrap_samples[0])

# # Create a decision tree for each bootstrap sample and append it to the trees list
# for i in range(len(bootstrap_samples)):
#     tree = DecisionTree()
#     tree.fit(bootstrap_samples[i], sample_features[i], target)
#     trees.append(tree)




     Feature-11  Feature-12  Feature-5  Feature-3  Feature-9  Feature-15  \
226           1           1          0          1          1           1   
8             0           0          0          1          1           1   
248           0           0          1          0          0           0   
8             0           0          0          1          1           1   
193           1           0          0          1          1           1   
..          ...         ...        ...        ...        ...         ...   
303           1           1          1          0          0           0   
149           0           1          1          1          1           1   
246           1           1          1          0          0           0   
83            0           1          1          1          0           0   
111           1           0          0          1          1           1   

     Feature-10  Feature-4  Feature-6  Feature-13  Feature-8  Party  
226           0  

Use the trained trees to make predictions on the validation data and report the accuracy and [F1 Score](https://en.wikipedia.org/wiki/F-score) on the validation data. Use the sample prediction method as in bagging. Try modifying the different parameters to get the best results on the validation data.

Ideas to try out once you have a working implementation and a score on the leaderboard:
- Try limiting the depth of the trees
- Try using different voting methods (weighted voting, etc.)
- Try using different methods to select the best split (information gain, gini index, etc.)

Try these ideas **ONLY AFTER** you have a working implementation and a score on the leaderboard. These ideas are not guaranteed to improve your score but are worth trying out.

In [None]:
df_test = pd.read_csv('test.csv')
# encode the categorical columns
d = {'y': 1, 'n': 0}
for col in df_test.columns:
    if col != 'ID':
        df_test[col] = df_test[col].map(d)
        
    
        
df_test.head()

Unnamed: 0,ID,Feature-1,Feature-2,Feature-3,Feature-4,Feature-5,Feature-6,Feature-7,Feature-8,Feature-9,Feature-10,Feature-11,Feature-12,Feature-13,Feature-14,Feature-15,Feature-16
0,1,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,0
1,2,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
2,3,1,1,0,1,1,1,0,0,0,0,1,1,1,1,0,1
3,4,1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,1
4,5,1,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1


## Prediction on Test Data

In [None]:
# predict using bootstrap aggregation
n_trees = 10
sample_fraction = 0.8
target = "Party" 
trees = []

bootstrap_samples = get_bootstrap_samples(df_train, n_trees, sample_fraction, target)

# Write your code below

# Create a decision tree for each bootstrap sample and append it to the trees list
for _s in bootstrap_samples:
    tree = DecisionTree()
    tree.fit(_s, features, target)
    trees.append(tree)

# Write code to predict the class of the current set of features and append the prediction to the preds list
# Also append the decisions made by the model to the decisions list
preds = []
decisions = []
for _, row in df_test.iterrows():
    decision_importance = []
    for tree in trees:
        pred, decision = tree.predict(row)
        decision_importance.append(pred)
        decisions.append(decision)
    
    pred = max(set(decision_importance), key=decision_importance.count)
    preds.append(pred)

# add column to test data
df_test['Party'] = preds

# create df_submission with only ID and Party columns
df_submission = df_test[['ID', 'Party']]
df_submission.to_csv('submission.csv', index=False)
df_submission.head()




Unnamed: 0,ID,Party
0,1,1
1,2,1
2,3,1
3,4,0
4,5,0


Pick your ensemble model and use it to make predictions on the test data.

## Submission Cells

We will now zip and prepare the notebook and csv for submission.

Preliminary checks to ensure `submission.csv` is in the correct format.

In [None]:
df_temp = pd.read_csv('submission.csv')
test_temp = pd.read_csv('test.csv')
assert len(df_temp.columns) == 2, "Number of columns in the submission file is not correct, check the submission format"
assert list(df_temp.columns) == ['ID', 'Party'] , "Column names are not correct, check the submission format"
assert df_temp['Party'].nunique() == 1 or df_temp['Party'].nunique() == 2, "The prediction should be 0 or 1 only"
assert len(df_temp) == len(test_temp), "Number of rows in the submission file is not correct"

Making the submission zip ready<br>
Note: Ensure that your notebook has been saved uptil now with the name eval.ipynb

In [None]:
import shutil
import os

if not os.path.exists('temp'):
    os.makedirs('temp')

if os.path.exists('submission.csv'):
    shutil.copy('submission.csv','temp/submission.csv')

if os.path.exists('eval.ipynb'):
    shutil.copy('eval.ipynb',os.path.join('temp','eval.ipynb'))

shutil.make_archive('submission', 'zip', 'temp')
shutil.rmtree('temp')

Submit the `submission.zip` file to kaggle