# ID2214/FID3214 Assignment 3 Group no. [3]
### Project members: 
[Francesco Gelain, gelain@kth.se] 
[Borja Javierre, javierre@kth.se] 
[Jingyi Hu, jingyihu@kth.se]

### Declaration
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas, time and sklearn.tree, may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas, time and DecisionTreeClassifier from sklearn.tree

In [14]:
import numpy as np
import pandas as pd
import time
import sklearn
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import OneHotEncoder

In [15]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"sklearn version: {sklearn.__version__}")

Python version: 3.9.13
NumPy version: 1.21.5
Pandas version: 1.4.4
sklearn version: 1.0.2


## Reused functions from Assignment 1

In [51]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
def column_filter(df): 

    filtered_df = df.copy() #copy input dataframe

    #iterate through all columns and consider to drop a column only if it is not labeled "CLASS" or "ID"
    #you may check the number of unique (non-missing) values in a column by applying the pandas functions
    #dropna and unique to drop missing values and get the unique (remaining) values
    filtered_df = filtered_df.dropna(how = 'all', axis = 1)
    for col in filtered_df.columns:
        if col != "CLASS" and col != "ID":
            if filtered_df[col].dropna().unique().size == 1:
                filtered_df = filtered_df.drop(col, axis=1)

    column_filter = filtered_df.columns #list of the names of the remaining columns, including "CLASS" and "ID"
    
    return filtered_df, column_filter

def apply_column_filter(df, column_filter):

    filtered_new_df = df.copy() #copy input dataframe

    #drop each column that is not included in column_filter
    for col in filtered_new_df.columns:
        if col not in column_filter:
            filtered_new_df = filtered_new_df.drop(col, axis=1)

    return filtered_new_df

def imputation(df):
    df_temp = df.copy()
    values = {}
    for column in df_temp:
        columnSeriesObj = df_temp[column]
        if columnSeriesObj.dtype == int or columnSeriesObj.dtype == float:
             values[column] = columnSeriesObj.mean()
        elif columnSeriesObj.dtype == object:
             values[column] = columnSeriesObj.mode()[0]

    df_temp.fillna(value=values, inplace=True)

    return df_temp, values

def apply_imputation(df,imputation):
    df_temp = df.copy()
    values = imputation
    df_temp.fillna(value=values, inplace=True)
    return df_temp

def one_hot(df):

    new_df = df.copy() #copy input dataframe

    one_hot = {} #a mapping (dictionary) from column name to a set of categories (possible values for the feature)

    for col in new_df.columns:
        if (new_df[col].dtype == "object" or new_df[col].dtype == "category") and col != "CLASS" and col !="ID":
            one_hot[col] = set(new_df[col])
            for value in one_hot[col]:
                new_df[col + "_" + value] = (new_df[col] == value).astype(int)
            new_df = new_df.drop(col, axis=1)

    return new_df, one_hot

def apply_one_hot(df, one_hot):
    
    new_df = df.copy() #copy input dataframe

    for col in new_df.columns:
        if (new_df[col].dtype == "object" or new_df[col].dtype == "category") and col != "CLASS" and col !="ID":
            for value in one_hot[col]:
                new_df[col + "_" + value] = (new_df[col] == value).astype(int)
            new_df = new_df.drop(col, axis=1)

    return new_df

def auc(df, correctlabels):
    df_temp = df.copy()
    values = {}
    auc_percolumn = 0
    df_truepositive = pd.DataFrame(np.zeros((len(df_temp), len(df_temp.columns))), columns=df_temp.columns)
    for i in range(len(correctlabels)):
        df_truepositive.loc[i, correctlabels[i]] = 1

    for column in df_temp:
        columnseriesobj = df_temp[column]
        columntruepositive = df_truepositive[column]
        columnfalsepositive = columntruepositive.copy()
        for i in range(len(columntruepositive)):
            if columntruepositive[i] == 0:
                columnfalsepositive[i] = 1
            elif columntruepositive[i] == 1:
                columnfalsepositive[i] = 0
        df_auc_temp = pd.DataFrame({"s": columnseriesobj, "tp": columntruepositive, "fp": columnfalsepositive})        
        
        agg_functions = {'tp': 'sum', 'fp': 'sum'}
        df_auc_temp = df_auc_temp.groupby(df_auc_temp['s']).aggregate(agg_functions)
        df_auc_temp = df_auc_temp.sort_values(by = 's', ascending = False)

        # return auc_score
        AUC = 0
        cov_tp = 0
        tot_tp = np.sum(columntruepositive)
        tot_fp = np.sum(columnfalsepositive)

        for idx in df_auc_temp.index:
            if df_auc_temp["fp"][idx] == 0:
                cov_tp += df_auc_temp["tp"][idx]
            elif df_auc_temp["tp"][idx] == 0:
                AUC += (cov_tp / tot_tp) * (df_auc_temp["fp"][idx] / tot_fp)
            else:
                AUC += (cov_tp / tot_tp) * (df_auc_temp["fp"][idx] / tot_fp) + (df_auc_temp["tp"][idx] / tot_tp) * (
                    df_auc_temp["fp"][idx] / tot_fp) / 2
                cov_tp += df_auc_temp["tp"][idx]

        #AUC = metrics.roc_auc_score(columntruepositive, columnseriesobj)
        values[column] = AUC

    auc_score = 0
    for i in values:
        count = 0
        for output in correctlabels:
            if i == output:
                count += 1
        frequency = count / len(correctlabels)
        auc_score += values[i] * frequency
    return auc_score

def accuracy(df, correctlabels):
    df_temp = df.copy()
    count = 0
    outputlabels = df_temp.idxmax(axis = 1)
    for i in range(outputlabels.size):
        if correctlabels[i] == outputlabels[i]:
            count += 1
    accuracy = count/outputlabels.size
        
    return accuracy

def brier_score(df, correctlabels):
    df_temp = df.copy()
    brier_score = 0
    mean = 0
    df_correct = pd.DataFrame(np.zeros((len(df), len(np.unique(correctlabels)))), columns=np.unique(correctlabels))
    for i in range(len(correctlabels)):
        df_correct.loc[i, correctlabels[i]] = 1
    for column in df_correct:
        columnSeriesObj = df_correct[column]
        for i in range(columnSeriesObj.size):
            brier_score += (df_correct.loc[i, column] - df_temp.loc[i, column])**2
    brier_score = brier_score/len(df)
    
    return brier_score

## 1. Define the class RandomForest

In [79]:
# Define the class RandomForest with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
#
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, one_hot, labels, model

class RandomForest:
    def __init__(self):
        column_filter = None
        imputation = None
        one_hot = None
        labels = None
        model = None

# Input to fit:
# self      - the object itself
# df        - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# no_trees  - no. of trees in the random forest (default = 100)
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter - a column filter (see Assignment 1) from df
# self.imputation    - an imputation mapping (see Assignment 1) from df
# self.one_hot       - a one-hot mapping (see Assignment 1) from df
# self.labels        - a (sorted) list of the categories of the "CLASS" column of df
# self.model         - a random forest, consisting of no_trees trees, where each tree is generated from a bootstrap sample
#                      and the number of evaluated features is log2|F| where |F| is the total number of features
#                      (for details, see lecture slides)
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: First create the column filter, imputation and one-hot mappings
#
# Hint 2: Then get the class labels and the numerical values (as an ndarray) from the dataframe after dropping the class labels 
#
# Hint 3: Generate no_trees classification trees, where each tree is generated using DecisionTreeClassifier 
#         from a bootstrap sample (see lecture slides), e.g., generated by np.random.choice (with replacement) 
#         from the row numbers of the ndarray, and where a random sample of the features are evaluated in
#         each node of each tree, of size log2(|F|), where |F| is the total number of features;
#         see the parameter max_features of DecisionTreeClassifier

    def fit(self, df, no_trees=100):

        df_copy = df.copy() # make a copy of the dataframe

        filtered_df, self.column_filter = column_filter(df) # apply column filter
        df_temp, self.imputation = imputation(filtered_df) # apply imputation
        new_df, self.one_hot = one_hot(df_temp) # apply one-hot encoding
        
        training_labels = df["CLASS"].astype("category")
        self.labels = list(training_labels.cat.categories)#get values of class labels
        print(self.labels)

        #here we generate the random forest. Uses no_trees and df_onehot to generate a forest of trees
        random_forest = [] #list of trees
        for tree in range(no_trees):

            #generate the indices for the bootstrap sample
            rows = [idx for idx in range(len(new_df))] #list of row indices
            dflength = len(new_df) #number of instances in the bootstrap 

            randomsamples = np.random.choice(rows, size=dflength, replace=True) #generate indices of the bootstrap sample
            

            inputlabel = new_df["CLASS"].values #get class labels for the bootstrap sample as the values of the "CLASS" column
            inputlabel = inputlabel[randomsamples] #get class labels for the bootstrap sample

            inputX = new_df.drop(columns=["CLASS"]).values #get the instances for the bootstrap sample
            inputX = inputX[randomsamples, :] #get the instances for the bootstrap sample

            #generate the tree
            clf = DecisionTreeClassifier(max_features=int(np.log2(inputX.shape[1]))) #with max_features as number of features to be evaluated in each node
            #print(bootstrap_instances)
            #bootstrap_instances_onehot = apply_one_hot(new_df, self.one_hot) #apply one-hot to the bootstrap instances
            clf.fit(inputX, inputlabel) #fit the tree to the bootstrap sample
            random_forest.append(clf) #add the generated tree to the forest
        
        self.model = random_forest

# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are the averaged probabilities output by each decision tree in the forest
#
# Hint 1: Drop any "CLASS" and "ID" columns of the dataframe first and then apply column filter, imputation and one_hot
#
# Hint 2: Iterate over the trees in the forest to get the prediction of each tree by the method predict_proba(X) where 
#         X are the (numerical) values of the transformed dataframe; you may get the average predictions of all trees,
#         by first creating a zero-matrix with one row for each test instance and one column for each class label, 
#         to which you add the prediction of each tree on each iteration, and then finally divide the prediction matrix
#         by the number of trees.
#
# Hint 3: You may assume that each bootstrap sample that was used to generate each tree has included all possible
#         class labels and hence the prediction of each tree will contain probabilities for all class labels
#         (in the same order). Note that this assumption may be violated, and this limitation will be addressed 
#         in the next part of the assignment.

    def predict(self, df):

        df_copy = df.copy() #copy the dataframe to a new dataframe (as done in Assignment 1 and 2)
        
        df_drop = df_copy.drop(["CLASS"], axis=1) #dropping the CLASS column (we can't drp ID because we don't have that column on this dataset)
        filtered_df = apply_column_filter(df_drop, self.column_filter) #applying the column filter
        imputated_df = apply_imputation(filtered_df, self.imputation) #applying the imputation
        onehot_df = apply_one_hot(imputated_df, self.one_hot) #applying the one-hot
        input_data_values = onehot_df.values

        predictions = np.zeros((input_data_values.shape[0], len(self.labels)), dtype="float64") #list of predictions for each tree in the forest and list of average predictions for each tree in the forest
        num_trees = 0 #counter for the number of trees in the forest
        for clf in (self.model): #iterating over the trees in the forest
            X = input_data_values
            #print("Shape of X: ", np.shape(X))
            #print("Shape of predictions: ", np.shape(predictions))
            #print("Shape of predictions2: ", np.shape(clf.predict_proba(X)))
            predictions = predictions + clf.predict_proba(X) #appending the predictions of each tree in the forest
            num_trees = num_trees + 1 #incrementing the number of trees in the forest
        predictions = predictions / num_trees #dividing the predictions by the number of trees in the forest to get the average predictions
        predictions = pd.DataFrame(predictions, columns=self.labels) #averaging the predictions of each tree in the forest

        return predictions

In [80]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("tic-tac-toe_train.csv")

test_df = pd.read_csv("tic-tac-toe_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)

print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.16 s.
Testing time: 0.02 s.
Accuracy: 0.9165
AUC: 0.9895
Brier score: 0.1818


In [81]:
train_labels = train_df["CLASS"]
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels))) # Comment this out if not implemented in assignment 1
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels))) # Comment this out if not implemented in assignment 1

Accuracy on training set: 1.0000
AUC on training set: 1.0000
Brier score on training set: 0.0212


### Comment on assumptions, things that do not work properly, etc.
The accuracy on the training set is very high (the highest possible) as well as the AUC using this set, meaning that this value is not as accurate as in the previous section.

## 2a. Handling trees with non-aligned predictions

In [108]:
# Define a revised version of the class RandomForest with the same input and output as described in part 1 above,
# where the predict function is able to handle the case where the individual trees are trained on bootstrap samples
# that do not include all class labels in the original training set. This leads to that the class probabilities output
# by the individual trees in the forest do not refer to the same set of class labels.
#
# Hint 1: The categories obtained with <pandas series>.cat.categories are sorted in the same way as the class labels
#         of a DecisionTreeClassifier; the latter are obtained by <DecisionTreeClassifier>.classes_ 
#         The problem is that classes_ may not include all possible labels, and hence the individual predictions 
#         obtained by <DecisionTreeClassifier>.predict_proba may be of different length or even if they are of the same
#         length do not necessarily refer to the same class labels. You may assume that each class label that is not included
#         in a bootstrap sample should be assigned zero probability by the tree generated from the bootstrap sample. 
#
# Hint 2: Create a mapping from the complete (and sorted) set of class labels l0, ..., lk-1 to a set of indexes 0, ..., k-1,
#         where k is the number of classes
#
# Hint 3: For each tree t in the forest, create a (zero) matrix with one row per test instance and one column per class label,
#         to which one column is added at a time from the output of t.predict_proba 
#
# Hint 4: For each column output by t.predict_proba, its index i may be used to obtain its label by t.classes_[i];
#         you may then obtain the index of this label in the ordered list of all possible labels from the above mapping (hint 2); 
#         this index points to which column in the prediction matrix the output column should be added to 

class RandomForest:
    def __init__(self):
        column_filter = None
        imputation = None
        one_hot = None
        labels = None
        model = None
    
    def fit(self, df, no_trees=100):

        df_copy = df.copy() # make a copy of the dataframe

        filtered_df, self.column_filter = column_filter(df) # apply column filter
        df_temp, self.imputation = imputation(filtered_df) # apply imputation
        new_df, self.one_hot = one_hot(df_temp) # apply one-hot encoding
        
        training_labels = df["CLASS"].astype("category")
        self.labels = list(training_labels.cat.categories)#get values of class labels
        #print(self.labels)

        #here we generate the random forest. Uses no_trees and df_onehot to generate a forest of trees
        random_forest = [] #list of trees
        for tree in range(no_trees):

            #generate the indices for the bootstrap sample
            rows = [idx for idx in range(len(new_df))] #list of row indices
            dflength = len(new_df) #number of instances in the bootstrap 

            randomsamples = np.random.choice(rows, size=dflength, replace=True) #generate indices of the bootstrap sample
            

            inputlabel = new_df["CLASS"].values #get class labels for the bootstrap sample as the values of the "CLASS" column
            inputlabel = inputlabel[randomsamples] #get class labels for the bootstrap sample

            inputX = new_df.drop(columns=["CLASS"]).values #get the instances for the bootstrap sample
            inputX = inputX[randomsamples, :] #get the instances for the bootstrap sample

            #generate the tree
            clf = DecisionTreeClassifier(max_features=int(np.log2(inputX.shape[1]))) #with max_features as number of features to be evaluated in each node
            #print(bootstrap_instances)
            #bootstrap_instances_onehot = apply_one_hot(new_df, self.one_hot) #apply one-hot to the bootstrap instances
            clf.fit(inputX, inputlabel) #fit the tree to the bootstrap sample
            random_forest.append(clf) #add the generated tree to the forest
        
        self.model = random_forest

    def predict(self, df):

        df_copy = df.copy() #copy the dataframe to a new dataframe (as done in Assignment 1 and 2)
        
        df_drop = df_copy.drop(["CLASS"], axis=1) #dropping the CLASS column (we can't drp ID because we don't have that column on this dataset)
        filtered_df = apply_column_filter(df_drop, self.column_filter) #applying the column filter
        imputated_df = apply_imputation(filtered_df, self.imputation) #applying the imputation
        onehot_df = apply_one_hot(imputated_df, self.one_hot) #applying the one-hot
        input_data_values = onehot_df.values

        predictions = np.zeros((input_data_values.shape[0], len(self.labels)), dtype="float64") #list of predictions for each tree in the forest and list of average predictions for each tree in the forest
        num_trees = 0 #counter for the number of trees in the forest
        for clf in (self.model): #iterating over the trees in the forest
            X = input_data_values
            #print("Shape of X: ", np.shape(X))
            #print("Shape of predictions: ", np.shape(predictions))
            #print("Shape of predictions2: ", np.shape(clf.predict_proba(X)))
            #print(predictions)
            
            #2a main implementation part
            roughprediction = clf.predict_proba(X)
            treelabel = clf.classes_ #get the labels for tree
            df_prediction = pd.DataFrame(roughprediction, columns = treelabel) #create a dataframe for tree output prediction, easier to add column
            for i in self.labels:
                if i in df_prediction.columns:
                    pass 
                else:  #each class label that is not included
                    df_prediction.insert(self.labels.index(i), i, np.zeros(input_data_values.shape[0]), True) #assigned zero probability, hint 1 and hint3
            #print(df_prediction)
            predictions = predictions + df_prediction.values #appending the predictions of each tree in the forest, change df back to ndarrays
            num_trees = num_trees + 1 #incrementing the number of trees in the forest
        predictions = predictions / num_trees #dividing the predictions by the number of trees in the forest to get the average predictions
        predictions = pd.DataFrame(predictions, columns=self.labels) #averaging the predictions of each tree in the forest

        return predictions

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)

Training time: 0.15 s.


In [109]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.15 s.
Testing time: 0.04 s.
Accuracy: 0.9488
AUC: 0.9698
Brier score: 0.1018


## 2b. Estimate predictive performance using out-of-bag predictions

In [80]:
# Define an extended version of the class RandomForest with the same input and output as described in part 2a above,
# where the results of the fit function also should include:
# self.oob_acc - the accuracy estimated on the out-of-bag predictions, i.e., the fraction of training instances for 
#                which the given (correct) label is the same as the predicted label when using only trees for which
#                the instance is out-of-bag
#
# Hint 1: You may first create a zero matrix with one row for each training instance and one column for each class label
#         and one zero vector to allow for storing aggregated out-of-bag predictions and the number of out-of-bag predictions
#         for each training instance, respectively. By "aggregated out-of-bag predictions" is here meant the sum of all 
#         predicted probabilities (one sum per class and instance). These sums should be divided by the number of predictions
#         (stored in the vector) in order to obtain a single class probability distribution per training instance. 
#         This distribution is considered to be the out-of-bag prediction for each instance, and e.g., the class that 
#         receives the highest probability for each instance can be compared to the correct label of the instance, 
#         when calculating the accuracy using the out-of-bag predictions.
#
# Hint 2: After generating a tree in the forest, iterate over the indexes that were not included in the bootstrap sample
#         and add a prediction of the tree to the out-of-bag prediction matrix and update the count vector
#
# Hint 3: Note that the input to predict_proba has to be a matrix; from a single vector (row) x, a matrix with one row
#         can be obtained by x[None,:]
#
# Hint 4: Finally, divide each row in the out-of-bag prediction matrix with the corresponding element of the count vector
#
#         For example, assuming that we have two class labels, then we may end up with the following matrix:
#
#         2 4
#         4 4
#         5 0
#         ...
#
#         and the vector (no. of predictions) (6, 8, 5, ...)
#
#         The resulting class probability distributions are:
#
#         0.333... 0.666...
#         0.5 0.5
#         1.0 0

class RandomForest:

    def __init__(self):
        column_filter = None
        imputation = None
        one_hot = None
        labels = None
        model = None

    def fit(self, df, no_trees=100):

        df_copy = df.copy() #copy the dataframe to a new dataframe (as done in Assignment 1 and 2)
        self.labels = df_copy["CLASS"] #get the labels from the dataframe
        labels_set = sorted(set(self.labels)) #get the possible labels for the dataset (which are the unique labels)

        training_instances = df_copy.shape[0] #get the number of training instances
        class_labels = len(labels_set) #get the number of class labels
        print("Number of training instances: ", training_instances)
        print("Number of class labels: ", class_labels)
        out_of_bag_matrix = np.zeros((training_instances, class_labels)) #create a matrix with the same number of rows as the dataframe and the same number of columns as the number of possible labels
        out_of_bag_zerovector = np.zeros(len(df_copy)) #vector with the same length as the number of rows in the dataframe used to store aggregated out-of-bag predictions and the number of out-of-bag predictions for each training instance, respectively
        out_of_bag_predictions = pd.DataFrame(columns=labels_set, data=out_of_bag_matrix) #create a vector for the predictions with the same length as the number of rows in the dataframe
        
        #apply, as usual, column filter, imputation and one hot encoding to the copy of the dataframe
        filtered_df, self.column_filter = column_filter(df_copy) #applying the column filter
        imputated_df, self.imputation = imputation(filtered_df) #applying the imputation
        onehot_df, self.one_hot = one_hot(imputated_df) #applying one-hot encoding
        onehot_df.drop(columns=["CLASS"], inplace=True) #drop the CLASS column from the dataframe

        self.model = [
                        DecisionTreeClassifier(max_features = "log2", max_depth = 10) 
                        for i in range(no_trees)
                    ] #create a list of trees, i.e. a forest
        
        #iterate over the trees in the forest. iterate over the indexes that were not included in the bootstrap sample
        #and add a prediction of the tree to the out-of-bag prediction matrix and update the count vector
        for tree in self.model:
            #bootstrap sample
            index_subsapling = np.random.choice(onehot_df.index.tolist(), size=300, replace=True)
            tree.fit(X=onehot_df.iloc[index_subsapling], y=self.labels.iloc[index_subsapling]) #fit the tree with the bootstrap sample
            #bootstrap_samples = np.random.choice(rows, size=num_instances, replace=True) #generate indices of the bootstrap sample

            out_of_bag_index = sorted(set(onehot_df.index.tolist()).difference(set(index_subsapling))) #get the indices of the out-of-bag instances

            predictions = pd.DataFrame(data = tree.predict_proba(X = onehot_df.iloc[out_of_bag_index]), columns = tree.classes_) #get the tree predictions for the out-of-bag instances

            predictions.set_index(pd.Series(out_of_bag_index), inplace=True) #set the index of the tree predictions to the index of the out-of-bag original instances indices

            out_of_bag_predictions = out_of_bag_predictions.add(predictions, fill_value=0) #add the tree predictions to the out-of-bag predictions matrix

            for i in out_of_bag_index:
                out_of_bag_zerovector[i] = out_of_bag_zerovector[i] + 1 #update the count vector
        
        #divide each row with the correspondign element of the count vector
        out_of_bag_predictions = out_of_bag_predictions.divide(out_of_bag_zerovector, axis = "rows", fill_value = 0).fillna(0) #sum should be divided by the number of predictions
        
        self.oob_acc = accuracy(out_of_bag_predictions, self.labels) #calculate the accuracy using the out of bag predictions

            
    def predict(self, df):

        df_copy = df.copy() #copy the dataframe to a new dataframe (as done in Assignment 1 and 2)
        
        df_drop = df_copy.drop(["CLASS"], axis=1) #dropping the CLASS column (we can't drp ID because we don't have that column on this dataset)
        filtered_df = apply_column_filter(df_drop, self.column_filter) #applying the column filter
        imputated_df = apply_imputation(filtered_df, self.imputation) #applying the imputation
        onehot_df = apply_one_hot(imputated_df, self.one_hot) #applying the one-hot
        input_data_values = onehot_df.values

        predictions = np.zeros((input_data_values.shape[0], len(self.labels)), dtype="float64") #list of predictions for each tree in the forest and list of average predictions for each tree in the forest
        num_trees = 0 #counter for the number of trees in the forest
        for clf in (self.model): #iterating over the trees in the forest
            X = input_data_values
            #print("Shape of X: ", np.shape(X))
            #print("Shape of predictions: ", np.shape(predictions))
            #print("Shape of predictions2: ", np.shape(clf.predict_proba(X)))
            #print(predictions)

            #2a main implementation part
            roughprediction = clf.predict_proba(X)
            treelabel = clf.classes_ #get the labels for tree
            df_prediction = pd.DataFrame(roughprediction, columns = treelabel) #create a dataframe for tree output prediction, easier to add column
            
            for i in self.labels:
                if i in df_prediction.columns:
                    pass 
                else:  #each class label that is not included
                    df_prediction.insert(self.labels.index(i), i, np.zeros(input_data_values.shape[0]), True) #assigned zero probability, hint 1 and hint3
            #print(df_prediction)
            
            predictions = predictions + df_prediction.values #appending the predictions of each tree in the forest, change df back to ndarrays
            num_trees = num_trees + 1 #incrementing the number of trees in the forest
        
        predictions = predictions / num_trees #dividing the predictions by the number of trees in the forest to get the average predictions
        predictions = pd.DataFrame(predictions, columns=self.labels) #averaging the predictions of each tree in the forest

        return predictions

In [81]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

print("OOB accuracy: {:.4f}".format(rf.oob_acc))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Number of training instances:  449
Number of class labels:  5
Training time: 1.55 s.
OOB accuracy: 0.9465




ValueError: operands could not be broadcast together with shapes (449,449) (449,5) 

In [10]:
train_labels = train_df["CLASS"]
rf = RandomForest()
rf.fit(train_df)
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 1.00
AUC on training set: 1.00
Brier score on training set: 0.01


### Comment on assumptions, things that do not work properly, etc.