# Assignment 3 Group no. 6
### Project members: 
Mar Balibrea Rull, marbr@kth.se

Marcos Fernández Carbonell, marcosfc@kth.se

Tiger Zha, tigerz2@illinois.edu

### Declaration
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).


## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time

## Reused functions from Assignment 1

In [2]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

# Binning functions -------------------------------------------------------------------------
def create_bins(df, nobins, bintype="equal-width"):
    df_tmp = df.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["int","float"])

    binning = {}
    df_out = pd.DataFrame()
    for col in df_tmp.columns:
        if (bintype == "equal-width"):
            res_tmp, bins_tmp = pd.cut(df_tmp[col], nobins, retbins=True, labels=False, duplicates="drop")
        elif (bintype == "equal-size"):
            res_tmp, bins_tmp = pd.qcut(df_tmp[col], nobins, retbins=True, labels=False, duplicates="drop")
            
        bins_tmp[0] = -np.inf # Set the first element of the binning list to -np.inf
        bins_tmp[-1] = np.inf # Set the last element of the binning list to np.inf
        
        binning[col] = tuple(bins_tmp)
        
    df_out = apply_bins(df,binning) # Apply binning
    return df_out,binning

def apply_bins(df,binning):
    df_out = df.copy()
    for col in binning:
        df_out[col] = pd.DataFrame(pd.cut(df_out[col],binning[col],labels=False, duplicates="drop")).astype("category")
    return df_out
# End of Binning functions ------------------------------------------------------------------


# Imputation functions ----------------------------------------------------------------------
def create_imputation(df):
    df_tmp = df.copy()
    
    df_tmp_num = df_tmp.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["int","float"]) # Select only numeric columns
    df_mean = pd.DataFrame(df_tmp_num.mean())
    df_tmp_num[(df_tmp_num.isna().all()).index[df_tmp_num.isna().all()]] = 0 # When all values are missing in a numeric column all values are replaced with by 0

    df_tmp_obj = df_tmp.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["object","category"]) # Select only object and categorical columns
    df_mode = df_tmp_obj.mode().transpose()
    df_tmp_obj[(df_tmp_obj.isna().all()).index[df_tmp_obj.isna().all()]] = "" # When all values are missing in a object or categorical column all values are replaced by ""

    imputation = ((pd.concat([df_mean[0],df_mode[0]], axis=0)).fillna(0)).to_dict() # Create a mapping from column name to new value
    
    df_out = apply_imputation(df, imputation) # Apply imputation
    return df_out,imputation

def apply_imputation(df, imputation):
    df_out = df.copy()
    df_out.fillna(value=imputation, inplace=True) # Replace NaN values accoding with the dictionary imputation
    return df_out
# End of Imputation functions ---------------------------------------------------------------


# Brier score function ----------------------------------------------------------------------
def brier_score(df,correctlabels):
    cat = df.columns # Get categories (columns)
    target = [(lab == cat).astype(float) for lab in correctlabels] # Get a matrix of targuet values from a list of correct labels 
    df_target = pd.DataFrame(target,columns=cat) # Matrix to dataframe
    return ((((df - df_target).pow(2)).sum(axis=1)).sum())/len(df) # Calculate brier score from dataframes
# End of brier score function ----------------------------------------------------------------


# Accuracy function -------------------------------------------------------------------------
def accuracy(df,correctlabels):
    predicted = df.idxmax(axis=1, skipna=True) # Index (column name) of first occurrence of maximum value
    acc = sum(predicted == correctlabels)/len(correctlabels)
    return acc
# End of Accuracy function ------------------------------------------------------------------


# AUC function ------------------------------------------------------------------------------
def auc(df,correctlabels):
    cat = df.columns
    target = [(lab == cat).astype(float) for lab in correctlabels]
    df_target = pd.DataFrame(target, columns=cat)

    AUC = 0
    for col in df.columns:                       # For each class create a dictionary with a mapping from each score score:[num_pos,num_neg]
        dic = {score:[0,0] for score in df[col]} # Initialize dictionary

        for n in range(len(df[col])):            # Check for each row of the dataframe and update the dictionary
            score = df[col][n]
            if df_target[col][n] == 1:           # Check if positive
                dic[score] = [dic[score][0]+1, dic[score][1]] # Increment positive instances
            else:                                # Negative
                dic[score] = [dic[score][0], dic[score][1]+1] # Increment negative instances

        dic_sorted = {}
        sorted_list = np.array([dic[key] for key in sorted(dic, reverse=True)]) # Create a reversely sorted list of positive and negative instances
        
        tp = sorted_list[:,0] # Get true positives from the list
        fp = sorted_list[:,1] # Get false positives from the list

        AUC_tmp = 0 # Initialize AUC
        cov_tp = 0 # Initialize cov_tp
        tot_tp = sum(tp) # Get total number of TP
        tot_fp = sum(fp) # Get total number of FP
        for i in range(len(tp)):
            if fp[i] == 0:
                cov_tp += tp[i]
            elif tp[i] == 0:
                AUC_tmp += (cov_tp/tot_tp)*(fp[i]/tot_fp) # Update AUC_tmp
            else:
                AUC_tmp += (cov_tp/tot_tp)*(fp[i]/tot_fp)+(tp[i]/tot_tp)*(fp[i]/tot_fp)/2 # Update AUC_tmp
                cov_tp += tp[i]

        #print('AUC_tmp: '+str(AUC_tmp))
        AUC += sum(df_target[col])/len(df_target)*AUC_tmp # Add weighted AUC_tmp to AUC 
    return AUC
# End of AUC function -----------------------------------------------------------------------

### Other functions needed for DecissionTree and RandomForest

In [3]:
# This function takes a dataframe and returns a mapping from the features to 
# the group sizes for each possible value of the feature
def group_sizes_per_feature(df,features):
    output = dict.fromkeys(features, {})                                                         # Initialize dictionary from features
    for col in features:                                                                         # For each column excluding "CLASS" and "ID"
        output[col] = {}                                                                         # Necessary to Assign Nested Keys and Values in Dictionaries (https://stackoverflow.com/questions/22455384/assign-nested-keys-and-values-in-dictionaries)
        for value in df[col].unique():                                                           # For each value of the column
            lower_equal_cases_series = df[df[col] <= value].groupby("CLASS").size()              # Find those instances that are lower or equal than the value and group them by CLASS
            greater_cases_series = df[df[col] > value].groupby("CLASS").size()                   # Find those instances that are greater than the value and group them by CLASS
        
            le = {label:lower_equal_cases_series.get(label,0) for label in df["CLASS"].unique()} # Transform lower_equal_cases_series to a structured dictionary where empty classes have a 0 as value
            g = {label:greater_cases_series.get(label,0) for label in df["CLASS"].unique()}      # Transform greater_cases_series to a structured dictionary where empty classes have a 0 as value

            output[col][value] = [le, g]                                                         # Update the output
                        
    return output


# This function computes entropy from class_counts
def entropy(class_counts):
    non_zero_labels_count = sum(np.fromiter(class_counts.values(), dtype=int) >= 0)
    if non_zero_labels_count <= 1:
        return 0
    
    num_instances = sum(class_counts.values())
    if num_instances <= 1:
        return 0
    
    probs = list(class_counts.values())/sum(class_counts.values())
    return sum([0 if i == 0 else -i * np.log2(i) for i in probs])


# This function computes residual information
def information_content(group_sizes,class_counts,features): 
    res_inf = dict.fromkeys(features, {})
    for feature in group_sizes.keys():
        res_inf[feature] = {}
        for value in group_sizes[feature].keys():
            dict_lower_equal = group_sizes.get(feature).get(value)[0]
            dict_greater = group_sizes.get(feature).get(value)[1]

            lower_equal_instances = sum(dict_lower_equal.values())
            greater_instances = sum(dict_greater.values())
            total_instances = lower_equal_instances + greater_instances 

            lower_equal_entropy = entropy(dict_lower_equal)
            greater_entropy = entropy(dict_greater)

            res_inf[feature][value] = (lower_equal_instances/total_instances)*lower_equal_entropy + (greater_instances/total_instances)*greater_entropy
    return res_inf


# This function finds the best question (feature value) from residual information
def find_best_question(res_inf):
    lowest_information = 20000
    for feature in res_inf.keys():
        for value in res_inf[feature].keys():
            res_inf_value = res_inf[feature][value]
            if res_inf_value < lowest_information:
                lowest_information = res_inf_value
                feature_out = feature
                value_out = value
    #print("Feature:{0} Value:{1} ResInf:{2}".format(feature_out,value_out,lowest_information))            
    return feature_out,value_out


# This function splits the dataframe according to a feature and a value
def split(df,feature,value):
    lower_equal_cases = df[df[feature] <= value].drop(columns=feature)
    greater_cases = df[df[feature] > value].drop(columns=feature)
    return lower_equal_cases, greater_cases

## 1. Define the class DecisionTree

In [4]:
# Define the class DecisionTree with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, imputatiom, labels, model
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size"
# min_samples_split: no. of instances required to allow a split (default = 5)
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.labels should be the categories of the "CLASS" column of df, set to be of type "category" 
# self.model should be a decision tree (for details, see lecture slides), where the leafs return class probabilities
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: First find the available features (excluding "CLASS" and "ID"), then find the class counts, e.g., using 
#         groupby, and calculate the default class probabilities (relative frequencies of the class labels)
# Hint 2: Define a function, e.g., called divide_and_conquer, that takes the above as input together with df 
#         and min_samples_split, and also a nodeno (starting with 0) to keep track of the generated nodes in the tree
# Hint 3: You may represent the tree under construction as a list of nodes (tuples), on the form:
#         (nodeno,"leaf",class_probabilities): corresponding to a leaf node where class_probabilities is a vector
#                                              with the relative class frequencies (ordered according to self.labels)
#         (nodeno,feature,node_dict): corresponding to an internal (non-leaf) node where node_dict is a mapping from
#                                     the possible values of feature to child nodes (their nodenos)
# Hint 4: You may evaluate each feature by a function information_content, which takes the group sizes
#         for each possible value of the feature together with the class counts of each group as input
# Hint 5: The best feature found (with lowest resulting information content) will be used to split the training
#         instances, and each sub-group is used for generating a sub-tree (recursively by divide_and_conquer,
#         see lecture slides for details)
# Hint 6: You may make divide_and_conquer return not only a list of nodes, but also a current_node_no; 
#         by this, each subsequent call to divide_and_conquer for each subset of instances, i.e. for each feature value, 
#         could use current_node_no as a starting point.
#         If you e.g. make the following call:
#
#         current_node_no, node_list = divide_and_conquer(current_node_no, ...)
#
#         then the returned value in current_node_no can be used in the next call to divide_and_conquer.
#         Node that node_list will contain an arbitrary number of tuples, each element corresponding to a node together 
#         with a node number. The first element in the list will have the same number as current_node_no when the call 
#         was made and the last element will have a number one less than current_node_no when returned, e.g., if there is
#         only one (leaf) node in the returned list, then current_node_no will only be incremented by one through the above call.
# Hint 7: The list of nodes output by divide_and_conquer may finally be converted to an array, where each nodeno in the 
#         tuples corresponds to an index of the array 
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are the relative class frequencies in the leaves of the decision tree into which the instances in
#              df fall
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation and binning
# Hint 2: Iterate over the rows calling some sub-function, e.g., make_prediction(nodeno,row), which for a test row
#         finds a leaf node from which class probabilities are obtained
# Hint 3: This sub-function may recursively traverse the tree (represented by an array), starting with the nodeno
#         that corresponds to the root



In [5]:
class DecisionTree:
    
    def __init__(self):
        self.binning = None
        self.imputation = None
        self.labels = None
        self.model = None

    def fit(self, df, nobins=10, bintype="equal-width", min_samples_split=5): # Where min_samples_split is no. of instances required to allow a split
                
        # Define a function, e.g., called divide_and_conquer, that takes the above as input together with df 
        # and min_samples_split, and also a nodeno (starting with 0) to keep track of the generated nodes in the tree
        def divide_and_conquer(df, features, class_counts, default_class_probabilities, min_samples_split, nodeno):  
            #print(node_list)
            if len(df) >= min_samples_split:
                if len(df["CLASS"].unique()) > 1:
                    if len(features) > 0:
                        # Get a mapping from the features to the group sizes for each possible value of the features
                        group_sizes = group_sizes_per_feature(df,features)
                        #print(group_sizes)
                        
                        # Evaluate each feature by a function information_content, which takes the group sizes
                        # for each possible value of the feature together with the class counts of each group as input
                        residual_information_features_values = information_content(group_sizes,class_counts,features)
                        
                        # Find the best question to split the dataframe (the one with the lowest residual information)
                        feature,value = find_best_question(residual_information_features_values)
                        #print(feature)
                        
                        # According to the "question" above (feature + value) split the dataframe
                        lower_equal_df, greater_df = split(df,feature,value)
                        #print(len(lower_equal_df))
                        #print(greater_df)
                        
                        # Update lower_equal_child_nodeno and greater_child_nodeno
                        if nodeno == 0:
                            lower_equal_child_nodeno = 1
                            greater_child_nodeno = 2

                        else:
                            lower_equal_child_nodeno = greater_child_nodes[-1] + 1
                            greater_child_nodeno = lower_equal_child_nodeno + 1
                    
                        greater_child_nodes.append(greater_child_nodeno) # Append last greater_child_nodno to be able to calculate next lower_equal_child_nodeno and greater_child_nodeno
                        
                        
                        # Append a new non-leaf node
                        node_dict = {"<="+str(value):lower_equal_child_nodeno,">"+str(value):greater_child_nodeno} # node_dict is a mapping from the possible values of feature to child nodes (their nodenos)
                        node_list.append((nodeno,feature,node_dict)) 
                        #print(node_list)
                        
                        # Drop the used feature and get remaining features
                        features = features.drop(feature) # Get remaining features after last split
                        
                        # Get Lower or equal divide and conquer input parameters
                        class_counts_LE = lower_equal_df.groupby("CLASS").size() # Find the class counts
                        if len(lower_equal_df) >= 1 and len(np.unique(class_counts_LE.values)) != 1:
                            default_class_probabilities_LE = [class_counts_LE.get(label,0)/class_counts_LE.sum() for label in self.labels]
                        else:
                            default_class_probabilities_LE = default_class_probabilities
                            
                        # Get Greater divide and conquer input parameters
                        class_counts_G = greater_df.groupby("CLASS").size() # Find the class counts
                        if len(greater_df) >= 1 and len(np.unique(class_counts_G.values)) != 1:
                            default_class_probabilities_G = [class_counts_G.get(label,0)/class_counts_G.sum() for label in self.labels]
                        else:
                            default_class_probabilities_G = default_class_probabilities
                            
                        divide_and_conquer(lower_equal_df, features, class_counts_LE, default_class_probabilities_LE, min_samples_split, lower_equal_child_nodeno)
                        divide_and_conquer(greater_df, features, class_counts_G, default_class_probabilities_G, min_samples_split, greater_child_nodeno)

                    else:
                        # No more features left
                        class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                        #print("Case 3 (No features): {0}".format(class_probabilities))
                        return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
                else:
                    # Only one class left
                    class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                    #print("Case 2 (Only one class): {0}".format(class_probabilities))
                    return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
            else:
                # Number of instances left < min_samples_split
                #print("Case 1 : {0}".format(default_class_probabilities))
                if df.empty or len(np.unique(class_counts.values)) == 1:
                    return nodeno, node_list.append((nodeno,"leaf",default_class_probabilities)) 
                else:
                    class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                    return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
                
            return nodeno, node_list
        
        df_tmp,self.imputation = create_imputation(df)                # Imputation mapping from df
        df_tmp,self.binning = create_bins(df_tmp, nobins, bintype)    # Discretization mapping from df_tmp
        self.labels = df["CLASS"].astype("category").cat.categories   # Categories of the "CLASS" column of df, set to be of type "category" 
        
        features = df_tmp.columns.drop(["CLASS","ID"])                # Find the available features
        class_counts = df_tmp.groupby("CLASS").size()                 # Find the class counts
        default_class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels] # Calculate the default class probabilities
        
        node_list = []                                                # Initialize node_list
        greater_child_nodes = []                                      # Initialize greater_child_nodes
        nodeno,tree = divide_and_conquer(df_tmp, features, class_counts, default_class_probabilities, min_samples_split, 0) # Build tree
        
        self.model = sorted(tree, key=lambda x: x[0])

        
    def predict(self, df):
        
        # This sub-function may recursively traverse the tree (represented by an array), starting with the nodeno
        # that corresponds to the root until it finds a leaf node from which class probabilities are obtained
        def make_prediction(nodeno,row):
            #print("Nodeno:{0} tree:{1} tree[nodeno][2]:{2}".format(nodeno,tree[nodeno],tree[nodeno][2]))
            if tree[nodeno][1] != "leaf":
                feature = tree[nodeno][1]
                for question in tree[nodeno][2].keys():
                    #print(question)
                    if eval(str(row[feature])+question):
                        new_nodeno = tree[nodeno][2][question]
                        
                        return make_prediction(new_nodeno,row)
            else: 
                return tree[nodeno][2]       
        
        
        tree = self.model                                    # Obtain tree from model
        #print(tree)
        
        df_tmp = df.drop(columns=["CLASS","ID"])             # Drop any "CLASS" and "ID" columns
        df_tmp = apply_imputation(df_tmp, self.imputation)   # Apply imputation 
        df_tmp = apply_bins(df_tmp,self.binning)             # Apply binning
        
        # Iterate over the rows calling some sub-function, e.g., make_prediction(nodeno,row)
        predictions = pd.DataFrame(0,index=np.arange(len(df_tmp)), columns=self.labels).fillna(0)
        for index, row in df_tmp.iterrows():
            prediction = make_prediction(0,row) # Obtain prediction
            #print(prediction)
            predictions.loc[index] = prediction # Update predictions
        
        #print(predictions)
        return predictions

In [6]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

tree_model = DecisionTree()

test_labels = glass_test_df["CLASS"]

nobins_values = [5,10]
bintype_values = ["equal-width","equal-size"]
min_samples_split_values = [3,5,10]
parameters = [(nobins,bintype,min_samples_split) for nobins in nobins_values for bintype in bintype_values 
              for min_samples_split in min_samples_split_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    tree_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1],min_samples_split=parameters[i][2])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = tree_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values,min_samples_split_values]),
                       columns=["Accuracy","Brier score","AUC"])

results



Training time (5, 'equal-width', 3): 1.42 s.
Testing time (5, 'equal-width', 3): 0.09 s.
Training time (5, 'equal-width', 5): 1.22 s.
Testing time (5, 'equal-width', 5): 0.09 s.
Training time (5, 'equal-width', 10): 0.99 s.
Testing time (5, 'equal-width', 10): 0.08 s.
Training time (5, 'equal-size', 3): 1.79 s.
Testing time (5, 'equal-size', 3): 0.09 s.
Training time (5, 'equal-size', 5): 1.47 s.
Testing time (5, 'equal-size', 5): 0.08 s.
Training time (5, 'equal-size', 10): 1.07 s.
Testing time (5, 'equal-size', 10): 0.08 s.
Training time (10, 'equal-width', 3): 2.00 s.
Testing time (10, 'equal-width', 3): 0.09 s.
Training time (10, 'equal-width', 5): 1.77 s.
Testing time (10, 'equal-width', 5): 0.09 s.
Training time (10, 'equal-width', 10): 1.33 s.
Testing time (10, 'equal-width', 10): 0.08 s.
Training time (10, 'equal-size', 3): 2.37 s.
Testing time (10, 'equal-size', 3): 0.09 s.
Training time (10, 'equal-size', 5): 2.20 s.
Testing time (10, 'equal-size', 5): 0.11 s.
Training time (

Unnamed: 0,Unnamed: 1,Unnamed: 2,Accuracy,Brier score,AUC
5,equal-width,3,0.663551,0.495742,0.823196
5,equal-width,5,0.663551,0.513286,0.815376
5,equal-width,10,0.663551,0.509703,0.815924
5,equal-size,3,0.570093,0.715156,0.747264
5,equal-size,5,0.570093,0.660114,0.76036
5,equal-size,10,0.53271,0.698032,0.740427
10,equal-width,3,0.663551,0.536818,0.819388
10,equal-width,5,0.616822,0.547826,0.819928
10,equal-width,10,0.626168,0.541446,0.810556
10,equal-size,3,0.570093,0.72261,0.744371


In [7]:
train_labels = glass_train_df["CLASS"]
tree_model.fit(glass_train_df,min_samples_split=1)
predictions = tree_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.91
AUC on training set: 0.98
Brier score on training set: 0.13


### Comment on assumptions, things that do not work properly, etc.
It can be seen above that the values obtained do not match with the ones in "Assignment 3.html". However, after running some tests on our code, we realized that the reason might be that there are leaf nodes that have some equiprobable class probabilities. Hence, the final results may vary depending on which class was selected as a prediction. For more details, see the example below.


In [8]:
###-------------------------------------------- TESTING CELL --------------------------------------------###
# Define a function, e.g., called divide_and_conquer, that takes the above as input together with df 
# and min_samples_split, and also a nodeno (starting with 0) to keep track of the generated nodes in the tree
def divide_and_conquer(df, features, class_counts, default_class_probabilities, min_samples_split, nodeno):  
    #print(node_list)
    if len(df) >= min_samples_split:
        if len(df["CLASS"].unique()) > 1:
            if len(features) > 0:
                # Get a mapping from the features to the group sizes for each possible value of the features
                group_sizes = group_sizes_per_feature(df,features)
                #print(group_sizes)
                        
                # Evaluate each feature by a function information_content, which takes the group sizes
                # for each possible value of the feature together with the class counts of each group as input
                residual_information_features_values = information_content(group_sizes,class_counts,features)
                        
                # Find the best question to split the dataframe (the one with the lowest residual information)
                feature,value = find_best_question(residual_information_features_values)
                #print(feature)
                        
                # According to the "question" above (feature + value) split the dataframe
                lower_equal_df, greater_df = split(df,feature,value)
                #print(len(lower_equal_df))
                #print(greater_df)
                        
                # Update lower_equal_child_nodeno and greater_child_nodeno
                if nodeno == 0:
                    lower_equal_child_nodeno = 1
                    greater_child_nodeno = 2

                else:
                    lower_equal_child_nodeno = greater_child_nodes[-1] + 1
                    greater_child_nodeno = lower_equal_child_nodeno + 1
                    
                greater_child_nodes.append(greater_child_nodeno) # Append last greater_child_nodno to be able to calculate next lower_equal_child_nodeno and greater_child_nodeno
                        
                        
                # Append a new non-leaf node
                node_dict = {"<="+str(value):lower_equal_child_nodeno,">"+str(value):greater_child_nodeno} # node_dict is a mapping from the possible values of feature to child nodes (their nodenos)
                node_list.append((nodeno,feature,node_dict)) 
                #print(node_list)
                        
                # Drop the used feature and get remaining features
                features = features.drop(feature) # Get remaining features after last split
                        
                # Get Lower or equal divide and conquer input parameters
                class_counts_LE = lower_equal_df.groupby("CLASS").size() # Find the class counts
                if len(lower_equal_df) >= 1 and len(np.unique(class_counts_LE.values)) != 1:
                    default_class_probabilities_LE = [class_counts_LE.get(label,0)/class_counts_LE.sum() for label in labels]
                else:
                    default_class_probabilities_LE = default_class_probabilities
                            
                # Get Greater divide and conquer input parameters
                class_counts_G = greater_df.groupby("CLASS").size() # Find the class counts
                if len(greater_df) >= 1 and len(np.unique(class_counts_G.values)) != 1:
                    default_class_probabilities_G = [class_counts_G.get(label,0)/class_counts_G.sum() for label in labels]
                else:
                    default_class_probabilities_G = default_class_probabilities
                            
                divide_and_conquer(lower_equal_df, features, class_counts_LE, default_class_probabilities_LE, min_samples_split, lower_equal_child_nodeno)
                divide_and_conquer(greater_df, features, class_counts_G, default_class_probabilities_G, min_samples_split, greater_child_nodeno)

            else:
                # No more features left
                class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in labels]
                #print("Case 3 (No features): {0}".format(class_probabilities))
                return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
        else:
            # Only one class left
            class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in labels]
            #print("Case 2 (Only one class): {0}".format(class_probabilities))
            return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
    else:
        # Number of instances left < min_samples_split
        #print("Case 1 : {0}".format(default_class_probabilities))
        if df.empty or len(np.unique(class_counts.values)) == 1:
            return nodeno, node_list.append((nodeno,"leaf",default_class_probabilities)) 
        else:
            class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in labels]
            return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
                
    return nodeno, node_list



glass_train_df = pd.read_csv("glass_train.txt")
df = glass_train_df.copy()
df = df[["RI","Na","ID","CLASS"]][0:10]
#df = df[0:10]

nobins = 10
bintype = "equal-width"
df_tmp,imputation = create_imputation(df) # Imputation mapping from df_tmp 
df_tmp,binning = create_bins(df_tmp, nobins, bintype) # Discretization mapping from df
labels = df["CLASS"].astype("category").cat.categories # Categories of the "CLASS" column of df, set to be of type "category" 
print("Create a toy dataset")
print(df_tmp)
print("")

features = df_tmp.columns.drop(["CLASS","ID"]) # Find the available features
class_counts = df_tmp.groupby("CLASS").size() # Find the class counts
default_class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in labels]# Calculate the default class probabilities
print("Default class probabilities (classes in descending order): {0}".format(default_class_probabilities)) #OK
print("")


node_list = []
greater_child_nodes = []
min_samples_split = 1
nodeno,tree = divide_and_conquer(df_tmp, features, class_counts, default_class_probabilities, min_samples_split, 0)
tree = sorted(tree, key=lambda x: x[0])
print("Tree:{0}".format(tree))
print("")


# PREDICTION
df_tmp_prediction = df_tmp.drop(columns=["CLASS","ID"])
#print(df_tmp_prediction)
        
def make_prediction(nodeno,row):
    #print("Nodeno:{0} tree:{1} tree[nodeno][2]:{2}".format(nodeno,tree[nodeno],tree[nodeno][2]))
    if tree[nodeno][1] != "leaf":
        feature = tree[nodeno][1]
        for question in tree[nodeno][2].keys():
            #print(question)
            if eval(str(row[feature])+question):
                new_nodeno = tree[nodeno][2][question]
                #print("Row:{4}={0} Nodeno:{1} Question:{2} Newnodeno:{3}".format(row[feature],nodeno, question, new_nodeno,feature))
                        
                return make_prediction(new_nodeno,row)
    else: 
        return tree[nodeno][2]       
                
                
train_labels = df_tmp["CLASS"]           
#print(tree)
predictions = pd.DataFrame(0,index=np.arange(len(train_labels)), columns=labels).fillna(0)
for index, row in df_tmp_prediction.iterrows():
    prediction = make_prediction(0,row)
    #print(prediction)
    predictions.loc[index] = prediction
    #print(predictions)

print("Predictions DataFrame")
print(predictions)
print("")

print("Accuracy:{0}".format(accuracy(predictions,train_labels)))
###-------------------------------------------- TESTING CELL --------------------------------------------###

Create a toy dataset
   RI  Na   ID  CLASS
0   1   0  202      7
1   2   5  124      2
2   9   7  152      3
3   0   6  197      7
4   2   3  144      2
5   2   2   99      2
6   1   9  207      7
7   0   4   36      1
8   5   5   50      1
9   4   4  117      2

Default class probabilities (classes in descending order): [0.2, 0.4, 0.1, 0.3]

Tree:[(0, 'RI', {'<=1': 1, '>1': 2}), (1, 'Na', {'<=4': 3, '>4': 4}), (2, 'Na', {'<=5': 5, '>5': 6}), (3, 'leaf', [0.5, 0.0, 0.0, 0.5]), (4, 'leaf', [0.0, 0.0, 0.0, 1.0]), (5, 'leaf', [0.2, 0.8, 0.0, 0.0]), (6, 'leaf', [0.0, 0.0, 1.0, 0.0])]

Predictions DataFrame
     1    2    3    7
0  0.5  0.0  0.0  0.5
1  0.2  0.8  0.0  0.0
2  0.0  0.0  1.0  0.0
3  0.0  0.0  0.0  1.0
4  0.2  0.8  0.0  0.0
5  0.2  0.8  0.0  0.0
6  0.0  0.0  0.0  1.0
7  0.5  0.0  0.0  0.5
8  0.2  0.8  0.0  0.0
9  0.2  0.8  0.0  0.0

Accuracy:0.8


### Notes

In this example there is one leaf node with equiprobable class probabilities.

(3, 'leaf', [0.5, 0.0, 0.0, 0.5])

After using the tree to predict the classes, there are two of ten instances that end up in that leaf node.

|            |     1      |  2         |  3         |  7         |    
|------------|------------|------------|------------|------------|
|  <font color='red'>0 |  <font color='red'>0.5 |  <font color='red'>0.0 |  <font color='red'>0.0 |  <font color='red'>0.5 |
|     1      |    0.2     |    0.8     |    0.0     |    0.0     |
|     2      |    0.0     |    0.0     |    1.0     |    0.0     |
|     3      |    0.0     |    0.0     |    0.0     |    1.0     |
|     4      |    0.2     |    0.8     |    0.0     |    0.0     |
|     5      |    0.2     |    0.8     |    0.0     |    0.0     |
|     6      |    0.0     |    0.0     |    0.0     |    1.0     |
|  <font color='red'>7  |  <font color='red'>0.5 |  <font color='red'>0.0 |  <font color='red'>0.0 |  <font color='red'>0.5 |
|     8      |    0.2     |    0.8     |    0.0     |    0.0     |
|     9      |    0.2     |    0.8     |    0.0     |    0.0     |

Since we code our function accuracy to choose the first class with the highest probability the prediction will be:
    
Instance 0 --> CLASS 1 <br>
Instance 7 --> CLASS 1

However, the targuet value of both instances is CLASS 7, therefore the accuracy will be 8/10 = 0.8

### Deep testing

We used the script below to check that everything was working as expected.

In [9]:
###-------------------------------------------- TESTING CELL --------------------------------------------###
glass_train_df = pd.read_csv("glass_train.txt")
df = glass_train_df.copy()
df = df[["RI","Na","ID","CLASS"]][0:10]
#df = df[0:10]

nobins = 10
bintype = "equal-width"
df_tmp,imputation = create_imputation(df) # Imputation mapping from df_tmp 
df_tmp,binning = create_bins(df_tmp, nobins, bintype) # Discretization mapping from df
labels = df["CLASS"].astype("category").cat.categories # Categories of the "CLASS" column of df, set to be of type "category" 
print("Create a toy dataset")
print(df_tmp)
print("")

features = df_tmp.columns.drop(["CLASS","ID"]) # Find the available features
class_counts = df_tmp.groupby("CLASS").size() # Find the class counts
default_class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in labels]# Calculate the default class probabilities
print("Default class probabilities (classes in descending order): {0}".format(default_class_probabilities)) #OK
print("")

group_sizes = group_sizes_per_feature(df_tmp,features) #OK
print("Group sizes: {0}".format(group_sizes)) #OK
print("")

residual_information_features_values = information_content(group_sizes,class_counts,features) #OK
print("Residual information features values: {0}".format(residual_information_features_values)) #OK
print("")

feature,value = find_best_question(residual_information_features_values) #OK
print("Best question: Feature={0} Value={1}".format(feature,value)) #OK
print("")

lower_equal_df, greater_df = split(df_tmp,feature,value) #OK
print("Lower or equal Data Frame:") 
print(lower_equal_df) #OK
print("")
print("Greater Data Frame:")
print(greater_df) #OK
print("")

features = features.drop(feature) #OK
print(features) #OK
print("")
###-------------------------------------------- TESTING CELL --------------------------------------------###

Create a toy dataset
   RI  Na   ID  CLASS
0   1   0  202      7
1   2   5  124      2
2   9   7  152      3
3   0   6  197      7
4   2   3  144      2
5   2   2   99      2
6   1   9  207      7
7   0   4   36      1
8   5   5   50      1
9   4   4  117      2

Default class probabilities (classes in descending order): [0.2, 0.4, 0.1, 0.3]

Group sizes: {'RI': {1: [{7: 3, 2: 0, 3: 0, 1: 1}, {7: 0, 2: 4, 3: 1, 1: 1}], 2: [{7: 3, 2: 3, 3: 0, 1: 1}, {7: 0, 2: 1, 3: 1, 1: 1}], 9: [{7: 3, 2: 4, 3: 1, 1: 2}, {7: 0, 2: 0, 3: 0, 1: 0}], 0: [{7: 1, 2: 0, 3: 0, 1: 1}, {7: 2, 2: 4, 3: 1, 1: 1}], 5: [{7: 3, 2: 4, 3: 0, 1: 2}, {7: 0, 2: 0, 3: 1, 1: 0}], 4: [{7: 3, 2: 4, 3: 0, 1: 1}, {7: 0, 2: 0, 3: 1, 1: 1}]}, 'Na': {0: [{7: 1, 2: 0, 3: 0, 1: 0}, {7: 2, 2: 4, 3: 1, 1: 2}], 5: [{7: 1, 2: 4, 3: 0, 1: 2}, {7: 2, 2: 0, 3: 1, 1: 0}], 7: [{7: 2, 2: 4, 3: 1, 1: 2}, {7: 1, 2: 0, 3: 0, 1: 0}], 6: [{7: 2, 2: 4, 3: 0, 1: 2}, {7: 1, 2: 0, 3: 1, 1: 0}], 3: [{7: 1, 2: 2, 3: 0, 1: 0}, {7: 2, 2: 2, 3: 1, 1: 2}],

## 2. Define the class DecisionForest

In [10]:
# Define the class DecisionForest with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, imputatiom, labels, model
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size"
# min_samples_split: no. of instances required to allow a split (default = 5)
# random_features: no. of features to evaluate at each split (default = 2), 0 means all features (no random sampling)
# notrees: no. of trees in the forest (default = 10)
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.labels should be the categories of the "CLASS" column of df, set to be of type "category" 
# self.model should be a random forest (for details, see lecture slides)
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: Redefine divide_and_conquer to take one additional argument; random_features, and instead of
#         evaluating all features choose a random subset, e.g., by np.random.choice (without replacement)
# Hint 2: Generate each tree in the forest from a bootstrap replicate of df, e.g., by np.random.choice 
#         (with replacement) from the index values of df.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are the mean of all relative class frequencies in the leaves of the forest into which the instances in
#              df fall
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation and binning
# Hint 2: Iterate over the rows calling some sub-function, e.g., make_prediction(row), which for a test row
#         finds all leaf nodes and calculates the average of their class probabilities



In [11]:
class DecisionForest:
    
    def __init__(self):
        self.binning = None
        self.imputation = None
        self.labels = None
        self.model = None
   
    # Input to fit:
    # self: the object itself
    # df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
    # nobins: no. of bins (default = 10)
    # bintype: either "equal-width" (default) or "equal-size"
    # min_samples_split: no. of instances required to allow a split (default = 5)
    # random_features: no. of features to evaluate at each split (default = 2), 0 means all features (no random sampling)
    # notrees: no. of trees in the forest (default = 10)
    def fit(self, df, nobins=10, bintype="equal-width", min_samples_split=5, random_features=2, notrees=10):
        # Note that the function does not return anything but just assigns values to the attributes of the object.
        #
        # Hint 1: Redefine divide_and_conquer to take one additional argument; random_features, and instead of
        #         evaluating all features choose a random subset, e.g., by np.random.choice (without replacement)
        # Hint 2: Generate each tree in the forest from a bootstrap replicate of df, e.g., by np.random.choice 
        #         (with replacement) from the index values of df.  
        
        def divide_and_conquer(df, features, class_counts, default_class_probabilities, min_samples_split, random_features, nodeno):  
            #print(node_list)
            if len(df) >= min_samples_split:
                if len(df["CLASS"].unique()) > 1:
                    if (len(features if random_features == 0 else remaining_features) > 0):
                        #print(df)
                        #print(remaining_features)
                        #print(features)
                        if random_features != 0:
                            features = pd.Series(np.random.choice(remaining_features,random_features))
                          
                        # Get a mapping from the features to the group sizes for each possible value of the features
                        group_sizes = group_sizes_per_feature(df,features)
                        #print(group_sizes)
                        
                        # Evaluate each feature by a function information_content, which takes the group sizes
                        # for each possible value of the feature together with the class counts of each group as input
                        residual_information_features_values = information_content(group_sizes,class_counts,features)
                        
                        # Find the best question to split the dataframe (the one with the lowest residual information)
                        feature,value = find_best_question(residual_information_features_values)
                        
                        # According to the "question" above (feature + value) split the dataframe
                        lower_equal_df, greater_df = split(df,feature,value)
                        #print(len(lower_equal_df))
                        #print(greater_df)
                        
                        # Update lower_equal_child_nodeno and greater_child_nodeno
                        if nodeno == 0:
                            lower_equal_child_nodeno = 1
                            greater_child_nodeno = 2

                        else:
                            lower_equal_child_nodeno = greater_child_nodes[-1] + 1
                            greater_child_nodeno = lower_equal_child_nodeno + 1
                    
                        greater_child_nodes.append(greater_child_nodeno) # Append last greater_child_nodno to be able to calculate next lower_equal_child_nodeno and greater_child_nodeno
                        
                        # Append a new non-leaf node
                        node_dict = {"<="+str(value):lower_equal_child_nodeno,">"+str(value):greater_child_nodeno} # node_dict is a mapping from the possible values of feature to child nodes (their nodenos)
                        node_list.append((nodeno,feature,node_dict)) 
                        #print(node_list)
                
                    
                        if random_features != 0:
                            remaining_features.remove(feature) # Update remaining features
                                                
                        # Get Lower or equal divide and conquer input parameters
                        class_counts_LE = lower_equal_df.groupby("CLASS").size() # Find the class counts
                        if len(lower_equal_df) >= 1 and len(np.unique(class_counts_LE.values)) != 1:
                            default_class_probabilities_LE = [class_counts_LE.get(label,0)/class_counts_LE.sum() for label in self.labels]
                        else:
                            default_class_probabilities_LE = default_class_probabilities
                            
                        # Get Greater divide and conquer input parameters
                        class_counts_G = greater_df.groupby("CLASS").size() # Find the class counts
                        if len(greater_df) >= 1 and len(np.unique(class_counts_G.values)) != 1:
                            default_class_probabilities_G = [class_counts_G.get(label,0)/class_counts_G.sum() for label in self.labels]
                        else:
                            default_class_probabilities_G = default_class_probabilities
                            
                        divide_and_conquer(lower_equal_df, features, class_counts_LE, default_class_probabilities_LE, min_samples_split, random_features, lower_equal_child_nodeno)
                        divide_and_conquer(greater_df, features, class_counts_G, default_class_probabilities_G, min_samples_split, random_features, greater_child_nodeno)

                    else:
                        # No more features left
                        class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                        #print("Case 3 (No features): {0}".format(class_probabilities))
                        return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
                else:
                    # Only one class left
                    class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                    #print("Case 2 (Only one class): {0}".format(class_probabilities))
                    return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
            else:
                # Number of instances left < min_samples_split
                #print("Case 1 : {0}".format(default_class_probabilities))
                if df.empty or len(np.unique(class_counts.values)) == 1:
                    return nodeno, node_list.append((nodeno,"leaf",default_class_probabilities)) 
                else:
                    class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels]
                    return nodeno, node_list.append((nodeno,"leaf",class_probabilities)) 
                
            return nodeno, node_list
        
        def generate_bootstrap(df):
            idx = np.random.choice(len(df),len(df),replace=True)
            df_out = df.iloc[idx].reset_index(drop=True)
            return df_out
        
        df_tmp,self.imputation = create_imputation(df) # Imputation mapping from df_tmp
        df_tmp,self.binning = create_bins(df_tmp, nobins, bintype) # Discretization mapping from df
        self.labels = df_tmp["CLASS"].astype("category").cat.categories # Categories of the "CLASS" column of df, set to be of type "category" 
        features = df_tmp.columns.drop(["CLASS","ID"]) # Find the available features
        
        random_forest = []
        for notree in range(notrees):
            df_tree = generate_bootstrap(df_tmp) # Generate a new bootstrap from df_tmp
            class_counts = df_tree.groupby("CLASS").size() # Find the class counts
            default_class_probabilities = [class_counts.get(label,0)/class_counts.sum() for label in self.labels] # Calculate the default class probabilities
        
            node_list = []             # Initialize node_list
            greater_child_nodes = []   # Initialize greater_child_nodes
            remaining_features = list(features.copy())    # Initialize remaining_features
            nodeno,tree = divide_and_conquer(df_tree, features, class_counts, default_class_probabilities, min_samples_split, random_features, 0)
            tree = sorted(tree, key=lambda x: x[0]) # Sort tree by nodeno
            
            random_forest.append(tree)
        
        self.model = random_forest
        #print(random_forest)
        
        
    def predict(self, df):
        # predictions: a dataframe with class labels as column names and the rows corresponding to
        #              predictions with estimated class probabilities for each row in df, where the class probabilities
        #              are the mean of all relative class frequencies in the leaves of the forest into which the instances in
        #              df fall
        #
        # Iterate over the rows calling some sub-function, e.g., make_prediction( ), which for a test row
        # finds all leaf nodes and calculates the average of their class probabilities
        
        # This function makes predictions from nodeno row and tree
        def make_prediction_randforest(nodeno,row,tree):
            if tree[nodeno][1] != "leaf":
                feature = tree[nodeno][1]
                for question in tree[nodeno][2].keys():
                    #print(question)
                    if eval(str(row[feature])+question):
                        new_nodeno = tree[nodeno][2][question]
                        
                        return make_prediction_randforest(new_nodeno,row,tree)
            else: 
                return tree[nodeno][2] 
            
        df_tmp = df.drop(columns=["CLASS","ID"]) # Drop CLASS and ID
        df_tmp = apply_imputation(df_tmp, self.imputation) # Apply imputation
        df_tmp = apply_bins(df_tmp,self.binning) # Aply bins
        
        predictions = pd.DataFrame(0,index=np.arange(len(df_tmp)), columns=self.labels).fillna(0) # Initialize predictions
        forest = self.model # Get forest
        for tree in forest: # For each tree in forest
            #print(tree)
            for index, row in df_tmp.iterrows():
                prediction = make_prediction_randforest(0,row,tree)
                #print(prediction)
                predictions.loc[index] += prediction
        predictions = predictions/len(forest) # Predictions divided by the number of trees in the forest (to get the mean)
        #print(predictions)
        return predictions
        

In [12]:
glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

forest_model = DecisionForest()

test_labels = glass_test_df["CLASS"]

min_samples_split_values = [1,2,5]
random_features_values = [1,2,5]

parameters = [(min_samples_split,random_features) for min_samples_split in min_samples_split_values 
              for random_features in random_features_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    forest_model.fit(glass_train_df,min_samples_split=parameters[i][0],random_features=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = forest_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([min_samples_split_values,random_features_values]),
                       columns=["Accuracy","Brier score","AUC"])

results

Training time (1, 1): 1.74 s.
Testing time (1, 1): 1.69 s.
Training time (1, 2): 5.59 s.
Testing time (1, 2): 1.34 s.
Training time (1, 5): 6.32 s.
Testing time (1, 5): 1.38 s.
Training time (2, 1): 2.19 s.
Testing time (2, 1): 0.97 s.
Training time (2, 2): 2.86 s.
Testing time (2, 2): 0.93 s.
Training time (2, 5): 7.15 s.
Testing time (2, 5): 0.92 s.
Training time (5, 1): 1.85 s.
Testing time (5, 1): 1.16 s.
Training time (5, 2): 3.11 s.
Testing time (5, 2): 0.88 s.
Training time (5, 5): 6.43 s.
Testing time (5, 5): 0.91 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
1,1,0.64486,0.495823,0.870735
1,2,0.616822,0.513723,0.833324
1,5,0.672897,0.478426,0.860754
2,1,0.663551,0.507072,0.844867
2,2,0.682243,0.480402,0.855857
2,5,0.728972,0.450066,0.880429
5,1,0.663551,0.509999,0.836641
5,2,0.682243,0.487489,0.838953
5,5,0.691589,0.451564,0.874586


In [13]:
train_labels = glass_train_df["CLASS"]
forest_model.fit(glass_train_df,min_samples_split=1)
predictions = forest_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.78
AUC on training set: 0.95
Brier score on training set: 0.43


### Comment on assumptions, things that do not work properly, etc.
For the same reason than above (comments exercise 1), our AUC on training set do not match with the one in "Assignment 3.html". Apart from that, the algorithm seems to work fine.