# ID2214/FID3214 Assignment 3 Group no. [enter]
### Project members: 
[Enter Name, email]
[Enter Name, email]
[Enter Name, email]
[Enter Name, email]

### Declaration
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas, time and sklearn.tree, may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

In [1]:
#pip install numpy==1.17.2

#pip install pandas==1.1.3

## Load NumPy, pandas, time and DecisionTreeClassifier from sklearn.tree

In [2]:
import numpy as np
import pandas as pd
import time
from sklearn.tree import DecisionTreeClassifier

## Reused functions from Assignment 1

In [3]:
# Copy and paste functions from Assignment 1 here that you need for this assignment
from pandas.api.types import CategoricalDtype

def create_column_filter(df):
    # List of columns we need to consider.
    # Use sets to easily subtract ID and CLASS from the list of columns.
    subset = list(set(df.columns.tolist())-{"ID", "CLASS"})

    # List of columns to keep
    column_filter = ["ID", "CLASS"]

    # Only select columns that contain at least 2 unique values.
    for col in subset:
        if df[col].nunique() > 1:
            column_filter.append(col)

    return apply_column_filter(df, column_filter), column_filter


def apply_column_filter(input, column_filter):
    # Intersection of columns in input and columns to filter.
    keep_columns = list(set(input.columns.tolist()).intersection(set(column_filter)))
    return input[keep_columns]


def create_normalization(df, normalizationtype="minmax"):
    # List of columns we need to consider.
    # Use sets to easily subtract ID and CLASS from the list of columns.
    subset = list(set(df.columns.tolist())-{"ID", "CLASS"})

    # zip 3 lists together: A list of the type, the minimums of the columns, the maximums of the colums
    if normalizationtype == "minmax":
        norms = list(zip(["minmax" for _ in range(len(subset))], df[subset].min(), df[subset].max()))
    elif normalizationtype == "zscore":
        norms = list(zip(["zscore" for _ in range(len(subset))], df[subset].mean(), df[subset].std()))
    else:
        raise ValueError("wrong normalization type")

    # Combine the columns with their respective values, and convert to a dict.
    normalization = dict(zip(subset, norms))

    return apply_normalization(df, normalization), normalization


def apply_normalization(input_df, normalization):
    df = input_df.copy()

    # loop through the dictionaly, and apply the repsective normalization rowwise.
    for col, norm in normalization.items():
        if norm[0] == "minmax":
            df[[col]] = (df[[col]]-norm[1])/(norm[2]-norm[1])
        if norm[0] == "zscore":
            df[[col]] = (df[[col]]-norm[1])/(norm[2])

    return df


def create_imputation(df):
    # List of columns we need to consider.
    # Use sets to easily subtract ID and CLASS from the list of columns.
    subset = list(set(df.columns.tolist())-{"ID", "CLASS"})

    numeric_means = df[subset].mean(numeric_only=True).to_dict()
    # for catagoric (non-number) columns, calculate the modes and convert to dict.
    catagoric_modes = df[subset].select_dtypes(exclude=[np.number]).mode().iloc[0].to_dict()

    imputation = {**numeric_means, **catagoric_modes}
    return apply_imputation(df, imputation), imputation


def apply_imputation(input_df, imputation):
    df = input_df.copy()
    df = df.fillna(imputation)
    return df


def create_bins(df, nobins, bintype="equal-width"):
    # List of columns we need to consider, only numeric.
    # Use sets to easily subtract ID and CLASS from the list of columns.
    subset = list(set(df.select_dtypes(include=np.number).columns.tolist())-{"ID", "CLASS"})

    df = df.copy()
    binning = dict()
    # For every column, call the respective pandas function for binning.
    for col in subset:
        if bintype == "equal-width":
            df[col], binning[col] = pd.cut(df[col], nobins, retbins=True, duplicates='drop', labels=False)
        if bintype == "equal-size":
            df[col], binning[col] = pd.qcut(df[col], nobins, retbins=True, duplicates='drop', labels=False)

        # Convert boundaries of outer bins to infinite.
        binning[col][0] = -float('inf')
        binning[col][-1] = float('inf')

    return df, binning


def apply_bins(df, binning):
    df = df.copy()
    for col, bins in binning.items():
        df[col] = pd.cut(df[col], bins, labels=False)
    return df


def create_one_hot(df):
    # List of columns we need to consider, only numeric.
    # Use sets to easily subtract ID and CLASS from the list of columns.
    subset = list(set(df.select_dtypes(include=['object', 'category']).columns.tolist())-{"ID", "CLASS"})

    # Create dict of the possible values for each column
    one_hot = dict()
    for col in subset:
        one_hot[col] = df[col].dropna().unique()

    return apply_one_hot(df, one_hot), one_hot


def apply_one_hot(df, one_hot):
    df = df.copy()
    for col, values in one_hot.items():
        # convert values to categorical
        df[col] = df[col].astype(CategoricalDtype(values))
        # Use get_dummies to get one-hot encoding.
        dummies = pd.get_dummies(df[col], prefix=col, dtype=float)

        # Add new columns to dataframe, and drop the old column.
        df = pd.concat([df, dummies], axis=1)
        df.drop([col], axis=1, inplace=True)

    return df


def split(df, testfraction):
    df_split = df.copy()
    if testfraction > 1 or testfraction < 0:
        raise ValueError("testfraction must be between 0 and 1")

    # Calculate test length
    test_frac = int(df_split.shape[0] * testfraction)
    # Calculate random test indexes of length test_frac
    test_idx = np.random.permutation(df_split.shape[0])[:test_frac]

    # Create dataset based on test_idx
    trainingdf = df_split.drop(test_idx)
    testdf = df_split.loc[test_idx]

    return trainingdf, testdf


def accuracy(df, correctlabels):
    # Get the indexex of the columns with the highest values. Take the first if they are the same.
    predictions = df.idxmax(axis="columns")
    # Ternary expection to get the total amount of correct predictions.
    correct = sum(label == guess for label, guess in zip(correctlabels, predictions))

    return correct/len(predictions)


def folds(input_df, nofolds=10):
    if nofolds <= 1:
        raise ValueError("nofolds should be greater than 1")
    # Randum shuffle by using a sample of fraction 1.
    df = input_df.sample(frac=1).reset_index(drop=True)
    # Use np.array_split to split into nofolds.
    return np.array_split(df, nofolds)


def brier_score(df, correctlabels):
    labels_vec = np.zeros([len(correctlabels), df.shape[1]])
    # Converts labels into one-hot encoded vectors
    for i, l in enumerate(correctlabels):
        labels_vec[i] = np.where(df.columns == l, 1, 0)

    values = df.copy().values

    # Calculate mean squared error.
    brier_score = np.sum(np.power(values - labels_vec, 2)) / values.shape[0]

    return brier_score


def auc(df: pd.DataFrame, correctlabels: []):
    auc_tot = 0
    for c in df.columns.tolist():
        class_labels = np.where(np.array(correctlabels)==c, True, False)
        weight = np.sum(class_labels) / class_labels.shape[0]

        tp = np.zeros(class_labels.shape[0])
        fp = np.zeros(class_labels.shape[0])
        for i, v in enumerate(df[c].values):
            if class_labels[i] == True:
                tp[i] = 1
            else:
                fp[i] = 1

        scores = np.zeros([len(tp),3])
        for i in range(len(tp)):
            scores[i] = [df[c][i], tp[i], fp[i]]
        
        # sort in reverse
        scores = scores[scores[:,0].argsort()[::-1]]
        
        auc = 0
        cov_tp = 0
        tot_tp = np.sum(tp)
        tot_fp = np.sum(fp)
        for i in range(scores.shape[0]):
            tp_i = scores[i][1]
            fp_i = scores[i][2]
            
            if fp_i == 0: # no false positives
                cov_tp += tp_i
            elif tp_i == 0: # no true positives
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp)
            else:
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp) + (tp_i/tot_tp)*(fp_i/tot_fp)/2
                cov_tp += tp_i
        
        auc_tot += auc * weight
    
    return auc_tot

## 1. Define the class RandomForest

In [4]:
# Define the class RandomForest with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, one_hot, labels, model
#
# Input to fit:
# self      - the object itself
# df        - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# no_trees  - no. of trees in the random forest (default = 100)
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter - a column filter (see Assignment 1) from df
# self.imputation    - an imputation mapping (see Assignment 1) from df
# self.one_hot       - a one-hot mapping (see Assignment 1) from df
# self.labels        - a (sorted) list of the categories of the "CLASS" column of df
# self.model         - a random forest, consisting of no_trees trees, where each tree is generated from a bootstrap sample
#                      and the number of evaluated features is log2|F| where |F| is the total number of features
#                      (for details, see lecture slides)
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: First create the column filter, imputation and one-hot mappings
#
# Hint 2: Then get the class labels and the numerical values (as an ndarray) from the dataframe after dropping the class labels 
#
# Hint 3: Generate no_trees classification trees, where each tree is generated using DecisionTreeClassifier 
#         from a bootstrap sample (see lecture slides), e.g., generated by np.random.choice (with replacement) 
#         from the row numbers of the ndarray, and where a random sample of the features are evaluated in
#         each node of each tree, of size log2(|F|), where |F| is the total number of features;
#         see the parameter max_features of DecisionTreeClassifier
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are the averaged probabilities output by each decision tree in the forest
#
# Hint 1: Drop any "CLASS" and "ID" columns of the dataframe first and then apply column filter, imputation and one_hot
#
# Hint 2: Iterate over the trees in the forest to get the prediction of each tree by the method predict_proba(X) where 
#         X are the (numerical) values of the transformed dataframe; you may get the average predictions of all trees,
#         by first creating a zero-matrix with one row for each test instance and one column for each class label, 
#         to which you add the prediction of each tree on each iteration, and then finally divide the prediction matrix
#         by the number of trees.
#
# Hint 3: You may assume that each bootstrap sample that was used to generate each tree has included all possible
#         class labels and hence the prediction of each tree will contain probabilities for all class labels
#         (in the same order). Note that this assumption may be violated, and this limitation will be addressed 
#         in the next part of the assignment. 

def create_tree(df):
    # Choose bootstrapped Indices
    idx = np.random.choice(len(df), replace=True, size=len(df))
    # Select bootstrap sample from indices
    sample = df.iloc[idx]
    features = list(set(df.columns.tolist())-{"ID", "CLASS"})
    # Create and fit Tree Classifier with log2 maximum features
    tree = DecisionTreeClassifier(max_features="log2")
    return tree.fit(sample[features], sample["CLASS"])
    

class RandomForest():
    def __init__(self):
        return
    
    def fit(self, df, no_trees=100):
        # Convert the column to category dtype
        df['CLASS'] =  df['CLASS'].astype('category')
        # List of possible labels
        self.labels = df['CLASS'].cat.categories.tolist()
        
        # Create and Apply Preprocessing
        df ,self.column_filter = create_column_filter(df)
        df, self.imputation = create_imputation(df)
        df, self.one_hot = create_one_hot(df)

        # Create no_trees different Tree Classifiers
        self.model = [create_tree(df) for _ in range(no_trees)]
        return

    def predict(self, df):
        # Apply Preprocessing
        df = apply_column_filter(df, self.column_filter)
        df = apply_imputation(df, self.imputation)
        df = apply_one_hot(df, self.one_hot)
        features = list(set(df.columns.tolist())-{"ID", "CLASS"})

        # Matrix to store predictions
        predictions = np.zeros((len(df), len(self.labels)))

        # For each tree, predict and add to matrix
        for tree in self.model:
            y = tree.predict_proba(df[features])
            predictions = predictions + y

        predictions = pd.DataFrame(predictions, columns=self.labels)
        # Divide predictions by no_trees to normalize
        predictions = predictions.div(len(self.model), axis=0)
        return predictions

In [5]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("tic-tac-toe_train.csv")

test_df = pd.read_csv("tic-tac-toe_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)

print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.25 s.
Testing time: 0.15 s.
Accuracy: 0.8977
AUC: 0.9873
Brier score: 0.1833


In [6]:
train_labels = train_df["CLASS"]
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels))) # Comment this out if not implemented in assignment 1
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels))) # Comment this out if not implemented in assignment 1

Accuracy on training set: 1.0000
AUC on training set: 1.0000
Brier score on training set: 0.0205


### Comment on assumptions, things that do not work properly, etc.


## 2a. Handling trees with non-aligned predictions

In [7]:
# Define a revised version of the class RandomForest with the same input and output as described in part 1 above,
# where the predict function is able to handle the case where the individual trees are trained on bootstrap samples
# that do not include all class labels in the original training set. This leads to that the class probabilities output
# by the individual trees in the forest do not refer to the same set of class labels.
#
# Hint 1: The categories obtained with <pandas series>.cat.categories are sorted in the same way as the class labels
#         of a DecisionTreeClassifier; the latter are obtained by <DecisionTreeClassifier>.classes_ 
#         The problem is that classes_ may not include all possible labels, and hence the individual predictions 
#         obtained by <DecisionTreeClassifier>.predict_proba may be of different length or even if they are of the same
#         length do not necessarily refer to the same class labels. You may assume that each class label that is not included
#         in a bootstrap sample should be assigned zero probability by the tree generated from the bootstrap sample. 
#
# Hint 2: Create a mapping from the complete (and sorted) set of class labels l0, ..., lk-1 to a set of indexes 0, ..., k-1,
#         where k is the number of classes
#
# Hint 3: For each tree t in the forest, create a (zero) matrix with one row per test instance and one column per class label,
#         to which one column is added at a time from the output of t.predict_proba 
#
# Hint 4: For each column output by t.predict_proba, its index i may be used to obtain its label by t.classes_[i];
#         you may then obtain the index of this label in the ordered list of all possible labels from the above mapping (hint 2); 
#         this index points to which column in the prediction matrix the output column should be added to 

# Method for creating a DecisionTreeClassifier with bootstrapped data
def create_tree(df):
    # Choose bootstrapped Indices
    idx = np.random.choice(len(df), replace=True, size=len(df))
    # Select bootstrap sample from indices
    sample = df.iloc[idx]
    features = list(set(df.columns.tolist())-{"ID", "CLASS"})
    # Create and fit Tree Classifier with log2 maximum features
    tree = DecisionTreeClassifier(max_features="log2")
    return tree.fit(sample[features], sample["CLASS"])    

class RandomForest():
    def __init__(self):
        return
    
    def fit(self, df, no_trees=100):
        # Convert the column to category dtype
        df['CLASS'] =  df['CLASS'].astype('category')
        # List of possible labels
        self.labels = df['CLASS'].cat.categories.tolist()
        
        # Create and Apply Preprocessing
        df ,self.column_filter = create_column_filter(df)
        df, self.imputation = create_imputation(df)
        df, self.one_hot = create_one_hot(df)
        # Create no_tree classifiers
        self.model = [create_tree(df) for _ in range(no_trees)]
        return

    def predict(self, df):
        # Apply Preprocessing
        df = apply_column_filter(df, self.column_filter)
        df = apply_imputation(df, self.imputation)
        df = apply_one_hot(df, self.one_hot)
        features = list(set(df.columns.tolist())-{"ID", "CLASS"})

        # Matrix to store predictions
        predictions = np.zeros((len(df), len(self.labels)))

        for tree in self.model:
            # Generate predictions for tree
            y = tree.predict_proba(df[features])

            # Temporary matrix to store tree predictions
            tree_predictions = np.zeros((len(df), len(self.labels)))

            # Transpose the predictions in order to loop through the collumns/Classes
            for i, probs in enumerate(y.T):
                # Find true index of class
                c = tree.classes_[i]
                c_i = self.labels.index(c)
                # Add class predictions for each datapoint to tree_predicitons
                tree_predictions[:, c_i] += probs               
                    
            predictions = predictions + tree_predictions

        predictions = pd.DataFrame(predictions, columns=self.labels)
        # Divide predictions by no_trees to normalize
        predictions = predictions.div(len(self.model), axis=0)
        return predictions

In [8]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.26 s.
Testing time: 0.17 s.
Accuracy: 0.9510
AUC: 0.9721
Brier score: 0.0979


## 2b. Estimate predictive performance using out-of-bag predictions

In [9]:
# Define an extended version of the class RandomForest with the same input and output as described in part 2a above,
# where the results of the fit function also should include:
# self.oob_acc - the accuracy estimated on the out-of-bag predictions, i.e., the fraction of training instances for 
#                which the given (correct) label is the same as the predicted label when using only trees for which
#                the instance is out-of-bag
#
# Hint 1: You may first create a zero matrix with one row for each training instance and one column for each class label
#         and one zero vector to allow for storing aggregated out-of-bag predictions and the number of out-of-bag predictions
#         for each training instance, respectively
#
# Hint 2: After generating a tree in the forest, iterate over the indexes that were not included in the bootstrap sample
#         and add a prediction of the tree to the out-of-bag prediction matrix and update the count vector
#
# Hint 3: Note that the input to predict_proba has to be a matrix; from a single vector (row) x, a matrix with one row
#         can be obtained by x[None,:]
#
# Hint 4: Finally, divide each row in the out-of-bag prediction matrix with the corresponding element of the count vector


class RandomForest():
    def __init__(self):
        return
    
    def fit(self, df, no_trees=100):
        # Convert the column to category dtype
        df['CLASS'] =  df['CLASS'].astype('category')
        # List of possible labels
        self.labels = df['CLASS'].cat.categories.tolist()
    
        # Create and Apply Preprocessing
        df ,self.column_filter = create_column_filter(df)
        df, self.imputation = create_imputation(df)
        df, self.one_hot = create_one_hot(df)

        self.model = []

        # Matrix to store oob predictions
        oob_predictions = np.zeros((len(df), len(self.labels)))
        # oob prediction counts for each datapoint
        prediction_count = np.zeros(len(df))
        for  _ in range(no_trees):
            # Create a bootstrapped data sample
            idx = np.random.choice(len(df), replace=True, size=len(df))
            sample = df.iloc[idx]
            features = list(set(df.columns.tolist())-{"ID", "CLASS"})

            # Create and fit tree with log2 max features
            tree = DecisionTreeClassifier(max_features="log2")
            tree.fit(sample[features], sample["CLASS"])

            # Create oob sample
            oob = df.drop(idx)
            # Predictions for oob
            y = tree.predict_proba(oob[features])

            # Zip oob indices with their predictions and loop through
            for oob_i, probs in zip(list(oob.index.values), y):
                for i, prob in enumerate(probs):
                    # For each class predictions find true class index
                    c = tree.classes_[i]
                    c_i = self.labels.index(c)
                    # Add class prediction to oob_predictions matrix
                    oob_predictions[oob_i, c_i] += prob
                # Increment the datapoints oob prediction count
                prediction_count[oob_i] += 1

            self.model.append(tree)
        # Normalize oob predictions by dividing by the prediction counts for each data point.
        oob_predictions  = oob_predictions / prediction_count[:, None]
        oob_predictions = pd.DataFrame(oob_predictions, columns=self.labels)
        # Calculate accuracy from oob predictions
        self.oob_acc = accuracy(oob_predictions, df["CLASS"])
        return

    def predict(self, df):
        # Apply Preprocessing
        df = apply_column_filter(df, self.column_filter)
        df = apply_imputation(df, self.imputation)
        df = apply_one_hot(df, self.one_hot)
        features = list(set(df.columns.tolist())-{"ID", "CLASS"})

        # Matrix to store predictions
        predictions = np.zeros((len(df), len(self.labels)))

        for tree in self.model:
            # Generate predictions for tree
            y = tree.predict_proba(df[features])

            # Temporary matrix to store tree predictions
            tree_predictions = np.zeros((len(df), len(self.labels)))

            # Transpose the predictions in order to loop through the collumns/Classes
            for i, probs in enumerate(y.T):
                # Find true index of class
                c = tree.classes_[i]
                c_i = self.labels.index(c)
                # Add class predictions for each datapoint to tree_predicitons
                tree_predictions[:, c_i] += probs               
                    
            predictions = predictions + tree_predictions

        predictions = pd.DataFrame(predictions, columns=self.labels)
        # Divide predictions by no_trees to normalize
        predictions = predictions.div(len(self.model), axis=0)
        return predictions

In [10]:
# Test your code (leave this part unchanged, except for if auc is undefined)

train_df = pd.read_csv("anneal_train.csv")

test_df = pd.read_csv("anneal_test.csv")

rf = RandomForest()

t0 = time.perf_counter()
rf.fit(train_df)
print("Training time: {:.2f} s.".format(time.perf_counter()-t0))

print("OOB accuracy: {:.4f}".format(rf.oob_acc))

test_labels = test_df["CLASS"]

t0 = time.perf_counter()
predictions = rf.predict(test_df)
print("Testing time: {:.2f} s.".format(time.perf_counter()-t0))

print("Accuracy: {:.4f}".format(accuracy(predictions,test_labels)))
print("AUC: {:.4f}".format(auc(predictions,test_labels))) # Comment this out if not implemented in assignment 1
print("Brier score: {:.4f}".format(brier_score(predictions,test_labels))) # Comment this out if not implemented in assignment 1

Training time: 0.48 s.
OOB accuracy: 0.9443
Testing time: 0.17 s.
Accuracy: 0.9465
AUC: 0.9716
Brier score: 0.1032


In [11]:
train_labels = train_df["CLASS"]
rf = RandomForest()
rf.fit(train_df)
predictions = rf.predict(train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 1.00
AUC on training set: 1.00
Brier score on training set: 0.01


### Comment on assumptions, things that do not work properly, etc.