# Homework 5: Ensembling
### Write your own ensemble classifier

Creating a custom classifier for scikit-learn involves implementing a class that adheres to the scikit-learn estimator interface. Here's a basic outline of the steps and the classes/functions you need to implement for your own ensemble classifier:

In [1]:
from sklearn.base import BaseEstimator, ClassifierMixin


class CustomClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, base_classifier=None, n_estimators=10):
        # Initialization code
        pass
    
    def fit(self, X, y):
        # Training code
        pass
    
    def predict(self, X):
        # Prediction code
        pass
    
    def predict_proba(self, X):
        # Prediction code with probabilities: The output should be a matrix of shape (n_samples, n_classes)
        pass


For example, a "majority class classifier" that always predicts the majority class can be implemented as follows:

In [2]:
import numpy as np

from sklearn.base import BaseEstimator, ClassifierMixin
class Majority(BaseEstimator, ClassifierMixin):
    """A classifier that classifies all instances as the majority class."""

    def __init__(self):   
        self.name = "Majority Classifier"
        
    def fit(self, X, y):
        ''' X is the training data, and y is the target labels. 
        Fitting the 'majority class classifier' involves determining and storing the majority class. '''
        
        # Count the number of instances of each class
        counts = np.bincount(y)
        # The class with the highest count is the majority class

        # Find the index with the maximum count (most frequent element)
        self.majority = np.argmax(counts)   # store the majority class label in a member attribute self.majority

    def predict(self, X):        
        '''X is the unlabaled data, the classifier returns the same prediciton (the majority class) for each instance in X'''
        return [self.majority for _ in range(len(X))]

For this homework, you will write your own ensemble classifier. 
1. In the ```__init__(self, parameters)``` function, you will set the parameters (number of trees, percentage of instances to use for each tree, random_seed, ...), initialize (diverse in terms of max_depth, min_samples_split, min_samples_leaf, and **max_features**) trees and store them in a member attribute (a Python _list_ called ```self.base_classifiers```).

2. In the ```fit(self, X, y)``` function, you need to fit the trees (in a for loop) on different subsets of data instances. 
3. In the ```predict(self, X)``` function, you need to predict all data points by all base classifiers (from ```self.base_classifiers```) and aggregate the result for each data point.
4. *Additional points: Implement also a probabilistic classifier by implementing ```predict_proba(self, X)```. 
5. Test your code, compare to the majority class classifier (above), to a single decision tree and to the scikit implementation of random forests.
6. Write a paragraph about your findings.

You can use ChatGPT, Copilot or other tools for this assignment. Mention the tools you used in your discussion.

Note:
In scikit-learn's decision tree implementation, the ```max_features``` parameter determines the maximum number of features considered for splitting a node. If set to an integer value, it considers that exact number of features. Alternatively, if set to a float between 0 and 1, it represents a fraction of total features. For example, max_features=0.5 means the algorithm will consider half of the features. More details: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Notes about object-oriented programming: 
 - The whole Python _class_ (all the member functions, i.e. methods) need to be implemented in one Jupyter notebook cell.
 - ```self``` in front of variable names (or as a first parameter in functions) inside a class makes the variable (function) a member attribute (method). The members are accessible to other class members, i.e. if you define ```self.base_classifiers``` within the constructor (```__init__```), you can access it within ```fit``` and ```predict```.

 

In [3]:
# Note: The following code is useful to achieve randomness. You will use it to initialize decision trees with different parameters.
import random
options = [0, 1, 2, 3, 4]
random.choice(options)            # Run this line several times to see the result

3

In [4]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import random

class MyForest(BaseEstimator, ClassifierMixin):
    """A simple implementation of the Random forest classifier."""

    def __init__(self, number_of_trees=10, features_subset=0.7, instances_subset=0.8, random_seed=42):
        # Initialization code
        self.number_of_trees = number_of_trees
        #self.features_subset = features_subset #Didn't need to use it in other functions, didn't store it
        self.instances_subset = instances_subset
        self.random_seed = random_seed
    
        random_seeds = np.random.randint(1,100,number_of_trees)  #I added this row because I realized that after setting the random seeds, I got basically the same tree parameters for each tree (because they got the same seed) 
        print("Random seeds: ", random_seeds)

        self.base_classifiers = []
        for i in range(number_of_trees):
            np.random.seed(random_seeds[i]) #Random choice has no seed, so had to do this
            max_depth = random.choice([2,3,4,5,6,7,8,9,10,11,12,13]) #I guess there is no point in selecting a larger depth than 13, because in the wine dataset there are only 13 attributes.
            min_samples_leaf = random.choice([1,2,5])# Min_samples_leaf I think isn't that important if you have min_samples_split,also I don't want to "ruin" for example max_depth = big value with big minimum leaf values (then there could only be few layers)   
            min_samples_split = random.choice([2,3,5,10,20,25,30])
            self.base_classifiers.append(DecisionTreeClassifier(max_features=features_subset, max_depth=max_depth, min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split, random_state=random_seed))
        pass
                
    def fit(self, X, y):
        # X is the training data, and y is the target labels
        np.random.seed(self.random_seed)
        for clf in self.base_classifiers:
            X_subset = np.empty((0,len(X[0])))
            y_subset = np.empty((0))

            js = list(range(len(X))) # All possible indices
            for i in range(int(len(X)*0.8)):
                j = random.choice(js) # tried t= np.random.choice(X) too, but requires an if in X_subset re-looping
                js.remove(j) # this way we don't resample
                X_subset = np.append(X_subset, X[j]) #np.append(X_subset, t)
                y_subset = np.append(y_subset, y[j]) #np.append(y_subset, y[X.index(t)]) 
                i += 1 #C-style for-loop
            X_subset = X_subset.reshape(int(len(X)*0.8),len(X[0]))
            clf.fit(X_subset,y_subset)
        
        pass

    def predict(self, X):        
        # X is the unlabaled data, the classifier should return a prediction for each instance in X
        # Get a prediction from each tree
        if self.base_classifiers == []: #just in case
            print("No proper fitting was done")
            return []
        
        np.random.seed(self.random_seed)
        modes = [] #Will collect the modes for each data row, of predictions. Decided not to store these in self.___ as they are input-dependent
        for _ in range(len(X)):
            predictions = [] #"Reset"
            for clf in self.base_classifiers:
                predictions.append(int(clf.predict(X[_].reshape(1,-1))) ) #I was thinking to use .tolist(), didn't work (must have a solution though), Copilot suggested .reshape(1,-1). Gives warning but works with np 1.25.2 + Python 3.11.5 
            bins = np.bincount(predictions) #above I assumed that the predictions are integers, not so "nice"
            modes.append(np.argmax(bins))
        return modes  
        pass
    
    def predict_proba(self,X):
        if self.base_classifiers == []:
            print("No proper fitting was done")
            return []

        np.random.seed(self.random_seed)
        probs = []
        for _ in range(len(X)):
            predictions = [] #"Reset"
            for clf in self.base_classifiers:
                predictions.append(int(clf.predict(X[_].reshape(1,-1))) )            
            pred = np.bincount(predictions)/self.number_of_trees
            while len(pred) < 3:
                pred = np.append(pred, 0) #If there is no prediction for class 2 or 3, then it is only [10] instead of [10,0,0], similarly [x,10-x] instead of [x,10-x,0]
            probs.append(pred)
        return probs
        pass

    

In [25]:
# Test your code here

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load the data
wine = load_wine()
X = wine.data
y = wine.target

print(X.shape)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create the classifier
clf = MyForest(number_of_trees=10, features_subset=0.7, instances_subset=0.8, random_seed=42)
clf_majority = Majority()

# Train the classifier
clf.fit(X_train, y_train)
clf_majority.fit(X_train, y_train)

# Test the classifier
y_pred = clf.predict(X_test)
y_pred_majority = clf_majority.predict(X_test)

# Print the results
print("Accuracy:", sum(y_pred == y_test)/len(y_test))
print("Accuracy (majority):", sum(y_pred_majority == y_test)/len(y_test))


(178, 13)
Random seeds:  [52 93 15 72 61 21 83 87 75 75]
Accuracy: 0.9333333333333333
Accuracy (majority): 0.4




In [32]:
y_pred_probs = clf.predict_proba(X_test)
y_pred_probs

  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_].reshape(1,-1))) )
  predictions.append(int(clf.predict(X[_

[array([0.9, 0.1, 0. ]),
 array([1., 0., 0.]),
 array([0.1, 0.6, 0.3]),
 array([1., 0., 0.]),
 array([0.1, 0.9, 0. ]),
 array([1., 0., 0.]),
 array([0., 1., 0.]),
 array([0., 0., 1.]),
 array([0., 1., 0.]),
 array([0.5, 0.1, 0.4]),
 array([0.9, 0.1, 0. ]),
 array([0.1, 0.2, 0.7]),
 array([1., 0., 0.]),
 array([0. , 0.7, 0.3]),
 array([1., 0., 0.]),
 array([0., 1., 0.]),
 array([0., 1., 0.]),
 array([0., 1., 0.]),
 array([1., 0., 0.]),
 array([0., 1., 0.]),
 array([1., 0., 0.]),
 array([0.1, 0.9, 0. ]),
 array([0. , 0.7, 0.3]),
 array([0. , 0.1, 0.9]),
 array([0., 0., 1.]),
 array([0., 0., 1.]),
 array([0., 1., 0.]),
 array([0.1, 0.9, 0. ]),
 array([0., 1., 0.]),
 array([1., 0., 0.]),
 array([1., 0., 0.]),
 array([0., 1., 0.]),
 array([0.2, 0.3, 0.5]),
 array([1., 0., 0.]),
 array([1., 0., 0.]),
 array([1., 0., 0.]),
 array([0. , 0.3, 0.7]),
 array([0., 0., 1.]),
 array([0.1, 0.9, 0. ]),
 array([0. , 0.2, 0.8]),
 array([1., 0., 0.]),
 array([0., 1., 0.]),
 array([0., 1., 0.]),
 array([0

In [34]:
clfdt = DecisionTreeClassifier()

# Train the classifier
clfdt.fit(X_train, y_train)

# Test the classifier
y_preddt = clfdt.predict(X_test)

# Print the results
print("Accuracy of decision tree:", sum(y_preddt == y_test)/len(y_test))
print("---")


for _ in range(len(y_test)):
    if(y_test[_] != y_pred[_]):
        print("Index:", _)
        print("Pred myforest:", y_pred[_])
        print("Real data:", y_test[_], "\n")
    if(y_test[_] != y_preddt[_]):
        print("Index:", _)
        print("Pred decisiontree:", y_preddt[_])
        print("Real data:", y_test[_], "\n")

Accuracy of decision tree: 0.9555555555555556
---
Index: 2
Pred myforest: 1
Real data: 2 

Index: 9
Pred myforest: 0
Real data: 2 

Index: 10
Pred decisiontree: 1
Real data: 0 

Index: 11
Pred decisiontree: 0
Real data: 2 

Index: 44
Pred myforest: 1
Real data: 2 



#### Your discussion goes here. (300 words)






In [33]:
y_pred_probs[2]

array([0.1, 0.6, 0.3])

I used GitHub Copilot and the documentation of scikit-learn to write the code.
Please read the comments in the code (there is not that much).

Here we create a class that ensembles a forest of decision trees, building on the DecisionTreeClassifier class, combining multiple instances of diverse trees, with a "voting system" to decide the final class. I also implemented a probabilistic classifier, that gives back the probability of each class, based on the results from the trees. I tried to make it "stable" on randomness by setting a random seed, however, it weirdly only becomes stable after running the "test the results" function 3 times, from that point on we always get the same results. (See the printed "random seeds". )

Results: The majority class does only 40% well, which is understandable, as the test set is distributed in 15:18:12 and class (1) is most popular, it is correct only 18/(15+18+12)=40% of the times. The decision tree and forest ensembler do similarly well, they are mostly 93.333-95.555% good (with other seeds, the ensembler does 95.55% well most commonly). Test set has 45 elements, so 2 mistakes result in 95.55% accuracy, 3 mistakes results in 93.33% accuracy. While the decision tree varied in results between 91.11%-100% (different seeds), it's quite "unreliable", the ensembler is much more stable. I didn't get any other result than 93.33% or 95.55% for the ensembler. In fact, there were only 4 data points in the test set that ever got "failed" by the assembler: index 2 and 9, which are consistently failed, occasionally index 11 too, and rarely index 44. This also shows that it is much more consistent. 