### Assignment 1

Implement a random forest sampling algorithm (bootstrap aggregating + random subspace method). Write an implementation of three functions (bootstrap_sample, random_subspace, get_subsample) and combine them into the sample class, which returns some random subsample from the sample, suitable for training one of the trees in the random forest ensemble.

Note: Note that features in the final subsample must not be duplicated. Use only the tools implemented in numpy.random, setting random_state=42 wherever necessary

#### Solution


1. bootstrap_sample is a method that works as follows: It choses random object from the dataset, remembers the id and repeates this operation for N_obj times, where N_obj is a number of ogbects in dataset. Then it returns all unique written values. Thus, probability for each object to be returned is approximately 0.62.


2. random_subspace returns id of randomly chosen features from dataset.


3. get_subsample returns subsample of given dataset based on bootstrap sampling and random_subspace methods

In [1]:
import numpy as np
np.random.seed(42)


class sample(object):
    def __init__(self, X, n_subspace):
        self.idx_subspace = self.random_subspace(X, n_subspace)

    def __call__(self, X, y):
        idx_obj = self.bootstrap_sample(X)
        X_sampled, y_sampled = self.get_subsample(X, y, self.idx_subspace, idx_obj)
        return X_sampled, y_sampled

    @staticmethod
    def bootstrap_sample(X):
        idx_obj = np.unique([ np.random.choice(X.shape[0], size=1) for _ in range(X.shape[0]) ])
        return idx_obj
    
    @staticmethod 
    def random_subspace(X, n_subspace):
        idx_subspace = np.random.choice(X.shape[1], size=n_subspace, replace=False)
        return idx_subspace
    
    @staticmethod
    def get_subsample(X, y, idx_subspace, idx_obj):
        X_sample, y_sample = X[:, idx_subspace][idx_obj, :], y[idx_obj]
        return X_sample, y_sample


An example of execution:

In [2]:
X = np.array([[1,2,3], [4,5,6], [7,8,9]])
Y = np.array([1, 2, 3])

for _ in range(3):
    s = sample(X, 2)
    bootstrap_indices = s.bootstrap_sample(X)
    X_sampled, y_sampled = s(X, Y)

In [3]:
# rows (objects) from given array X
bootstrap_indices

array([1])

In [4]:
# columns (features) from given array X
s.idx_subspace

array([2, 1])

In [5]:
# output array
X_sampled

array([[3, 2],
       [6, 5]])

In [6]:
# corresponding responses output
y_sampled

array([1, 2])

### Assignment 2

1. Write the random_forest class to solve the classification problem based on the Fisher Iris dataset (sklearn.datasets.load_iris), which takes the n_estimators, max_depth, subspaces_dim and random_state arguments as input to the constructor. Define the .fit() and .predict() methods.

2. Select the hyperparameters on which your algorithm gets the best quality (in terms of the accuracy metric, the proportion of correct answers) on the test set with the parameter test_size=0.3, set them as global variables N_ESTIMATORS, MAX_DEPTH, SUBSPACE_DIM.

Note: The use of the sklearn.ensemble.RandomForestClassifier class is prohibited.

In [7]:
import warnings
warnings.filterwarnings("ignore")

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

np.random.seed(42)

#### Solution

Here we implement a Random Forest Classifier algorithm. 

1. In fit() method the several base algorithms (DecisionTreeClassifier) are trained using sampling algorithm above. So that we have $n$=n_estimators different models.

2. In predict() method each trained model predicts classes of objects and wote for each object. Method returns the most popular class among models.

In [9]:
class random_forest(object):
    def __init__(self, n_estimators: int, max_depth: int, subspaces_dim: int, random_state: int):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.subspaces_dim = subspaces_dim
        self.random_state = random_state
        self._estimators = []
        self.subspace_idx = []
        
    def fit(self, X, y):
        for i in range(self.n_estimators):
            s = sample(X, self.subspaces_dim)
            X_sample, y_sample = s(X, y) 
            self.subspace_idx.append(s.idx_subspace)
            
            estimator = DecisionTreeClassifier(max_depth=self.max_depth, random_state=self.random_state).fit(X_sample, y_sample)
            self._estimators.append(estimator)
            
    def predict(self, X):
        
        predictions = np.array([ self._estimators[i].predict(X[:, self.subspace_idx[i]]) for i in range(self.n_estimators) ])
        y_pred = []
        for i in range(X.shape[0]):
            y_predict, counts = np.unique(predictions[:, i], return_counts=True)
            y_pred.append(y_predict[np.argmax(counts)])
            
        return np.array(y_pred)
    

Here we train RandomFroest using iris dataset.

In [10]:
X,y = load_iris(return_X_y=True)
X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

In [11]:
N_ESTIMATORS = 3
MAX_DEPTH = 2
SUBSPACE_DIM = 2

model = random_forest(N_ESTIMATORS, MAX_DEPTH, SUBSPACE_DIM, 42)
model.fit(X_train, y_train)
            
y_pred = model.predict(x_test)
acc = accuracy_score(y_pred, y_test)
acc

1.0