# Coil2000 dataset
[Source](https://www.openml.org/search?type=data&sort=qualities.NumberOfFeatures&status=active&qualities.NumberOfClasses=lte_1&qualities.NumberOfFeatures=between_10_100&format=ARFF&qualities.NumberOfInstances=between_1000_10000&id=298)

Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?

Perfect challenge for a random forest tree regressor.

9822 instances 

86 features

Each instance is a customer. All features are the customer informations, which include product usage data and socio-demographic data derived from zip area codes.

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff 
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
import sklearn
from random_forest_regressor import RandomForestRegressor
import sklearn.ensemble
import time

RANDOM_SEED = 42


### Import data

In [2]:
raw_data = loadarff('coil2000.arff')
df = pd.DataFrame(raw_data[0])

In [3]:
df.head()

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND,CARAVAN
0,33.0,1.0,3.0,2.0,8.0,0.0,5.0,1.0,3.0,7.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,37.0,1.0,2.0,2.0,8.0,1.0,4.0,1.0,4.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,37.0,1.0,2.0,2.0,8.0,0.0,4.0,2.0,4.0,3.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9.0,1.0,3.0,3.0,3.0,2.0,3.0,2.0,4.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,40.0,1.0,4.0,2.0,10.0,1.0,4.0,1.0,4.0,7.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9822 entries, 0 to 9821
Data columns (total 86 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MOSTYPE   9822 non-null   float64
 1   MAANTHUI  9822 non-null   float64
 2   MGEMOMV   9822 non-null   float64
 3   MGEMLEEF  9822 non-null   float64
 4   MOSHOOFD  9822 non-null   float64
 5   MGODRK    9822 non-null   float64
 6   MGODPR    9822 non-null   float64
 7   MGODOV    9822 non-null   float64
 8   MGODGE    9822 non-null   float64
 9   MRELGE    9822 non-null   float64
 10  MRELSA    9822 non-null   float64
 11  MRELOV    9822 non-null   float64
 12  MFALLEEN  9822 non-null   float64
 13  MFGEKIND  9822 non-null   float64
 14  MFWEKIND  9822 non-null   float64
 15  MOPLHOOG  9822 non-null   float64
 16  MOPLMIDD  9822 non-null   float64
 17  MOPLLAAG  9822 non-null   float64
 18  MBERHOOG  9822 non-null   float64
 19  MBERZELF  9822 non-null   float64
 20  MBERBOER  9822 non-null   floa

In [5]:
len(df)

9822

In [6]:
df.empty

False

### Prepare the data

In [7]:
X_data = df.drop("CARAVAN", axis=1)
y_data = df["CARAVAN"]

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=RANDOM_SEED)

## Run the experiments

### Baseline experiment

In [17]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=2, max_depth=5)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [None]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=5, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [19]:
implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of implementation:  0.24098255327130602
RMSE of sklearn:  0.2350161143782378
Runtime of implementation 116.48364019393921
Runtime of sklearn 0.14285898208618164


**Output**  
RMSE of implementation:  0.24098255327130602  
RMSE of sklearn:  0.2350161143782378  
Runtime of implementation 116.48364019393921  
Runtime of sklearn 0.14285898208618164

### Experiments on trees depth
**1. Decrease tree depth to 2 instead of 5** 

In [20]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=2, max_depth=2)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [None]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=2, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [22]:
implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of implementation:  0.23951705622424962
RMSE of sklearn:  0.2350467171819462
Runtime of implementation 48.99829983711243
Runtime of sklearn 0.05793952941894531


**Output**  
RMSE of implementation:  0.23951705622424962  
RMSE of sklearn:  0.2350467171819462  
Runtime of implementation 48.99829983711243  
Runtime of sklearn 0.05793952941894531

**2. Increase tree depth to None instead of 5**

In [8]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=2, max_depth=None)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

RecursionError: maximum recursion depth exceeded

**Output**
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
Cell In[8], line 3
      1 start_time_implementation = time.time()
      2 rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=2, max_depth=None)
----> 3 rf.train(X_train, y_train)
      4 y_pred = rf.predict(X_test)
      5 end_time_implementation = time.time()

File ~/Documents/Winter24/Machine Learning/MachineLearningCourse/Exercise2/random_forest_regressor.py:36, in RandomForestRegressor.train(self, X_data, y_data)
     34 for i in range(self.n_tress):
     35     tree = regressor_tree.Node(max_features=self.max_features, min_samples_split=self.min_samples_split, max_depth=self.max_depth)
---> 36     tree.train(X_data, y_data)
     37     self.fitted_trees.append(tree)

File ~/Documents/Winter24/Machine Learning/MachineLearningCourse/Exercise2/regressor_tree.py:121, in Node.train(self, X_data, y_data)
    118     self.left_child = Node(height=self.height + 1, max_features=self.max_features, max_depth=self.max_depth, min_samples_split=self.min_samples_split)
    119     self.right_child = Node(height=self.height + 1, max_features=self.max_features, max_depth=self.max_depth, min_samples_split=self.min_samples_split)
--> 121     self.left_child.train(data_left_X, data_left_y)
    122     self.right_child.train(data_right_X, data_right_y)
    123 else:
    124     # Get the prediction value by the mean of the remaining target data (y_data)

File ~/Documents/Winter24/Machine Learning/MachineLearningCourse/Exercise2/regressor_tree.py:121, in Node.train(self, X_data, y_data)
    118     self.left_child = Node(height=self.height + 1, max_features=self.max_features, max_depth=self.max_depth, min_samples_split=self.min_samples_split)
...
     42 @classmethod  # type: ignore[misc]
     43 def _instancecheck(cls, inst) -> bool:
---> 44     return _check(inst) and not isinstance(inst, type)

RecursionError: maximum recursion depth exceeded
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

In [9]:
import sys
print(sys.getrecursionlimit())

3000


By printing the recursion limit, we can see that our implementation exceeded 3000 recursions

In [None]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=None, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [24]:
#implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
#print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
#implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
#print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of sklearn:  0.2814079075177691
Runtime of sklearn 0.3791069984436035


**Output**  
RMSE of sklearn:  0.2814079075177691  
Runtime of sklearn 0.3791069984436035

**3. Increase tree depth to 20 instead of 5**

In [13]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=2, max_depth=20)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [13]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=20, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [14]:
implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of implementation:  0.23957197065958202
RMSE of sklearn:  0.27971263323518153
Runtime of implementation 118.94622135162354
Runtime of sklearn 0.39827799797058105


**Output**  
RMSE of implementation:  0.23957197065958202  
RMSE of sklearn:  0.27971263323518153  
Runtime of implementation 118.94622135162354  
Runtime of sklearn 0.39827799797058105

## Experiments on max features
**1. Increase max features to 2**

In [None]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=2, min_samples_split=2, max_depth=5)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [15]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=5, max_features=2, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [16]:
implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of implementation:  0.23957197065958202
RMSE of sklearn:  0.23570879464546549
Runtime of implementation 118.94622135162354
Runtime of sklearn 0.03200674057006836


**Output**  
RMSE of implementation:  0.23957197065958202  
RMSE of sklearn:  0.23570879464546549  
Runtime of implementation 118.94622135162354  
Runtime of sklearn 0.03200674057006836

**2. Increase max features to 5**

In [8]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=5, min_samples_split=2, max_depth=5)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [17]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=5, max_features=5,random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()

In [18]:
implementation_rmse = root_mean_squared_error(y_test, y_pred)
sklearn_rmse = root_mean_squared_error(y_test, sk_y_pred)
print("RMSE of implementation: ", implementation_rmse)
print("RMSE of sklearn: ", sklearn_rmse)
implementation_runtime = end_time_implementation - start_time_implementation
sklearn_runtime = end_time_sk - start_time_sk
print("Runtime of implementation", implementation_runtime)
print("Runtime of sklearn", sklearn_runtime)

RMSE of implementation:  0.23957197065958202
RMSE of sklearn:  0.2355863726400265
Runtime of implementation 118.94622135162354
Runtime of sklearn 0.04115128517150879


**Output**  
RMSE of implementation:  0.23957197065958202  
RMSE of sklearn:  0.2355863726400265  
Runtime of implementation 118.94622135162354  
Runtime of sklearn 0.04115128517150879

## Experiments on min samples splits
**1. Increase min_sample_split to 5**

In [None]:
start_time_implementation = time.time()
rf = RandomForestRegressor(n_trees=5, max_features=1, min_samples_split=5, max_depth=5)
rf.train(X_train, y_train)
y_pred = rf.predict(X_test)
end_time_implementation = time.time()

In [None]:
start_time_sk = time.time()
sklearn_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=5, max_depth=5, max_features=1, min_samples_split=5, random_state=RANDOM_SEED)
sklearn_rf.fit(X_train, y_train)
sk_y_pred = sklearn_rf.predict(X_test)
end_time_sk = time.time()