# 1. Voting Classifier
#### In this assignment, you are expected to build an ensemble of different models and train it on cover type dataset.

## 1.1. Load dataset
#### You will need to read the data from the file (cover.csv). It contains 581012 samples and 54 attributes for each sample. The target column is Cover_Type.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

In [2]:
df = pd.read_csv('cover.csv')

In [3]:
df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Cover_Type
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


## 1.2. Prepare dataset
#### Split the data into train, validation, and test sets using train_test_split twice with 0.2 test_size. Your final distribution will be 371847-92962-116203.

In [4]:
test_proportion = 0.2
validation_proportion_of_remaining = 0.2
validation_proportion_of_total = test_proportion * validation_proportion_of_remaining / (1 - test_proportion)

In [5]:
x = df.drop('Cover_Type', axis=1)
y = df['Cover_Type']

In [6]:
x_remaining, x_test, y_remaining, y_test = train_test_split(x, y, test_size = test_proportion, random_state = 42)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size = validation_proportion_of_total, random_state = 42)

In [7]:
print(f"Train set has {x_train.shape[0]} samples")
print(f"Validation set has {x_val.shape[0]} samples")
print(f"Test set has {x_test.shape[0]} samples")

Train set has 441568 samples
Validation set has 23241 samples
Test set has 116203 samples


## 1.3. Modeling
#### Train 4-5 different classifiers on the data. You can train RandomForestClassifier, ExtraTreesClassifier, LinearSVC, SGDClassifier, MLPClassifier, etc. Evaluate their performances using validation set. Note that training may take quite a while (up to 30 minutes) depending on the hardware.

In [8]:
classifiers = {
    "RandomForestClassifier": RandomForestClassifier(n_jobs = -1, random_state = 42),
    "ExtraTreesClassifier": ExtraTreesClassifier(n_jobs = -1, random_state = 42),
    "LinearSVC": LinearSVC(max_iter = 10000, dual = False, random_state = 42),
    "SGDClassifier": SGDClassifier(max_iter = 2000, tol = 1e-3, n_jobs = -1, random_state = 42),
    "MLPClassifier": MLPClassifier(max_iter = 300, random_state = 42) 
}

In [9]:
for name, clf in classifiers.items():
    print(f"Training {name}...")
    clf.fit(x_train, y_train)
    y_val_pred = clf.predict(x_val)
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"{name} validation accuracy is {accuracy}")

Training RandomForestClassifier...
RandomForestClassifier validation accuracy is 0.9544770018501786
Training ExtraTreesClassifier...
ExtraTreesClassifier validation accuracy is 0.9530570973710254
Training LinearSVC...
LinearSVC validation accuracy is 0.7092207736328041
Training SGDClassifier...
SGDClassifier validation accuracy is 0.5627985026461856
Training MLPClassifier...
MLPClassifier validation accuracy is 0.7837872724925777


## 1.4. Ensembling
#### Create a hard and soft voting classifier using the models you have trained. You can use VotingClassifier. Check its performance on the validation set. Do you get better or worse performance than any of the individual classifiers?

In [10]:
class_for_soft_voting = {name: clf for name, clf in classifiers.items() if name != "LinearSVC"}

In [11]:
voting_clf_hard = VotingClassifier(estimators = [(name, clf) for name, clf in classifiers.items()], voting = 'hard')
print("Training hard voting...")
voting_clf_hard.fit(x_train, y_train)
y_val_pred_hard = voting_clf_hard.predict(x_val)
accuracy_hard = accuracy_score(y_val, y_val_pred_hard)
print(f"Hard voting classifier validation accuracy is {accuracy_hard}")

Training hard voting...
Hard voting classifier validation accuracy is 0.8747041865668431


In [12]:
class_for_soft_voting['SGDClassifier'] = SGDClassifier(loss = 'log_loss', max_iter = 2000, tol = 1e-3, n_jobs = -1, random_state = 42)

voting_clf_soft = VotingClassifier(
    estimators = [(name, clf) for name, clf in class_for_soft_voting.items()], 
    voting = 'soft'
)

print("Training soft voting...")
voting_clf_soft.fit(x_train, y_train) 
y_val_pred_soft = voting_clf_soft.predict(x_val)  
accuracy_soft = accuracy_score(y_val, y_val_pred_soft)  
print(f"Soft voting classifier validation accuracy is {accuracy_soft}")

Training soft voting...
Soft voting classifier validation accuracy is 0.9118368400671227


#### Check if any of the models hurts the performance of the ensemble. You can access the estimators of the ensemble using estimators_ attribute. If so, drop those using set_params and reevaluate.

In [13]:
ensemble_hard_accuracy = accuracy_score(y_val, voting_clf_hard.predict(x_val))
ensemble_soft_accuracy = accuracy_score(y_val, voting_clf_soft.predict(x_val))

In [14]:
print(f"Ensemble hard voting accuracy is {ensemble_hard_accuracy}")
print(f"Ensemble soft voting accuracy is {ensemble_soft_accuracy}")

Ensemble hard voting accuracy is 0.8747041865668431
Ensemble soft voting accuracy is 0.9118368400671227


In [15]:
hard_remaining = []

original_estimators = [(estimator.__class__.__name__, estimator) for estimator in voting_clf_hard.estimators_]

for i, (name, estimator) in enumerate(original_estimators):
    del voting_clf_hard.estimators_[i]
    print(f"{i+1}. Removing: {name}")

    voting_clf_hard.fit(x_train, y_train)
    
    new_hard_voting_pred = voting_clf_hard.predict(x_val)
    new_hard_voting_acc = accuracy_score(y_val, new_hard_voting_pred)
    print(f"Performance after removing {name}: {new_hard_voting_acc:.2%}")
    
    if new_hard_voting_acc <= accuracy_hard:
        hard_remaining.append((name, estimator))
        print(f"Removing {name} hurts performance, it should not be dropped")
    else:
        print(f"Removing {name} does not hurt performance, it can be dropped")
    
    voting_clf_hard.estimators_.insert(i, estimator)

voting_clf_hard.estimators_ = [est for _, est in original_estimators]

1. Removing: RandomForestClassifier
Performance after removing RandomForestClassifier: 87.47%
Removing RandomForestClassifier hurts performance, it should not be dropped
2. Removing: ExtraTreesClassifier
Performance after removing ExtraTreesClassifier: 87.47%
Removing ExtraTreesClassifier hurts performance, it should not be dropped
3. Removing: LinearSVC
Performance after removing LinearSVC: 87.47%
Removing LinearSVC hurts performance, it should not be dropped
4. Removing: SGDClassifier
Performance after removing SGDClassifier: 87.47%
Removing SGDClassifier hurts performance, it should not be dropped
5. Removing: MLPClassifier
Performance after removing MLPClassifier: 87.47%
Removing MLPClassifier hurts performance, it should not be dropped


# 2. Random Forest
#### In this assignment, you are expected to build a random forest that classifies a toy dataset.

## 2.1. Load dataset
#### You will need to read the data from the file (data.csv). It contains 15000 samples and two features for each sample.

In [16]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [17]:
data_path = 'data.csv'
dt = pd.read_csv(data_path)

## 2.2. Prepare dataset
#### Split the data into train and test sets with 0.2 test size.

In [18]:
train_set, test_set = train_test_split(dt, test_size = 0.2, random_state = 42)
print(f"Training set shape: {train_set.shape}")
print(f"Test set shape: {test_set.shape}")

Training set shape: (11999, 3)
Test set shape: (3000, 3)


## 2.3. Modeling
#### Train a DecisionTreeClassifier on the data. Use GridSearchCV to tune the hyperparameters.

In [19]:
X = dt.drop('1.000000000000000000e+02', axis = 1).to_numpy() 
Y = dt['1.000000000000000000e+02'].to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [20]:
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, Y_train)
parameters = {
    'max_depth': range(2, 10), 
    'min_samples_split': range(2, 10), 
    'min_samples_leaf': range(2, 10)}

In [21]:
grid_search = GridSearchCV(dt_clf, parameters, cv = 5, scoring = 'accuracy') 
grid_search.fit(X_train, Y_train)

#### Train the best model on the whole train set (do you need to?) and evaluate the model on the test set.

In [22]:
best_cls = DecisionTreeClassifier(**grid_search.best_params_) 
best_cls.fit(X_train, Y_train)
predict_y = best_cls.predict(X_test) 
best_accuracy = accuracy_score(Y_test, predict_y)
print(f'Accuracy score of the best model: {best_accuracy * 100:.2f}%')

Accuracy score of the best model: 85.50%


#### Generate 1,200 subsets of the training set, each containing 100 randomly chosen instances. You can use ShuffleSplit.

In [23]:
from sklearn.model_selection import ShuffleSplit 

s = ShuffleSplit(n_splits=1200, train_size = 100, random_state = 42)

X_subsets = []
Y_subsets = []

In [24]:
for train_index, _ in s.split(X_train):
    X_subset = X_train[train_index]
    Y_subset = Y_train[train_index]
    X_subsets.append(X_subset)
    Y_subsets.append(Y_subset)


In [25]:
print('X subsets:', len(X_subsets))
print('Y subsets:', len(Y_subsets))

X subsets: 1200
Y subsets: 1200


#### Train one tree on each subset, using the best model you previously found. Evaluate the performance of the trees using the test set. Did you get lower or higher accuracy? Why?

In [26]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import mode

In [27]:
accuracies = []
trees = []

for X_subset, Y_subset in zip(X_subsets, Y_subsets):
    dt_subset = DecisionTreeClassifier(**grid_search.best_params_)
    dt_subset.fit(X_subset, Y_subset)
    trees.append(dt_subset)
    
    Y_pred_subset = dt_subset.predict(X_test)
    acc = accuracy_score(Y_test, Y_pred_subset)
    accuracies.append(acc)

mean_accuracy = np.mean(accuracies)
print(f'Mean accuracy: {mean_accuracy * 100:.2f}%')


Mean accuracy: 81.87%


#### For each instance in the test set, predict its class using 1200 trees, and keep only the most frequent prediction. You can use mode from scipy.stats. Evaluate these predictions. Did you get lower or higher accuracy?

In [28]:
from scipy.stats import mode
import numpy as np

test_preds = []

for tree in trees:
    tree_test_preds = tree.predict(X_test) 
    test_preds.append(tree_test_preds)

test_preds = np.array(test_preds)

test_preds_mode = mode(test_preds, axis=0).mode[0]
all_trees_accuracy = np.sum(test_preds_mode.ravel() == Y_test) / len(Y_test)

print(f"All trees accuracy: {all_trees_accuracy * 100:.2f}%")


All trees accuracy: 49.93%
