# 1. Voting Classifier
#### In this assignment, you are expected to build an ensemble of different models and train it on cover type dataset.

## 1.1. Load dataset
#### You will need to read the data from the file (cover.csv). It contains 581012 samples and 54 attributes for each sample. The target column is Cover_Type.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [3]:
data = pd.read_csv('cover.csv')
df = data.copy()

In [4]:
df.shape

(581012, 55)

## 1.2. Prepare dataset
#### Split the data into train, validation, and test sets using train_test_split twice with 0.2 test_size. Your final distribution will be 371847-92962-116203.

In [5]:
X = df.drop(columns = ['Cover_Type'])
y = df['Cover_Type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 42)

In [28]:
print(f"Shape of the Input of Train dataset: {X_train.shape}")
print(f"Shape of the Input of Validation dataset: {X_test.shape}")
print(f"Shape of the Input of Test dataset: {X_val.shape}")

Shape of the Input of Train dataset: (371847, 54)
Shape of the Input of Validation dataset: (116203, 54)
Shape of the Input of Test dataset: (92962, 54)


In [6]:
# Before fitting the data, I scale them with StandardScaler().
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

## 1.3. Modeling
#### Train 4-5 different classifiers on the data. You can train RandomForestClassifier, ExtraTreesClassifier, LinearSVC, SGDClassifier, MLPClassifier, etc. Evaluate their performances using validation set. Note that training may take quite a while (up to 30 minutes) depending on the hardware.

In [7]:
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train_scaled, y_train)

In [8]:
tree_clf = ExtraTreesClassifier(random_state=42)
tree_clf.fit(X_train_scaled, y_train)

In [9]:
linear_clf = SGDClassifier(random_state=42)
linear_clf.fit(X_train_scaled, y_train)

In [10]:
svm_clf = LinearSVC(random_state=42)
svm_clf.fit(X_train_scaled, y_train)

In [11]:
nn_clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)
nn_clf.fit(X_train_scaled, y_train)

In [12]:
classifiers = [rf_clf, tree_clf, linear_clf, svm_clf, nn_clf]
names = ['RandomForest', 'ExtraTrees', 'SGD', 'LinearSVC', 'MLP']
validation_accuracies={}
for clf, name in zip(classifiers, names):
    y_val_pred = clf.predict(X_val_scaled)
    acc = accuracy_score(y_val, y_val_pred)
    validation_accuracies[name] = round(acc,2)
print(validation_accuracies)

{'RandomForest': 0.95, 'ExtraTrees': 0.95, 'SGD': 0.71, 'LinearSVC': 0.71, 'MLP': 0.86}


## 1.4. Ensembling
#### Create a hard and soft voting classifier using the models you have trained. You can use VotingClassifier. Check its performance on the validation set. Do you get better or worse performance than any of the individual classifiers?

In [19]:
# Hard Voting Classifier
hard_voting_clf = VotingClassifier(
    estimators=[('RandomForest', rf_clf), ('ExtraTrees', tree_clf), ('MLP', nn_clf), ('SGD', linear_clf), ('LinearSVC', svm_clf)],
    voting='hard'
)

#Train the hard voting classifier
hard_voting_clf.fit(X_train_scaled, y_train)

# Evaluate on validation set
hard_y_val_pred = hard_voting_clf.predict(X_val_scaled)
hard_acc = accuracy_score(y_val, hard_y_val_pred)
print(f"Hard Voting Classifier Validation Accuracy: {hard_acc:.2f}")

Hard Voting Classifier Validation Accuracy: 0.90


In [14]:
# # Initializing Fresh Models for VotingClassifier:

# rf = RandomForestClassifier(random_state=42)
# tree = ExtraTreesClassifier(random_state=42)
# linear = SGDClassifier(random_state=42)
# svm = LinearSVC(random_state=42)
# nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)

# Soft Voting Classifier (Since SGDClassifier and LinearSVC don't support `predict_proba`, we exclude them)
soft_voting_clf = VotingClassifier(
    estimators=[('RandomForest', rf_clf), ('ExtraTrees', tree_clf), ('MLP', nn_clf)],
    voting='soft' 
)


soft_voting_clf.fit(X_train_scaled, y_train)
soft_y_val_pred = soft_voting_clf.predict(X_val_scaled)
soft_acc = accuracy_score(y_val, soft_y_val_pred)
print(f"Soft Voting Classifier Validation Accuracy: {soft_acc:.2f}")

Soft Voting Classifier Validation Accuracy: 0.94


In [15]:
validation_accuracies # Individual Models

{'RandomForest': 0.95,
 'ExtraTrees': 0.95,
 'SGD': 0.71,
 'LinearSVC': 0.71,
 'MLP': 0.86}

#### Check if any of the models hurts the performance of the ensemble. You can access the estimators of the ensemble using estimators_ attribute. If so, drop those using set_params and reevaluate.

In [22]:
final_hard_estimators = []
all_estimators = [(name, estimator) for name, estimator in hard_voting_clf.named_estimators_.items()]

# Loop through each estimator
for i, (name, estimator) in enumerate(all_estimators):
    temp_estimators = [
        (other_name, other_estimator) for j, (other_name, other_estimator) in enumerate(all_estimators) if i != j
    ]
    
    # Creating a new VotingClassifier with the reduced set of estimators
    temp_voting_clf = VotingClassifier(estimators=temp_estimators, voting='hard')
    temp_voting_clf.fit(X_train_scaled, y_train)
    
    # Evaluate the reduced ensemble
    temp_hard_voting_pred = temp_voting_clf.predict(X_val_scaled)
    temp_hard_voting_acc = accuracy_score(y_val, temp_hard_voting_pred)
    print(f"Performance after removing {name}: {temp_hard_voting_acc:.2%}")
    
    if temp_hard_voting_acc <= hard_acc:
        final_hard_estimators.append((name, estimator))
        print(f"Removing {name} decreases accuracy, it should not be dropped")
    else:
        print(f"Removing {name} does not hurt performance, it can be dropped")


final_hard_voting_clf = VotingClassifier(estimators=final_hard_estimators, voting='hard')

# Print final estimators
print("Final hard voting estimators:", [name for name, _ in final_hard_estimators])

Performance after removing RandomForest: 79.98%
Removing RandomForest decreases accuracy, it should not be dropped
Performance after removing ExtraTrees: 80.03%
Removing ExtraTrees decreases accuracy, it should not be dropped
Performance after removing MLP: 82.05%
Removing MLP decreases accuracy, it should not be dropped
Performance after removing SGD: 92.18%
Removing SGD does not hurt performance, it can be dropped
Performance after removing LinearSVC: 92.04%
Removing LinearSVC does not hurt performance, it can be dropped
Final hard voting estimators: ['RandomForest', 'ExtraTrees', 'MLP']


In [23]:
final_soft_estimators = []
all_estimators = [(name, estimator) for name, estimator in soft_voting_clf.named_estimators_.items()]

# Loop through each estimator
for i, (name, estimator) in enumerate(all_estimators):
    temp_estimators = [
        (other_name, other_estimator) for j, (other_name, other_estimator) in enumerate(all_estimators) if i != j
    ]
    
    # Creating a new VotingClassifier with the reduced set of estimators
    temp_voting_clf = VotingClassifier(estimators=temp_estimators, voting='soft')
    temp_voting_clf.fit(X_train_scaled, y_train)
    
    # Evaluate the reduced ensemble
    temp_soft_voting_pred = temp_voting_clf.predict(X_val_scaled)
    temp_soft_voting_acc = accuracy_score(y_val, temp_soft_voting_pred)
    print(f"Performance after removing {name}: {temp_soft_voting_acc:.2%}")
    
    if temp_soft_voting_acc <= soft_acc:
        final_soft_estimators.append((name, estimator))
        print(f"Removing {name} decreases accuracy, it should not be dropped")
    else:
        print(f"Removing {name} does not hurt performance, it can be dropped")


final_soft_voting_clf = VotingClassifier(estimators=final_soft_estimators, voting='soft')

# Print final estimators
print("Final soft voting estimators:", [name for name, _ in final_soft_estimators])

Performance after removing RandomForest: 91.91%
Removing RandomForest decreases accuracy, it should not be dropped
Performance after removing ExtraTrees: 91.91%
Removing ExtraTrees decreases accuracy, it should not be dropped
Performance after removing MLP: 95.15%
Removing MLP does not hurt performance, it can be dropped
Final soft voting estimators: ['RandomForest', 'ExtraTrees']


In [30]:
final_hard_voting_clf.fit(X_train_scaled, y_train)
print(f"Final Hard Voting Classifier Validation Accuracy: {accuracy_score(y_val, final_hard_voting_clf.predict(X_val_scaled)):.2f}")

# Since we only remove MLP from soft voting classifier, we already know that the accuracy is 95.15%, so no need to train again
print(f"Final Soft Voting Classifier Validation Accuracy: 95.15%")

Final Hard Voting Classifier Validation Accuracy: 0.95
Final Soft Voting Classifier Validation Accuracy: 95.15%


# 2. Random Forest
#### In this assignment, you are expected to build a random forest that classifies a toy dataset.

## 2.1. Load dataset
#### You will need to read the data from the file (data.csv). It contains 15000 samples and two features for each sample.

In [3]:
dataset = pd.read_csv('data.csv')
dataset.head()

Unnamed: 0,1.018255499889504426e+04,-3.718306912453772384e+02,1.000000000000000000e+02
0,-8493.323486,7009.446179,0.0
1,21322.088204,-390.558362,100.0
2,5473.925002,-1878.223941,0.0
3,-7422.54071,5291.351276,0.0
4,-9103.655795,3197.164389,0.0


In [4]:
column_names = ['feature1', 'feature2', 'label']
dataset = pd.read_csv('data.csv', header=None, names=column_names)
# dataset['label'].value_counts()
dataset.head()

Unnamed: 0,feature1,feature2,label
0,10182.554999,-371.830691,100.0
1,-8493.323486,7009.446179,0.0
2,21322.088204,-390.558362,100.0
3,5473.925002,-1878.223941,0.0
4,-7422.54071,5291.351276,0.0


## 2.2. Prepare dataset
#### Split the data into train and test sets with 0.2 test size.

In [5]:
X=dataset.drop(columns=['label'])
y=dataset['label']

In [6]:
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 2.3. Modeling
#### Train a DecisionTreeClassifier on the data. Use GridSearchCV to tune the hyperparameters.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {
    'criterion': ['gini', 'entropy'],                  
    'max_depth': [None, 5, 10, 20, 50],               
    'min_samples_split': [2, 5, 10],                  
    'min_samples_leaf': [1, 2, 5, 10],           
}

In [10]:
dt_clf = DecisionTreeClassifier(random_state=42)

# Perform grid search
grid_search = GridSearchCV(
    estimator=dt_clf,
    param_grid=param_grid,
    scoring='accuracy',          
    cv=5,                        
    verbose=1,                   
    n_jobs=-1                  
)

In [11]:
grid_search.fit(X_train_scaled, y_train)

# Best hyperparameters and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", round(grid_search.best_score_,2))

Fitting 5 folds for each of 120 candidates, totalling 600 fits
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best Cross-Validation Score: 0.86


#### Train the best model on the whole train set (do you need to?) and evaluate the model on the test set.

In [13]:
# Retrain the best model on the entire training set
best_model = grid_search.best_estimator_
best_model.fit(X_train_scaled, y_train)
# Evaluate on the test set
y_test_pred = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy of the Best Model:", round(test_accuracy, 2))

Test Accuracy of the Best Model: 0.86


#### Generate 1,200 subsets of the training set, each containing 100 randomly chosen instances. You can use ShuffleSplit.

In [14]:
from sklearn.model_selection import ShuffleSplit

ss = ShuffleSplit(n_splits=1200, train_size=100, random_state=42)

subsets = list(ss.split(X_train_scaled)) #It gives indices that I will use to take subset of X_train_scaled

In [15]:
X_train_scaled[subsets[3][0]].shape

(100, 2)

#### Train one tree on each subset, using the best model you previously found. Evaluate the performance of the trees using the test set. Did you get lower or higher accuracy? Why?

In [16]:
individual_trees = []

# Train one tree on each subset
for train_indices, _ in subsets:
    X_subset = X_train_scaled[train_indices]
    y_subset = y_train.iloc[train_indices]
    
    tree_clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)
    tree_clf.fit(X_subset, y_subset)
    individual_trees.append(tree_clf)

# Evaluate individual trees
individual_accuracies = [
    accuracy_score(y_test, tree.predict(X_test_scaled)) for tree in individual_trees
]

# Sort the accuracies in descending order and get the top 10
top_10_accuracies = sorted(individual_accuracies, reverse=True)[:10]
worst_10_accuracies = sorted(individual_accuracies, reverse=False)[:10]

print("Top 10 Accuracies of Individual Trees:", [round(acc, 2) for acc in top_10_accuracies])
print("Worst 10 Accuracies of Individual Trees:", [round(acc, 2) for acc in worst_10_accuracies])

average_accuracy = sum(individual_accuracies) / len(individual_accuracies)
print("Average Accuracy of Individual Trees:", round(average_accuracy, 2))

Top 10 Accuracies of Individual Trees: [0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.84, 0.84]
Worst 10 Accuracies of Individual Trees: [0.62, 0.66, 0.67, 0.67, 0.67, 0.68, 0.68, 0.68, 0.69, 0.69]
Average Accuracy of Individual Trees: 0.79


#### For each instance in the test set, predict its class using 1200 trees, and keep only the most frequent prediction. You can use mode from scipy.stats. Evaluate these predictions. Did you get lower or higher accuracy?

In [17]:
import numpy as np
from scipy.stats import mode

# Collect predictions from all trees
all_predictions = np.array([tree.predict(X_test_scaled) for tree in individual_trees])

# Majority voting
final_predictions, _ = mode(all_predictions, axis=0)

# Evaluation
ensemble_accuracy = accuracy_score(y_test, final_predictions.ravel())
print("Ensemble Accuracy Using Majority Voting:", round(ensemble_accuracy, 2))

Ensemble Accuracy Using Majority Voting: 0.83
