Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score

In [2]:
# Step 1: Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

In [3]:
# Step 2: Split the dataset into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Step 3: Use Random Search CV to hyperparameter tune the Decision Tree
param_dist = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": randint(1, 20),
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 20),
}

In [5]:
tree = DecisionTreeClassifier(random_state=42)
random_search = RandomizedSearchCV(tree, param_distributions=param_dist, n_iter=100, cv=5, random_state=42)
random_search.fit(X_train, y_train)

In [6]:
# Step 4: Check accuracy and fine-tuned parameters
best_tree = random_search.best_estimator_
y_pred = best_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best parameters found by RandomizedSearchCV:")
print(random_search.best_params_)
print("Accuracy on test set:", accuracy)

Best parameters found by RandomizedSearchCV:
{'criterion': 'gini', 'max_depth': 13, 'min_samples_leaf': 1, 'min_samples_split': 8, 'splitter': 'best'}
Accuracy on test set: 0.9444444444444444


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

# Grow a random forest

In [21]:
from sklearn.model_selection import ShuffleSplit
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
import numpy as np


In [14]:
# Step 1: Create 10 subsets of the training dataset
shuffle_split = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
forest_train_indices = []

In [15]:
for train_index, _ in shuffle_split.split(X_train):
    forest_train_indices.append(train_index)

In [16]:
# Step 2: Train 1 decision tree on each subset using the best hyperparameters
forest = []
for train_index in forest_train_indices:
    tree = DecisionTreeClassifier(**random_search.best_params_)
    tree.fit(X_train[train_index], y_train[train_index])
    forest.append(tree)

In [17]:
# Step 3: Evaluate all the trees on the test dataset
forest_predictions = []
for tree in forest:
    forest_predictions.append(tree.predict(X_test))

In [18]:
# Calculate the accuracy of each tree
forest_accuracies = [accuracy_score(y_test, pred) for pred in forest_predictions]

print("Random Forest accuracies for each tree:", forest_accuracies)

Random Forest accuracies for each tree: [0.9722222222222222, 0.9722222222222222, 0.9166666666666666, 0.9166666666666666, 0.9444444444444444, 0.8888888888888888, 0.9166666666666666, 0.9444444444444444, 0.9166666666666666, 0.9444444444444444]


In [22]:
# Calculate the average accuracy of the forest
average_accuracy = np.mean(forest_accuracies)
print("Average accuracy of Random Forest:", average_accuracy)

Average accuracy of Random Forest: 0.9333333333333333
