Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?


In [1]:
# import necessaries libary
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score


In [7]:
# Step 2: Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
# Step 3: Hyperparameter tuning using RandomizedSearchCV
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': randint(2, 10),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'max_features': ['sqrt', 'log2', None]
}

In [9]:
dt_classifier = DecisionTreeClassifier(random_state=42)
random_search = RandomizedSearchCV(dt_classifier, param_distributions=param_dist, n_iter=100, cv=5, random_state=42)
random_search.fit(X_train, y_train)


In [10]:
print("Best hyperparameters for Decision Tree:", random_search.best_params_)
print("Best cross-validation score:", random_search.best_score_)

Best hyperparameters for Decision Tree: {'criterion': 'gini', 'max_depth': 4, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 12}
Best cross-validation score: 0.9224137931034484


In [11]:
# Evaluate the best model on the test set
best_dt_model = random_search.best_estimator_
y_pred = best_dt_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the best Decision Tree model on test set:", accuracy)


Accuracy of the best Decision Tree model on test set: 0.9444444444444444


In [12]:
# Step 4: Grow a Random Forest
n_subsets = 10
shuffle_split = ShuffleSplit(n_splits=n_subsets, test_size=0.2, random_state=42)

forest = []
for train_index, _ in shuffle_split.split(X_train):
    subset_X_train, subset_y_train = X_train[train_index], y_train[train_index]
    dt = DecisionTreeClassifier(**random_search.best_params_)
    dt.fit(subset_X_train, subset_y_train)
    forest.append(dt)


In [13]:
# Evaluate all trees on the test dataset
forest_predictions = [dt.predict(X_test) for dt in forest]
forest_accuracy = [accuracy_score(y_test, pred) for pred in forest_predictions]


In [15]:
# Compare with the single Decision Tree
print("Accuracy of the Decision Trees in the Random Forest:")
for i, acc in enumerate(forest_accuracy):
    print(f"Tree {i+1}: {acc}")


Accuracy of the Decision Trees in the Random Forest:
Tree 1: 0.9722222222222222
Tree 2: 0.9722222222222222
Tree 3: 0.9166666666666666
Tree 4: 0.9166666666666666
Tree 5: 0.9444444444444444
Tree 6: 0.9444444444444444
Tree 7: 0.9166666666666666
Tree 8: 0.9166666666666666
Tree 9: 0.9166666666666666
Tree 10: 0.9444444444444444
