## Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint

In [2]:
# Step 1: Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

In [5]:
# Step 2: Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Step 3: Define the parameter grid for RandomizedSearchCV
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'max_features': ['auto', 'sqrt', 'log2', None]
}

In [8]:
dt = DecisionTreeClassifier()

random_search = RandomizedSearchCV(dt, param_distributions=param_dist, n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

# Step 6: Evaluate the tuned model on the test set
accuracy = random_search.best_estimator_.score(X_test, y_test)
print("Accuracy on test set:", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Accuracy on test set: 0.9166666666666666


## Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [9]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

In [10]:
# Step 1: Create 10 subsets of the training dataset using ShuffleSplit
rs = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
subsets = []

for train_index, _ in rs.split(X_train):
    X_subset, y_subset = X_train[train_index], y_train[train_index]
    subsets.append((X_subset, y_subset))

In [12]:
# Step 2: Train 1 decision tree on each subset using the best hyperparameters
trees = []

for X_subset, y_subset in subsets:
    dt = DecisionTreeClassifier(**random_search.best_params_)
    dt.fit(X_subset, y_subset)
    trees.append(dt)

In [16]:
# Step 3: Evaluate all the trees on the test dataset
tree_accuracies = []

for tree in trees:
    y_pred = tree.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("accuracy is ",accuracy)
    tree_accuracies.append(accuracy)

accuracy is  0.9722222222222222
accuracy is  0.9166666666666666
accuracy is  0.8611111111111112
accuracy is  0.9166666666666666
accuracy is  0.9166666666666666
accuracy is  0.8888888888888888
accuracy is  0.9444444444444444
accuracy is  0.9722222222222222
accuracy is  0.9444444444444444
accuracy is  0.8611111111111112


In [15]:
# Compare performance with the decision tree trained in the previous question
average_accuracy = sum(tree_accuracies) / len(tree_accuracies)
print("Average accuracy of the decision trees on the test set:", average_accuracy)
print("Accuracy of the single decision tree trained in the previous:", accuracy)

Average accuracy of the decision trees on the test set: 0.9194444444444443
Accuracy of the single decision tree trained in the previous: 0.8611111111111112
