## 1. What is the estimated depth of a Decision Tree trained (unrestricted) on a one million instance training set?

**Ans:**

The estimated depth of an unrestricted Decision Tree on a one million instance training set can vary widely and depends on several factors, including the complexity of the data and the specific algorithm or software used to train the tree. Theoretically, an unrestricted Decision Tree could grow to have a depth equal to the number of instances in the training set.

## 2. Is the Gini impurity of a node usually lower or higher than that of its parent? Is it always lower/greater, or is it usually lower/greater?

**Ans:**

The Gini impurity of a node in a decision tree is usually lower than or equal to the Gini impurity of its parent node after a split. However, it is not an absolute rule and can depend on the specific dataset and split.

## 3. Explain if its a good idea to reduce max depth if a Decision Tree is overfitting the training set?

**Ans:**

Reducing the maximum depth of a decision tree can be a good idea if the tree is overfitting the training set. Overfitting occurs when the tree is too complex and captures noise in the data, making it perform poorly on unseen data (testing data).

Benefits of reducing the max depth of a decosion tree:
- Simplifies the Model
- Avoids Overfitting
- Improves Interpretability
- Enhances Training Speed

## 4. Explain if its a good idea to try scaling the input features if a Decision Tree underfits the training set?

**Ans**

Scaling the input features is generally not necessary or effective for addressing underfitting in a decision tree. Decision trees are a type of machine learning model that are not influenced by the scale of input features in the same way as some other algorithms, such as Support Vector Machines (SVMs) or k-Nearest Neighbors (k-NN)

## 5. How much time will it take to train another Decision Tree on a training set of 10 million instances if it takes an hour to train a Decision Tree on a training set with 1 million instances?

**Ans:**

The time required to train a Decision Tree on a training set depends on various factors, and it's not always a linear relationship with the number of instances. However, if we assume that the training time scales linearly with the number of instances, we can make an estimate.

If it takes 1 hour to train a Decision Tree on a training set with 1 million instances, and we want to train it on a training set with 10 million instances, the training time can be estimated as follows:

Time to train on 10 million instances = (Time to train on 1 million instances) * (Number of instances to train on / 1 million)

Time to train on 10 million instances = 1 hour * (10 million / 1 million)

Time to train on 10 million instances = 1 hour * 10

Time to train on 10 million instances = 10 hours

So, if we assume a linear relationship, it would take approximately 10 hours to train a Decision Tree on a training set with 10 million instances, assuming similar hardware and settings. Keep in mind that this is a simplified estimate, and in practice, training times can vary due to factors like the complexity of the tree, hardware resources, and software optimizations.

## 6. Will setting `presort = True` speed up training if your training set has `100,000` instances?

**Ans:**

Setting `presort = True` in a Decision Tree can help speed up training for small to moderately sized datasets. Presorting involves sorting the data for each feature to find the best split points, which can be computationally expensive. However, it can speed up training for smaller datasets because it allows the algorithm to quickly identify good split points.


In your case, with a training set of 100,000 instances, using `presort = True` could potentially speed up training, especially if the dataset is relatively small and can fit into memory. It precomputes feature values, which can make finding optimal splits faster.


However, for very large datasets, the cost of presorting may outweigh the benefits, as sorting all features for each node becomes computationally expensive. In such cases, it's often better to set `presort = False` (which is the default) and let the algorithm choose the optimal strategy.


The impact of setting `presort = True` on training time can vary depending on factors like the dataset size, hardware, and available memory, so it's a good practice to experiment with both settings to see which one performs better for your specific case.

## 7. Follow these steps to train and fine-tune a Decision Tree for the moons dataset:
### a. To build a moons dataset, use make moons(n samples=10000, noise=0.4).
### b. Divide the dataset into a training and a test collection with train test split().
### c. To find good hyperparameters values for a DecisionTreeClassifier, use grid search with cross-validation (with the GridSearchCV class). Try different values for max leaf nodes.
### d. Use these hyperparameters to train the model on the entire training set, and then assess its output on the test set. You can achieve an accuracy of 85 to 87 percent.

**Ans:**

In [8]:
# Step 1: Import necessary libraries
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [9]:
# Step 2: Create the moons dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [10]:
# Step 3: Split the dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [11]:
# Step 4: Define hyperparameters to search
param_grid = {
    'max_leaf_nodes': [None, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}


In [12]:
# Step 5: Create a DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(random_state=42)

In [13]:
# Step 6: Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(tree_clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

In [14]:
# Step 7: Get the best hyperparameters
best_params = grid_search.best_params_

In [15]:
# Step 8: Train the model on the entire training set with the best hyperparameters
best_tree_clf = DecisionTreeClassifier(**best_params, random_state=42)
best_tree_clf.fit(X_train, y_train)

In [16]:
# Step 9: Assess the model's performance on the test set
y_pred = best_tree_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.2%}")

Accuracy on the test set: 87.00%


## 8. Follow these steps to grow a forest:

### a. Using the same method as before, create 1,000 subsets of the training set, each containing 100 instances chosen at random. You can do this with Scikit-ShuffleSplit Learn&#39;s class.
### b. Using the best hyperparameter values found in the previous exercise, train one Decision Tree on each subset. On the test collection, evaluate these 1,000 Decision Trees. These Decision Trees would likely perform worse than the first Decision Tree, achieving only around 80% accuracy, since they were trained on smaller sets.
### c. Now the magic begins. Create 1,000 Decision Tree predictions for each test set case, and keep only the most common prediction (you can do this with SciPy&#39;s mode() function). Over the test collection, this method gives you majority-vote predictions.
### d. On the test range, evaluate these predictions: you should achieve a slightly higher accuracy than the first model (approx 0.5 to 1.5 percent higher). You&#39;ve successfully learned a Random Forest classifier!

**Ans:**

In [17]:
from sklearn.model_selection import ShuffleSplit
from sklearn.base import clone
import numpy as np
from scipy.stats import mode
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [18]:
# Step a: Create 1,000 subsets of the training set
n_trees = 1000
n_instances = 100

subset_estimators = []

rs = ShuffleSplit(n_splits=n_trees, test_size=n_instances, random_state=42)

for train_index, _ in rs.split(X_train):
    subset_tree = RandomForestClassifier(n_estimators=10, max_leaf_nodes=16)  
    subset_tree.fit(X_train[train_index], y_train[train_index])
    subset_estimators.append(subset_tree)

In [20]:
# Step b: Train Decision Trees on each subset and evaluate
accuracy_scores = []
for estimator in subset_estimators:
    y_pred = estimator.predict(X_test)
    accuracy = np.mean(y_pred == y_test)
    accuracy_scores.append(accuracy)

In [21]:
# Step c: Make predictions and take the majority vote
forest_predictions = np.empty([n_trees, len(X_test)], dtype=np.uint8)
for tree_index, estimator in enumerate(subset_estimators):
    forest_predictions[tree_index] = estimator.predict(X_test)

y_pred_majority, _ = mode(forest_predictions, axis=0)

# Step d: Evaluate the Random Forest
forest_accuracy = np.mean(y_pred_majority.ravel() == y_test)
print("Random Forest Accuracy: {:.2f}%".format(forest_accuracy * 100))

Random Forest Accuracy: 87.15%
