## Making Predictions

You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts
that your flower is an Iris-Setosa (class=setosa).

One of the many qualities of Decision Trees is that they require
very little data preparation. In particular, they don’t require feature
scaling or centering at all

A node’s samples attribute counts how many training instances it applies to

A node’s value attribute tells you how many training instances of each class this node
applies to

a node’s gini attribute measures its impurity: a node is “pure” (gini=0) if all training instances it applies to belong to the same class

## Estimating Class Probabilities

A Decision Tree can also estimate the probability that an instance belongs to a partic‐
ular class k

## The CART Training Algorithm

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees). 

The idea is really quite simple: the algorithm first splits the training set in two subsets using a single feature k and a threshold tk 

It searches for the pair (k, tk) that produces the purest subsets (weighted by their size).

Once it has successfully split the training set in two, it splits the subsets using the
same logic, then the sub-subsets and so on, recursively.

It stops recursing once it reaches the maximum depth (defined by the max_depth hyperparameter), or if it cannot find a split that will reduce impurity

A few other hyperparameters (described in a moment) 
control additional stopping conditions (min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes)

the CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether or not the split will
lead to the lowest possible impurity several levels down. A greedy
algorithm often produces a reasonably good solution, but it is not
guaranteed to be the optimal solution

## Gini Impurity or Entropy?

the Gini impurity measure is used, but you can select the entropy impurity
measure instead by setting the criterion hyperparameter to "entropy".

entropy frequently used as an impurity measure: a set’s
entropy is zero when it contains instances of only one class.

So should you use Gini impurity or entropy? The truth is, most of the time it does not
make a big difference: they lead to similar trees. Gini impurity is slightly faster to
compute, so it is a good default. However, when they differ, Gini impurity tends to
isolate the most frequent class in its own branch of the tree, while entropy tends to
produce slightly more balanced trees

## Regularization Hyperparameters

Decision Trees make very few assumptions about the training data

If left unconstrained, the tree structure will adapt itself to the training data, fitting it very
closely, and most likely overfitting it.

nonparametric model, not because it does not have any parameters (it often has a lot) but because the
number of parameters is not determined prior to training, so the model structure is
free to stick closely to the data.

a parametric model such as a linear model has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting).

To avoid overfitting the training data, you need to restrict the Decision Tree’s freedom
during training.  (regularization)

The regularization hyperparameters depend on the algorithm used, but generally you can at least restrict
the maximum depth of the Decision Tree. In Scikit-Learn, this is controlled by the
max_depth hyperparameter

Reducing max_depth will regularize the model and thus reduce the risk of overfitting.

The DecisionTreeClassifier class has a few other parameters that similarly restrict
the shape of the Decision Tree: min_samples_split (the minimum number of samples a node must have before it can be split), min_samples_leaf (the minimum number of samples a leaf node must have), min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances), max_leaf_nodes (maximum number of leaf nodes), and max_features
(maximum number of features that are evaluated for splitting at each node). Increas‐
ing min_* hyperparameters or reducing max_* hyperparameters will regularize the model.

Other algorithms work by first training the Decision Tree without
restrictions, then pruning (deleting) unnecessary nodes.

A node whose children are all leaf nodes is considered unnecessary if the
purity improvement it provides is not statistically significant

Standard statistical tests, such as the χ2 test, are used to estimate the
probability that the improvement is purely the result of chance. null hypothesis

If this probability, called the pvalue, is higher than a given threshold (typically 5%, controlled by
a hyperparameter), then the node is considered unnecessary and its
children are deleted. The pruning continues until all unnecessary nodes have been pruned

## Instability

#### limitation

First, Decision Trees love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation.

the main issue with Decision Trees is that they are very sensitive to
small variations in the training data.

since the training algorithm used by Scikit-Learn is stochastic6 you may
get very different models even on the same training data (unless you set the
random_state hyperparameter).