In [2]:
# powerful classification and regression models, which also give us great deal of info about our dataset.
# Trained with labeled data which can be classes (classification) or values (regression)

# it follows an intuitive process to make predictions, kinda resembles human reasoning

"""
Levels of a decision tree:

ROOT NODE: topmost node of a tree and contains the first yes/no question

DECISION NODE: each yes/no questions in model represented by a decision node, with 2 branches emanting from it (one for each yes/no)

LEAF NODE: node with no branches, the final decision for said path

BRANCH: 2 edges emanating from each decision node

DEPTH: number of levels in decision tree
"""


# Important to pick a good question for nodes (Gini index or entropy), and is an iterative process which must be logically built

'\nLevels of a decision tree:\n\nROOT NODE: topmost node of a tree and contains the first yes/no question\n\nDECISION NODE: each yes/no questions in model represented by a decision node, with 2 branches emanting from it (one for each yes/no)\n\nLEAF NODE: node with no branches, the final decision for said path\n\nBRANCH: 2 edges emanating from each decision node\n\nDEPTH: number of levels in decision tree\n'

In [None]:
"""
even more probabilities when creating a decision tree

important mathematical concepts for this is gini index or entropy, which again is heavy on probability math
worth looking at probability and gini index and entropy firther, definition given in book was too math heavy and not practical

We recurscively split our dataset using the concepts above where each split was performed by picking the best feature to split. feature was found using any of the following metrics:
accuracy, gini index, entropy
we finish when the portion of the dataset corresponding to each of the leaf nodes is pure in other words, when all the samples on it have the same label.

many issues arise, like splitting too much, where every leaf contains few samples, leading to extreme overfitting. to prevent this is to introduce a stopping condition.
- don't split the node if the change in accuracy gini index or entropy is below some threshold
- dont split a node if it has less than a certain num of samples
- split a node only if both of the resulting leaves contain at least a certain number of samples.
- stop buikding the tree after you reach a certain depth

all of the above conditions require a hyperparamter.
in order, these are the correspnding hyperparam:
- gini index
- min num of samples that a node must have to split
- min num of samples allows in a leaf node
- max depth of a tree
"""

In [4]:

"""
PSUEDOCODE:

-a training datast of samples with their associated labels
-a metric to split the data (accuracy, gini index, entropy)
-one or more stopping condition

OUPUT: 
decision tree that fits dataset

PROCEDURE:
- add a root node, and associate it with entire dataset
- repeat until stoppping conditions are met at every leaf node:
    - pick one of the leaf nodes at the highest level
    - go through all features, select one that splits sample corresponding to that node in optimal way, according to selected metric. Associate that feature to a node
    - this feature splits the dataset into 2 branches. Create 2 leaf nodes, one for each branch, and associate corresponding samples to each of the nodes
    - if stopping conditions allow a split, turn the node into a decision node and add 2 new leaf nodes underneath it. if level of the node is i, 2 new leaf nodes are at level i+1
    - if stopping conditions don't allow a split, node becomes a leaf node. To this leaf node, associate most common label among its samples. that label is the prediction at the leaf

RETURN:
-decision tree obtained
"""

"\nPSUEDOCODE:\n\n-a training datast of samples with their associated labels\n-a metric to split the data (accuracy, gini index, entropy)\n-one or more stopping condition\n\nOUPUT: \ndecision tree that fits dataset\n\nPROCEDURE:\n- add a root node, and associate it with entire dataset\n- repeat until stoppping conditions are met at every leaf node:\n    - pick one of the leaf nodes at the highest level\n    - go through all features, select one that splits sample corresponding to that node in optimal way, according to selected metric. Associate that feature to a node\n    - this feature splits the dataset into 2 branches. Create 2 leaf nodes, one for each branch, and associate corresponding samples to each of the nodes\n    - if stopping conditions allow a split, turn the node into a decision node and add 2 new leaf nodes underneath it. if level of the node is i, 2 new leaf nodes are at level i+1\n    - if stopping conditions don't allow a split, node becomes a leaf node. To this leaf 

In [5]:
"""
Beyond questions like yes/no

this would be called a nonbinary feature, instead of yes or no, like is the animal a dog? (yes/no)
it would be 'which animal' and prompts for 'dog' 'cat' 'bird'
"""

"\nBeyond questions like yes/no\n\nthis would be called a nonbinary feature, instead of yes or no, like is the animal a dog? (yes/no)\nit would be 'which animal' and prompts for 'dog' 'cat' 'bird'\n"

In [6]:
import pandas as pd
dataset = pd.DataFrame({
'x_0':[7,3,2,1,2,4,1,8,6,7,8,9],
'x_1':[1,2,3,5,6,7,9,10,5,8,4,6],
'y': [0,0,0,0,0,0,1,1,1,1,1,1]})

In [7]:
features = dataset[['x_0', 'x_1']]
labels = dataset['y']

from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(features, labels)

In [8]:
"""
SUMMARY:

-Decision trees are important machine learning models, used for classification and
regression.

-The way decision trees work is by asking binary questions about our data and making a
prediction based on the answers to those questions.

-The algorithm for building decision trees for classification consists of finding the feature
in our data that best determines the label and iterating over this step.

-We have several ways to tell if a feature determines the label best. The three that we
learned in this chapter are accuracy, Gini impurity index, and entropy.

-The Gini impurity index measures the purity of a set. In that way, a set in which every
element has the same label has a Gini impurity index of 0. A set in which every element
has a different label has a Gini impurity label close to 1.

-Entropy is another measure for the purity of a set. A set in which every element has the
same label has an entropy of 0. A set in which half of the elements have one label and the
other half has another label has an entropy of 1. When building a decision tree, the
difference in entropy before and after a split is called information gain.

-The algorithm for building a decision tree for regression is similar to the one used for
classification. The only difference is that we use the mean square error to select the best
feature to split the data.

-In two dimensions, regression tree plots look like the union of several horizontal lines,
where each horizontal line is the prediction for the elements in a particular leaf.

-Applications of decision trees range very widely, from recommendation algorithms to
applications in medicine and biology.
"""

'\nSUMMARY:\n\n-Decision trees are important machine learning models, used for classification and\nregression.\n\n-The way decision trees work is by asking binary questions about our data and making a\nprediction based on the answers to those questions.\n\n-The algorithm for building decision trees for classification consists of finding the feature\nin our data that best determines the label and iterating over this step.\n\n-We have several ways to tell if a feature determines the label best. The three that we\nlearned in this chapter are accuracy, Gini impurity index, and entropy.\n\n-The Gini impurity index measures the purity of a set. In that way, a set in which every\nelement has the same label has a Gini impurity index of 0. A set in which every element\nhas a different label has a Gini impurity label close to 1.\n\n-Entropy is another measure for the purity of a set. A set in which every element has the\nsame label has an entropy of 0. A set in which half of the elements have on