## Decision Tree

### Building Your Model
    The steps to building and using a model are:
    1.Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
    2.Fit: Capture patterns from provided data. This is the heart of modeling.
    3.Predict: Just what it sounds like
    4.Evaluate: Determine how accurate the model's predictions are

### Underfitting & Overfitting
    1）Overfitting - Tree depth is too large, only a few of training data in each leaf causing unreliable predictions on new data
    2）Underfitting - Tree depth is too shallow, too large volume of training data in each leaf causing failure in capturing important distinctions and patterns on data

<img src=".\pic\underfitting_overfitting.png" width="50%" height="50%">

But the **max_leaf_nodes** argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.
We can use a utility function to help compare MAE scores from different values for **max_leaf_nodes**

In [None]:
from sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)

We can use a for-loop to compare the accuracy of models built with different values for **max_leaf_nodes**.

In [None]:
# compare MAE with differing values of max_leaf_nodesfor max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 Mean Absolute Error: 347380  

Max leaf nodes: 50 Mean Absolute Error: 258171  

Max leaf nodes: 500 Mean Absolute Error: 243495  

Max leaf nodes: 5000 Mean Absolute Error: 254983

## Random Forests

The random forest uses many trees, and it makes a prediction by **averaging the predictions of each component tree**. It generally has **much better predictive accuracy than a single decision tree** and it works well with default parameters. 
Random Forests can significantly solve over-fitting problem by  
    1.  Ensemble Learning - by combining multiple decision trees to make predictions by averaging out the errors and capture more representive patterns  
    2.  Random Feature Selection - each tree is built uing a random subset of features at each split preventing from memorizing noise

### Dealing with missing values

#### 1) A Simple Option: Drop Columns with Missing Values   
    Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. 
#### 2) A Better Option: Imputation  
    Fill in missing values with mean value

#### 3) An Extension To Imputation  

    In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

<img src=".\pic\imputation.png" width="50%" height="50%">