# kaggle - Learn: Intro to Machine Learning
- https://www.kaggle.com/learn/intro-to-machine-learning
## 5. Underfitting and Overfitting
- Fine-tune your model for better performance.

### Experimenting With Different Models
- Decision Tree Model (DecisionTreeRegressor) has many options, most important options determine the tree's depth
    - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
    - https://scikit-learn.org/stable/modules/tree.html#tree
- A shallow tree can generate an __underfitting__ model.
- A tree that is too deep can lead to __overfitting__.
- __Underfitting:__ failing to capture relevant patterns, again leading to less accurate predictions.
- __Overfitting:__ capturing spurious patterns that won't recur in the future, leading to less accurate predictions.

### Example 
*max_leaf_nodes* argument provides a very sensible way to control overfitting vs underfitting.

In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
from sklearn.model_selection import train_test_split


def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


# Load data
melbourne_df = pd.read_csv('min_melb_data.csv') 
# Filter rows with missing values
filtered_melbourne_df = melbourne_df.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_df.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_df[melbourne_features]

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)


# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))



Max leaf nodes: 5  		 Mean Absolute Error:  322937
Max leaf nodes: 50  		 Mean Absolute Error:  303430
Max leaf nodes: 500  		 Mean Absolute Error:  302932
Max leaf nodes: 5000  		 Mean Absolute Error:  303443


> Optimal max_leaf_nodes = 500

## Exercise: Underfitting and Overfitting
- Optimize the size of the tree to make better predictions. In next cell the known code for the model previously built

In [15]:
# --> Better to use split data (train + val -or test- data)
### Build a model with train data and validate (quality measure) with validation data

# 0.- import libraries, modules, functions i'll need.
import pandas as pd                                 # to get & manage df and Series
from sklearn.tree import DecisionTreeRegressor      # to define e model type
from sklearn.model_selection import train_test_split    # to split train & validation (test) data from whole df
from sklearn.metrics import mean_absolute_error     # to calc. MAE

# 1.- read data + basic filter missing values +  obtain target & features + ...
df = pd.read_csv('train.csv')
#df = df_0.dropna(axis=0)  ---- OJO __ analizar esto para distintos datasets!!!
y = df.SalePrice
X = df[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
        # (df.column if i want to know what are the columns)
# i won't make de non-split-data model - ONLY the more real model de splitted one

# 2. - split data (between train and validation) + make and fit this model
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
df_model = DecisionTreeRegressor(random_state=1)
df_model.fit(train_X, train_y)

# 3.- Measure model quality: calc predict, calc mae, (optional) calc % mae / average_target
val_predict = df_model.predict(val_X)
mae = mean_absolute_error(val_y, val_predict)
print(f' MAE: {round(mae, 4):,} '.center(26, '-'))
ratio_mae_avg = mae / val_y.mean()
print(f'MAE % of real mean: {round(ratio_mae_avg * 100, 2)} %\
  - real mean (val_y.mean()): {round(val_y.mean(), 2):,}')

---- MAE: 29,652.9315 ----
MAE % of real mean: 16.78 %  - real mean (val_y.mean()): 176,725.51


## Same construction with kaggle terms:

In [3]:
### Code you have previously used to load data
# imports...

home_data = pd.read_csv('train.csv')
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.2f}".format(val_mae))

Validation MAE: 29,652.93


In [None]:
### get_mae() is just written...

### Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for max_leaf_nodes from a set of possible values.

Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes that gives the most accurate model on your data.

In [4]:
    
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
for max_lnodes in candidate_max_leaf_nodes:
    my_mae = get_mae(max_lnodes, train_X, val_X, train_y, val_y)
    print(f'max_leaf_nodes: {max_lnodes} \t\t MAE: {my_mae:,.2f}')

max_leaf_nodes: 5 		 MAE: 35,044.51
max_leaf_nodes: 25 		 MAE: 29,016.41
max_leaf_nodes: 50 		 MAE: 27,405.93
max_leaf_nodes: 100 		 MAE: 27,282.51
max_leaf_nodes: 250 		 MAE: 27,893.82
max_leaf_nodes: 500 		 MAE: 29,454.19


> beast_tree_size = 100 (best max_leaf_nodes value)

### Step 2: Fit Model Using All Data
- You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [5]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)   # ALL the Data used give more accurate

> You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.