# kaggle - Learn: Intro to Machine Learning
- https://www.kaggle.com/learn/intro-to-machine-learning
## 5. Underfitting and Overfitting
- Fine-tune your model for better performance.

### Experimenting With Different Models
- Decision Tree Model (DecisionTreeRegressor) has many options, most important options determine the tree's depth
    - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
    - https://scikit-learn.org/stable/modules/tree.html#tree
- A shallow tree can generate an __underfitting__ model
- A tree that is too deep can lead to __overfitting__.

In [15]:
# --> Better to use split data (train + val -or test- data)
### Build a model with train data and validate (quality measure) with validation data

# 0.- import libraries, modules, functions i'll need.
import pandas as pd                                 # to get & manage df and Series
from sklearn.tree import DecisionTreeRegressor      # to define e model type
from sklearn.model_selection import train_test_split    # to split train & validation (test) data from whole df
from sklearn.metrics import mean_absolute_error     # to calc. MAE

# 1.- read data + basic filter missing values +  obtain target & features + ...
df = pd.read_csv('train.csv')
#df = df_0.dropna(axis=0)  ---- OJO __ analizar esto para distintos datasets!!!
y = df.SalePrice
X = df[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
        # (df.column if i want to know what are the columns)
# i won't make de non-split-data model - ONLY the more real model de splitted one

# 2. - split data (between train and validation) + make and fit this model
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
df_model = DecisionTreeRegressor(random_state=1)
df_model.fit(train_X, train_y)

# 3.- Measure model quality: calc predict, calc mae, (optional) calc % mae / average_target
val_predict = df_model.predict(val_X)
mae = mean_absolute_error(val_y, val_predict)
print(f' MAE: {round(mae, 4):,} '.center(26, '-'))
ratio_mae_avg = mae / val_y.mean()
print(f'MAE % of real mean: {round(ratio_mae_avg * 100, 2)} %\
  - real mean (val_y.mean()): {round(val_y.mean(), 2):,}')

---- MAE: 29,652.9315 ----
MAE % of real mean: 16.78 %  - real mean (val_y.mean()): 176,725.51


- __MAE:__ Mean Absolute Error. On average, our predictions are off by about X (MAE).
    - Error = actual_val - predicted_val 
    - Absolute value of Error: abs(Error)
    - Mean, average of all AE computed.
- To calculate MAE, we first need a model, a fitted model cause we use the prediction target and the 'predicted values'. We use the mean_absolute_error function of sklearn.metrics module.

In [51]:
import pandas as pd

# Load data
melbourne_df = pd.read_csv('min_melb_data.csv') 
# Filter rows with missing price values
filtered_melbourne_df = melbourne_df.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_df.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_df[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

In [52]:
#### Once we have a model, here is how we calculate the mean absolute error:

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

63.72132540356839

### The Problem with "In-Sample" Scores
- We used a single "sample" of houses for both building the model and evaluating it. This is bad
- __Validation Data:__ some data excluded from the model-building process.
- The scikit-learn library has a function __*train_test_split*__ to break up the data into two pieces.
    - We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate *mean_absolute_error*:

In [53]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=5)
# Define model
melbourne_mod2 = DecisionTreeRegressor()
# Fit model
melbourne_mod2.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_mod2.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

303389.4915254237


### __Wow!__
- 'in-sample' MAE = 63.7; 'out-of-sample' MAE = 292_499.6 (with random_state = 5)
- This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes

In [54]:
# the average home value in the validation data is: ------1.1 million dollars
" validation data home val is val_y, IS a Series with only some of the rows randomly selected of the original dataset"
display(val_y)
# take a look at the descriptive statistics of validation data (val_X df)
# mean (average) 
print(f'\nValidation data home value average (val_y,mean()): {round(val_y.mean(), 2):,}\n'.center(185, '-'))
print()
print(f'  Val. data home value avg- (val_y,mean()): {round(val_y.mean(), 2):,}  '.center(64, '\\'))
print()
print("For more information let's see the descriptive statistics:")
print("----------------------------------------------------------")
val_y.describe()


1353     967500.0
1068    1200000.0
1042     400000.0
470      522500.0
1002    1875000.0
          ...    
1395    1360000.0
386     1665000.0
422      690000.0
395     1830000.0
1380    1450000.0
Name: Price, Length: 295, dtype: float64

------------------------------------------------------------
Validation data home value average (val_y,mean()): 1,172,083.05
------------------------------------------------------------

\\\  Val. data home value avg- (val_y,mean()): 1,172,083.05  \\\

For more information let's see the descriptive statistics:
----------------------------------------------------------


count    2.950000e+02
mean     1.172083e+06
std      6.196489e+05
min      2.725000e+05
25%      7.175000e+05
50%      1.001000e+06
75%      1.482500e+06
max      4.000000e+06
Name: Price, dtype: float64

In [58]:
# I need percent_mae_mean = (mae / mean) * 100  - parenthesis are note necessary (only for clear reading )
#    - mae = mean_absolute_error (actual_values, predicted_values)  - all over VALIDATION data
#                             (once the model es fitted you dont use the training data anymore)
#         - actual_values = val_y (a Series) - actual VALIDATION values
#         - predicted_values = fitted_model.predict(val_X) - predicted VALIDATION value
#    - mean = val_y.mean() - actual sale prices mean of validation data (target)       
mae = mean_absolute_error(val_y, melbourne_mod2.predict(val_X))
percent_mae_mean = (mae / val_y.mean()) * 100

t_val_hpmean = f'The average home value in the validation data is: {round(val_y.mean(), 2):,}'
t_mae = f'The mean_absolute_error (val. data of-course) is: {round(mae, 2):,}'
t_erro_nd = f'So the error in new data is: {round(percent_mae_mean, 2)} % of the average home value.'

print(t_val_hpmean)
print(t_mae)
print(t_erro_nd)

print()
# also can use ratio + :% text formating, ej.    - but can't round ¿?
ratio_mae_mean = mae / val_y.mean()
print(f'So the error in new data is: {ratio_mae_mean:%} of the average home value.')

print('\n' + ' THIS MODEL IS "UNUSABLE" FOR PRACTICAL PURPOSES '.center(56, '!'))

The average home value in the validation data is: 1,172,083.05
The mean_absolute_error (val. data of-course) is: 303,389.49
So the error in new data is: 25.88 % of the average home value.

So the error in new data is: 25.884641% of the average home value.

!!! THIS MODEL IS "UNUSABLE" FOR PRACTICAL PURPOSES !!!!


> __There are many ways to improve this model, such as experimenting to find better features
 or different model types.__
- Improve Models:
    1. find better features.
    2. use a different model type
    3. Both?  - but one at e time , i would say.

# Exercise: Model Validation

In [59]:
home_df = pd.read_csv('train.csv')
home_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [60]:
# prediction target of whole dataset
y = home_df.SalePrice
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [61]:
# features fo whole dataset
X = home_df[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]
X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9
...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7
1456,13175,1978,2073,0,2,3,7
1457,9042,1941,1188,1152,2,4,9
1458,9717,1950,1078,0,1,2,5


In [65]:
## a NON split data model: 'in-Sample' -erroneous- situation only for examples
home_model = DecisionTreeRegressor()
home_model.fit(X, y)    # model fitted using whole dataset

print("First in-sample predictions:", home_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

print()
print("First in-sample predictions:", home_model.predict(X))

print()
print("First in-sample predictions:", home_model.predict(X.head(9)))
print("Actual target values for those homes:", y.head(9).tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]

First in-sample predictions: [208500. 181500. 223500. ... 266500. 142125. 147500.]

First in-sample predictions: [208500. 181500. 223500. 140000. 250000. 143000. 307000. 200000. 129900.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900]


### Step 1: Split Your Data
- to obtain different data for train and for validation to avoid 'in-sample' situation.

In [68]:
from sklearn.model_selection import train_test_split

# Give it the argument random_state=1 so the check functions know what to expect when verifying your code.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

## Step 2: Specify and Fit the (a-new) Model

In [78]:
# Set random_state to 1 again when creating the model.
home_mmod = DecisionTreeRegressor(random_state=1)
home_mmod.fit(train_X, train_y)

## Step 3: Make Predictions with Validation data

In [80]:
val_predictions = home_mmod.predict(val_X)
val_predictions

0      186500.0
1      184000.0
2      130000.0
3       92000.0
4      164500.0
         ...   
360    133750.0
361    188500.0
362    148500.0
363    284000.0
364    201800.0
Length: 365, dtype: float64

In [81]:
### Inspect your predictions and actual values from validation data.
print(pd.Series(val_predictions).head())
print(val_y.head())


0    186500.0
1    184000.0
2    130000.0
3     92000.0
4    164500.0
dtype: float64
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data

In [91]:
mae_ex = mean_absolute_error(val_y, val_predictions)
print(f' {round(mae_ex, 2):_} '.center(20, '_'))

____ 29_652.93 _____


In [None]:
### Future calc de percentequal weight of mae vs. target.mean()

> Is that MAE good? __JM thinks NOT, is UNUSABLE__ There isn't a general rule for what values are good that applies across applications. But you'll see how to use (and improve) this number in the next step.

### Keep Going

You are ready for **[Underfitting and Overfitting](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**
