# kaggle - Learn: Intro to Machine Learning
- https://www.kaggle.com/learn/intro-to-machine-learning
## 7. Machine Learning Competitions
1. rf_m0: Base RadomForest Model
2. rf_m1: rf_Improved #1 - Maximun Features
3. rf_m2: rf_Improved #2 - Max. Features + Tune Parameters
4. dt_m0: Base DecisionTree Model
5. dt_m1: dt_Improved #1 - Maximun Features
6. dt_m2: dt_Improved #2 - Max. Features + Tune Parameters

## Housing Prices Competition for Kaggle Learn Users
- In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to apply what you've learned and move up the leaderboard.

### 0.- Obtain data that will be the same for all cases (1. - 6.)

In [18]:
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [19]:
# Load data (train.csv) and separate target 'y' - features will be separated later for e/case
df = pd.read_csv('train.csv')
y = df.SalePrice

### 1. rf_m0: Base RadomForest Model + 
### 4. dt_m0: Base DecisionTree Model
- They use the same features cols (basics from the lecture)

In [20]:
# Separate basic features 'X' (same for 1. and 4.)
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath',
            'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[features]

# Split into training and validation (test) data (same for 1. and 4.)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=18)
# random state could have different wight in dt than in rf (18, 42, 238, 777)

# Define 1. Basic Random Forest model + fit with training data + mk _predict + calc mae
rf_m0 = RandomForestRegressor(random_state=18)
rf_m0.fit(train_X, train_y)
rf_m0_pred = rf_m0.predict(val_X)
rf_m0_mae = mean_absolute_error(val_y, rf_m0_pred)
print(f'Validation MAE for Basic Random Forest model: {rf_m0_mae:,.0f}')

# Define 4. Basic Decision Terr model + fit with training data + mk val_predicted + calc mae
dt_m0 = DecisionTreeRegressor(random_state=18)
dt_m0.fit(train_X, train_y)
dt_m0_pred = dt_m0.predict(val_X)
dt_m0_mae = mean_absolute_error(val_y, dt_m0_pred)
print(f'Validation MAE for Basic Decision Tree model: {dt_m0_mae:,.0f}')

Validation MAE for Basic Random Forest model: 20,354
Validation MAE for Basic Decision Tree model: 28,717


### 2. rf_m1: rf_Improved #1 - Maximun Features
### 5. dt_m1: dt_Improved #1 - Maximun Features
- They use the same features cols (max. possible -FULL- from the lecture)

In [21]:
fF_tring= '''MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,1stFlrSF,2ndFlrSF,LowQualFinSF,\
GrLivArea,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,WoodDeckSF,OpenPorchSF,\
EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold'''
featuresF = fF_tring.split(',')
XF = df[featuresF]
rs = 18                # to probe diff random_state
# Split into training and validation (test) data
train_XF, val_XF, train_yF, val_yF = train_test_split(XF, y, random_state=rs)

# Define 2. Imp#1 Random Forest model + fit with training data + mk _predict + calc mae
rf_m1 = RandomForestRegressor(random_state=rs)
rf_m1.fit(train_XF, train_yF)
rf_m1_pred = rf_m1.predict(val_XF)
rf_m1_mae = mean_absolute_error(val_yF, rf_m1_pred)
print(f'Validation MAE for Imp#1 Random Forest model: {rf_m1_mae:,.0f}')

# Define 5. Imp#2 Decision Terr model + fit with training data + mk val_predicted + calc mae
dt_m1 = DecisionTreeRegressor(random_state=rs)
dt_m1.fit(train_XF, train_yF)
dt_m1_pred = dt_m1.predict(val_XF)
dt_m1_mae = mean_absolute_error(val_yF, dt_m1_pred)
print(f'Validation MAE for Imp#1 Decision Tree model: {dt_m1_mae:,.0f}')

## CAUTION 'overfitting for too much (maybe spurious features) -select best with AI?

Validation MAE for Imp#1 Random Forest model: 17,508
Validation MAE for Imp#1 Decision Tree model: 26,776


### 3. rf_m2: rf_Improved #2 - Max. Features + Tune Parameters
- Use XF and train_XF, val_XF, train_yF, val_yF that already exists
- Same as rf_m1 but rf_m2 with different parameters values

In [22]:
# Define 3. Imp#2 Random Forest model + fit with training data + mk _predict + calc mae
#rf_m2 = RandomForestRegressor(random_state=rs, max_features=6, n_estimators=135)
rf_m2 = RandomForestRegressor(n_estimators=135, max_features=6, random_state=rs)
rf_m2.fit(train_XF, train_yF)
rf_m2_pred = rf_m2.predict(val_XF)
rf_m2_mae = mean_absolute_error(val_yF, rf_m2_pred)
print(f'Validation MAE for Imp#2 Random Forest model: {rf_m2_mae:,.0f}')

Validation MAE for Imp#2 Random Forest model: 16,270


In [23]:
# Best n_estimators
def get_rfm2_mae(ne, tX, ty, vX, vy):
    rf_m2 = RandomForestRegressor(n_estimators=ne, random_state=rs, max_features=6)
    rf_m2.fit(tX, ty)
    rf_m2_pred = rf_m2.predict(vX)
    rf_m2_mae = mean_absolute_error(vy, rf_m2_pred)
    return rf_m2_mae

rfm2_d = dict()
for ne in range(130, 140):          # probed form (2, 300)
    rfm2_d[ne] = get_rfm2_mae(ne, train_XF, train_yF, val_XF, val_yF)

#print(rfm2_d, type(rfm2_d))

mxmae = max(rfm2_d.values())
mimae = min(rfm2_d.values())

for k in rfm2_d:
    if rfm2_d[k] == mimae:
        kmin = k
    if rfm2_d[k] == mxmae:
        kmax = k


print(f'k_min: {kmin}  -  MAEmin: {mimae:,.2f}')
print(f'k_mxn: {kmax}  -  MAEmax: {mxmae:,.2f}')

# k_min: 135  -  MAEmin: 16,270.14
# k_mxn: 34  -  MAEmax: 17,208.54

k_min: 135  -  MAEmin: 16,270.14
k_mxn: 138  -  MAEmax: 16,321.09


In [24]:
# Best max_features
def get_rfm2_mae(mf, tX, ty, vX, vy):
    rf_m2 = RandomForestRegressor(n_estimators=135, random_state=rs, max_features=mf)
    rf_m2.fit(tX, ty)
    rf_m2_pred = rf_m2.predict(vX)
    rf_m2_mae = mean_absolute_error(vy, rf_m2_pred)
    return rf_m2_mae

rfm2_d = dict()
for n in range(1, 11):          # probed form (2, 300)
    mf = n/10
    rfm2_d[mf] = get_rfm2_mae(mf, train_XF, train_yF, val_XF, val_yF)

#print(rfm2_d, type(rfm2_d))

mxmae = max(rfm2_d.values())
mimae = min(rfm2_d.values())

for k in rfm2_d:
    if rfm2_d[k] == mimae:
        kmin = k
    if rfm2_d[k] == mxmae:
        kmax = k


print(f'max_features _min: {kmin}  -  MAEmin: {mimae:,.2f}')
print(f'max_features _max: {kmax}  -  MAEmax: {mxmae:,.2f}')

# max_features _min: 6  -  MAEmin: 16,270.14
# max_features _max: 1  -  MAEmax: 19,040.38

max_features _min: 0.3  -  MAEmin: 16,427.99
max_features _max: 1.0  -  MAEmax: 17,528.56


> Random Forest Model - Conclusion:
- Case: train_XF, val_XF, train_yF, val_yF = train_test_split(XF, y, random_state=18)
- Best: RandomForestRegressor(n_estimators=135, random_state=18, max_features=6)
- MAE: 16,270.14 vs 17,508 vs 20,354 (poor Best DT: 23,674.48)

### 6. dt_m2: dt_Improved #2 - Max. Features + Tune Parameters

In [25]:
# Best max_leaf_nodes in a DecisionTree model
def get_dtm2_mae(mln, tX, ty, vX, vy):
    dt_m2 = DecisionTreeRegressor(max_leaf_nodes=mln, random_state=rs)
    dt_m2.fit(tX, ty)
    dt_m2_pred = dt_m2.predict(vX)
    dt_m2_mae = mean_absolute_error(vy, dt_m2_pred)
    return dt_m2_mae

dtm2_d = dict()
for mln in range(2, 300):          # probed form (2, 300)
    dtm2_d[mln] = get_dtm2_mae(mln, train_XF, train_yF, val_XF, val_yF)

#print(rfm2_d, type(rfm2_d))

mxmae = max(dtm2_d.values())
mimae = min(dtm2_d.values())

for k in dtm2_d:
    if dtm2_d[k] == mimae:
        kmin = k
    if dtm2_d[k] == mxmae:
        kmax = k

print(f'max_leaf_nodes _min: {kmin}  -  MAEmin: {mimae:,.2f}')
print(f'max_leaf_nodes _max: {kmax}  -  MAEmax: {mxmae:,.2f}')

# max_leaf_nodes _min: 59  -  MAEmin: 23,674.48
# max_leaf_nodes _max: 2  -  MAEmax: 40,775.65

max_leaf_nodes _min: 59  -  MAEmin: 23,674.48
max_leaf_nodes _max: 2  -  MAEmax: 40,775.65


> Decision Tree Model - Conclusion:
- Never as well as Random Forest (in this practice in any case)
- Case :train_XF, val_XF, train_yF, val_yF = train_test_split(XF, y, random_state=18)
- Best: DecisionTreeRegressor(max_leaf_nodes=59 random_state=18)
- Best MAE: 23,674.48 vs 26,776 vs 28,717; BUT- best RF: 16,270.14


### MORE TO WORK to IMPROVE MODEL
- Choose better features !! Creative features engineering !! AI?
- Others models .. - gradient boosting is named LATER!!

# __________________________________________________________________________________

## Generate output file 'submission.csv'
### --- with the trained model for the competition

### Train a model for the competition
- The code cells above trains a Random Forest model on train_XF and train_yF.
- Use the code cell below to build a Random Forest model and train it on all of X and y.

In [26]:
# target and features
ym = df.SalePrice

fM_trin= '''MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,1stFlrSF,2ndFlrSF,LowQualFinSF,\
GrLivArea,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,WoodDeckSF,OpenPorchSF,\
EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold'''
featuresM = fM_trin.split(',')

XM = df[featuresM]

# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model = RandomForestRegressor(n_estimators=135, max_features=6, random_state=rs)

# fit rf_model_on_full_data on all data from the training data
rf_model.fit(XM, ym)

In [27]:
# path to file you will use for predictions
''' local '''

# read test data file using pandas
test_data = pd.read_csv('test.csv')
#test_data

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[featuresM]

# make predictions which we will submit. 
test_preds = rf_model.predict(test_X)

In [28]:
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

## 