# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress. 

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**)
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [None]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.tree import DecisionTreeRegressor

#from xgboost import XGBRegressor
from xgboost.sklearn import XGBRegressor


import xgboost as xgb
import lightgbm as lgb


#from sklearn.impute import SimpleImputer
import numpy as np
from scipy.stats import skew
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone




# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = '../input/train.csv'
test_data_path = '../input/test.csv'

train_data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(test_data_path)

# Create target object and call it y
#y = train_data.SalePrice

#train_data["OverallQual-s2"] = train_data["OverallQual"] ** 2
#train_data["OverallQual-s3"] = train_data["OverallQual"] ** 3
#train_data["OverallQual-Sq"] = np.sqrt(train_data["OverallQual"])

#test_data["OverallQual-s2"] = test_data["OverallQual"] ** 2
#test_data["OverallQual-s3"] = test_data["OverallQual"] ** 3
#test_data["OverallQual-Sq"] = np.sqrt(test_data["OverallQual"])

# remove 2 obnormal points
train_data = train_data.drop(train_data[train_data['Id'] == 1299].index)
train_data = train_data.drop(train_data[train_data['Id'] == 524].index)
# drop Id column
#train_data.drop("Id", axis = 1, inplace = True)
#test_data.drop("Id", axis = 1, inplace = True)

#y = train_data.SalePrice
#y = np.log1p(y)
#print (y.head())

#candidate_X_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
#candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)


#total = train_data.isnull().sum().sort_values(ascending=False)
#percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
#missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

#train_data = train_data.drop((missing_data[missing_data['Total'] > 1]).index,1)
#candidate_X_predictors = train_data.drop(train_data.loc[train_data['Electrical'].isnull()].index)
#candidate_X_predictors = candidate_X_predictors.drop(pd.Int64Index([1379], dtype='int64'))
#test_data = test_data.drop((missing_data[missing_data['Total'] > 1]).index,1)
#y = y.drop(pd.Int64Index([1379], dtype='int64'))
#y = candidate_X_predictors.SalePrice
#candidate_X_predictors = candidate_X_predictors.drop(['Id', 'SalePrice'], axis=1)
#candidate_test_predictors = test_data.drop(['Id'], axis=1)

#low_cardinality_cols = [cname for cname in candidate_X_predictors.columns if 
#                                candidate_X_predictors[cname].nunique() < 10 and
#                                candidate_X_predictors[cname].dtype == "object"]
#numeric_cols = [cname for cname in candidate_X_predictors.columns if 
#                                candidate_X_predictors[cname].dtype in ['int64', 'float64']]
#my_cols = low_cardinality_cols + numeric_cols
#train_predictors = candidate_X_predictors[my_cols]
#test_predictors = candidate_test_predictors[my_cols]

#print (train_predictors.shape, test_predictors.shape, y.shape)

#cols_with_missing = [col for col in train_predictors.columns 
#                                 if train_predictors[col].isnull().any()]  

#print (cols_with_missing)
#imputed_X_train_plus = train_predictors.copy()
#imputed_X_test_plus = test_predictors.copy()

#for col in cols_with_missing:
#    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
#    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
    
#train_predictors = imputed_X_train_plus
#test_predictors = imputed_X_test_plus

#since test set has some columns which is dropped from train set, so we need to fill in missing values to align with train set
#print (test_predictors.describe(), test_predictors.shape[1])
#my_imputer = SimpleImputer(strategy='most_frequent')
#my_imputer2 = SimpleImputer()
#imputed_train = pd.DataFrame(my_imputer.fit_transform(train_predictors.select_dtypes(include = ['O']))) # for object
#imputed_test = pd.DataFrame(my_imputer.fit_transform(test_predictors.select_dtypes(include = ['O']))) # for object
#imputed_train2 = pd.DataFrame(my_imputer2.fit_transform(train_predictors.select_dtypes(include = ['int64', 'float64']))) # for number
#imputed_test2 = pd.DataFrame(my_imputer2.fit_transform(test_predictors.select_dtypes(include = ['int64', 'float64']))) # for number
#imputed_train.columns = train_predictors.select_dtypes(include = ['O']).columns
#imputed_test.columns = test_predictors.select_dtypes(include = ['O']).columns
#imputed_train2.columns = train_predictors.select_dtypes(include = ['int64', 'float64']).columns
#imputed_test2.columns = test_predictors.select_dtypes(include = ['int64', 'float64']).columns

#tmp_obj = train_predictors.copy()
#for col in imputed_train.columns:
#    tmp_obj[col] = imputed_train[col]
#for col in imputed_test2.columns:
#    tmp_obj[col] = imputed_train2[col]
#train_predictors = tmp_obj

#tmp_obj = test_predictors.copy()
#for col in imputed_test.columns:
#    tmp_obj[col] = imputed_test[col]
#for col in imputed_test2.columns:
#    tmp_obj[col] = imputed_test2[col]
#test_predictors = tmp_obj

#skewness = train_predictors.select_dtypes(include = ['int64', 'float64']).apply(lambda x: skew(x))
#skewness = skewness[abs(skewness) > 0.5]
#print(str(skewness.shape[0]) + " skewed numerical features to log transform")
#skewed_features = skewness.index
#train_predictors[skewed_features] = np.log1p(train_predictors[skewed_features])

#skewness = test_predictors.select_dtypes(include = ['int64', 'float64']).apply(lambda x: skew(x))
#skewness = skewness[abs(skewness) > 0.5]
#print(str(skewness.shape[0]) + " skewed numerical features to log transform")
#skewed_features = skewness.index
#test_predictors[skewed_features] = np.log1p(test_predictors[skewed_features])

#numerical_features = train_predictors.select_dtypes(exclude = ["object"]).columns


#one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
#one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)

#print ("Cond2 ", one_hot_encoded_training_predictors['Condition2_Norm'])
#cols_with_missing = [col for col in one_hot_encoded_test_predictors.columns 
#                                 if col.startswith("Condition2")] 


# i found join='inner' outperform join='left' option
#X, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
#                                                join='inner', 
#                                                axis=1)

# following code is only meaningful for join='left' option
#my_imputer3 = SimpleImputer(strategy='constant', fill_value=0)
#final_test = pd.DataFrame(my_imputer3.fit_transform(final_test))
#final_test.columns = X.columns


#cols_with_missing = [col for col in final_test.columns 
#                                 if final_test[col].isnull().any()] 
#print (cols_with_missing)

#print (one_hot_encoded_training_predictors.describe())
#print (one_hot_encoded_test_predictors.describe())
#print (X.describe())
#print (final_test.describe())

all_data = pd.concat((train_data.loc[:,'MSSubClass':'SaleCondition'],
                      test_data.loc[:,'MSSubClass':'SaleCondition']))
print (all_data.shape)

train_data["SalePrice"] = np.log1p(train_data["SalePrice"])

# Feature engineering (Clean up)
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
all_data["GarageType"] = all_data["GarageType"].fillna("None")
all_data["GarageFinish"] = all_data["GarageFinish"].fillna("None")
all_data["GarageQual"] = all_data["GarageQual"].fillna("None")
all_data["GarageCond"] = all_data["GarageCond"].fillna("None")
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data.drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print (missing_data.head())

# Feature engineering (Transform)
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

all_data['YearBuilt'] = all_data['YearBuilt'].astype(str)
all_data['YearRemodAdd'] = all_data['YearRemodAdd'].astype(str)
#all_data['GarageYrBlt'] = all_data['GarageYrBlt'].astype(str) #seems column with NaN is treated as float number


from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold', 'YearBuilt', 'YearRemodAdd')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape 
print('Shape all_data: {}'.format(all_data.shape))


# Adding total sqfootage feature 
all_data['TotalSF'] = pd.Series(all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF'])


numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[abs(skewed_feats) > 0.5]
skewed_feats = skewed_feats.index

#use box cox instead of log1
print("There are {} skewed numerical features to Box Cox transform".format(skewed_feats.shape[0]))
from scipy.special import boxcox1p
lam = 0.15
for feat in skewed_feats:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
#all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())

# make sure there is no missing data
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
print (missing_data.head())


X_train = all_data[:train_data.shape[0]]
X_test = all_data[train_data.shape[0]:]
y = train_data.SalePrice

# try to find outliers
#import statsmodels.api as sm

#ols = sm.OLS(endog = y, exog = X_train)
#fit = ols.fit()
#test2 = fit.outlier_test()['bonf(p)']
#outliers = list(test2[test2<1e-3].index) 
outliers = [462, 632, 1324, 1370, 1453]
print (outliers)

X_train = X_train.drop(X_train.index[outliers])
y = y.drop(y.index[outliers])

# find big mean columns and then normalize them with standard deviation
big_col = (X_train.mean() > 100).index
tmp_mean = X_train[big_col].mean().sort_values(ascending=False)
print (tmp_mean.head(5))


stdSc = StandardScaler()
X_train.loc[:, ["GarageArea"]] = stdSc.fit_transform(X_train.loc[:, ["GarageArea"]])
X_test.loc[:, ["GarageArea"]] = stdSc.transform(X_test.loc[:, ["GarageArea"]])


from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, ElasticNetCV, Lasso, LassoCV, LassoLarsCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.kernel_ridge import KernelRidge


#def rmse_cv(model):
#    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
#    return(rmse)

def rmse_cv(model):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(X_train.values)
    rmse= np.sqrt(-cross_val_score(model, X_train.values, y.values, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

# Lasso - L1
model_lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0005, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 
                          0.3, 0.6, 1], 
                max_iter = 50000, cv = 5)
model_lasso.fit(X_train, y)
alpha = model_lasso.alpha_
print("Best alpha :", alpha)

print("Try again for more precision with alphas centered around " + str(alpha))
model_lasso = LassoCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, 
                          alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05, 
                          alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, 
                          alpha * 1.4], 
                max_iter = 50000, cv = 5)
model_lasso.fit(X_train, y)
alpha = model_lasso.alpha_
print("Best alpha :", alpha)

#coef = pd.Series(model_lasso.coef_, index = X_train.columns)
#print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

#print("Lasso RMSE on Training set : ", rmse_cv(model_lasso).mean())
# make more robust to outliers
p_lasso = make_pipeline(RobustScaler(), Lasso(alpha =alpha))
print ("Lasso RMSE on Training set with RobustScaler : ", rmse_cv(p_lasso).mean())

# Ridge - L2
ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 20, 40, 60])
ridge.fit(X_train, y)
alpha = ridge.alpha_
print("Best alpha :", alpha)

print("Try again for more precision with alphas centered around " + str(alpha))
ridge = RidgeCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, 
                          alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
                          alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4, alpha * 1.5], 
                cv = 5)
ridge.fit(X_train, y)
alpha = ridge.alpha_
print("Best alpha :", alpha)

#print("Ridge RMSE on Training set :", rmse_cv(ridge).mean())
#y_train_rdg = ridge.predict(X_train)
p_ridge = make_pipeline(RobustScaler(), Ridge(alpha=alpha))
print("Ridge RMSE on Training set with RobustScaler :", rmse_cv(p_ridge).mean())


# ElasticNet - L1 & L2
elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                          alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                    0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                          max_iter = 50000, cv = 5)
elasticNet.fit(X_train, y)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("Try again for more precision with l1_ratio centered around " + str(ratio))
elasticNet = ElasticNetCV(l1_ratio = [ratio * .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
                          alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                          max_iter = 50000, cv = 5)
elasticNet.fit(X_train, y)
if (elasticNet.l1_ratio_ > 1):
    elasticNet.l1_ratio_ = 1    
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("Now try again for more precision on alpha, with l1_ratio fixed at " + str(ratio) + 
      " and alpha centered around " + str(alpha))
elasticNet = ElasticNetCV(l1_ratio = ratio,
                          alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, 
                                    alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, 
                                    alpha * 1.35, alpha * 1.4], 
                          max_iter = 50000, cv = 5)
elasticNet.fit(X_train, y)
if (elasticNet.l1_ratio_ > 1):
    elasticNet.l1_ratio_ = 1    
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("ElasticNet RMSE on Training set :", rmse_cv(elasticNet).mean())
#p_ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=alpha, l1_ratio=ratio, random_state=3))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=alpha, l1_ratio=ratio, random_state=3))
print("ElasticNet RMSE on Training set with RobustScaler:", rmse_cv(ENet).mean())


# KRR
KRR = KernelRidge(alpha=0.84, kernel='polynomial', degree=2, coef0=2.5)
print("KernelRidge RMSE on Training set :", rmse_cv(KRR).mean())

# Split into validation and training data
#train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

#stdSc = StandardScaler()
#train_X.loc[:, numerical_features] = stdSc.fit_transform(train_X.loc[:, numerical_features])
#val_X.loc[:, numerical_features] = stdSc.transform(val_X.loc[:, numerical_features])
#X.loc[:, numerical_features] = stdSc.transform(X.loc[:, numerical_features])
#final_test.loc[:, numerical_features] = stdSc.transform(final_test.loc[:, numerical_features])


# Specify Model
#iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
#iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
#val_predictions = iowa_model.predict(val_X)
#val_mae = mean_absolute_error(val_predictions, val_y)
#print("Validation MAE when not specifying max_leaf_nodes: {:,.4f}".format(val_mae))

# Using best value for max_leaf_nodes
#iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
#iowa_model.fit(train_X, train_y)
#val_predictions = iowa_model.predict(val_X)
#val_mae = mean_absolute_error(val_predictions, val_y)
#print("Validation MAE for best value of max_leaf_nodes: {:,.4f}".format(val_mae))

# Define the model. Set random_state to 1
#rf_model = RandomForestRegressor(random_state=1)
#rf_model.fit(X_train, y)
#rf_val_predictions = rf_model.predict(val_X)
#rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

#print("Validation MAE for Random Forest Model: {:,.4f}".format(rmse_cv(rf_model).mean()))

# add XGBoost algo
#xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=1)
xgb_model = xgb.XGBRegressor(
 learning_rate =0.01,
 n_estimators=2407,
 max_depth=3,
 min_child_weight=3,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha = 1.5e-05,
 objective= 'reg:linear',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
#xgb_model.fit(X_train, y, early_stopping_rounds=5, eval_set=[(val_X, val_y)], verbose=False)
#print("XGBRegressor RMSE on Training set :", rmse_cv(xgb_model).mean())
#xgb_val_predictions = xgb_model.predict(val_X)
#xgb_val_mae = mean_absolute_error(np.expm1(xgb_val_predictions), np.expm1(val_y))
#print("Validation MAE for XGBoot Model: {:,.5f}".format(xgb_val_mae))


model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

#dtrain = xgb.DMatrix(X_train, label = y)
#dtest = xgb.DMatrix(X_test)
#params = {"max_depth":2, "eta":0.1}
#model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)


# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.  

In [None]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
#rf_model_on_full_data = RandomForestRegressor(random_state=1)
#xgb_model_on_full_data = XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=1)

# fit rf_model_on_full_data on all data from the 
#rf_model_on_full_data.fit(X, y)
#xgb_model_on_full_data.fit(X, y, early_stopping_rounds=5, eval_set=[(val_X, val_y)], verbose=False)


#model_xgb = xgb.XGBRegressor(n_estimators=360, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
#model_xgb = XGBRegressor(n_estimators=1000, max_depth=2, learning_rate=0.05) #the params were tuned using xgb.cv


class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)


def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))



stacked_averaged_models = StackingAveragedModels(base_models = (ENet, p_ridge, KRR),
                                                 meta_model = p_lasso)
print ("Evaluating ...")
#score = rmse_cv(stacked_averaged_models)
#print("Stacking Averaged models score: ", score.mean())
stacked_averaged_models.fit(X_train.values, y.values)
stacked_train_pred = stacked_averaged_models.predict(X_train.values)
stacked_preds = np.expm1(stacked_averaged_models.predict(X_test.values))
print(rmsle(y, stacked_train_pred))

model_lgb.fit(X_train.values, y.values)
lgb_train_pred = model_lgb.predict(X_train.values)
lgb_pred = np.expm1(model_lgb.predict(X_test.values))
print(rmsle(y, lgb_train_pred))


print ("Starting fit ...")
#p_lasso.fit(X_train, y)
xgb_model.fit(X_train, y)
#stacked_averaged_models.fit(X_train.values, y.values)

#print (rmse_cv(model_lasso).mean())
#print (rmse_cv(xgb_model).mean())

#predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
#predictions.plot(x = "xgb", y = "lasso", kind = "scatter")

#lasso_preds = np.expm1(p_lasso.predict(X_test))
#elasticNet_preds = np.expm1(elasticNet.predict(X_test))
xgb_preds = np.expm1(xgb_model.predict(X_test))


preds = 0.1*lgb_pred + 0.2*xgb_preds + 0.7*stacked_preds
#preds = stacked_preds

solution = pd.DataFrame({"id":test_data.Id, "SalePrice":preds})
solution.to_csv("ridge_sol.csv", index = False)
print ("Done")

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [None]:
#from sklearn.impute import SimpleImputer
#import numpy as np

test_X = final_test
# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
#test_X = test_data[features]
print (test_X.describe())

# make predictions which we will submit. 
#test_preds = rf_model_on_full_data.predict(test_X)
test_preds = xgb_model_on_full_data.predict(test_X)
test_preds = np.expm1(test_preds)

# The lines below shows you how to save your data in the format needed to score it in the competition
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})

output.to_csv('submission.csv', index=False)

# Test Your Work
After filling in the code above:
1. Click the **Commit and Run** button. 
2. After your code has finished running, click the small double brackets **<<** in the upper left of your screen.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
3. Go to the output tab at top of your screen. Select the button to submit your file to the competition.  
4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process.

Congratulations, you've started competing in Machine Learning competitions.

# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

Level 2 of this course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Courses
The **[Pandas course](https://kaggle.com/Learn/Pandas)** will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Course Home Page](https://www.kaggle.com/learn/machine-learning)**

**[Learn Discussion Forum](https://kaggle.com/learn-forum)**.
