In [None]:
#Importing libraries
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost
import pandas as pd
import numpy as np
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')

### Importing test and train datasets

In [None]:
train = pd.read_csv('./data/train.csv',index_col = 'Id')
test = pd.read_csv('./data/test.csv',index_col = 'Id')

##### Shape of the Datasets

In [None]:
test.shape

In [None]:
train.shape

### Getting information about dataset

In [None]:
#train.info()
#test.info()

In [None]:
train.head()

List  of all the features

In [None]:
train.columns

**Information about Features**

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale


### To get the descriptive statistics about the data

In [None]:
train.describe()

### Selecting numerical and categorical features

In [None]:
num_feat=train.select_dtypes(include=[np.number])
cat_feat=train.select_dtypes(include=[np.object])
print('Numerical Features:\n',num_feat.dtypes,'\n')
print('Categorical Features:\n',cat_feat.dtypes)

### Target Analysis

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(train['SalePrice'],fit = norm);

In [None]:
train['SalePrice'].describe()

#### Skewness:
If the bulk of the data is at the left and the right tail is
longer, we say that the distribution is skewed right or **positively skewed**;
if the peak is toward the right and the left tail is longer, we say that the
distribution is skewed left or **negatively skewed**.

#### Kurtosis:
The height and sharpness of the peak relative to the rest of the data
are measured by a number called kurtosis. **Higher values indicate a
higher, sharper peak; lower values indicate a lower, less distinct
peak.**

A **normal distribution has kurtosis exactly 3** (excess kurtosis exactly
0). Any distribution with kurtosis ≈3 (excess ≈0) is called
mesokurtic.
A distribution with kurtosis <3 (excess kurtosis <0) is called
**platykurtic**. Compared to a normal distribution, its central peak is
lower and broader, and its tails are shorter and thinner.
A distribution with kurtosis >3 (excess kurtosis >0) is called
**leptokurtic**. Compared to a normal distribution, its central peak is
higher and sharper, and its tails are longer and fatter.

***So we can see that our Target variable is positively skewed and is leptokurtic.***

In [None]:
#Skewness and Kurtosis for Target Variable
print('Skewness :',train['SalePrice'].skew())
print('Kurtosis :',train['SalePrice'].kurt())

### To reduce the skewness we'll take log of SalePrice
**And the skewness for the logSalePrice is 0.12 which is very close to normal**

In [None]:
plt.figure(figsize = (12,7))
sns.distplot(np.log(train.SalePrice),fit = norm);
print('Skewness = ',np.log(train.SalePrice).skew())

### Checking Correlations

In [None]:
train.corr()['SalePrice'].sort_values(ascending=False)

#### List of Highly correlated features : Here we'll visualize them and clean the outliers 

In [None]:
features = ['OverallQual','YearBuilt','YearRemodAdd','TotalBsmtSF','1stFlrSF',
                    'GrLivArea','FullBath','TotRmsAbvGrd','GarageCars','GarageArea']

In [None]:
plt.figure(figsize = (15,15))
sns.heatmap(train[features].corr(),annot = True,linewidths = 0.5,cmap='cubehelix_r');
plt.savefig('Correlation Heatmap.png')

# Data Visualization

In [None]:
#Plotting regression plot for GrLivArea
plt.figure(figsize = (10,7))
sns.regplot('GrLivArea','SalePrice',data=train,color = 'red');


**Outliers affect the mean value of the data but have little effect on the median or mode of a given set of data.**
We can see that there are outliers. So we'll remove them and take GrLivArea till 4000.

In [None]:
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
plt.figure(figsize = (10,7))
sns.regplot('GrLivArea','SalePrice',data=train,color = 'red');

In [None]:
#Visualizing Garage Area
plt.figure(figsize=(10,7))
sns.regplot('GarageArea','SalePrice',data=train,color='green');

**The error seems to have constant variance till GarageArea=1000 but we after that it's dipersed and it can create huge problem in analysis. So we'll remove the outliers from here.**

In [None]:
#Removing Outliers from the GarageArea
train = train[train['GarageArea']<1200]
plt.figure(figsize=(10,7))
sns.regplot('GarageArea','SalePrice',data=train,color='green');

In [None]:
#Visualizing TotalBsmtSF
plt.figure(figsize=(10,7))
sns.regplot('TotalBsmtSF','SalePrice',data=train,color='Red');

**Combined '1stFlrSf' and '2ndFlrSF' has a better relationship with SalePrice than both of them alone**

In [None]:
plt.figure(figsize=(15,7))
plt.subplot(221)
sns.regplot('1stFlrSF','SalePrice',data=train,color = 'Brown');
plt.subplot(222)
sns.regplot('2ndFlrSF','SalePrice',data=train,color = 'Brown');
plt.subplot(223)
sns.regplot(train['1stFlrSF'] + train['2ndFlrSF'],train['SalePrice']);


In [None]:
plt.figure(figsize=(10,7))
sns.regplot('LotFrontage','SalePrice',data=train);  #we can see the outliers here

In [None]:
train = train[train['LotFrontage']<200]   ##Removing Outliers
plt.figure(figsize=(10,7))
sns.regplot('LotFrontage','SalePrice',data=train);

### This plot shows that as Overall Quality for a house is increasing the median Sale Price is increasing.
**Also the maximum sought for house has an overall condition of 5.**

In [None]:
plt.figure(figsize= (15,7))
plt.subplot(121)
sns.boxplot(train['OverallQual'], train['SalePrice']);
plt.subplot(122)
train['OverallQual'].value_counts().plot(kind="bar");
plt.savefig('OverallQual Vs SalePrice.png')

In [None]:
plt.figure(figsize= (20,8))
plt.subplot(121)
sns.boxplot(train['TotRmsAbvGrd'], train['SalePrice']);
sns.stripplot(train["TotRmsAbvGrd"],train["SalePrice"], jitter=True, edgecolor="gray")
plt.subplot(122)
train['TotRmsAbvGrd'].value_counts().plot(kind="bar");
plt.savefig('TotRmsAbvGrd Vs SalePrice.png')

#Sample size is decreasing after Total rooms above grade reaches to 10.

In [None]:
plt.figure(figsize= (15,8))
plt.subplot(121)
sns.boxplot(train['GarageCars'], train['SalePrice']);
sns.stripplot(train["GarageCars"],train["SalePrice"], jitter=True, edgecolor="gray")
plt.subplot(122)
train['GarageCars'].value_counts().plot(kind="bar");
plt.savefig('GarageCars Vs SalePrice.png')
#Median Sale Price going down after 4 Garagecars is undestandable after plotting the points on boxes.

In [None]:
plt.figure(figsize= (15,8))
plt.subplot(121)
sns.boxplot(train['FullBath'], train['SalePrice']);
plt.subplot(122)
train['FullBath'].value_counts().plot(kind="bar");
plt.savefig('FullBath Vs SalePrice.png')

# Data Preprocessing and Cleaning

In [None]:
train['log_SalePrice']=np.log(train['SalePrice']+1)
saleprices=train[['SalePrice','log_SalePrice']]

saleprices.head(5)

In [None]:
train=train.drop(columns=['SalePrice','log_SalePrice'])

In [None]:
print(test.shape)
print(train.shape)

In [None]:
all_data = pd.concat((train, test))
print(all_data.shape)
all_data.head()

### Checking for NaN values in Data

In [None]:
null_data = pd.DataFrame(all_data.isnull().sum().sort_values(ascending=False))

null_data.columns = ['Null Count']
null_data.index.name = 'Feature'
null_data


In [None]:
# Percentage of Null Data in each Feature

(null_data/len(all_data)) * 100

In [None]:
# Visualising missing data
f, ax = plt.subplots(figsize=(20, 7));
plt.xticks(rotation='90');
sns.barplot(x=null_data.index, y=null_data['Null Count']);
plt.xlabel('Features', fontsize=15);
plt.ylabel('Percent of missing values', fontsize=15);
plt.title('Percent missing data by feature', fontsize=15);

## Imputing Missing Values
#### In the below column, we have most of the values missing so we'll impute them with 'None'

In [None]:
for col in ('PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
            'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'MasVnrType', 'MSSubClass'):
    
    all_data[col] = all_data[col].fillna('None')

#### In these numerical features we'll impute NaN with zero because a missing values here means the house doesn't have that feature so it's zero

In [None]:
#Impute the numerical features and replace with a value of zero

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath',
            'BsmtHalfBath', 'MasVnrArea'):
    
    all_data[col] = all_data[col].fillna(0)

#### In the following features there are very less missing values so we'll impute them with the most frequent value.

In [None]:
for col in ('MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st', 'Exterior2nd', 'SaleType', 'Functional', 'Utilities'):
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

###### Imputing LotFrontage with median values

In [None]:
all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].apply(lambda x: x.fillna(x.median()))


### Combining similar features to make new features

**TotalBsmtSF** - Total Basement Square Feet

**1stFlrSF** - First Floor Square Feet

**2ndFlrSF** - Second Floor Square Feet

All the above three feature define area of the house and we can easily combine these to form **TotalSF** - Total Area in square feet

In [None]:
all_data['TotalSF']=all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
all_data['No2ndFlr']=(all_data['2ndFlrSF']==0)
all_data['NoBsmt']=(all_data['TotalBsmtSF']==0)

sns.regplot(train['TotalBsmtSF']+train['1stFlrSF']+train['2ndFlrSF'],saleprices['SalePrice'],color='red');

 The **BsmtFullBath ,FullBath, BsmtHalfBath** can be combined for a **TotalBath** similar to TotalSF


In [None]:
plt.figure(figsize = (12,7))
sns.barplot(train['BsmtFullBath'] + train['FullBath'] + train['BsmtHalfBath'] + train['HalfBath'], saleprices['SalePrice']);

all_data['TotalBath']=all_data['BsmtFullBath'] + all_data['FullBath'] + all_data['BsmtHalfBath'] + all_data['HalfBath']

#### Combining YearBuilt and YearRemodAdd

In [None]:
plt.figure(figsize=(10,7))
sns.regplot((train['YearBuilt']+train['YearRemodAdd']), saleprices['SalePrice']);

all_data['YrBltAndRemod']=all_data['YearBuilt']+all_data['YearRemodAdd']

#### These features are not much related to the SalePrice so we'll drop them.

In [None]:
all_data=all_data.drop(columns=['Street','Utilities','Condition2','RoofMatl',
                                'Heating','PoolArea','PoolQC','MiscVal','MiscFeature'])

In [None]:
# treat some numeric values as str which are infact a categorical variables
all_data['MSSubClass']=all_data['MSSubClass'].astype(str)
all_data['MoSold']=all_data['MoSold'].astype(str)
all_data['YrSold']=all_data['YrSold'].astype(str)

#### I found these features might look better without 0 data.

In [None]:
all_data['NoLowQual']=(all_data['LowQualFinSF']==0)
all_data['NoOpenPorch']=(all_data['OpenPorchSF']==0)
all_data['NoWoodDeck']=(all_data['WoodDeckSF']==0)
all_data['NoGarage']=(all_data['GarageArea']==0)

In [None]:
Basement = ['BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
            'BsmtFinType2', 'BsmtQual', 'BsmtUnfSF','TotalBsmtSF']

Bsmt=all_data[Basement]
Bsmt.head()

In [None]:
Bsmt['BsmtCond'].unique()

BsmtQual: Evaluates the height of the basement

*    Ex   Excellent (100+ inches) 
*    Gd   Good (90-99 inches)
*    TA   Typical (80-89 inches)
*    Fa   Fair (70-79 inches)
*    Po   Poor (&lt;70 inches
*    NA   No Basement
   
BsmtCond: Evaluates the general condition of the basement

*    Ex   Excellent
*    Gd   Good
*    TA   Typical - slight dampness allowed
*    Fa   Fair - dampness or some cracking or settling
*    Po   Poor - Severe cracking, settling, or wetness
*    NA   No Basement

BsmtExposure: Refers to walkout or garden level walls

*    Gd   Good Exposure
*    Av   Average Exposure (split levels or foyers typically score average or above)  
*    Mn   Mimimum Exposure
*    No   No Exposure
*    NA   No Basement

BsmtFinType1: Rating of basement finished area

*    GLQ  Good Living Quarters
*    ALQ  Average Living Quarters
*    BLQ  Below Average Living Quarters   
*    Rec  Average Rec Room
*    LwQ  Low Quality
*    Unf  Unfinshed
*    NA   No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

*    GLQ  Good Living Quarters
*    ALQ  Average Living Quarters
*    BLQ  Below Average Living Quarters   
*    Rec  Average Rec Room
*    LwQ  Low Quality
*    Unf  Unfinshed
*    NA   No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area   

In [None]:
Bsmt=Bsmt.replace(to_replace='Po', value=1)
Bsmt=Bsmt.replace(to_replace='Fa', value=2)
Bsmt=Bsmt.replace(to_replace='TA', value=3)
Bsmt=Bsmt.replace(to_replace='Gd', value=4)
Bsmt=Bsmt.replace(to_replace='Ex', value=5)
Bsmt=Bsmt.replace(to_replace='None', value=0)

Bsmt=Bsmt.replace(to_replace='No', value=1)
Bsmt=Bsmt.replace(to_replace='Mn', value=2)
Bsmt=Bsmt.replace(to_replace='Av', value=3)
Bsmt=Bsmt.replace(to_replace='Gd', value=4)

Bsmt=Bsmt.replace(to_replace='Unf', value=1)
Bsmt=Bsmt.replace(to_replace='LwQ', value=2)
Bsmt=Bsmt.replace(to_replace='Rec', value=3)
Bsmt=Bsmt.replace(to_replace='BLQ', value=4)
Bsmt=Bsmt.replace(to_replace='ALQ', value=5)
Bsmt=Bsmt.replace(to_replace='GLQ', value=6)

In [None]:
Bsmt.head()

In [None]:
Bsmt['BsmtScore']= Bsmt['BsmtQual']  * Bsmt['BsmtCond'] * Bsmt['TotalBsmtSF']
all_data['BsmtScore']=Bsmt['BsmtScore']

In [None]:
Bsmt['BsmtFin'] = (Bsmt['BsmtFinSF1'] * Bsmt['BsmtFinType1']) + (Bsmt['BsmtFinSF2'] * Bsmt['BsmtFinType2'])
all_data['BsmtFinScore']=Bsmt['BsmtFin']
all_data['BsmtDNF']=(all_data['BsmtFinScore']==0)

In [None]:
lot=['LotFrontage', 'LotArea','LotConfig','LotShape']
Lot=all_data[lot]
Lot.head()

In [None]:
garage=['GarageArea','GarageCars','GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']
Garage=all_data[garage]

Garage=Garage.replace(to_replace='Po', value=1)
Garage=Garage.replace(to_replace='Fa', value=2)
Garage=Garage.replace(to_replace='TA', value=3)
Garage=Garage.replace(to_replace='Gd', value=4)
Garage=Garage.replace(to_replace='Ex', value=5)
Garage=Garage.replace(to_replace='None', value=0)

Garage=Garage.replace(to_replace='Unf', value=1)
Garage=Garage.replace(to_replace='RFn', value=2)
Garage=Garage.replace(to_replace='Fin', value=3)

Garage=Garage.replace(to_replace='CarPort', value=1)
Garage=Garage.replace(to_replace='Basment', value=4)
Garage=Garage.replace(to_replace='Detchd', value=2)
Garage=Garage.replace(to_replace='2Types', value=3)
Garage=Garage.replace(to_replace='Basement', value=5)
Garage=Garage.replace(to_replace='Attchd', value=6)
Garage=Garage.replace(to_replace='BuiltIn', value=7)

Garage.head()

In [None]:
all_data.head()

In [None]:
non_numeric=all_data.select_dtypes(exclude=[np.number, bool])
non_numeric.head()

In [None]:
def onehot(col_list):
    global all_data
    while len(col_list) !=0:
        col=col_list.pop(0)
        data_encoded=pd.get_dummies(all_data[col], prefix=col)
        all_data=pd.merge(all_data, data_encoded, on='Id')
        all_data=all_data.drop(columns=col)
    print(all_data.shape)

In [None]:
onehot(list(non_numeric))

In [None]:
def log_transform(col_list):
    transformed_col=[]
    while len(col_list)!=0:
        col=col_list.pop(0)
        if all_data[col].skew() > 0.5:
            all_data[col]=np.log(all_data[col]+1)
            transformed_col.append(col)
        else:
            pass
    print(f"{len(transformed_col)} features had been tranformed")
    print(all_data.shape)

In [None]:
numeric=all_data.select_dtypes(include=np.number)
log_transform(list(numeric))

In [None]:
print(train.shape)
print(test.shape)

#### Extracting Train and Test Data again


In [None]:
train=all_data[:len(train)]
test=all_data[len(train):]

In [None]:
print(train.shape)
print(test.shape)

# Modeling

In [None]:
# loading pakages for model. 
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

from sklearn import linear_model, model_selection, ensemble, preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet,SGDRegressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
import lightgbm as lgb

In [None]:
def rmse(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)
    distance = predict - actual
    square_distance = distance ** 2
    mean_square_distance = square_distance.mean()
    score = np.sqrt(mean_square_distance)
    return score
rmse_score = make_scorer(rmse)
rmse_score

In [None]:
feature_names=list(all_data)
X_train = train[feature_names]
X_test = test[feature_names]
y_train = saleprices['log_SalePrice']

In [None]:
def score(model):
    score = cross_val_score(model, X_train, y_train, cv=5, scoring=rmse_score).mean()
    return score

### Tutorials for Models
##### Here I am adding few tutorials for people who want to know about the models I have used in prediction.

In [None]:
from IPython.display import YouTubeVideo
#Video tutorial on Bias-Variance Tradeoff

YouTubeVideo('EuBBz3bI-aA',width=700, height=350)

In [None]:
#Video tutorial on ridge regression

YouTubeVideo('Q81RR3yKn30',width=700, height=350)

In [None]:
#Tutorial on Lasso
YouTubeVideo('NGf0voTMlcs',width=700, height=350)

In [None]:
#Elastic Net Regression
YouTubeVideo('1dKRdX9bfIo',width=700, height=350)

In [None]:
#Decision Tree
YouTubeVideo('7VeUPuFGJHk',width=700, height=350)

In [None]:
model_Lasso= make_pipeline(RobustScaler(), Lasso(alpha =0.000327, random_state=18))

model_ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.00052, l1_ratio=0.70654, random_state=18))


model_GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =18)

model_XGB=xgb.XGBRegressor(n_jobs=-1, n_estimators=849, learning_rate=0.015876, 
                           max_depth=58, colsample_bytree=0.599653, colsample_bylevel=0.287441, subsample=0.154134, seed=18)

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

forest_reg = RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
           max_features=60, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=70, n_jobs=1, oob_score=False, random_state=42,
           verbose=0, warm_start=False)


## Predictions

In [None]:
model_Lasso.fit(X_train, y_train)
Lasso_Predictions=np.exp(model_Lasso.predict(X_test))-1

model_ENet.fit(X_train, y_train)
ENet_Predictions=np.exp(model_ENet.predict(X_test))-1

model_XGB.fit(X_train, y_train)
XGB_Predictions=np.exp(model_XGB.predict(X_test))-1

model_GBoost.fit(X_train, y_train)
GBoost_Predictions=np.exp(model_GBoost.predict(X_test))-1

model_lgb.fit(X_train, y_train)
lgb_Predictions=np.exp(model_lgb.predict(X_test))-1

forest_reg.fit(X_train, y_train)
forest_reg_Predictions=np.exp(forest_reg.predict(X_test))-1


In [None]:
scores ={}
scores.update({'Lasso':score(model_Lasso)})
scores.update({"Elastic Net":score(model_ENet)})

scores.update({"XGB":score(model_XGB)})
scores.update({"Gradient Boost":score(model_GBoost)})
scores.update({"lgb":score(model_lgb)})
scores.update({"Random Forest":score(forest_reg)})

In [None]:
scores

In [None]:
scores_df =pd.DataFrame(list(scores.items()),columns=['Model','Score'])
scores_df.sort_values(['Score'])

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(scores_df['Model'],scores_df['Score']);

###### Combining models to get better prediction

Ensemble methods are commonly used to boost predictive accuracy by combining the predictions of multiple machine learning models. The traditional wisdom has been to combine so-called “weak” learners. However, a more modern approach is to create an ensemble of a well-chosen collection of strong yet diverse models.

Building powerful ensemble models has many parallels with building successful human teams in business, science, politics, and sports. Each team member makes a significant contribution and individual weaknesses and biases are offset by the strengths of other members.

The simplest kind of ensemble is the unweighted average of the predictions of the models that form a model library. For example, if a model library includes three models for an interval target (as shown in the following figure), the unweighted average would entail dividing the sum of the predicted values of the three candidate models by three. In an unweighted average, each model takes the same weight when an ensemble model is built.



![image.png](attachment:image.png)



Reference: [https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/](http://)

In [None]:
ensemble = (Lasso_Predictions*0.59 + XGB_Predictions*0.06 + lgb_Predictions*0.35)

ensemble

In [None]:
submission=pd.read_csv('./data/sample_submission.csv')
submission['SalePrice']= ensemble

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv',index=False)