## House Prices: Advanced Regression Techniques

This project aims to predict the house price based on various features.

#### Dataset link
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

## The lifecycle of the project
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building
5. Model Deployment

## Data Analysis Phase
#### To understand more about:
1. Missing values
2. All the numerical variables and its distribution.
3. Categorical variables and its cardinality
4. Outliers and abnormalities
5. Relationship between independent and dependent feature (SalePrice)

In [None]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# For features engineering
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# for feature slection
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
# Display all the columns of the dataframe
pd.pandas.set_option('display.max_columns',None)
pd.pandas.set_option('display.max_rows',None)

In [None]:
# Loading the data
df1=pd.read_csv('train.csv')
#df1=pd.read_csv('test.csv')

In [None]:
# Checking the shape of the data
df1.shape

In [None]:
# Taking a look at the first 5 rows of the dataset
df1.head()

In [None]:
# The statistical summary of dataset
df1.describe().T

In [None]:
# Learning the dtypes of columns' and how many non-null 
df1.info()

### 1. Missing values

In [None]:
# Features have missing values
cols_na=df1.columns[df1.isnull().any()]
# The percentage of missing values for eah feature
print("Missing Values by Column")
print("-"*30)
for col in cols_na:
    print(col,df1[col].count(),df1[col].isnull().sum(),np.round(df1[col].isnull().mean()*100,2),'%')

#### Find the relationship between the numerical features that have missing values (more than half) and SalePrice before drop them, by plotting diagrams for these relationships

In [None]:
sns.pairplot(df1[cols_na])

The Correlation between the numerical features and the target variable using jointplot visualization

In [None]:
num_cols_na=[col for col in df1.columns[df1.isnull().any()] if df1[col].dtype!='O']

for col in num_cols_na:
    plt.figure(figsize=(10,8))
    sns.jointplot(x=df1[col], y=df1["SalePrice"], kind="kde")

#### Find the relationship between the categorical features that have missing values (more than half) and SalePrice before drop them, by plotting diagrams for these relationships

In [None]:
# Copy from df1 has categorical features that ave null values
df_na=df1.copy()
cat_cols_na=[col for col in df_na.columns[df_na.isnull().any()] if df_na[col].dtype=='O']
df_na=df_na[cat_cols_na+ ['SalePrice']]
for col in cat_cols_na:
    # Indicate 1 if the observation is null or 0 otherwise.
    df_na[col]= np.where(df_na[col].isnull(),1,0)
    # Plot the mean of SalePrice feature for each feature
    df_na.groupby(col)['SalePrice'].mean().plot.bar()
    plt.xlabel(col)
    plt.ylabel('Mean House Price')
    plt.title(col)
    plt.show()
df_na.head()

The relation between the missing values and the dependent variable (SalePrice) is clearly visible.So We can't just drop these features. In the next Phase (Feature Engineering), will solve this proplem by replacing the missing values.

### 2. Numerical Variables

In [None]:
# num_cols is a list of numerical features
num_cols=df1.select_dtypes(exclude='object').columns
# num_cols= [col for col in df1.columns if df1[col].dtype!='O']
df1[num_cols].shape
df1[num_cols].head()

#### Temporal variables
From data_description.txt file, there are 4 date type variables (YearBuilt, YearRemodAdd, GarageYrBlt, YrSold).

In [None]:
for col in num_cols:
    print(col, df1[col].dtype) 

In [None]:
# temp_cols is a list of temporal features
temp_cols = [col for col in num_cols if 'Yr' in col or 'Year' in col]
# data type of the temporal features in the dataset
for col in temp_cols:
    print(col, df1[col].dtype) 

From the dataset, the data type of the 4 temporal features is int or float.

#### Find the relationship between the temporal features and SalePrice before drop them, by plotting diagrams for these relationships

In [None]:
for col in temp_cols:
    df1.groupby(col)['SalePrice'].mean().plot()
    plt.xlabel(col)
    plt.ylabel('Mean House Price')
    plt.title('Mean House Price vs. '+col)
    plt.show()

#### Find the relationship between the houses' ages and SalePrice, by plotting scatter diagrams for these relationships

In [None]:
# Copy from df1
df_temp=df1.copy()


for col in temp_cols[:-1]:
        # Calculate the house age related to the other temporal variables
        df_temp[col]=df_temp['YrSold']-df_temp[col]
        # Plot the relationship between the houses' ages (related to the other temporal variables) and SalePrice
        plt.scatter(df_temp[col],df_temp['SalePrice'])
        plt.xlabel('House age since '+col)
        plt.ylabel('Sale Price')
        plt.title('House Price vs. house age since '+col)
        plt.show()

    
    

The relation between the temporal features and the dependent variable (SalePrice) is clearly visible. There is an inverse relationship between the age of the house and its price. In the next Phase (Feature Engineering), will solve this proplem by changing these features' type to date.

#### Continous and Discrete variables

In [None]:
dis_cols= [col for col in num_cols if len(df1[col].unique())<25 and col not in temp_cols]
con_cols= [col for col in num_cols if col not in dis_cols and col not in temp_cols+['Id']]

In [None]:
df1[dis_cols].head()

In [None]:
df1[con_cols].head()

#### Find the relationship between the continous and discrete variables and SalePrice, by plotting bar diagrams for these relationships

In [None]:
# Analyse the discrete values by creating barcharts
for col in dis_cols:
    df1.groupby(col)['SalePrice'].mean().plot.bar()
    plt.xlabel(col)
    plt.ylabel('Mean House Price')
    plt.title('Mean House Price vs. '+ col)
    plt.show()

There is a clear relationship between the discrete numbers and SalePrice, such as Exponential relationships.

In [None]:
# Analyse the continuous values by creating histograms to understand the distribution
for col in con_cols:
    df1[col].hist(bins=25)
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.title(col)
    plt.show()

The relationship between the continuous numbers and SalePrice for most of the charts are not Normal distribution. In the next cell, will solve this proplem by changing these features to Normal distribution by using logarithmic transformation.

In [None]:
# Using logarithmic transformation to changing continuous features to Normal distribution
for col in con_cols:
    df_con=df1.copy()
    if 0 in df_con[col].unique() or col in ['SalePrice']:
    #if 0 in df_con[col].unique():
        pass
    else:
        df_con[col]=np.log(df_con[col])
        df_con['SalePrice']=np.log(df_con['SalePrice'])
        plt.scatter(df_con[col],df_con['SalePrice'])
        plt.xlabel(col)
        plt.ylabel("Count")
        plt.title(col)
        plt.show()
        
    

There is a monotonic relationship between GrLivArea, 1stFlrSF and SalePrice. 

### 3. Categorical Variables

In [None]:
# cat_cols=[col for col in df1.columns if df1[col].dtypes=='O']
cat_cols= df1.select_dtypes(include='object').columns
df1[cat_cols].head()

In [None]:
# Number of categories for each categorical features
for col in cat_cols:
    print(col,len(df1[col].unique()))

#### Find the relationship between the categorical variables and SalePrice, by plotting bar diagrams for these relationships

In [None]:
for col in cat_cols:
    df1.groupby(col)['SalePrice'].mean().plot.bar()
    plt.xlabel(col)
    plt.ylabel('Mean House Price')
    plt.title('Mean House Price vs. '+ col)
    plt.show()

### 4. Outliers

In [None]:
# Using boxplot to show the outliers for the continuous variables
for col in con_cols:
    df_con=df1.copy()
    if 0 in df_con[col].unique():
        pass
    else:
        df_con[col]=np.log(df_con[col])
        df_con.boxplot(column=col)
        plt.ylabel(col)
        plt.title(col)
        plt.show()

## Feature Engineering Phase
#### Performing:
1. Missing values.
2. Temporal variables.
3. Categorical variables.
4. Standardise the variables.

To avoid data leakage, should split dataset to train and test datasets then run the feature engineering on both. But our data is already splitted to train and test 

In [None]:
# Using train_test_split modelin sklearn package
#X_train,X_test,y_train,y_test=train_test_split(df1,df1['SalePrice'],test_size=0.1,random_state=0)
#X_train.shape,X_test.shape

#### 1. Missing values.
Drop the columns that have a lot of missing data and fill the rest with appropriate values.

In [None]:
# Make a copy from dataset
df2=df1.copy()

In [None]:
# Function to replace the null values in any dataframe
def replaceNull(df,colList,type):
    if type=='categorical':
        new_df=df.copy()
        new_df[colList]=new_df[colList].fillna('n/a')    
        return new_df
    if type=='numerical':
        new_df=df.copy()
        for col in colList:
            new_df[col]=new_df[col].fillna(new_df[col].median())    
        return new_df

1. Replace the missing values in the categorical features.

In [None]:
# find the percentage of missing values in the categorical features.
cat_na=[col for col in df2[cat_cols] if df2[col].isnull().sum()>0]
for col in cat_na:
    print(col, np.round(df2[col].isnull().mean()*100,2), '% missing values')

Alley, PoolQC, Fence, and MiscFeature features have more than 75% of their values are null. So, will drop them.

In [None]:
df2=df2.drop(columns= {'Alley', 'PoolQC', 'Fence', 'MiscFeature'})

In [None]:
# new_cat_cols=[col for col in df2.columns if df2[col].dtype=='O']
new_cat_cols= df2.select_dtypes(include='object').columns
new_cat_cols

In [None]:
# Replace the rest of categorical missing values with (n/a)
df3= replaceNull(df2,new_cat_cols,'categorical')
for col in new_cat_cols:
    print(col, df3[col].isnull().sum()>0)

2. Replace the missing values in the numerical features.

In [None]:
# find the percentage of missing values in the numerical features.
num_na=[col for col in df3[num_cols] if df3[col].isnull().sum()>0]
for col in num_na:
    print(col, np.round(df3[col].isnull().mean()*100,2), '% missing values')

Just three features have null values and all of them have low percentage.

Because we have outliers, prefere to replace the null values with median or mode, not with mean.

In [None]:
# Replace the missing values in the numerical features with median
df4= replaceNull(df3,num_na,'numerical')
for col in num_na:
    print(col, df4[col].isnull().sum()>0)

#### 2. Temporal variables.
Change the data values of the temporal variables from year to age

In [None]:
df5=df4.copy()
for col in temp_cols:
    if col!='YrSold':
        df5[col]=df5['YrSold']-df5[col]

In [None]:
df5[temp_cols].head()

#### 3. Numerical variables.
Change the numerical variables from skewed distribution to log normal distribution.

In [None]:
# make a list that has all numeric independent features
col_num=[col for col in df5.columns if col not in ['Id','SalePrice'] and df5[col].dtype!='O']
#col_num=[col for col in df5.columns if col not in ['Id'] and df5[col].dtype!='O']

In [None]:
# Create a plot diagnostic function that is ploting histogram and Q-Q plot, to detect the skewed distributed features.
def diagnostic_plt(df,col):
    plt.figure(figsize=(15,6))
    # Histogram plot
    plt.subplot(1,2,1)
    df[col].hist()
    # Q-Q plot
    plt.subplot(1,2,2)
    stats.probplot(df[col],dist='norm',plot=plt)
    plt.title(col)
    plt.show()

In [None]:
# Plot histogram and Q-Q plots for each numerical feature in the dataset
for col in col_num:
    diagnostic_plt(df5,col)

From the graphs, most of the numerical features have a skewed distribution. In the Linear Regression, transforming data to better fit the Gaussian Distribution. In this case, Logarithmic transformation will apply on the skewed distributed features to convert it to normal distributed features.

In [None]:
df6=df5.copy()
skewed_cols=['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea','MiscVal', 'SalePrice']
#skewed_cols=['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea','MiscVal']
for col in skewed_cols:
    df6[col]=np.log(df6[col]+1)
df6.head()

#### 3. Combine levels.
Change all rare categorical variables (which have frequency less than 1% of the dataset) in the whole dataset to a same value/label ('rare_val'), using their frequency.

In [None]:
# cat_cols= [col for col in df5.columns if df5[col].dtype=='O']
cat_cols= df5.select_dtypes(include='object').columns
df5[cat_cols].head()

In [None]:
df7=df6.copy()
for col in cat_cols:
    temp=df7.groupby(col)['SalePrice'].count()/len(df7)
    #temp=df7.groupby(col)['Id'].count()/len(df7)
    temp_df=temp[temp>0.01].index
    df7[col]=np.where(df7[col].isin(temp_df),df7[col],'rare_val')

In [None]:
# Show the percentage of cardinality in each categorical feature after compined the rera values in one vaue
for col in cat_cols:
    temp=np.round((df7.groupby(col)['SalePrice'].count()/len(df7))*100,2)
    #temp=np.round((df7.groupby(col)['Id'].count()/len(df7))*100,2)
    print(temp)

#### 4. Convert to Number.
Some ML libraries do not take categorical variables as input. Thus, we convert them into numerical variables using label encoder.

In [None]:
df8=df7.copy()
for col in cat_cols:
        label_encoder=df8[col].value_counts().index
        #label_encoder=df8.groupby([col])['Id'].mean().sort_values().index
        label_encoder={j:i for i,j in enumerate(label_encoder)}
        df8[col]= df8[col].map(label_encoder)

# Check if the dataset has any categorical values
# cat_cols= [col for col in df8.columns if df8[col].dtype=='O']
cat_cols= df8.select_dtypes(include='object').columns
print(cat_cols)

In [None]:
df8.head()

#### 4. Features scaling.
This dataset has many features computed by different measurements and units, so it is necessary to scale the features (except Id and SalePrice columns) to apply the ML models.
Two common features scaling (normalizations) are:
1. Z-score
2. MinMax

In this project, I use MinMax normalization in order to handle the outliers better.

In [None]:
df9=df8.copy()
# Make a list has all independent features
scale_cols=[col for col in df9.columns if col not in ['Id','SalePrice']]
#scale_cols=[col for col in df9.columns if col not in ['Id']]
# Use MinMaxScaler from sklearn.preprocessing to scale the features
scaler=MinMaxScaler(copy=True, feature_range=(0, 1))
scaler.fit(df9[scale_cols])
scaled=scaler.transform(df9[scale_cols])

In [None]:
# Concatenate Id and SalePrice variables to the ttransformed dataset
df10=pd.concat([df9[['Id', 'SalePrice']].reset_index(drop=True), pd.DataFrame(scaled,columns=scale_cols)], axis=1)
#df10=pd.concat([df9[['Id']].reset_index(drop=True), pd.DataFrame(scaled,columns=scale_cols)], axis=1)

In [None]:
df10.head()

## Features Selection
#### To reduce the features for the linear regression and  skip the not useful ones.
1. Capture the dependent feature from the dataset (y-train).
2. Capture the independent feature from the dataset (X-train).
3. Select the unuseful independent feature from X-train and drop them.

#### 1- Capture the dependent feature

In [None]:
y_train=df10[['SalePrice']]

#### 2- Capture the independent feature

In [None]:
X_train=df10.drop(['Id','SalePrice'],axis=1)
#X_test=df10.drop(['Id'],axis=1)

#### 3- select the unuseful independent feature
To select the useless independent features will use Lasso Regression model and selectFromModel object, which will select the features with non-zero coefficients.

In [None]:
# Select alpha=0.5% (equivalent of penalty).Bigger than this will select less features.
sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0))
sel_model.fit(X_train, y_train)

In [None]:
# show the results, which True means that this feature is important to the Regression algorithm and false means not.
for col in X_train.columns:
    print(col,sel_model.get_support()[X_train.columns.get_loc(col)])

In [None]:
# Number of selected features
selected_cols=X_train.columns[(sel_model.get_support())]
print('Total features: {}'.format((X_train.shape[1])))
print('No. features with coefficients shrank to zero: {}'.format(
    np.sum(sel_model.estimator_.coef_ == 0)))
print('No. selected features is {} features:'.format(len(selected_cols)))
list(X_train[selected_cols])

In [None]:
# Drop the unselected independent features
X_train=X_train[selected_cols]

#X_test=X_test[selected_cols]

In [None]:
# Save the cleaned and transformed dataset.
X_train.to_csv('X_train.csv',index=False)
y_train.to_csv('y_train.csv',index=False)
#X_test.to_csv('X_test.csv',index=False)

## Model building Phase
1. Split the data
2. Defining evaluation functions.
3. Machine Learning Models.
4. Model Comparison.

#### 1. Split the train data as x_train , x_test, y_train, y_test

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib
X_train=pd.read_csv('X_train.csv')
y_train=pd.read_csv('y_train.csv')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.2, random_state = 3)

In [None]:
# !pip install xgboost

In [None]:

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
def best_model(X_train, y_train, X_test, y_test):
    algos={
#         'linear_regression':{
#             'model':LinearRegression(),
#             'params':{
#                 'normalize':[True,False]
#             }
#         },
#         'lasso':{
#             'model':Lasso(),
#             'params':{
#                 'alpha':[1,2],
#                 'selection':['random','cycle']
#             }
#         },
#         'ridge_regression':{
#             'model':Ridge(),
#             'params':{}
#         },
#         'decision_tree':{
#             'model':DecisionTreeRegressor(),
#             'params':{
#                 'criterion':['mse','friedman_mse'],
#                 'splitter':['best','random'],
#                 'max_depth' : [10],
#                 'min_samples_leaf' : [2]
#             }
#         },
#         'elastic_net':{
#             'model':ElasticNet(),
#             'params':{}
#         },
#         'SVR':{
#             'model':SVR(),
#             'params':{
#                 'C':[100000,0.7],
#                 'kernel' : ['rbf'],
#                 'gamma' : ['auto'],
#                 'degree':[4],
# #                 'epsilon':[0.002],
#                 'coef0':[20]
#             }
#         },
#         'Random_Forest_Regressor':{
#             'model':RandomForestRegressor(),
#             'params':{
#                 'n_estimators':[100,1500],
#                 'max_depth' : [3]
#             }
#         },
        'GradientBoostingRegressor':{
            'model':GradientBoostingRegressor(),
            'params':{
               #                 'n_estimators':[100,1500],
                'learning_rate':[1, 0.5, 0.25, 0.1, 0.05, 0.01],
#                 'learning_rate':[.01],
#                 'max_depth':[3],
#                 'min_samples_leaf':[1],
                'max_depth':range(1,32,2), 
                'min_samples_split':range(200,1001,200),
                'n_estimators':range(20,1500,10),
                'min_samples_leaf':range(30,71,10),
#                 'max_features':range(0,5),
                'subsample':[.9]
            }
        }
#         ,
#         'XGBoost Regressor':{
#             'model':XGBRegressor(),
#             'params':{
#                 'n_estimators':[100],
#                 'learning_rate':[0.01]
#             }
#         }
#         ,
#         'Polynomial_Regression_d2 ':{
#             'model':PolynomialFeatures(),
#             'params':{
#                 'degree':[2]
#             }
#         }
    }
    
    ml_models,model_scores,predictions=[],[],[]
    cv=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
    for algo_name,config in algos.items():
        gs=GridSearchCV(config['model'],config['params'],cv=cv,return_train_score=False).fit(X_train, y_train)
        y_pred=gs.predict(X_test)
        # Error validation functions
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r_squared = r2_score(y_test, y_pred)
        rmse_cv = np.sqrt(-cross_val_score(gs, X_train, y_train, scoring="neg_mean_squared_error", cv=5)).mean()
        model_scores.append({
                          "Model_name": algo_name,
                          "best_score":gs.best_score_,
                          "mean_absolute_error": mae,
                          "mean_squared_error": mse,
                          "root_mean_squared_error": rmse,
                          "r2_score": r_squared,
                          "RMSE_Cross_Validation": rmse_cv,
                          "best_params":gs.best_params_,
                            "best_estimator":gs.best_estimator_})
        predictions.append({"Model_name": algo_name,"y_pred":y_pred})
        print(f'{algo_name} is done')        
    return pd.DataFrame(model_scores, columns=["Model_name","best_score",
                                            "mean_absolute_error",
                                            "mean_squared_error", 
                                            "root_mean_squared_error",
                                            "r2_score",
                                            "RMSE_Cross_Validation",
                                            "best_params",
                                            "best_estimator"]
                       ).sort_values(by="RMSE_Cross_Validation"
                                    ),pd.DataFrame(predictions, columns=["Model_name","y_pred"])

In [None]:
model_scores,predictions=best_model(X_train, y_train, X_test,y_test)

In [None]:
model_scores

In [None]:
for i in range (0,len(predictions)):
    plt.scatter(y_test, predictions['y_pred'][i])
    plt.xlabel("Reality Prices")
    plt.ylabel("Predicted prices")
    plt.title(predictions['Model_name'][i])
    plt.show()

In [None]:
fig, ax = plt.subplots()
ax.bar(ML_models_results["Model_name"], ML_models_results["r2_score"], width = 0.35 , label='r2_score')
ax.bar(ML_models_results["Model_name"], ML_models_results["RMSE_Cross_Validation"],width = 0.35 , label='RMSE_Cross_Validation')
ax.set_ylabel('Scores')
plt.title("Evaluation of Models Based on RMSE (Cross-Validated)")
plt.xticks(rotation=90)
ax.legend()
plt.show()

## Test file
Prepare test dataset (Cleaning, features engineering and selection )

In [None]:
test_df1=pd.read_csv('test.csv')
sample_sub_df1 = pd.read_csv('sample_submission.csv')

In [None]:
cat_test_cols=[col for col in test_df1.columns if test_df1[col].dtypes=='O']
num_test_cols=[col for col in test_df1.columns if test_df1[col].dtypes!='O']


In [None]:
test_df2=test_df1.copy()
test_df2=test_df2.drop(columns= {'Alley', 'PoolQC', 'Fence', 'MiscFeature'})

In [None]:
new_cat_test_cols=[col for col in test_df2.columns if test_df2[col].dtype=='O']


In [None]:
# Replace the rest of categorical missing values with (n/a)
test_df3= replaceNull(test_df2,new_cat_test_cols,'categorical')

In [None]:
# find the percentage of missing values in the numerical features.
num_test_na=[col for col in test_df3[num_test_cols] if test_df3[col].isnull().sum()>0]

In [None]:
# Replace the missing values in the numerical features with median
test_df4= replaceNull(test_df3,num_test_na,'numerical')

In [None]:
temp_test_cols = [col for col in num_test_cols if 'Yr' in col or 'Year' in col]
test_df5=test_df4.copy()
for col in temp_test_cols:
    if col!='YrSold':
        test_df5[col]=test_df5['YrSold']-test_df5[col]

In [None]:
# make a list that has all numeric independent features
test_col_num=[col for col in test_df5.columns if col not in ['Id'] and test_df5[col].dtype!='O']

In [None]:
test_df6=test_df5.copy()
skewed_test_cols=['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea','MiscVal']
for col in skewed_test_cols:
    test_df6[col]=np.log(test_df6[col]+1)

In [None]:
cat_test_cols= [col for col in test_df6.columns if test_df6[col].dtype=='O']

In [None]:
test_df7=test_df6.copy()
for col in cat_test_cols:
        label_encoder=test_df7.groupby([col])['Id'].mean().sort_values().index
        label_encoder={j:i for i,j in enumerate(label_encoder)}
        test_df7[col]= test_df7[col].map(label_encoder)

In [None]:
test_df8=test_df7.copy()
# Use MinMaxScaler from sklearn.preprocessing to scale the features
scaler=MinMaxScaler(copy=True, feature_range=(0, 1)).fit(test_df8)
scaled=scaler.transform(test_df8)
X_test=pd.DataFrame(scaled,columns=test_df8.columns)
X_test.head()

In [None]:
X_test=X_test[selected_cols]
X_test.to_csv('X_test.csv',index=False)

#### Prediction X_test dataset 
Use GradientBoostRegressor with the best params to predict the sale price for X_test dataset  

In [None]:
model_scores['best_params'][7]

In [None]:
prediction_price = GradientBoostingRegressor(n_estimators=1500,
                learning_rate=0.01,
                max_depth=3,
                min_samples_leaf=1,
                random_state=2,
                subsample = 0.2).fit(X_train,y_train).predict(X_test)

In [None]:

sample_sub_df1['SalePrice_new'] = np.exp(prediction_price)
# sample_sub_df1.to_csv('submission.csv', index=False)

In [None]:
sample_sub_df1