## Business Problem : To estimate the price of a house with the help of various features

# Breakdown of this notebook:
1. **Loading the dataset:** Load the data and import the libraries. <br>
2. **Exploratory Data Analysis(EDA):** <br>
 - Deleting redundant columns.
 - Renaming the columns if required.
 - #Some Transformations
   
   **Data Preparation**
 - Treating the NaN values in the dataset,if any.
 - Data Standardization is done.
 - Handling Categorical Data - Perform encoding operation
3. **Modelling**
 - Random Forest Regressor
 - Gradient Boosting Regressor
4. **Analysis**

## Data Understanding

In [1]:
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
# Import the dataset 
df = pd.read_csv('House_Pricing.csv')

In [3]:
df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,house_number,street_name,unit_number,city,zip_code,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,42670,Lopez Crossing,,Hallfort,10907,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,5194,Gardner Park,,Hallfort,10907,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,4366,Harding Islands,,Lake Christinaport,11203,2721596.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,3302,Michelle Highway,,Lake Christinaport,11203,212968.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,582,Jacob Cape,,Lake Christinaport,11203,224529.0


In [4]:
# Remove the fields from the data set that we don't want to include in our model
df1 = df.drop(columns = ['house_number','street_name','unit_number','zip_code'])
df2 = df1

In [5]:
df1.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,city,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,Hallfort,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,Hallfort,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,Lake Christinaport,2721596.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,Lake Christinaport,212968.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,Lake Christinaport,224529.0


**DataFrame.describe()** method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

Now, let's understand the statistics that are generated by the describe() method:
* count tells us the number of NoN-empty rows in a feature.
* mean tells us the mean value of that feature.
* std tells us the Standard Deviation Value of that feature.
* min tells us the minimum value of that feature.
* 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.
* max tells us the maximum value of that feature.

In [6]:
df.describe()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,house_number,unit_number,zip_code,sale_price
count,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,3088.0,42703.0,42703.0
mean,1990.993209,1.365759,3.209283,1.923659,0.527153,1987.758986,2127.155446,455.8498,41.656324,18211.767347,2027.395402,11030.991476,441986.2
std,19.199987,0.513602,1.043396,0.759699,0.499268,846.76627,922.807342,243.453463,168.715867,27457.109993,1141.38377,573.576228,344285.7
min,1852.0,0.0,0.0,0.0,0.0,-3.0,5.0,-4.0,0.0,0.0,3.0,10004.0,664.0
25%,1980.0,1.0,3.0,1.0,0.0,1380.0,1466.0,412.0,0.0,674.0,1063.0,10537.0,285591.5
50%,1994.0,1.0,3.0,2.0,1.0,1808.0,1937.0,464.0,0.0,4530.0,2033.0,11071.0,402191.0
75%,2005.0,2.0,4.0,2.0,1.0,2486.0,2640.0,606.0,0.0,24844.5,2921.0,11510.0,532715.0
max,2017.0,4.0,31.0,8.0,1.0,12406.0,15449.0,8318.0,9200.0,99971.0,3998.0,11989.0,22935780.0


In [7]:
df1['sale_price'].min()

664.0

In [8]:
df1['sale_price'].max()

22935778.0

In [9]:
m = df1['sale_price'].mean()
m

441986.20551249327

In [10]:
df1['sale_price'] = df1['sale_price'] >= m

In [11]:
# Convert Y variable (Price) into 2 categories
from sklearn.preprocessing import LabelEncoder

In [12]:
le = LabelEncoder()

In [13]:
df1['sale_price'] = le.fit_transform(df1['sale_price'])
df1.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,city,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,Hallfort,0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,Hallfort,0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,Lake Christinaport,1
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,Lake Christinaport,0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,Lake Christinaport,0


In [14]:
print("count samples & features: ", df1.shape) # printing the number of rows and columns
print("Are there missing values: ", df1.isnull().values.any()) # printing if dataset has any NaN value

count samples & features:  (42703, 16)
Are there missing values:  False


In [15]:
# Replace categorical data with one-hot encoded data
# • Garage type
# • city
from sklearn.preprocessing import LabelBinarizer
df_one_hot = df1.copy()
lb = LabelBinarizer()
lb_results = lb.fit_transform(df_one_hot['garage_type'])
lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)
lb_results_df.head()

Unnamed: 0,attached,detached,none
0,1,0,0
1,1,0,0
2,0,0,1
3,1,0,0
4,1,0,0


In [16]:
final_df = pd.concat([df_one_hot,lb_results_df],axis = 1)

In [17]:
df_one_hot_en = final_df.copy()
lb_en = LabelBinarizer()
lb_results_en = lb_en.fit_transform(df_one_hot_en['city'])
lb_results_df_en = pd.DataFrame(lb_results_en, columns=lb_en.classes_)
lb_results_df_en.head()

Unnamed: 0,Amystad,Brownport,Chadstad,Clarkberg,Coletown,Davidfort,Davidtown,East Amychester,East Janiceville,East Justin,...,South Anthony,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
final_df_en = pd.concat([df_one_hot_en,lb_results_df_en],axis = 1)

In [19]:
print("Original dimensions :", df.shape)
print("One hot Encoded dimensions :", final_df_en.shape)
final_df_en.head()

Original dimensions : (42703, 20)
One hot Encoded dimensions : (42703, 66)


Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,...,South Anthony,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence
0,1978,1,4,1,1,1689,1859,attached,508,0,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,attached,462,0,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,none,0,625,...,0,0,0,0,0,0,0,0,0,0
3,2004,1,4,2,0,1829,2277,attached,479,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,attached,430,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
final_df_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42703 entries, 0 to 42702
Data columns (total 66 columns):
year_built              42703 non-null int64
stories                 42703 non-null int64
num_bedrooms            42703 non-null int64
full_bathrooms          42703 non-null int64
half_bathrooms          42703 non-null int64
livable_sqft            42703 non-null int64
total_sqft              42703 non-null int64
garage_type             42703 non-null object
garage_sqft             42703 non-null int64
carport_sqft            42703 non-null int64
has_fireplace           42703 non-null bool
has_pool                42703 non-null bool
has_central_heating     42703 non-null bool
has_central_cooling     42703 non-null bool
city                    42703 non-null object
sale_price              42703 non-null int64
attached                42703 non-null int32
detached                42703 non-null int32
none                    42703 non-null int32
Amystad                 42703 non-null

In [21]:
final_df_en = final_df_en.drop(columns =['garage_type','city'])

In [22]:
final_df_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42703 entries, 0 to 42702
Data columns (total 64 columns):
year_built              42703 non-null int64
stories                 42703 non-null int64
num_bedrooms            42703 non-null int64
full_bathrooms          42703 non-null int64
half_bathrooms          42703 non-null int64
livable_sqft            42703 non-null int64
total_sqft              42703 non-null int64
garage_sqft             42703 non-null int64
carport_sqft            42703 non-null int64
has_fireplace           42703 non-null bool
has_pool                42703 non-null bool
has_central_heating     42703 non-null bool
has_central_cooling     42703 non-null bool
sale_price              42703 non-null int64
attached                42703 non-null int32
detached                42703 non-null int32
none                    42703 non-null int32
Amystad                 42703 non-null int32
Brownport               42703 non-null int32
Chadstad                42703 non-null i

In [23]:
final_df_en['sale'] = final_df_en['sale_price']
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale
0,1978,1,4,1,1,1689,1859,508,0,True,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,True,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,False,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,True,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,True,...,0,0,0,0,0,0,0,0,0,0


In [24]:
final_df_en = final_df_en.drop(columns = ['sale_price'])

In [25]:
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale
0,1978,1,4,1,1,1689,1859,508,0,True,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,True,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,False,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,True,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,True,...,0,0,0,0,0,0,0,0,0,0


In [26]:
final_df_en = final_df_en.rename(columns = {'sale':'sale_price'})

In [27]:
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale_price
0,1978,1,4,1,1,1689,1859,508,0,True,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,True,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,False,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,True,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,True,...,0,0,0,0,0,0,0,0,0,0


In [28]:
final_df_en['has_fireplace'] = final_df_en['has_fireplace'].astype(int)

In [29]:
final_df_en['has_pool'] = final_df_en['has_pool'].astype(int)

In [30]:
final_df_en['has_central_heating'] = final_df_en['has_central_heating'].astype(int)

In [31]:
final_df_en['has_central_cooling'] = final_df_en['has_central_cooling'].astype(int)

In [32]:
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale_price
0,1978,1,4,1,1,1689,1859,508,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,0,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,1,...,0,0,0,0,0,0,0,0,0,0


In [33]:
final_df_en.dtypes

year_built              int64
stories                 int64
num_bedrooms            int64
full_bathrooms          int64
half_bathrooms          int64
livable_sqft            int64
total_sqft              int64
garage_sqft             int64
carport_sqft            int64
has_fireplace           int32
has_pool                int32
has_central_heating     int32
has_central_cooling     int32
attached                int32
detached                int32
none                    int32
Amystad                 int32
Brownport               int32
Chadstad                int32
Clarkberg               int32
Coletown                int32
Davidfort               int32
Davidtown               int32
East Amychester         int32
East Janiceville        int32
East Justin             int32
East Lucas              int32
Fosterberg              int32
Hallfort                int32
Jeffreyhaven            int32
                        ...  
Lake Carolyn            int32
Lake Christinaport      int32
Lake Dariu

In [36]:
# Apply Standard Scalar function to the Numerical Columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
print(scaler.fit(final_df_en[['livable_sqft','total_sqft','garage_sqft','carport_sqft']]))

StandardScaler(copy=True, with_mean=True, with_std=True)


In [37]:
print(scaler.transform(final_df_en[['livable_sqft','total_sqft','garage_sqft','carport_sqft']]))

[[-0.35282757 -0.29059     0.21421265 -0.24690512]
 [-0.00443928 -0.13562626  0.02526262 -0.24690512]
 [-0.48037311 -0.59509916 -1.87245288  3.45759125]
 ...
 [-1.64599767 -1.63541516 -1.87245288  0.92075214]
 [-0.69649195 -0.78690742 -0.22530155 -0.24690512]
 [-0.69294902 -0.78148911 -0.24583959 -0.24690512]]


# Modelling

In [35]:
# Create the X and y arrays
array = final_df_en.values
X = array[0:5000,1:63]

In [36]:
y = df['sale_price'][0:5000]

In [37]:
X

array([[1, 4, 1, ..., 0, 0, 0],
       [1, 3, 1, ..., 0, 0, 0],
       [1, 3, 2, ..., 0, 0, 0],
       ...,
       [1, 3, 1, ..., 0, 0, 0],
       [1, 3, 2, ..., 0, 0, 0],
       [1, 3, 3, ..., 0, 0, 0]], dtype=int64)

In [38]:
# Split the data set in a training set (70%) and a test set (30%)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=7)

In [39]:
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale_price
0,1978,1,4,1,1,1689,1859,508,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,0,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,1,...,0,0,0,0,0,0,0,0,0,0


In [40]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(random_state = 7)
regressor.fit(x_train,y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=7, verbose=0,
                      warm_start=False)

In [41]:
# Predicting the Test set results
y_pred = regressor.predict(x_test)

In [42]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.606600855938997

##### Hyperparameter tuning for Random Forest Regression model

In [43]:
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.model_selection import GridSearchCV
param_grid = { 
    'n_estimators': [500,1000],
    'max_features': [1.0,0.3,0.1],
    'max_depth' : [4,6],
    'min_samples_leaf' : [3,5,9,17]
}
grid = GridSearchCV(estimator = regressor,scoring = make_scorer(mean_squared_error), param_grid=param_grid,n_jobs = -1,cv = 4,
                    verbose = 2)
grid.fit(x_train,y_train)

Fitting 4 folds for each of 48 candidates, totalling 192 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 192 out of 192 | elapsed:  5.2min finished


GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=10, n_jobs=None,
                                             oob_score=False, random_state=7,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [4, 6], 'max_features': [1.

In [44]:
# Predicting the Test set results
y_pred1 = regressor.predict(x_test)

In [45]:
best_accuracy = grid.best_score_
best_parameters = grid.best_params_
print(best_accuracy)
print(best_parameters)

49249255580.104355
{'max_depth': 4, 'max_features': 0.1, 'min_samples_leaf': 17, 'n_estimators': 500}


In [47]:
from sklearn import metrics
from sklearn.metrics import r2_score
# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

173687.32223434898


In [48]:
x = r2_score(y_test, y_pred1)
x

0.606600855938997

In [49]:
# Fitting GBM to the Training set
classifier = GradientBoostingRegressor(random_state = 7)
classifier.fit(x_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto', random_state=7,
                          subsample=1.0, tol=0.0001, validation_fraction=0.1,
                          verbose=0, warm_start=False)

In [50]:
y_pred3 = classifier.predict(x_test)

In [51]:
r2_score(y_test, y_pred3)

0.7061876793207948

##### Hyperparameter tuning for Gradient Boosting Regression model

In [52]:
param_grid = { 
    'n_estimators': [500,1000],
    'max_features': [1.0,0.3,0.1],
    'max_depth' : [4,6],
    'min_samples_leaf' : [3,5,9,17],
    'learning_rate' : [0.1, 0.05, 0.02, 0.01]
}
grid_gb = GridSearchCV(estimator = classifier,scoring = make_scorer(mean_squared_error), param_grid=param_grid,n_jobs = -1,cv = 2,
                      verbose = 2)
grid_gb.fit(x_train,y_train)

Fitting 2 folds for each of 192 candidates, totalling 384 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 384 out of 384 | elapsed:  9.5min finished


GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter...
                             

In [53]:
best_accuracy_gb = grid_gb.best_score_
best_parameters_gb = grid_gb.best_params_
print(best_accuracy_gb)
print(best_parameters_gb)

51794668750.39525
{'learning_rate': 0.1, 'max_depth': 6, 'max_features': 1.0, 'min_samples_leaf': 17, 'n_estimators': 1000}


In [54]:
# Predicting the Test set results
y_pred4 = classifier.predict(x_test)
y_pred4

array([548468.41259094, 452806.19933022, 632868.15231262, ...,
       521459.6503771 , 449173.86595439, 440203.52972092])

In [55]:
from sklearn.externals import joblib 
  
# Save the model as a pickle in a file 
joblib.dump(regressor, 'House_Price_Estimation.pkl') 



['House_Price_Estimation.pkl']

In [56]:
# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))

173687.32223434898


In [57]:
r2_score(y_test, y_pred4)

0.7061876793207948

In [58]:
# Training error
train_error = 1 - (regressor.score(x_train,y_train))

In [59]:
train_error

0.055917837040827645

In [60]:
# Test error 
test_error = 1 - x
test_error

0.39339914406100296

In [62]:
final_df_en.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,South Stevenfurt,Toddshire,Wendybury,West Ann,West Brittanyview,West Gerald,West Gregoryview,West Lydia,West Terrence,sale_price
0,1978,1,4,1,1,1689,1859,508,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,0,...,0,0,0,0,0,0,0,0,0,1
3,2004,1,4,2,0,1829,2277,479,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,1,...,0,0,0,0,0,0,0,0,0,0


In [63]:
### Realtime predictions
Real_predictions_rf = regressor.predict([[1998,1,3,2,0,1602,1986,420,0,1,0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,1,1,1,0,0,
                                       0,1,1,1,0,1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,1,
                                      1,1]])
Real_predictions_rf

array([786757.8])

In [64]:
### Realtime predictions
Real_predictions_gb = classifier.predict([[1998,1,3,2,0,1602,1986,420,0,1,0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,1,1,1,0,0,
                                       0,1,1,1,0,1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,1,
                                      1,1]])
Real_predictions_gb

array([1320443.85180383])

## Points observed in Analysis :

##### 1. We have a dataset which gives us details required to predict the price of  a house.
##### 2. Done data cleaning and preprocessing to the data.
##### 3. Applied models like RandomForest regressor and GradientBoosting regressor to get the r2 score in which GradientBoosting regressor          model gave a better r2 score,so this model will be used for further predictions.
##### 4. Finally predicted the price of house based on mock data using the above stated models.