# House Prices: Advanced Regression Techniques
Author: Jingwen ZHENG<br>
Update: 2019-05-06

## Content
- Project understanding
- Objectif
- Practice skills
- Python packages to be applied
- Import data
- Data description

## Project understanding
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

## Objectif
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

## Practice skills
- Creative feature engineering 
- Advanced regression techniques like random forest and gradient boosting

## Python packages to be applied

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler


## Import data

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [None]:
print('Dimension train_df:', train_df.shape)
print('Dimension test_df:', test_df.shape)

In [None]:
train_df.head(3)

In [None]:
test_df.head(3)

## Data description

In [None]:
train_df.describe(include='all').T

In [None]:
train_df.info()

## Data cleaning

There are missing data in "LotFrontage", "Alley", "MasVnrType", "MasVnrArea", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Electrical", "FireplaceQu", "GarageType", "GarageYrBlt", "GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence" and "MiscFeature".

Among these fields,
- 94% data of "Alley" are missing.
- 47% data of "FireplaceQu" are missing.
- 99.5% data of "PoolQC" are missing.
- 81% data of "Fence" are missing.
- 96% data of "MiscFeature" are missing.

So we will ignore them during the analysis.

What should we do on missing data of other fields? We might replace null by median value or mode value.

In [None]:
train_df['LotFrontage'].fillna(train_df['LotFrontage'].median(), inplace=True)
train_df['MasVnrType'].fillna(train_df['MasVnrType'].value_counts().index[0], inplace=True)
train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].median(), inplace=True)
train_df['BsmtQual'].fillna(train_df['BsmtQual'].value_counts().index[0], inplace=True)
train_df['BsmtCond'].fillna(train_df['BsmtCond'].value_counts().index[0], inplace=True)
train_df['BsmtExposure'].fillna(train_df['BsmtExposure'].value_counts().index[0], inplace=True)
train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].value_counts().index[0], inplace=True)
train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].value_counts().index[0], inplace=True)
train_df['Electrical'].fillna(train_df['Electrical'].value_counts().index[0], inplace=True)
train_df['GarageType'].fillna(train_df['GarageType'].value_counts().index[0], inplace=True)
train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].median(), inplace=True)
train_df['GarageFinish'].fillna(train_df['GarageFinish'].value_counts().index[0], inplace=True)
train_df['GarageQual'].fillna(train_df['GarageQual'].value_counts().index[0], inplace=True)
train_df['GarageCond'].fillna(train_df['GarageCond'].value_counts().index[0], inplace=True)

In [None]:
test_df['LotFrontage'].fillna(test_df['LotFrontage'].median(), inplace=True)
test_df['MasVnrType'].fillna(test_df['MasVnrType'].value_counts().index[0], inplace=True)
test_df['MasVnrArea'].fillna(test_df['MasVnrArea'].median(), inplace=True)
test_df['BsmtQual'].fillna(test_df['BsmtQual'].value_counts().index[0], inplace=True)
test_df['BsmtCond'].fillna(test_df['BsmtCond'].value_counts().index[0], inplace=True)
test_df['BsmtExposure'].fillna(test_df['BsmtExposure'].value_counts().index[0], inplace=True)
test_df['BsmtFinType1'].fillna(test_df['BsmtFinType1'].value_counts().index[0], inplace=True)
test_df['BsmtFinType2'].fillna(test_df['BsmtFinType2'].value_counts().index[0], inplace=True)
test_df['Electrical'].fillna(test_df['Electrical'].value_counts().index[0], inplace=True)
test_df['GarageType'].fillna(test_df['GarageType'].value_counts().index[0], inplace=True)
test_df['GarageYrBlt'].fillna(test_df['GarageYrBlt'].median(), inplace=True)
test_df['GarageFinish'].fillna(test_df['GarageFinish'].value_counts().index[0], inplace=True)
test_df['GarageQual'].fillna(test_df['GarageQual'].value_counts().index[0], inplace=True)
test_df['GarageCond'].fillna(test_df['GarageCond'].value_counts().index[0], inplace=True)

In [None]:
train_df.drop(columns=['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)
test_df.drop(columns=['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], inplace=True)

In [None]:
# Transform some numerical variables that are really categorical
train_df['MSSubClass'] = train_df['MSSubClass'].apply(str)
train_df['OverallCond'] = train_df['OverallCond'].astype(str)
train_df['YrSold'] = train_df['YrSold'].astype(str)
train_df['MoSold'] = train_df['MoSold'].astype(str)

test_df['MSSubClass'] = test_df['MSSubClass'].apply(str)
test_df['OverallCond'] = test_df['OverallCond'].astype(str)
test_df['YrSold'] = test_df['YrSold'].astype(str)
test_df['MoSold'] = test_df['MoSold'].astype(str)

In [None]:
train_df.shape

In [None]:
test_df.shape

In [None]:
train_df['SalePrice_per_squareFeet'] = train_df['SalePrice'] / train_df['LotArea']

In [None]:
train_df.hist(bins=40, figsize=(16, 20), density=True)
plt.subplots_adjust(hspace=0.4, wspace=0.45)#, top=0.97, bottom=0.03, left=0.04, right=0.95)
plt.show()

According to the group of histograms, we observed that most "SalePrice" is between 130k dollars(1st quartile) and 214k dollars(3rd quartile).

## Data analysis

### Correlation matrix between numerical values and "SalePrice"

In [None]:
train_df.info()

In [None]:
sns.set(rc={'figure.figsize':(20, 18)})
num_fields = ['LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
              'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
              'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
              'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
              'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch',
              'PoolArea', 'MiscVal', 'SalePrice', 'SalePrice_per_squareFeet']

sns.heatmap(train_df[num_fields].corr(),
            annot=True,
            fmt='.2f',
            cmap='coolwarm')
plt.show()

As the correlation heatmap shows, "SalePrice" is more related to "OverallQual", "GrLivArea" and "GarageCars".

### Relationship between "SalePrice" and numeric fields

In [None]:
# plt.figure(figsize=(7, 7))
fig, axarr = plt.subplots(nrows=6, ncols=3, figsize=(15, 30))

axarr[0, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['BsmtFinSF1'],
                    s=30,
                    alpha=0.2)
axarr[0, 0].set_xlabel('SalePrice per feet ($)')
axarr[0, 0].set_ylabel('Type 1 finished square feet (BsmtFinSF1)')
axarr[0, 0].set_ylim(bottom=-50)#, top=2500)

axarr[0, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['BsmtFinSF2'],
                    s=30,
                    alpha=0.2)
axarr[0, 1].set_xlabel('SalePrice per feet ($)')
axarr[0, 1].set_ylabel('Type 2 finished square feet (BsmtFinSF2)')
axarr[0, 1].set_ylim(bottom=-50)#, top=2500)

axarr[0, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['LotArea'],
                    s=30,
                    alpha=0.2)
axarr[0, 2].set_xlabel('SalePrice per feet ($)')
axarr[0, 2].set_ylabel('Lot size in square feet (LotArea)')
axarr[0, 2].set_ylim(bottom=-50)#, top=50000)

axarr[1, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['YearBuilt'],
                    s=30,
                    alpha=0.2)
axarr[1, 0].set_xlabel('SalePrice per feet ($)')
axarr[1, 0].set_ylabel('Original construction date (YearBuilt)')
axarr[1, 0].set_ylim(bottom=1860)#, top=2500)

axarr[1, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['YearRemodAdd'],
                    s=30,
                    alpha=0.2)
axarr[1, 1].set_xlabel('SalePrice')
axarr[1, 1].set_ylabel('Remodel date (YearRemodAdd)')

axarr[1, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['BsmtUnfSF'],
                    s=30,
                    alpha=0.2)
axarr[1, 2].set_xlabel('SalePrice per feet ($)')
axarr[1, 2].set_ylabel('Unfinished square feet of \nbasement area (BsmtUnfSF)')
axarr[1, 2].set_ylim(bottom=-50)#, top=50000)

axarr[2, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['TotalBsmtSF'],
                    s=30,
                    alpha=0.2)
axarr[2, 0].set_xlabel('SalePrice per feet ($)')
axarr[2, 0].set_ylabel('Total square feet of basement area\n(TotalBsmtSF)')
axarr[2, 0].set_ylim(bottom=-50)#, top=2500)

axarr[2, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['1stFlrSF'],
                    s=30,
                    alpha=0.2)
axarr[2, 1].set_xlabel('SalePrice per feet ($)')
axarr[2, 1].set_ylabel('First Floor square feet (1stFlrSF)')
axarr[2, 1].set_ylim(bottom=-50)#, top=2500)

axarr[2, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['2ndFlrSF'],
                    s=30,
                    alpha=0.2)
axarr[2, 2].set_xlabel('SalePrice per feet ($)')
axarr[2, 2].set_ylabel('Second floor square feet (2ndFlrSF)')
axarr[2, 2].set_ylim(bottom=-50)#, top=2500)

axarr[3, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['GrLivArea'],
                    s=30,
                    alpha=0.2)
axarr[3, 0].set_xlabel('SalePrice per feet ($)')
axarr[3, 0].set_ylabel('Above grade (ground) living area\nsquare feet (GrLivArea)')
axarr[3, 0].set_ylim(bottom=-50)#, top=2500)

axarr[3, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['BedroomAbvGr'],
                    s=30, 
                    alpha=0.2)
axarr[3, 1].set_xlabel('SalePrice per feet ($)')
axarr[3, 1].set_ylabel('Number of bedrooms above\nbasement level (BedroomAbvGr)')
axarr[3, 1].set_ylim(bottom=-0.5)#, top=2500)

axarr[3, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['KitchenAbvGr'],
                    s=30, 
                    alpha=0.2)
axarr[3, 2].set_xlabel('SalePrice per feet ($)')
axarr[3, 2].set_ylabel('Number of kitchens (KitchenAbvGr)')
axarr[3, 2].set_ylim(bottom=-0.2)#, top=2500)

axarr[4, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['TotRmsAbvGrd'],
                    s=30,
                    alpha=0.2)
axarr[4, 0].set_xlabel('SalePrice per feet ($)')
axarr[4, 0].set_ylabel('Total rooms above grade (does NOT\ninclude bathrooms) (TotRmsAbvGrd)')
axarr[4, 0].set_ylim(bottom=-0.2)#, top=2500)

axarr[4, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['Fireplaces'],
                    s=30,
                    alpha=0.2)
axarr[4, 1].set_xlabel('SalePrice per feet ($)')
axarr[4, 1].set_ylabel('Number of fireplaces (Fireplaces)')
axarr[4, 1].set_ylim(bottom=-0.2)#, top=2500)

axarr[4, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['GarageArea'],
                    s=30,
                    alpha=0.2)
axarr[4, 2].set_xlabel('SalePrice per feet ($)')
axarr[4, 2].set_ylabel('Size of garage in square feet\n(GarageArea)')
axarr[4, 2].set_ylim(bottom=-50)#, top=2500)

axarr[5, 0].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['WoodDeckSF'],
                    s=30,
                    alpha=0.2)
axarr[5, 0].set_xlabel('SalePrice per feet ($)')
axarr[5, 0].set_ylabel('Wood deck area in square feet\n(WoodDeckSF)')
axarr[5, 0].set_ylim(bottom=-50)#, top=2500)

axarr[5, 1].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['OpenPorchSF'],
                    s=30,
                    alpha=0.2)
axarr[5, 1].set_xlabel('SalePrice per feet ($)')
axarr[5, 1].set_ylabel('Open porch area in square feet\n(OpenPorchSF)')
axarr[5, 1].set_ylim(bottom=-50)#, top=2500)

axarr[5, 2].scatter(x=train_df['SalePrice_per_squareFeet'],
                    y=train_df['EnclosedPorch'],
                    s=30,
                    alpha=0.2)
axarr[5, 2].set_xlabel('SalePrice per feet ($)')
axarr[5, 2].set_ylabel('Enclosed porch area in square feet\n(EnclosedPorch)')
axarr[5, 2].set_ylim(bottom=-50)#, top=2500)


plt.subplots_adjust(hspace=0.2, wspace=0.3)#, top=0.97, bottom=0.03, left=0.04, right=0.95)
plt.show()

I take parts of numeric values, show the relationship between "SalePrice per square feet" and each of them:

- The more recent construction / remodel is, the higher "SalePrice per square feet" is.
- The more total rooms above grade is, the higher "SalePrice per square feet" is.
- The larger lot size (LotArea) is, the cheaper "SalePrice per square feet" is.
- For the lot whose total basement area is not larger than 40 square feet, the larger total basement area is, the cheaper "SalePrice per square feet" is; for the lot whose total basement area is larger than 40 square feet, the "SalePrice per square feet" is between 500\$ and 2000\$.
- For the lot whose above grade (groud) living area is not larger than 40 square feet, the large above grade (groud) living area is, the higher "SalePrice per square feet" is; for the lot whose above grade (groud) living area is larger than 40 square feet, the "SalePrice per square feet" is between 500\$ and 2000\$.
- Etc.

### Relationship between "SalePrice_per_squareFeet" and category fields

"SalePrice_per_squareFeet" vs. "1MSSubClass"

In [None]:
plt.figure(figsize=(7, 12))
class_price_plt = sns.factorplot(data=train_df,
                                 x='MSSubClass',
                                 y='SalePrice_per_squareFeet',
                                 size=6,
                                 kind='bar',
                                 palette='muted',
                                 aspect=2)
class_price_plt.despine(left=True)
class_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Among all building classes, the first three most expensive classes are "2-STORY PUD - 1946 & NEWER", "PUD - MULTILEVEL - INCL SPLIT LEV/FOYER" and "1-STORY PUD (Planned Unit Development) - 1946 & NEWER".

"SalePrice_per_squareFeet" vs. "MSZoning"

In [None]:
plt.figure(figsize=(7, 12))
zonecls_price_plt = sns.factorplot(data=train_df,
                                 x='MSZoning',
                                 y='SalePrice_per_squareFeet',
                                 size=6,
                                 kind='bar',
                                 palette='muted',
                                 aspect=2)
zonecls_price_plt.despine(left=True)
zonecls_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

The graph above studies the sale price per square feet in terms of general zoning classification. Among these 5 zoning classes, the sale price per square feet of "Floating Village Residential (FV)" is the most expensive, the zoning classes which are less expensive are "Residential Medium Density (RM)", "Residential High Density (RH)" and "Residential Low Density (RL)", the sale price per square feet of "Commercial (C)" is the cheapest among the 5 classes.

Considering the construction's difficulty and their rarity, we can obviously understand why the sale price per square feet of "Floating Village Residential (FV)" is the most expensive. However, there are less restrictions on the "Commercial" class, so it's the cheapest class.

"SalePrice_per_squareFeet" vs. "LotShape"

In [None]:
plt.figure(figsize=(7, 12))
lotshape_price_plt = sns.factorplot(data=train_df,
                                 x='LotShape',
                                 y='SalePrice_per_squareFeet',
                                 size=6,
                                 kind='bar',
                                 palette='muted',
                                 aspect=2)
lotshape_price_plt.despine(left=True)
lotshape_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

The relationship between General shape of property (LotShape) and the sale price per square feet is easily to understand: people usually like regular shape (Reg) of property, since it's simple for the overall arrangement and more confortable for living.

"SalePrice_per_squareFeet" vs. "Utilities"

In [None]:
plt.figure(figsize=(7, 12))
utility_price_plt = sns.factorplot(data=train_df,
                                 x='Utilities',
                                 y='SalePrice_per_squareFeet',
                                 size=6,
                                 kind='bar',
                                 palette='muted',
                                 aspect=2)
utility_price_plt.despine(left=True)
utility_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

The result of this plot is interesting: we all know the more complete utilities are, the more expensive per square feet is. Except this point, we also get the price per square feet of a property whose all public utilities are available is double of the square feet-price of a property that only electricity and gas are available.

"SalePrice_per_squareFeet" vs. "LotConfig"

In [None]:
plt.figure(figsize=(7, 12))
lotconfig_price_plt = sns.factorplot(data=train_df,
                                     x='LotConfig',
                                     y='SalePrice_per_squareFeet',
                                     size=6,
                                     kind='bar',
                                     palette='muted',
                                     aspect=2)
lotconfig_price_plt.despine(left=True)
lotconfig_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Considering the lightness, the ventilation and the view, the lot with "Frontage on 3 sides of property" is the best, so its price per square feet is the most expensive among the 5 configurations. On the contrary, the lot which is located as a Cul-de-sac, its price per square feet is the cheapest.

"SalePrice_per_squareFeet" vs. "Neighborhood"

In [None]:
plt.figure(figsize=(7, 12))
neighbor_price_plt = sns.factorplot(data=train_df,
                                       x='Neighborhood',
                                       y='SalePrice_per_squareFeet',
                                       size=6,
                                       kind='bar',
                                       palette='muted',
                                       aspect=2)
neighbor_price_plt.despine(left=True)
neighbor_price_plt.set_ylabels('SalePrice per square feet ($)')
neighbor_price_plt.set_xticklabels(rotation=20)

plt.show()

Considering the economic / political / geographical reasons, if a lot is located near Bluestem, its price per square feet is nearly 90 dollars; moreover, if a lot is located near Bloomington Heights or Briardale, its price per square feet is about 60 dollars. However, if a lot is located near Clear Creek, its unit price is only about 15 dollars.

"SalePrice_per_squareFeet" vs. "OverallQual"

In [None]:
plt.figure(figsize=(7, 12))
overallQual_price_plt = sns.factorplot(data=train_df,
                                    x='OverallQual',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
overallQual_price_plt.despine(left=True)
overallQual_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

The better the overall material is, the more expensive the lot is. The interesting point is median value of square feet price of "very excellent" lot is a little bit lower than "excellent" ones, but its variance is more than others.

"SalePrice_per_squareFeet" vs. "RoofMatl"

In [None]:
plt.figure(figsize=(7, 12))
roofMatl_price_plt = sns.factorplot(data=train_df,
                                    x='RoofMatl',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
roofMatl_price_plt.despine(left=True)
roofMatl_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Considering the insulation, drainage, material cost and robustness, the lot with Standard (Composite) Shingle roof or Wood Shingles roof is more expensive than others. However, if a lot's roof is constucted by Clay or Tile, it's the relatively cheapest (per square feet) since its function is not as well as others.

"SalePrice_per_squareFeet" vs. "Heating"

In [None]:
plt.figure(figsize=(7, 12))
heating_price_plt = sns.factorplot(data=train_df,
                                    x='Heating',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
heating_price_plt.despine(left=True)
heating_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Considering material cost and construction-difficulties, the lot with "Gas forced warm air furnace" heating is more expensive than other heating types.

"SalePrice_per_squareFeet" vs. "GarageType"

In [None]:
plt.figure(figsize=(7, 12))
garageType_price_plt = sns.factorplot(data=train_df,
                                    x='GarageType',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
garageType_price_plt.despine(left=True)
garageType_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Considering construction and property's convenience, the lot with built-in garage is more expensive than other types of garage, the lot only with car port as the garage is the cheapest in terms of per square feet's price.

"SalePrice_per_squareFeet" vs. "SaleType"

In [None]:
plt.figure(figsize=(7, 12))
saleType_price_plt = sns.factorplot(data=train_df,
                                    x='SaleType',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
saleType_price_plt.despine(left=True)
saleType_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Let's talk about the impact of sale type on the sale price. There is no doubt that the new lot which is just constructed and sold is the most expensive because its loss is the least. But I'm not clear for the reason of why other types of sale are less expensive. If you know why, your ideas are welcome :)

"SalePrice_per_squareFeet" vs. "SaleCondition"

In [None]:
plt.figure(figsize=(7, 12))
saleCdt_price_plt = sns.factorplot(data=train_df,
                                    x='SaleCondition',
                                    y='SalePrice_per_squareFeet',
                                    size=6,
                                    kind='bar',
                                    palette='muted',
                                    aspect=2)
saleCdt_price_plt.despine(left=True)
saleCdt_price_plt.set_ylabels('SalePrice per square feet ($)')

plt.show()

Among all sold lots, a lot is more expensive than others if it was not completed when last assessed (associated with New Homes), but it's less expensive for the adjoining land purchase.

## Data preprocessing for building models

In [None]:
# num_attribs = ['Id', 'LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd',
#                'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
#                '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
#                'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
#                'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
#                'ScreenPorch', 'PoolArea', 'MiscVal']

num_attribs = train_df.drop(columns=['MSSubClass', 'OverallCond',
                                     'MoSold', 'YrSold',
                                     'SalePrice_per_squareFeet',
                                     'SalePrice']).dtypes[train_df.dtypes != "object"].index

cat_attribs = ['MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'OverallCond',
               'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
               'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
               'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
               'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
               'Electrical', 'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQual',
               'GarageCond', 'PavedDrive', 'YrSold', 'MoSold', 'SaleType', 'SaleCondition']

### Label encoding categorical variables

In [None]:
# process columns, apply LabelEncoder to categorical features
for col in cat_attribs:
    lbl_train = LabelEncoder() 
    lbl_train.fit(list(train_df[col].values)) 
    train_df[col] = lbl_train.transform(list(train_df[col].values))

for col in cat_attribs:
    lbl_test = LabelEncoder() 
    lbl_test.fit(list(test_df[col].values)) 
    test_df[col] = lbl_test.transform(list(test_df[col].values))

In [None]:
train_df.shape

In [None]:
test_df.shape

### Build pipeline for numeric variables

In [None]:
num_attribs

### Separate training set and validation set

In [None]:
X = train_df.drop(columns=['SalePrice_per_squareFeet', 'SalePrice'])
y = train_df['SalePrice']

# validation_ratio = 0.25
# train_df_size = len(X)

# validation_size = validation_ratio * train_df_size
# train_size = train_df_size - validation_size

# random_indices = np.random.permutation(train_df_size)

# X_train = X.loc[random_indices[:int(train_size)], :]
# y_train = y[random_indices[:int(train_size)]]
# X_validation = X.loc[random_indices[int(train_size):], :]
# y_validation = y[random_indices[int(train_size):]]

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_validation, y_train, y_validation = train_test_split(X,
                                                               y,
                                                               test_size=0.25,
                                                               random_state=42)


In [None]:
X_train.shape

In [None]:
X_validation.shape

In [None]:
full_pipeline.fit_transform(X_train).shape

In [None]:
full_pipeline.transform(X_validation).shape

In [None]:
X_train = full_pipeline.fit_transform(X_train)
X_validation = full_pipeline.fit_transform(X_validation)
test_df = full_pipeline.fit_transform(test_df)

## Train models

In [None]:
from sklearn.linear_model import LinearRegression

lin_rg = LinearRegression()
lin_rg.fit(X_train, y_train)

In [None]:
lin_rg.predict(X_validation)


In [None]:
lin_rg