# Descriptive Visualizations

The purpose of this notebook is to visually examine the nominal features, discard the useless ones among them, and create new factor variables.

The "main" plot used in this notebook is *OverQual* vs. *SalePrice* as the overall living area is the most correlated predictor (which is also very intuitive). Many of the nominal variables change the slopes of the regression lines for sub-groups of data points significantly.

## "Housekeeping"

In [None]:
import json

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.ensemble import IsolationForest


In [None]:
pd.set_option("display.max_columns", 120)

In [None]:
sns.set_style("white")

## Load the Data


In [None]:

housing = pd.read_csv('housing.csv', index_col=0)
housing_new = pd.read_csv('housing_new.csv', index_col=0)
housing = housing[housing.GrLivArea<3700]

In [None]:
housing.shape

In [None]:
housing.head()

Collect the newly created variables in the *new_variables* list.

In [None]:
new_variables = []

## Derived Characteristics

Certain characteristics of a house are assumed to have a "binary" influence on the sales price. For example, the existence of a pool could be an important predictor while the exact size of the pool can be deemed not so important.

The below cell creates boolean factor variables out of a set of numeric variables.

In [None]:
derived_variables = {
    "has 2nd Flr": "2ndFlrSF",
    "has Bsmt": "TotalBsmtSF",
    "has Fireplace": "Fireplaces",
    "has Garage": "GarageArea",
    "has Pool": "PoolArea",
    
}
# Factorize numeric columns.
for factor_column, column in derived_variables.items():
    housing[factor_column] = housing[column].apply(lambda x: 1 if x > 0 else 0)
derived_variables = list(derived_variables.keys())

In [None]:
housing[derived_variables].head()

### 2nd Floors

A second floor may have a positive effect on the sales price. However, having a second floor correlates with overall living space. The individual effect is therefore not as clear as it seems in the plot below. The properties that have the same GrLivArea on two floors seems to cost less than those that have only a single floor. This is because the first floor is actually half the size and the compound is likely to be smaller.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="has 2nd Flr", s=15, data=housing);

### Basements

Nearly all houses in Ames, IA, have a basement. Therefore, *has Bsmt* is most likely not an important predictor.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="has Bsmt", s=15, data=housing);

### Fireplaces

Bigger houses are more likely to have a fireplace. Thus, the variable *has Fireplace* might be an interesting predictor for bigger houses

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="has Fireplace", s=10, data=housing);

### Garages

Holding the overall living area fixed adding a garage seems to affect the price positively. Thus, *has Garage* seems like an interesting predictor as well.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="has Garage", s=10, data=housing);

### Pools

Unfortunately, almost no one in Ames, IA, has a pool. The predictor *has Pool* seems quite uninteresting.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="has Pool", s=15, data=housing);

### Quality

Bigger houses seems to have better qualities.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue= "OverallQual", s=10, data=housing);

## Neighborhoods


In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue= "Neighborhood", s=10, data=housing);

In [None]:
housing

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue= "dist", s=10, data=housing);

In [None]:
_, ax = plt.subplots(figsize=(10, 8))
sns.boxplot(x="Neighborhood", y="SalePrice", data=housing, ax=ax)
ax.set_title("Prices by hood", fontsize=24)
ax.set_xlabel("Neighborhood", fontsize=18)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.set_ylabel("House Price", fontsize=18);

In [None]:
housing_new

The 28 neighborhoods are encoded as factor variables.

In [None]:
neighborhood = pd.get_dummies(housing["Neighborhood"], prefix="nhood_", drop_first=True)
housing = pd.concat([housing, neighborhood], axis=1)


In [None]:
housing

In [None]:
new_variables.extend(neighborhood.columns)

In [None]:
housing[neighborhood.columns].shape

In [None]:
housing[neighborhood.columns].head()

### Alleys

Almost no house has access to an alley.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Alley", s=10, data=housing);

### Building Type

The type of a building clearly affects the valuation. The two types of townhouses as well as the 2-family condo and duplex type are summarized into a single category. This makes sense a) semantically, and b) by looking at the two sub-clusters in the scatter plot.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="BldgType", s=10, data=housing);

In [None]:
# Housing where condition1 is not norm is significanctly lower priced.
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Condition1", s=10, data=housing);

However, plotting the groups seperately reveals different slopes.

In [None]:
street = [1]

plot = sns.lmplot(
    x="GrLivArea", y="SalePrice", col="Condition1", hue="Condition1",
    col_order=[0] + street,
    data=housing, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 10},
)
# Adjust font sizes.
for ax in plot.axes:
    ax.set_title(ax.get_title(), fontsize=20)
    ax.set_xlabel(ax.get_xlabel(), fontsize=16)
    ax.set_ylabel(ax.get_ylabel(), fontsize=16)

Extract factor variables *major_street*, *railway*, and *park*.

### Exterior

This dimensions tells the main material with which the houses are made of. The category is too diverse and the various grouped scatter plots did not reveal differing slopes.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Exterior1st", s=10, data=housing);

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Exterior2nd", s=10, data=housing);

### Foundation

The type of foundation appears to have an effect. Houses with Pconc foundations have higher prices compared to the rest

### Garage Type

The garagetype is significant for the SalePrice

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="GarageType", s=15, data=housing);

### Heating

Most of the houses have gas.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Heating", s=10, data=housing);

### House Style


In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="HouseStyle", s=15, data=housing);

### Land Contour

This variable is assumed to contain the same information as the ordinal variable *Land Slope* and is dropped.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="LandContour", s=15, data=housing);

### Lot Configuration

This variable shows no good pattern and is dropped.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="LotConfig", s=15, data=housing);

### Masonry Veneer Type

"Stone" has a higher slope than others

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="MasVnrType", s=15, data=housing);

In [None]:
street = ["BrkFace"]
none =["None"]
plot = sns.lmplot(
    x="GrLivArea", y="SalePrice", col="MasVnrType", hue="MasVnrType",
    col_order=["Stone"] + street+none,
    data=housing, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 10},
)
# Adjust font sizes.
for ax in plot.axes:
    ax.set_title(ax.get_title(), fontsize=20)
    ax.set_xlabel(ax.get_xlabel(), fontsize=16)
    ax.set_ylabel(ax.get_ylabel(), fontsize=16)

### Roof

Roofs in Ames, IA, are not special enough to make a difference in the price. Even "hip" roofs seem already priced in bigger houses.

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="RoofMatl", s=10, data=housing);

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="RoofStyle", s=10, data=housing);

### Sale Info

Partial and abnormal (= foreclosure) sales seem to make a change with higher and lower prices respectively. These two types will be encoded in factor variables *partial_sale* and *abnormal_sale*. The impact seems to be not big though.

In [None]:
housing["SaleCondition"].value_counts()

In [None]:
partial = ["Partial"]
norm =["Normal"]
plot = sns.lmplot(
    x="GrLivArea", y="SalePrice", col="SaleCondition", hue="SaleCondition",
    col_order=["Abnorml"] + partial+norm,
    data=housing, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 10},
)
# Adjust font sizes.
for ax in plot.axes:
    ax.set_title(ax.get_title(), fontsize=20)
    ax.set_xlabel(ax.get_xlabel(), fontsize=16)
    ax.set_ylabel(ax.get_ylabel(), fontsize=16)

In [None]:
### Sale types New and Con have higher prices

In [None]:
sns.scatterplot(x="GrLivArea", y="SalePrice", hue="SaleType", s=15, data=housing);

## Age & Remodeling

The dataset was put together between 2006 and 2010. Therefore, the variables with year numbers need to be aligned to indicate the right ages.

Convert the years to age by subtracting it from 2010. Then take the squareroot to reduce the effect of older houses on the outcome

### Corelation

In [None]:

corr = pd.concat([housing.iloc[:,:30], housing['SalePrice']], axis=1).corr()
sns.heatmap(corr)

In [None]:
def plot_correlation(data, title):
    """Visualize a correlation matrix in a nice heatmap."""
    fig, ax = plt.subplots(figsize=(12, 12))
    ax.set_title(title, fontsize=24)
    # Blank out the upper triangular part of the matrix.
    mask = np.zeros_like(data, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    # Use a diverging color map.
    cmap = sns.diverging_palette(240, 0, as_cmap=True)
    # Adjust the labels' font size.
    labels = data.columns
    ax.set_xticklabels(labels, fontsize=10)
    ax.set_yticklabels(labels, fontsize=10)
    # Plot it.
    sns.heatmap(
        data, vmin=-1, vmax=1, cmap=cmap, center=0, linewidths=.5,
        cbar_kws={"shrink": .5}, square=True, mask=mask, ax=ax
    )

size_related = housing.filter(regex='SF$|Area$').fillna(1)

pearson = size_related.corr(method="pearson")
plot_correlation(pearson, "Pearson's Correlation")

In [None]:
#  TotalBmntSf is highly correlated with IstFlrSF

In [None]:
qual_related = housing.filter(regex='Qual$|Cond$')
pearson = qual_related.corr(method="pearson")
plot_correlation(pearson, "Pearson's Correlation")

### 3D Visualisation

In [None]:
housing.OverallQual

In [None]:
housing_plot1 = housing[[ "GrLivArea",'OverallQual']]
price = housing["SalePrice"]
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(housing_plot1, price)
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
N = len(price)
beta = np.array([np.round(ols.intercept_, 1), np.round(ols.coef_, 1)[0], np.round(ols.coef_, 1)[1]])
x_m = np.array(housing_plot1.head(N)) #np.random.randn(N, 2)
y_m = np.array(price.head(N))#np.dot(np.append(np.ones((N,1)), x_m, axis=1), beta) + np.random.randn(N)*4
fig = plt.figure(figsize=(14, 10))
ax = plt.axes(projection='3d')
# plot the data points
X = np.array(list(map(lambda x: [3]+ list(x), x_m)))  # Idiomatic Py3, but inefficient on Py2
up = np.where(y_m >= np.sum(X*beta, axis=1))[0]
down = np.where(y_m < np.sum(X*beta, axis=1))[0]
ax.scatter(x_m[up, 0], x_m[up, 1], y_m[up], c='blue', alpha=.6)
ax.scatter(x_m[down, 0], x_m[down, 1], y_m[down], c='orange', alpha=.4)

# plot the error bars
ax = fig.gca(projection='3d')
x_up = x_m[up,:]; y_up = y_m[up]
up_kwargs = dict(color='red', alpha=.6, lw=0.8)
for i, j, k in zip(x_up[:, 0], x_up[:, 1], y_up):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **up_kwargs)
    
x_down = x_m[down,:]; y_down = y_m[down]
down_kwargs = dict(color='red', alpha=.3, lw=0.8)
for i, j, k in zip(x_down[:,0], x_down[:,1], y_down):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **down_kwargs)
    
    
# plot the plane which represents the true model
x_1 = np.linspace(min(x_m[:, 0])-.5, max(x_m[:, 0])+.5, 25)
x_2 = np.linspace(min(x_m[:, 1])-.5, max(x_m[:, 1])+.5, 25)
x_1, x_2 = np.meshgrid(x_1, x_2)
x_3 = beta[1]*x_1 + beta[2]*x_2 + beta[0]
surface_kwargs = dict(rstride=100, cstride=100, color='yellow', alpha=0.1)
ax.plot_surface(x_1, x_2, x_3, **surface_kwargs)
ax.set_xlabel('GrLivArea')
ax.set_ylabel('TotalBsmtSF')
ax.set_zlabel('SalePrice')
plt.show()

In [None]:
## 3D scatter plot
housing_plot2 = housing[[ "Price_by_hood",'OverallQual']]
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(housing_plot2, price)

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
N = len(price)
beta = np.array([np.round(ols.intercept_, 1), np.round(ols.coef_, 1)[0], np.round(ols.coef_, 1)[1]])
x_m = np.array(housing_plot2.head(N)) #np.random.randn(N, 2)
y_m = np.array(price.head(N))#np.dot(np.append(np.ones((N,1)), x_m, axis=1), beta) + np.random.randn(N)*4
fig = plt.figure(figsize=(14, 10))
ax = plt.axes(projection='3d')
# plot the data points
X = np.array(list(map(lambda x: [3]+ list(x), x_m)))  # Idiomatic Py3, but inefficient on Py2
up = np.where(y_m >= np.sum(X*beta, axis=1))[0]
down = np.where(y_m < np.sum(X*beta, axis=1))[0]
ax.scatter(x_m[up, 0], x_m[up, 1], y_m[up], c='blue', alpha=.6)
ax.scatter(x_m[down, 0], x_m[down, 1], y_m[down], c='orange', alpha=.4)

# plot the error bars
ax = fig.gca(projection='3d')
x_up = x_m[up,:]; y_up = y_m[up]
up_kwargs = dict(color='red', alpha=.6, lw=0.8)
for i, j, k in zip(x_up[:, 0], x_up[:, 1], y_up):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **up_kwargs)
    
x_down = x_m[down,:]; y_down = y_m[down]
down_kwargs = dict(color='red', alpha=.3, lw=0.8)
for i, j, k in zip(x_down[:,0], x_down[:,1], y_down):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **down_kwargs)
    
    
# plot the plane which represents the true model
x_1 = np.linspace(min(x_m[:, 0])-.5, max(x_m[:, 0])+.5, 25)
x_2 = np.linspace(min(x_m[:, 1])-.5, max(x_m[:, 1])+.5, 25)
x_1, x_2 = np.meshgrid(x_1, x_2)
x_3 = beta[1]*x_1 + beta[2]*x_2 + beta[0]
surface_kwargs = dict(rstride=100, cstride=100, color='yellow', alpha=0.1)
ax.plot_surface(x_1, x_2, x_3, **surface_kwargs)
ax.set_xlabel('Price_by_hood')
ax.set_ylabel('OverallQual')
ax.set_zlabel('SalePrice')

In [None]:
## 3D scatter plot
housing_plot3 = housing[[ "dist",'GrLivArea']]
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(housing_plot3, price)

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
N = len(price)
beta = np.array([np.round(ols.intercept_, 1), np.round(ols.coef_, 1)[0], np.round(ols.coef_, 1)[1]])
x_m = np.array(housing_plot3.head(N)) #np.random.randn(N, 2)
y_m = np.array(price.head(N))#np.dot(np.append(np.ones((N,1)), x_m, axis=1), beta) + np.random.randn(N)*4
fig = plt.figure(figsize=(14, 10))
ax = plt.axes(projection='3d')
# plot the data points
X = np.array(list(map(lambda x: [3]+ list(x), x_m)))  # Idiomatic Py3, but inefficient on Py2
up = np.where(y_m >= np.sum(X*beta, axis=1))[0]
down = np.where(y_m < np.sum(X*beta, axis=1))[0]
ax.scatter(x_m[up, 0], x_m[up, 1], y_m[up], c='blue', alpha=.6)
ax.scatter(x_m[down, 0], x_m[down, 1], y_m[down], c='orange', alpha=.4)

# plot the error bars
ax = fig.gca(projection='3d')
x_up = x_m[up,:]; y_up = y_m[up]
up_kwargs = dict(color='red', alpha=.6, lw=0.8)
for i, j, k in zip(x_up[:, 0], x_up[:, 1], y_up):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **up_kwargs)
    
x_down = x_m[down,:]; y_down = y_m[down]
down_kwargs = dict(color='red', alpha=.3, lw=0.8)
for i, j, k in zip(x_down[:,0], x_down[:,1], y_down):
    ax.plot([i, i], [j, j], [k, np.dot(beta, [1, i, j])], **down_kwargs)
    
    
# plot the plane which represents the true model
x_1 = np.linspace(min(x_m[:, 0])-.5, max(x_m[:, 0])+.5, 25)
x_2 = np.linspace(min(x_m[:, 1])-.5, max(x_m[:, 1])+.5, 25)
x_1, x_2 = np.meshgrid(x_1, x_2)
x_3 = beta[1]*x_1 + beta[2]*x_2 + beta[0]
surface_kwargs = dict(rstride=100, cstride=100, color='yellow', alpha=0.1)
ax.plot_surface(x_1, x_2, x_3, **surface_kwargs)
ax.set_xlabel('dist')
ax.set_ylabel('GrLivArea')
ax.set_zlabel('SalePrice')
plt.show()

### Interaction of numerical features with SalePrice

In [None]:

import matplotlib.gridspec as gridspec
import matplotlib as mpl
import matplotlib.pyplot as plt
lst = list(size_related.columns)+list(qual_related.columns)

X_1= housing[lst]
y_bos = housing["SalePrice"]
fig = plt.figure(figsize=(14, 66))
gs = gridspec.GridSpec(len(lst), 2)

for i in range(len(lst)):
    ax1 = plt.subplot(gs[i, 0])
    ax2 = plt.subplot(gs[i, 1])    
    sns.regplot(y_bos[X_1.iloc[:, i]>0], X_1.iloc[:, i][X_1.iloc[:, i]>0], ax=ax1)
    ax1.set_title('{}'.format(X_1.columns[i]))
    ax1.set_xlabel('')
    ylim = ax1.get_ylim()   
    X_1[X_1.columns[i]].hist(bins=50, ax=ax2, orientation='horizontal',color="g")    
    ax2.set_ylim((ylim[0], ylim[1]))
    ax2.set_xlabel('')
    ax2.set_xlim((0, 200))
plt.tight_layout(pad=0, w_pad=0, h_pad=0)
plt.show()

In [None]:

import matplotlib.gridspec as gridspec
import matplotlib as mpl
import matplotlib.pyplot as plt

X_1= size_related
L = len(X_1.columns)
y_bos = housing["SalePrice"]
fig = plt.figure(figsize=(10, 35))
gs = gridspec.GridSpec(L, 1)

for i in range(L-1):
    ax1 = plt.subplot(gs[i, 0])    
    sns.regplot(X_1.iloc[:, i][X_1.iloc[:, i+1]>0], X_1.iloc[:, i+1][X_1.iloc[:, i+1]>0], ax=ax1)
    ax1.set_title('{}'.format(X_1.columns[i+1]))
    ax1.set_xlabel('{}'.format(X_1.columns[i]))
    ylim = ax1.get_ylim()   
plt.tight_layout(pad=0, w_pad=0, h_pad=0)
plt.show()