In [None]:
import pandas as pd
import numpy as np
import math
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import seaborn as sns

## 1.0: General Overview of the Dataset and its Features

In [None]:
df = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")

### 1.1: First few observations and the dataframe's shape

In [None]:
df.head()

In [None]:
df.shape

We have 1460 observations in the training dataset and 81 features. 

### 1.2: What about missing values? How many are there? Where are they concentrated?

In [None]:
df.isna().sum().sum()

In [None]:
df.isna().sum().sort_values(ascending = False).head(15)

There seems to be about 6965 missing values focused on optional additions to a house like pools, fences, firplaces, garages, basements, and miscellaneous features. 

### 1.3: Breakdown Of Predictive Features

In [None]:
noms = ['MSSubClass','MSZoning', 'Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','Foundation','Heating','CentralAir','Electrical','GarageType','MiscFeature','SaleType','SaleCondition']

ords = ['ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual','Functional','FireplaceQu','GarageFinish','GarageQual','GarageCond','PavedDrive','PoolQC','Fence']

continuous = ['LotFrontage','MasVnrArea']

discrete = ['LotArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','GrLivArea','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','OverallQual','OverallCond','LowQualFinSF','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars', 'YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']

rest = [item for item in list(df.columns) if item not in list(noms + ords + continuous + discrete)]

An example of each of the different variable types (nominal, ordinal, discrete, and continuous) found within the dataset:

In [None]:
df[['Neighborhood','ExterQual','LotFrontage','LotArea']].head()

We have different labels, shown by the Neighborhood feature, several ranking systems, and numerical measurements of different fixtures within the house (both as floats and as integers). Of the four types of features, which are most prominent?

In [None]:
total = len(noms + ords + continuous + discrete)
plt.figure(figsize=(8,8))
sns.barplot(['Nominal','Ordinal','Discrete','Continuous'], [(len(noms) / total), (len(ords)/total),(len(discrete)/total),(len(continuous)/total)])
plt.title("Percentage of Total Variables as Each of the Different Types")


In [None]:
df[['Neighborhood','ExterQual','LotFrontage','LotArea']].dtypes

The majority of features are categorical; these often need to be converted into a more descriptive datatype (turning the general object into a date or numerical value) during the cleaning phase. Of course, the discrete variables are represented by integers, and those continuous utilize float values.

### 1.4: Statistical Summary of the Numerical Variables

In [None]:
df[discrete + continuous].describe()

Although there is much to take in here, there are important aspects of this dataset to note. First, many features represent optional housing fixtures, which many observations chose to do without. This results in a case where the mean is often a lower value than the standard deviation. See the WoodDeckSF histogram below for an example.

In [None]:
plt.figure(figsize=(8,8))
sns.histplot(df, x="WoodDeckSF", bins=30)
plt.title("The Distribution of Values in the WoodDeckSF Feature")

### 1.5: We are trying to predict the sales price (a continuous variable) of houses in the testing dataset. What does its distribution look like?

In [None]:
summary = df[['SalePrice']].describe().astype('int32')

sns.set_palette(sns.color_palette("pastel"))
sns.set_style('white')
plt.figure(figsize=(8,8))
plt.title("Distribution of SalePrice Target Feature")
sns.histplot(data = df, x = "SalePrice", bins=30, kde=True, color='green')
plt.table(cellText=summary.values, rowLabels=summary.index, colLabels=summary.columns, cellLoc='right', rowLoc='center', loc='right', bbox=[0.79, 0.69, 0.2, 0.3])

There is a clear right skew in the Sales Prices of these homes. Beyond that, there seem to be a few outliers at the extreme right. Both of these issues should be addressed. 

## 2.0: Comparing the Colinearity of the Differing Features

### 2.1: Creating a Heat Map

In [None]:
plt.figure(figsize=(13,13))
sns.heatmap(df.drop('Id', axis = 1).corr(), mask = np.triu(df.drop('Id', axis = 1).corr()))

From the looks of it, all of the strong and notable features have a positive connection with the sales price of a house. Below I will gather those relationships are healthy and could not have been made by chance (p-value <= 0.05).

In [None]:
temp = pd.concat((df[continuous + discrete], df['SalePrice']), axis = 1).dropna()
h1 = {}
for column in temp:
    corr, pval = pearsonr(temp[column], temp['SalePrice'])
    if abs(corr) >= 0.3 and pval <= 0.05:
        h1[column] = (pval, corr)
dict(sorted(h1.items(), key=lambda item: item[1]))

The dictionary above represents those features that had a p-value greater than 0.05 and a correlation coefficient greater than 0.3. The latter states that the pair of features            (x and SalePrice) holds a strong relationship, and the former measures whether or not it is statistically significant. Once those hurdles are cleared (and as long as these features do not correlate with each other to a great degree), we can move on. 

### 2.2: Choosing Discrete and Continuous Features

In [None]:
predictive = df.corr()[h1.keys()]
index = df.corr()[h1.keys()].columns
heat = predictive.loc[index]

plt.figure(figsize=(13,13))
sns.heatmap(heat, mask = np.triu(heat))

We see that a couple of the features are correlated with each other such as the number of cars in the garage and its size; with that said their correlation coefficient is not greater than 0.9, meaning that there is still a sizeable amount of information that is independent. Let's take a look a few of these features and visualize how they realte to SalePrie. 

In [None]:
from pandas.plotting import scatter_matrix

attr = ['SalePrice','OverallQual','GarageCars','YearBuilt','TotalBsmtSF']
scatter_matrix(df[attr], figsize=(13,13))

If you look at the top row of the scatter matrix, you can see the positive relationship between SalePrice and the other highly predictive features. They make a certain amount of sense too; improved quality, more luxurious and spacious additions (like basements and garages), and newer buildings would often sell for more than their competitors. 

## 3.0: Feature Engineering

### 3.1: Missing Values

Missing values in this dataset are not random; instead they signal that the observation does not have the optional addon that is being measured. The simple solution here is to create a 'None' label, which holds that information. That said, this only works for categorical features. This becomes a problem for discrete features like 'GarageYrBlt', which has 81 missing values. I chose to fill this by stating that the house had built one in 1900, the earliest year in the dataset. This decision was made because a value needed to be there, 0 would have disrupted the feature, and, assuming a linear relationship with SalePrice, having an ancient garage is the closest we can manage to having none. I'll show the resulting relationship below.

In [None]:
df[noms + ords] = df[noms+ords].fillna("None")
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(1900.0)

plt.figure(figsize = (8,8))
sns.scatterplot(data = df, x = "GarageYrBlt", y= "SalePrice")
plt.title("Representing No Garage")

In [None]:
df.isna().sum().sort_values(ascending = False)

This leaves two sources of missing values, LotFrontage and MasVnrArea, which I will simply have filled in by the median value in the numeric pipeline.

### 3.2: Outliers

In [None]:
df['SalePrice'].sort_values(ascending = False).head(10)

The two outliers I found are homes that sold for more than $100,000 more than their nearest competitors, which are more grouped together with the rest of the observations. These values are extreme to the point that they would almost definitely having an outsized influence on the model, so they should be removed. 

In [None]:
df = df.drop(index = [691, 1182])

### 3.3: SalePrice - Skewed Target Feature

As shown in 1.5, the target feature has a right skew; I'll address this by computing the log value of the feature, which should normalize its values. 

In [None]:
plt.figure(figsize = (8,8))
sns.histplot(df['SalePrice'].transform(np.log), color = 'blue')
plt.title("The Transformed SalePrice Distribution")

df['SalePrice'] = df['SalePrice'].transform(np.log)

### 3.4: Dropping Features

In [None]:
df['Utilities'].value_counts()

The Utilities feature would be helpful if it represented something besides homes with all public utilities; seeing how it is right now, it provides no useful information with the exception of its one observation that has no access to public water. Even then, that one record is not sizeable enough to be worthwhile. 

In [None]:
if 'Utilities' in df.columns:
    df = df.drop('Utilities', axis = 1)
if "Utilities" in noms:
    noms.remove('Utilities')

Additionally, the house ID attribute adds nothing of value.

In [None]:
if 'Id' in df.columns:
    df = df.drop('Id', axis = 1)

### 3.5: Numeric Pipeline

In addition to filling in any missing values with the median, the numeric pipeline will standardize the data by using the StandardScaler; this method is more resistant to outliers than MinMaxScaler. 

In [None]:
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('std', StandardScaler())])

### 3.6: Full Pipeline

Categorical features will be split into binary attributes using the OneHotEncoder; ordinal features can be easily ranked, so they can be transformed into numeric values through the OrdinalEncoder. 

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_attributes = list(heat.columns)
ordinal_attributes = ords
cat_attributes = noms

full_pipeline = ColumnTransformer([
    ("num", numeric_pipeline, num_attributes),
    ("ord", OrdinalEncoder(), ordinal_attributes),
    ("cats", OneHotEncoder(), cat_attributes),
])

newdf = full_pipeline.fit_transform(df)

In [526]:
Xtrain = pd.DataFrame(newdf.todense())
ytrain = df['SalePrice']