### To dos
- check skew of variables
    - apply transformations as required
- convert categoricals to dummy variables
- deal with nulls/nans (or don't)
- split off dependent/independent variables
- scale/normalise
- split into train/validate


Let's start off with some imports

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

pd.options.display.max_rows = 1000
pd.options.display.max_columns = 200

We'll then load up the training data and derive a list of numerical and categorical features

In [2]:
train = pd.read_csv("../data/train.csv")

numeric = [var for var in train.columns if train.dtypes[var] != 'object']
category = [var for var in train.columns if train.dtypes[var] == 'object']

Now let's analyse our null values, by feature type

In [3]:
num_nulls = train[numeric].isnull().sum().sort_values(ascending=False)
print("Numerical features with null values:\n{}".format(num_nulls[num_nulls > 0]))

print()

cat_nulls = train[category].isnull().sum().sort_values(ascending=False)
print("Categorical features with null values:\n{}".format(cat_nulls[cat_nulls > 0]))


# nulls = train.isnull().sum()
# nulls = nulls[nulls > 0]
# nulls = nulls.reset_index()
# nulls.columns = ['variable', 'count']
# nulls['percent'] = nulls['count'] / len(train)
# nulls.sort_values("count", ascending=False, inplace=True)
# nulls

Numerical features with null values:
LotFrontage    259
GarageYrBlt     81
MasVnrArea       8
dtype: int64

Categorical features with null values:
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
GarageType        81
GarageCond        81
GarageQual        81
GarageFinish      81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtQual          37
BsmtCond          37
MasVnrType         8
Electrical         1
dtype: int64


Numerical features with null values should be replaceable with zeros. For the categoricals, we need to dig a little deeper.

`PoolQC`, `MiscFeature`, `Alley`, `Fence`, and `FireplaceQu` all have ~50% or more missing values. Let's check in turn if we should drop the variable or convert it to a dummy.

The `Garage*`, `Bsmt*`, and `MsnVnr*` variables each sharing the same values would indicate these nulls communicate an absence of that feature. We can capture this with a dummy variable which we create later

In [4]:
high_null = ["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu"]
for feat in high_null:
    print(train[feat].unique())

[nan 'Ex' 'Fa' 'Gd']
[nan 'Shed' 'Gar2' 'Othr' 'TenC']
[nan 'Grvl' 'Pave']
[nan 'MnPrv' 'GdWo' 'GdPrv' 'MnWw']
[nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']


In my estimation, `PoolQC` is probably covered adequately 
by the `PoolArea` feature; ditto `FireplaceQu` & `Fireplaces`; 
`MiscFeature`, with so many null values is unlikely to add value.

We will delete these three, and convert the other two into 
dummy variables representing yes/no for 'has fence' and 'has alley'

In [5]:
if len(train.columns) == 81:
    train = train.drop(['PoolQC', 'MiscFeature', "FireplaceQu"], axis=1)

Before we create our dummy variables, let's look at the last null value we haven't address: Electrical. Given that there is just one of entry, let's drop the data point with no electrical (how does that work anyway?)

In [6]:
print(train[train["Electrical"].isna() == True]["Electrical"])
if len(train) == 1460:
    train.drop(1379, inplace=True)


1379    NaN
Name: Electrical, dtype: object


Let's make those dummy variables. We will utilise the `pandas` built-in `get_dummies()`. Using this function, we capture natively the houses without garages, pools, etc. as entries with 0s for all the categorical options for a given variable. 

To address this on the numerical variable side, we will `fillna` with 0s

In [7]:
categoricals = pd.get_dummies(train[category])
numericals = train[numeric].fillna(0)
dataset = pd.merge(numericals, categoricals, left_index=True, right_index=True)
dataset

KeyError: "['PoolQC', 'FireplaceQu', 'MiscFeature'] not in index"

In [None]:

# y = train['SalePrice'].sort_values()
# # x = np.arange(len(y))
# # fig, ax = plt.subplots(1,2)
# # plt.figure(1); plt.title('Johnson SU')
# # sns.displot(y)

# number_of_bins = 50
# bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)
# h = plt.hist(y, bins = bin_cutoffs, color='0.75')

# # Create the plot
# # sns.displot(y)
# params = st.lognorm.fit(y)
# # print(params)

# fitted_pdf = st.lognorm.pdf(y, params[0], loc=params[-2], scale=params[-1])
# scale_pdf = np.trapz(h[0], h[1][:-1]) / np.trapz(fitted_pdf, y)
# fitted_pdf *= scale_pdf
# # sns.lineplot(x=y, y=fitted_pdf)
# plt.plot(y, fitted_pdf)
# # sns.displot(y)
# # plt.figure(2); plt.title('Normal')
# # sns.distplot(y, kde=False, fit=st.norm)
# # plt.figure(3); plt.title('Log Normal')
# # sns.distplot(y, kde=False, fit=st.lognorm)

In [17]:
isinstance(train['SalePrice'], pd.core.frame.Series)

True