### To dos
- check skew of variables
    - apply transformations as required
- convert categoricals to dummy variables
- deal with nulls/nans (or don't)
- split off dependent/independent variables
- scale/normalise
- split into train/validate


Let's start off with some imports

In [30]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from collections import defaultdict
from src.analyse import test_trans
from src.analyse import analyse
from src.load import load
from src.load import clean

pd.options.display.max_rows = 1000
pd.options.display.max_columns = 200
plt.rcParams['figure.figsize'] = [10, 4]
plt.rcParams['figure.dpi'] = 100

We'll then load up the training data

In [2]:
train_data = pd.read_csv('data/train.csv')

In [3]:
drops = ['PoolQC', 'MiscFeature', 'FireplaceQu', 'Id']
fills = {'MasVnrArea': 0.0, 'LotFrontage': 0.0}

elec_na = train_data["Electrical"].isna()
clean_data = train_data.drop(elec_na.loc[elec_na == True].index)

clean_data = clean(clean_data, drop_list=drops, fill_na=fills)

# train_data = clean(trainmultiple_data, drop_list=drops, fill_na=fills)

In [4]:
skew_kurt = analyse(clean_data)

In [20]:
skew_kurt.loc[(skew_kurt.Skewness >= 1) | (skew_kurt.Kurtosis >= 1)]

Unnamed: 0,Skewness,Kurtosis
MiscVal,24.468441,700.524315
PoolArea,14.823236,223.112709
LotArea,12.203431,203.101592
3SsnPorch,10.300725,123.57497
LowQualFinSF,9.008149,83.174678
KitchenAbvGr,4.48664,21.514545
BsmtFinSF2,4.253594,20.096813
ScreenPorch,4.120572,18.423508
BsmtHalfBath,4.101759,16.382219
EnclosedPorch,3.088518,10.420906


In [34]:
unique_count = defaultdict(list)
for col in clean_data:
    unique_count[len(clean_data[col].unique())].append(col)

for key in sorted(unique_count):
    print('Number of unique values: {} | Column(s):\n{}'.format(key, unique_count[key]))

Number of unique values: 2 | Column(s):
['Street', 'Utilities', 'CentralAir']
Number of unique values: 3 | Column(s):
['Alley', 'LandSlope', 'BsmtHalfBath', 'HalfBath', 'PavedDrive']
Number of unique values: 4 | Column(s):
['LotShape', 'LandContour', 'ExterQual', 'BsmtFullBath', 'FullBath', 'KitchenAbvGr', 'KitchenQual', 'Fireplaces', 'GarageFinish']
Number of unique values: 5 | Column(s):
['MSZoning', 'LotConfig', 'BldgType', 'MasVnrType', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'HeatingQC', 'Electrical', 'GarageCars', 'Fence', 'YrSold']
Number of unique values: 6 | Column(s):
['RoofStyle', 'Foundation', 'Heating', 'GarageQual', 'GarageCond', 'SaleCondition']
Number of unique values: 7 | Column(s):
['BsmtFinType1', 'BsmtFinType2', 'Functional', 'GarageType']
Number of unique values: 8 | Column(s):
['Condition2', 'HouseStyle', 'RoofMatl', 'BedroomAbvGr', 'PoolArea']
Number of unique values: 9 | Column(s):
['Condition1', 'OverallCond', 'SaleType']
Number of unique values: 1