# Data Transformations

## Objectives

* To prepare data for modelling. The main tasks in this file are: 
  * Where necessary, apply appropriate transformations to the features. For example, using logarithmic transformation for features with a skewed distribution
  * Where do-able, impute information for missing cells 
  * Covert categorical variables into sets of dummies.   

## Inputs

* `house_prices_records.csv` data located at outputs/datasets/collection

## Outputs

* An updated version of `house_prices_records.csv` data

# Imports and Setup

In [9]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from feature_engine.encoding import OneHotEncoder

## Change Working Directory

We will change the directory to the project level. 

In [10]:
cd "~/Documents/GitHub/heritage-housing-issues"

/Users/mehtap/Documents/GitHub/heritage-housing-issues


# Load Data

Load the `house_prices_records` data.

In [11]:
df = pd.read_csv('outputs/datasets/collection/house_prices_records.csv')
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


# Prepare Data for Modelling

The data has information on 23 features. These will be examined one by one and necessary transformations will be applied. 

## `1stFlrSF`, `2ndFlrSF` and `GrLivArea`

In [12]:
# Count zero observations in the "2ndFlrSF" column
count_zero = (df["2ndFlrSF"] == 0).sum()
print(count_zero)


781


* Out of 1460 observations, there are 781 zero observations for `2ndFlrSF`. 
* 86 observations has missing values. Considering the context and the high volume of zero observations, these missing observations are very likely to belong to houses where there is no second floor. So, they will be replaced by zero. 
* `GrLivArea` provides information on size of above ground living area. This is likely to be the sum of `1stFlrSF` and 2ndFlrSF`.  

In [13]:
# Replace missing values in 2ndFlrSF with 0 and verify the sum of 1stFlrSF and 2ndFlrSF equals GrLivArea
df["2ndFlrSF"].fillna(0, inplace=True)
count_diff = (
    df["1stFlrSF"] + df["2ndFlrSF"] != df["GrLivArea"]
).sum()

print(count_diff)
print(count_diff/len(df))


64
0.043835616438356165


* In 4.4% observations the sum of first and second floor area does not equate to above ground living area size. This is probably because these houses has an additional part that is not included in either of the first or second floor size. 
* While ground living area size is important for house price, having or not having a second floor may also be a determinant. Estimations will use `GrLivArea` together with a dummy variable for houses that have a second floor.  