# Data Exploration

## Import Libraries & Dataset

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 100) # to look at more rows of data later
pd.set_option('display.max_columns', 100) # to expand columns view so that all can be seen later

In [2]:
# Load datasets
train_df = pd.read_csv('../datasets/train.csv')
test_df = pd.read_csv('../datasets/test.csv')

## Exploring Data

### Identify data types in dataset

In [3]:
# For this case, we will only need to check train_df since both have the same column data
train_df.dtypes

Id                   int64
PID                  int64
MS SubClass          int64
MS Zoning           object
Lot Frontage       float64
Lot Area             int64
Street              object
Alley               object
Lot Shape           object
Land Contour        object
Utilities           object
Lot Config          object
Land Slope          object
Neighborhood        object
Condition 1         object
Condition 2         object
Bldg Type           object
House Style         object
Overall Qual         int64
Overall Cond         int64
Year Built           int64
Year Remod/Add       int64
Roof Style          object
Roof Matl           object
Exterior 1st        object
Exterior 2nd        object
Mas Vnr Type        object
Mas Vnr Area       float64
Exter Qual          object
Exter Cond          object
Foundation          object
Bsmt Qual           object
Bsmt Cond           object
Bsmt Exposure       object
BsmtFin Type 1      object
BsmtFin SF 1       float64
BsmtFin Type 2      object
B

#### Observations

From our understanding of the data dictionary, <b>MS Sub Class</b> is considered as categorical data despite being a numerical column (the value represents class of the building), so this will be required to be converted later.

Same goes for <b>Mo Sold</b> as well, since the value represents the month in a year and not the value.

The following columns: <b>Yr Sold, Year Built, Year Remod/Add, Garage Yr Blt</b> are all in years even though they are numeric.
Instead of using them as it is, we can use these values to calculate 3 new values that may make more sense, namely:

1. age_house = Yr Sold - Year Built<br>
This is to calculate the age of the house when it was sold
2. num_years_remod = Yr Sold - Year Remod/Add <br>
This is to calculate year difference between the year it was sold and the year it was remod.
3. age_garage = Yr Sold - Garage Yr Blt<br>
This is to calculate the age of the garage when it was sold. If age is 0, it means there is no garage present.


### Check for amount of null values per column

In [4]:
# For training data
s = pd.Series(train_df.isnull().mean().sort_values(ascending=False))
s[s > 0]

Pool QC           0.995612
Misc Feature      0.968308
Alley             0.931741
Fence             0.804973
Fireplace Qu      0.487567
Lot Frontage      0.160897
Garage Finish     0.055583
Garage Cond       0.055583
Garage Qual       0.055583
Garage Yr Blt     0.055583
Garage Type       0.055095
Bsmt Exposure     0.028279
BsmtFin Type 2    0.027304
BsmtFin Type 1    0.026816
Bsmt Cond         0.026816
Bsmt Qual         0.026816
Mas Vnr Type      0.010726
Mas Vnr Area      0.010726
Bsmt Half Bath    0.000975
Bsmt Full Bath    0.000975
Garage Cars       0.000488
Garage Area       0.000488
Bsmt Unf SF       0.000488
BsmtFin SF 2      0.000488
Total Bsmt SF     0.000488
BsmtFin SF 1      0.000488
dtype: float64

In [5]:
# for test data
s = pd.Series(test_df.isnull().mean().sort_values(ascending=False))
s[s > 0]

Pool QC           0.995449
Misc Feature      0.953356
Alley             0.934016
Fence             0.804323
Fireplace Qu      0.480091
Lot Frontage      0.182025
Garage Cond       0.051195
Garage Qual       0.051195
Garage Yr Blt     0.051195
Garage Finish     0.051195
Garage Type       0.050057
Bsmt Exposure     0.028441
BsmtFin Type 1    0.028441
Bsmt Qual         0.028441
BsmtFin Type 2    0.028441
Bsmt Cond         0.028441
Mas Vnr Area      0.001138
Mas Vnr Type      0.001138
Electrical        0.001138
dtype: float64

#### Observations

For the features <b>Pool QC, Misc Feature</b> and <b>Alley</b>, since > 90% of the data values are null, we will drop the columns completely.
Remaining null values above will be required to be filled in later.

### Logic checking between columns/Relationship between columns

We will need to check the following columns to ensure the columns are logically sound.
This is also to validate the relationship between the columns.

1. Yr Sold >= Garage Yr Blt/Year Built
2. Year Remod/Add >= Year Built
3. Total Bsmt SF = BsmtFin SF 1 + BsmtFin SF 2 + Bsmt Unf SF
4. Gr Liv Area = 1st Flr SF + 2nd Flr SF + Low Qual Fin SF


### Any other new insights we can gather with existing data?

Assuming the above relationship holds, we can choose to calculate the overall area of the house (which we can name as overallsf). <br>
Calculation of the overall will be as follows: <br>

overallsf = Gr Liv Area + Total Bsmt SF

This will give us a more general idea of the size of the house, which may help us to determine how size affects the price of a house.

During feature selection, we can drop BsmtFin SF 1, BsmtFin SF 2, Bsmt Unf SF, 1st Flr SF, 2nd Flr SF and Low Qual Fin SF to avoid multicollinearity issues when doing our modeling.

Additionally, looking at the data dictionary, <b>Garage Cars</b> and <b>Garage Area</b> are both measurements of the garage but with different units. We may opt to drop <b>Garage Cars</b> because of collinearity issues as well.

The data dictionary have also highlighted that houses with <b>Gr Liv Area greater than 4000</b> to be removed as those are deemed as outliers, which will need to be taken care of.