### 0. Loading and High-level Overview of the Dataset.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import warnings 

# silence warnings
warnings.filterwarnings('ignore')

# plotting settings
plt.style.use(['ggplot'])

In [2]:
# checking current workign directory
pwd

'/Users/jensen/Desktop/aiig-suss/learning_materials/regression/notebooks'

In [3]:
# changing directory
os.chdir('../data/raw')

# loading the dataset
housing_data = pd.read_csv("train.csv")
# look at first 5 rows
housing_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


After loading your dataset, you should develop a habit of quickly retrieving an very high level overview of the dataset. This can involves:

1. Understanding the `shape` of your dataset – that refers to the number of `rows` and `columns` present.
2. Observe the data types (`dtypes`) of each columns so that you understand the possibility of having to cast certain columns to a different type.
3. Print out the summary statistics of the dataset to have a better understanding of the value distributions.
4. Check if you have any missing or null values in the dataset. Very often than not, real world data is not clean, and that means it would contain missing values in different rows. This may represent a small abnormality in the data collection process, or even highlight significant faults when a column is missing most of its data (> 50%).

In [40]:
# let's print out the overall shape of the dataframe loaded in `housing_data`
print(f"The dataset consists of: {housing_data.shape[0]} rows and {housing_data.shape[1]} columns.\n")

# let's also check the column names and dtypes
print(housing_data.dtypes)

# since the results is summarized (or squashed – displaying little rows)
# let's print the column names
print(f"\n{housing_data.columns}")

The dataset consists of: 1460 rows and 81 columns.

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrS

After looking through the column names, and also reading the [overview](https://www.kaggle.com/c/home-data-for-ml-course/overview) of the dataset – over at KLaggle, do you know what's the column we are trying to predict in the `Regression` module? 

If you guessed `SalePrice` – yes! You are right. Recalling your knowledge on `Supervised Machine Learning`, which is broadly classified under two categories:

    1. Classification – categorical target (e.g., Predicting if I will win the lottery or not)
    2. Regression – numerical target (e.g., Predicting how much would a 4-room HDB cost at Choa Chu Kang)

Since, this is a housing price competition on Kaggle, it would only be logical that we are trying to predict the housing price. As such, `SalePrice` is our target variable.

In [38]:
# since we have a total of 81 rows and columns
# let's count the number of numerical vs categorical columns

cat_cols = housing_data.dtypes[housing_data.dtypes == 'object']
num_cols = housing_data.dtypes[housing_data.dtypes != 'object']

# printing to see what's in the two list, feel free to uncomment to see
# print(cat_cols)
# print(num_cols)     # you observed that numerical columns consists of both `int64` and `float64`

# printing the summary
print(f"The total number of categorical columns: {len(cat_cols)}, numerical columns: {len(num_cols)}")

The total number of categorical columns: 43, numerical columns: 38


By understanding the number of columns – or I would preferred mention it as `features` of either numerical/categorical types, we are able to better plan out the necessary visualisation required to best present each columns. You will see this subsequently later in `Section 1`.

In [18]:
# let's also check the column names summary statistics
print(housing_data.describe(include='all'))      # you can also just use `housing_data.dtypes` to just check for the data types

                 Id   MSSubClass MSZoning  LotFrontage        LotArea Street  \
count   1460.000000  1460.000000     1460  1201.000000    1460.000000   1460   
unique          NaN          NaN        5          NaN            NaN      2   
top             NaN          NaN       RL          NaN            NaN   Pave   
freq            NaN          NaN     1151          NaN            NaN   1454   
mean     730.500000    56.897260      NaN    70.049958   10516.828082    NaN   
std      421.610009    42.300571      NaN    24.284752    9981.264932    NaN   
min        1.000000    20.000000      NaN    21.000000    1300.000000    NaN   
25%      365.750000    20.000000      NaN    59.000000    7553.500000    NaN   
50%      730.500000    50.000000      NaN    69.000000    9478.500000    NaN   
75%     1095.250000    70.000000      NaN    80.000000   11601.500000    NaN   
max     1460.000000   190.000000      NaN   313.000000  215245.000000    NaN   

       Alley LotShape LandContour Utili

In [28]:
# checking of the prevalence of missing values
print(housing_data.columns[housing_data.isnull().sum() > 0])      # filtering for columns with null values

# what is the percentage of missing columns in the bigger picture?
missing_cols = housing_data.columns[housing_data.isnull().sum() > 0]
print(f"\nThe total % of columns with missing values: {len(missing_cols)/housing_data.shape[1] * 100:.2f} %")

Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

The total % of columns with missing values: 23.46 %


After running through a set of high-level data exploration, we understood that the `train.csv` – housing_data DataFrame, is not 100% perfect. We observed that there is a total of `23.46%` columns (out of 81) highlighted to contain missing/null values. We, as a data analyst/scientist exploring this dataset should keep this in mind as we proceed further with Exploratory Data Analysis. We do not need to start cleaning any data yet but eventually, we will need find ways to prepare the `housing_data` DataFrame in a suitable way, in order for us to conduct appropriate machine-learning and generate certain predictive results. 

Hence as such, we have answered the following question in `Section 0`: 

        * What is the shape of your data i.e. number of rows and columns?
        * For the numerical columns, what does the distributions look like?
        * What is the name of the column to be predicted?

### 1. Exploratory Data Analysis (EDA)