#### The dataset is a training set that I got from Kaggle, it consist lots of data columns all about houses attributes at a certain area. The analysis is for the reader to understand the variables inside the dataset better by splitting it into numerical variables & categorical variables. Some data were translated into graph for ease of understanding. At the end is the analysis of relationship between the variables..


Here's a brief version of what you'll find in the data description file:

    SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
    MSSubClass: The building class
    MSZoning: The general zoning classification
    LotFrontage: Linear feet of street connected to property
    LotArea: Lot size in square feet
    Street: Type of road access
    Alley: Type of alley access
    LotShape: General shape of property
    LandContour: Flatness of the property
    Utilities: Type of utilities available
    LotConfig: Lot configuration
    LandSlope: Slope of property
    Neighborhood: Physical locations within Ames city limits
    Condition1: Proximity to main road or railroad
    Condition2: Proximity to main road or railroad (if a second is present)
    BldgType: Type of dwelling
    HouseStyle: Style of dwelling
    OverallQual: Overall material and finish quality
    OverallCond: Overall condition rating
    YearBuilt: Original construction date
    YearRemodAdd: Remodel date
    RoofStyle: Type of roof
    RoofMatl: Roof material
    Exterior1st: Exterior covering on house
    Exterior2nd: Exterior covering on house (if more than one material)
    MasVnrType: Masonry veneer type
    MasVnrArea: Masonry veneer area in square feet
    ExterQual: Exterior material quality
    ExterCond: Present condition of the material on the exterior
    Foundation: Type of foundation
    BsmtQual: Height of the basement
    BsmtCond: General condition of the basement
    BsmtExposure: Walkout or garden level basement walls
    BsmtFinType1: Quality of basement finished area
    BsmtFinSF1: Type 1 finished square feet
    BsmtFinType2: Quality of second finished area (if present)
    BsmtFinSF2: Type 2 finished square feet
    BsmtUnfSF: Unfinished square feet of basement area
    TotalBsmtSF: Total square feet of basement area
    Heating: Type of heating
    HeatingQC: Heating quality and condition
    CentralAir: Central air conditioning
    Electrical: Electrical system
    1stFlrSF: First Floor square feet
    2ndFlrSF: Second floor square feet
    LowQualFinSF: Low quality finished square feet (all floors)
    GrLivArea: Above grade (ground) living area square feet
    BsmtFullBath: Basement full bathrooms
    BsmtHalfBath: Basement half bathrooms
    FullBath: Full bathrooms above grade
    HalfBath: Half baths above grade
    Bedroom: Number of bedrooms above basement level
    Kitchen: Number of kitchens
    KitchenQual: Kitchen quality
    TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
    Functional: Home functionality rating
    Fireplaces: Number of fireplaces
    FireplaceQu: Fireplace quality
    GarageType: Garage location
    GarageYrBlt: Year garage was built
    GarageFinish: Interior finish of the garage
    GarageCars: Size of garage in car capacity
    GarageArea: Size of garage in square feet
    GarageQual: Garage quality
    GarageCond: Garage condition
    PavedDrive: Paved driveway
    WoodDeckSF: Wood deck area in square feet
    OpenPorchSF: Open porch area in square feet
    EnclosedPorch: Enclosed porch area in square feet
    3SsnPorch: Three season porch area in square feet
    ScreenPorch: Screen porch area in square feet
    PoolArea: Pool area in square feet
    PoolQC: Pool quality
    Fence: Fence quality
    MiscFeature: Miscellaneous feature not covered in other categories
    MiscVal: $Value of miscellaneous feature
    MoSold: Month Sold
    YrSold: Year Sold
    SaleType: Type of sale
    SaleCondition: Condition of sale


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
housing = pd.read_csv('house_train.csv')

In [None]:
housing.shape

In [None]:
housing.info()

# Objectives

> Understand the variable in the data set

> Understand how the variables in this dataset relate with the SalePrice of the house

### To understand the variables, we split the data into two categories :

> Numerical variables: sale price, lot area, yearbuilt, over all condition rate
        
> Categorical variables: classification, lot shape, neighborhood, central air conditioning, sale condition, month/year sold

In [None]:
numerical_vars = ['SalePrice','LotArea','OverallQual','OverallCond','YearBuilt','1stFlrSF','2ndFlrSF','BedroomAbvGr']
categorical_vars = ['MSZoning','LotShape','Neighborhood','CentralAir','SaleCondition','MoSold','YrSold']

In [None]:
housing = housing[numerical_vars + categorical_vars]

In [None]:
housing.shape

### Understand the main variables.

In [None]:
housing['SalePrice'].describe()

In [None]:
housing['SalePrice'].hist(edgecolor='black', bins=20)

In [None]:
# skewness & kurtosis
print('Skewness: {:0.3f}'. format(housing['SalePrice'].skew()))
print('Kurtosis: {:0.3f}'. format(housing['SalePrice'].kurt()))

From  the graph & the data, we learnt that the data are __highly positive skewed & have a positive kurtosis__.

### Understand the Numerical Variables.

In [None]:
housing[numerical_vars].describe()

In [None]:
housing[numerical_vars].hist(edgecolor='black', bins=15, figsize=(14,5), layout= (2,4))

From the histograms, we can see that:

1. The distribution of sizes of the first floor is skewed to the right. This is espected, there are a few big houses.
2. There is a big peak at zero in the 2ndFlrSF variable. Those are the variables that don't have a second floor, so we can identify a new variable from this one.
3. Most houses have 3 bedrooms.
4. The lot area is highly skewed: there are few houses with a large lot area.
5. The ratings for condition * quality tend to be around 5, few houses have very high or low ratings.
6. The YearBuilt variable is actually not usefulin the present form. However we can use it to construct a variable that actually make sense: Age of the house at the same time of the sale.

### Lets check how old is the house before it is sold

In [None]:
housing['Age'] = housing['YrSold'] - housing['YearBuilt']
numerical_vars.remove('YearBuilt')
numerical_vars.append('Age')

In [None]:
housing[numerical_vars].hist(edgecolor='black', bins=15, figsize=(14,5), layout=(2,4));

Now we can see clearly how many houses are new when they were sold.

### Understanding the Categorical Variable.

In [None]:
housing['SaleCondition'].value_counts().plot(kind='bar', title='SaleCondition')

In [None]:
fig, ax = plt.subplots(2,4, figsize=(14,6))
for var, subplot in zip(categorical_vars, ax.flatten()):
    housing[var].value_counts().plot(kind='bar', ax=subplot, title=var)
    
fig.tight_layout()

Some graph have too many data, so we will eliminate some of the insignificant data to make it easier to understand

In [None]:
# eliminates data that have < 30 observations
def identify_cat_above30(series):
    counts = series.value_counts()
    return list(counts[counts>=30].index)

In [None]:
levels_to_keep = housing[categorical_vars].apply(identify_cat_above30, axis=0)
levels_to_keep

In [None]:
for var in categorical_vars:
    housing = housing.loc[housing[var].isin(levels_to_keep[var])]

In [None]:
# a new data with > 30 observations.
fig, ax=plt.subplots(2,4, figsize=(14,6))
for var, subplot in zip(categorical_vars, ax.flatten()):
    housing[var].value_counts().plot(kind='bar', ax=subplot, title=var)
    
fig.tight_layout()

From the graph, we can see that:

1. Most houses we sold in between May to July
2. Most houses have central Air for the air-conditioning

### Relationships between Numerical Variables

In [None]:
housing.plot.scatter(x='1stFlrSF', y='SalePrice')

In [None]:
sns.jointplot(x='1stFlrSF', y='SalePrice', data=housing, joint_kws={"s":10})

In [None]:
sns.pairplot(housing[numerical_vars[:4]], plot_kws={'s':10});

In [None]:
sns.pairplot(housing[['SalePrice']+numerical_vars[4:]], plot_kws={'s':10});

In [None]:
housing[numerical_vars].corr()

In [None]:
housing[numerical_vars].corr()['SalePrice'].sort_values(ascending=False)

In [None]:
correlations = housing[numerical_vars].corr()

In [None]:
fig, ax =plt.subplots(figsize=(7,5))
sns.heatmap(correlations, ax =ax);

### Relationships between  Categorical Variables

In [None]:
sns.boxplot(x='CentralAir', y='SalePrice', data= housing);

In [None]:
fig, ax =plt.subplots(3,3,figsize=(14, 9))
for var, subplot in zip(categorical_vars, ax.flatten()):
    sns.boxplot(x=var, y='SalePrice', data=housing, ax=subplot)
    
fig.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(14,4))
sns.boxplot(x='Neighborhood', y='SalePrice', data=housing, ax=ax);

In [None]:
sorted_nb = housing.groupby('Neighborhood')['SalePrice'].median().sort_values().index.values

In [None]:
fig, ax = plt.subplots(figsize=(14, 4))
sns.boxplot(x='Neighborhood', y='SalePrice', data=housing, order=sorted_nb, ax=ax)
plt.xticks(rotation='vertical');

### More Complex Plots

In [None]:
conditional_plot =sns.FacetGrid(housing, col='Neighborhood', col_wrap=4)
conditional_plot.map(plt.scatter, 'OverallQual', 'SalePrice');

In [None]:
conditional_plot =sns.FacetGrid(housing, col='YrSold', row='SaleCondition', hue='CentralAir')
conditional_plot.map(plt.scatter, 'Age', 'SalePrice').add_legend();