## About the dataset

AMES Housing Dataset by Dean De Cock (Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

## Housing Prices visualizations

In [None]:
%pip install pandas matplotlib seaborn numpy scikit-learn scipy

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
import os
import requests
warnings.filterwarnings('ignore')
%matplotlib inline

thePath = "./"
theFile = 'train.csv'
theLink = "https://dse200.dev/Day3/train.csv"

if not os.path.exists(thePath + theFile):
    r = requests.get(theLink)
    with open(thePath + theFile, 'wb') as f:
        f.write(r.content)


### Data Exploration
- Visualize
- Find Missing Data
- Look For Correlations

In [None]:
df = pd.read_csv(thePath + 'train.csv') # loading the ames data

#### Analysing Sale Price

In [None]:
df['SalePrice'].describe()

In [None]:
sns.distplot(df['SalePrice']);

We observe that the data
- Deviates from the normal distribution.
- Has appreciable positive skewness.
- Shows peakedness.

In [None]:
print("Skewness: %f" % df['SalePrice'].skew())
print("Kurtosis: %f" % df['SalePrice'].kurt())

### Examples of Relations with Numerical Variables

In [None]:
var = 'GrLivArea'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

We notice a linear relationship

In [None]:
var = 'TotalBsmtSF'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

linear relation with a higher slope

### Examples of Relations with Categorical Variables

In [None]:
var = 'OverallQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);


we see a positive correlation between these two variables

In [None]:
var = 'YearBuilt'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

Although not a strong one, we still see a positive correlation between these two variables as well

### Correlation Matrix

In [None]:
df.head()

In [None]:

var = 'LotArea' # 'YearBuilt' 
data = pd.concat([df['SalePrice'], df[var]], axis=1)
#corrmat = df.corr()
data = pd.concat([df['SalePrice'], df['LotFrontage'],df['YearBuilt'],df['LotArea']], axis=1)
corrmat = data.corr()
#f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

From this overview of all the realtions

### We pick "k" columns which are most correlated with Sale Price

In [None]:
k = 10 
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cols

In [None]:
cmvals = df[cols].values.T
cmvals

In [None]:
cm = np.corrcoef(df[cols].values.T)
sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

'OverallQual', 'GrLivArea' and 'TotalBsmtSF' are strongly correlated with 'SalePrice'<br/>
'GarageCars' can be assumed to be dependent on 'GarageArea'. Hence we choose only 'GarageCars' since its correlation with 'SalePrice' is higher <br/>
'TotalBsmtSF' and '1stFloor' represent kinda the same thing so we pick one ('TotalBsmtSF')<br/>
'TotRmsAbvGrd' and 'GrLivArea' have a high correlation as expected.<br/>
TODO :  Time series analysis for 'YearBuilt'


In [None]:
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df[cols], size = 2.5)
plt.show()

### Data Cleaning

#### Missing Data

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

columns with more than 80% of the data is missing are chosen to be deleted. Hence the set of variables (e.g. 'PoolQC', 'MiscFeature', 'Alley', etc. are chosen for deletion. Further, these features show nearly 0 correlation with 'Sale Price'

'GarageCars' will represent most info regarding garages, hence other 'GarageX' variables can be ignored.

For 'Electrical', we can either use another value to fill the missing value or drop the observation. In this case we drop the observation

In [None]:
(missing_data[missing_data['Total'] > 1]).index

In [None]:
#df = df.drop((missing_data[missing_data['Total'] > 1]).index,1)
#df = df.drop(df.loc[df['Electrical'].isnull()].index)

#### Outliers

In [None]:
var = 'GrLivArea'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

The rightmost observations in 'GrLivArea' seem to be outliers <br/>
The topmost observations in 'SalePrice' seem to follow the trend hence we do not consider them as outliers

In [None]:
df.sort_values(by = 'GrLivArea', ascending = False)[:2]
df = df.drop(df[df['Id'] == 1299].index)
df = df.drop(df[df['Id'] == 524].index)

#### Normality

In [None]:
sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['SalePrice'], plot=plt)

<b>Normal Probability Plot : </b>The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality.

in case of positive skewness, log transformations usually works well

In [None]:
df['SalePrice'] = np.log(df['SalePrice'])

Replotting

In [None]:
sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['SalePrice'], plot=plt)

Checking the col : GrLivArea

In [None]:
sns.distplot(df['GrLivArea'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['GrLivArea'], plot=plt)

Applying similar transformation

In [None]:
df['GrLivArea'] = np.log(df['GrLivArea'])

In [None]:
sns.distplot(df['GrLivArea'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['GrLivArea'], plot=plt)

Checking for col : TotalBsmtSF

In [None]:
sns.distplot(df['TotalBsmtSF'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['TotalBsmtSF'], plot=plt)

Here we see postive skewness but quite a few of the points are 0

In [None]:
df.loc[df['TotalBsmtSF']>0,'TotalBsmtSF'] = np.log(df['TotalBsmtSF'])

In [None]:
sns.distplot(df[df['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm);
fig = plt.figure()
res = stats.probplot(df[df['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt)

In [None]:
s = pd.Series(list('abca'))

In [None]:
s

#### Credits

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python <br/>
https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python/notebook