# Introduction

This is my attempt for [Kaggle's Housing Price Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview), where participants have to use regression techniques in order to predict housing prices. 

In order to accomplish my goal of accurately predicting housing prices, I will perform EDA on the given training dataset. This will allow me to see trends and patterns that may prove useful when I build the prediction model down the line. 

More specifically, I will first try to understand the data overall (i.e. the number and variety of predictors, the number of entries, etc.), then analyze the dependent variable (the housing price) and the independent variables. Afterwards, I will perform some basic cleaning like dealing with outliers and missing data, feature engineering and feature extracting.

This portion of this project is heavily inspired by [Pedro Marcelino's data exploration of this competition](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#COMPREHENSIVE-DATA-EXPLORATION-WITH-PYTHON) as well as [this other Kaggle notebook](https://www.kaggle.com/dgawlik/house-prices-eda/notebook).

Finally, I will create my model and submit my predictions using the provided testing dataset to the Kaggle leaderboard, and report what rank I get.

## Taking a Look at the Data...

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

from scipy.stats import norm
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [4]:
data = pd.read_csv('train.csv')

data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
data.shape

(1460, 81)

Above, we see that there are 1460 entries with 81 variables. One of these variables (`SalePrice`) is our dependent variable, and is the value we are trying to predict. Another variable (`Id`) is simply used marking unique entries, and will not help our model predict `SalePrice`. 

So, **each entry has 79 dependent variables and 1 independent variable.**
The descriptions for these variables can be found in `data_description.txt`.