<br/>
# CSIT 5800 Introduction to Big Data
### Assignment 1 - Data Pre-processing and Exploratory Analysis

### Description
In this assignment, you will have an opportunity to apply data pre-processing tecniques that you learned in the class to a problem. In addition, you will do exploratory analysis on the given dataset.

To get started on this assignment, you need to download the given dataset and read the description carefully written on this page. Please note that all implementation of your program should be done with Python.
<br/>

### Intended Learning Outcomes

- Upon completion of this assignment, you should be able to:
<ol>
    <li>Demonstrate your understanding on how to pre-process data using the algorithms / techniques as described in the class.</li>
    <li>Use simple descriptive statistical appraoches to understand your data.</li>
    <li>Construct Python program to analyse the data and draw simple conclusions from it.</li>
</ol>

### Required Libraries
The following libraries are required for this assignment:
<ol>
    <li>Numpy - Numerical python</li>
    <li>Scipy - Scientific python</li>
    <li>Matplotlib - Python 2D plotting library</li>
    <li>Seaborn - Visualization library based on matplotlib</li>
    <li>Pandas - Python data analysis library</li>
</ol>

### Dataset ~ House Prices (<a href="https://canvas.ust.hk/courses/23028/files/folder/Assignment1">house-train.csv</a>)

This dataset consists of sales prices of houses in Ames, Iowa (<a href="http://www.amstat.org/publications/jse/v19n3/decock.pdf">The Ames Housing Dataset</a>).
The training dataset has 1460 instances with unique Ids, sales prices, and 79 more features.

<ul>
<li>Pricing — Monetary values, one of which is the sales price we are trying to determine<br />
Examples: SalePrice, MiscVal    
</li> 
<li>Dates — Time based data about when it was built, remodeled or sold.<br />
Example: YearBuilt, YearRemodAdd, GarageYrBlt, YrSold
</li>
<li>Quality/Condition — There are categorical assessment of the various features of the houses, most likely from the property assessor.<br />
Example: PoolQC, SaleCondition, GarageQual, HeatingQC
</li>
<li>Property Features — Categorical collection of additional features and attributes of the building<br />
Example: Foundation, Exterior1st, BsmtFinType1, Utilities
</li>
<li>Square Footage — Area measurement of section of the building and features like porches and lot area(which is in acres)<br />
Example: TotalBsmtSF, GrLivArea, GarageArea, PoolArea, LotArea
</li>
<li>Room/Feature Count — Quantitative counts of features (versus categorical) like rooms, prime candidate for feature engineering<br />
Example: FullBath, BedroomAbvGr, Fireplaces,GarageCars
</li>
<li>Neighborhood — Information about the neighborhood, zoning and lot.<br />
Examples: MSSubClass, LandContour, Neighborhood, BldgType
</li>
</ul>

You may refer to the data description for more details (<a href="https://canvas.ust.hk/courses/23028/files/folder/Assignment1">data_description.txt</a>). 

### Steps:
<ol>
    <li>Importing data and exploring the features.</li>
    <li>Cleaning data: Handling missing values.</li>
    <li>Transforming Categorical data.</li>
    <li>Creating new features and dropping redundant features.</li>
    <li>Analysing data statistically.</li>
    <li>Transforming Numerical Data: Normalization.</li>
</ol>

## Step 1: Importing data and exploring the features

### Step 1.1 
To start working with the House Prices dataset, you will need to import the required libraries, and read the data into a pandas DataFrame.
- Import the following libraries using import statements.
<ul>
    <li>pandas (for data manipulation)</li>
    <li>numpy (for multidimensional array computation)</li>
    <li>seaborn and matplotlib.pyplot (both for data visualization)</li>
</ul>
- Read the csv file 'train.csv' using Pandas' read_csv function
(<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html">pandas.read_csv</a>)

Note: Run a code cell by clicking on the cell and using the keyboard shortcut &lt;Shift&gt; + &lt;Enter&gt;.

In [None]:
# Put your statements here


### Step 1.2
Use head function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html">pandas.DataFrame.head</a>) of pandas library to preview the first 10 data.

In [None]:
# Put your statement here


### Step 1.3
Use tail function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html">pandas.DataFrame.tail</a>) of pandas library to preview the last 10 data.

In [None]:
# Put your statement here


### Step 1.4
Display informtion on dataframe using info function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html">pandas.DataFrame.info</a>) of pandas library.

In [None]:
# Put your statement here


### Step 1.5

Exploring the data

#### Step 1.5.0
Use select_dtypes function (<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html">pandas.DataFrame.select_dtypes</a>) of pandas library to 
get the features (excluding SalePrice and Id) that are numerical (i.e. not categorical).

In [None]:
numerical_features = trainData.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")
numerical_features = numerical_features.drop("Id")
print("# of numerical features: " + str(len(numerical_features)))
print (numerical_features)
# Run the above code

#### Step 1.5.1
Use select_dtypes function (<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html">pandas.DataFrame.select_dtypes</a>) of pandas library to 
get the features that are categorical.


In [None]:
# Put your statements here


### Step 1.6
Evaluate the data quality & perform missing values assessment using isnull function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html">pandas.isnull</a>) and sum function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html">pandas.DataFrame.sum</a>) of pandas library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> 
(Write your observation here.)


### Step 1.7
Evaluate the distribution of categorical features using describe function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html">pandas.DataFrame.describe</a>) of the pandas library.

Note: if you cannot see all the features, use the command
<pre>
pd.options.display.max_columns = 81
</pre>

In [None]:
# Put your statements here


<span style="color:red">What is/are your observation(s)?</span> 
(Write your observation(s) here.)


## Step 2: Cleaning data: Handling missing values

### Step 2.1 Not Really NA Values
According to the data description, NA actually has a particular meaning for many featuress.<br />
But the value "NA" will be regarded as missing values in DataFrame.<br />
We need to replace those by another value.

#### Step 2.1.0
Considering the feature <strong>Alley</strong>, data description says NA means "no alley access".
Use fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of pandas library 
to replace those NA values with "None".<br />

In [None]:
trainData["Alley"].fillna("None", inplace=True)
# Run the above code

#### Step 2.1.1
Similarly, for features:<br />
<strong>    BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2,</strong><br />
the data description says NA for basement features is "no basement".

Use fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of pandas library 
to replace those NA values with "No".<br />

In [None]:
# Put your statements here


#### Step 2.1.2
Similarly, for features:<br />
<ul>
<li>Fence : data description says NA means "no fence"</li>
<li>FireplaceQu : data description says NA means "no fireplace"</li>
<li>Functional : data description says NA means typical</li>
<li>GarageType etc : data description says NA for garage features is "no garage"</li>
<li>PoolQC : data description says NA for pool quality is "no pool"</li>
<li>MiscFeature: Miscellaneous feature not covered in other categories, NA means no miscellaneous features</li>
</ul>

Use fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of pandas library 
to replace those NA values with "No".<br />

In [None]:
#Put your statements here


### Step 2.2
Besides the features above, there are two other features with missing value: <br />
<strong>MasVnrType</strong> and <strong>MasVnrArea</strong> <br />

#### Step 2.2.0
If we look at those instances with missing values of MasVnrType, we can observe that those instances will also have values of MasVnrArea missing.

In [None]:
tempData = trainData.isnull()[["MasVnrType", "MasVnrArea"]]
tempData.loc[tempData.MasVnrType == True]
# Run the above code

#### Step 2.2.1

First, let's explore the <strong>MasVnrType</strong> feature.
For this feature, evaluate the distribution using countplot function (<a href="https://seaborn.pydata.org/generated/seaborn.countplot.html">seaborn.countplot</a>) of seaborn library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span>(Write your observation here.)




#### Step 2.2.2
Use the most common value of the feature to impute the missing values. Again, fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of the pandas library can be used.

In [None]:
# Put your statement here


#### Step 2.2.3

Then, let's look at the feature: <strong>MasVnrArea</strong>.

Since we have replaced the missing values of MasVnrType with "None", we should look at the values of <del>MasVnrType</del> <strong>MasVnrArea</strong> with respect to those with MasVnrType of "None".

To do this, evaluate the distribution using countplot function (<a href="https://seaborn.pydata.org/generated/seaborn.countplot.html">seaborn.countplot</a>) of seaborn library.

Note: to get those instances whose value of MasVnrType is "None":
<pre>
trainData.loc[trainData.MasVnrType =="None"]
</pre>

In [None]:
# Put your statement here


<span style="color:red">What is your observation?</span> (Write your observation here.)


#### Step 2.2.4
Use the most common value of the feature to impute the missing values. Again, fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of the pandas library can be used.

In [None]:
# Put your statement here


### Step 2.3

The feature <strong>LotFrontage</strong> also has missing values.

#### Step 2.3.1
For the feature <strong>LotFrontage</strong>, evaluate its distribution using hist function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html">pandas.DataFrame.hist</a>) of pandas library.

In [None]:
# Put your statements here 


#### Step 2.3.2
Compute the mean OR median of the <del>second least missing values</del> <strong>feature LotFrontage</strong> using mean (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html">pandas.DataFrame.mean</a>) / median function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html">pandas.DataFrame.median</a>) of pandas library.

Note: You have to skip all the missing values when computing the mean or median.

In [None]:
# Put your statements here


#### Step 2.3.3
Use mean / median to impute the missing values of the feature <del>with the second least missing values</del> <strong>LotFrontage</strong>. fillna function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html">pandas.DataFrame.fillna</a>) of pandas library can be used.

In [None]:
# Put your statement here


### Step 2.4

Since there is only one missing instance in the feature 'Electrical', we will keep the feature and just delete that instance.

In [None]:
trainData = trainData.drop(trainData.loc[trainData['Electrical'].isnull()].index)
# Run the above code

### Step 2.5

For the last feature with missing value, <strong>GarageYrBlt</strong>, we will just drop this feature
using drop function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html">pandas.DataFrame.drop</a>) of the pandas library) since there is already the feature 'YearBuilt'.

In [None]:
# Put your statement here


## Step 3: Transforming data

### Step 3.1
Using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html">pandas.DataFrame.replace</a> function, transfor the following numerical features to categorical.

<ul>
<li>MSSubClass:<br />
20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 
50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 
80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 
150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"
</li>
<li>MoSold:<br />
1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"
</li>
</ul>

In [None]:
# Put your statements here


### Step 3.2
Using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html">pandas.DataFrame.replace</a> function, transfor values of the following categorical features to numerical values.

<ul>
<li>
Alley : {"Grvl" : 1, "Pave" : 2},
</li><li>
BsmtCond : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
BsmtExposure : {"No" : 0, "Mn" : 1, "Av": 2, "Gd" : 3}
</li><li>
BsmtFinType1 : {"No" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6}
</li><li>
BsmtFinType2 : {"No" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6}
</li><li>
BsmtQual : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}
</li><li>
ExterCond : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5}
</li><li>
ExterQual : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5}
</li><li>
FireplaceQu : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
Functional : {"Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod": 5, 
                                       "Min2" : 6, "Min1" : 7, "Typ" : 8}
</li><li>
GarageCond : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
GarageQual : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
HeatingQC : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
KitchenQual : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
</li><li>
LandSlope : {"Sev" : 1, "Mod" : 2, "Gtl" : 3}
</li><li>
LotShape : {"IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4}
</li><li>
PavedDrive : {"N" : 0, "P" : 1, "Y" : 2}
</li><li>
PoolQC : {"No" : 0, "Fa" : 1, "TA" : 2, "Gd" : 3, "Ex" : 4}
</li><li>
Street : {"Grvl" : 1, "Pave" : 2}
</li><li>
Utilities : {"ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4}
</li>
</ul>

In [None]:
# Put your statements here


## Step 4: Creating new features and dropping redundant features

### Step 4.0
We can create new features by combining some existing features.
For example, we can combine GrLiveArea with TotalBsmtSF to form a new feature called TotalSF.
- Define a new feature 'TotalSF' and assign it with the sum of GrLiveArea and TotalBsmtSF.

In [None]:
trainData["TotalSF"] = trainData["GrLivArea"] + trainData["TotalBsmtSF"]
# Run the above code

### Step 4.1
Similarly, we can create more new features considering the followings:
<ul>
<li>
Overall quality of the house: product of OverallQual and OverallCond
</li><li>
Overall quality of the garage: product of GarageQual and GarageCond
</li><li>
Overall quality of the exterior: product of ExterQual and ExterCond
</li><li>
Overall kitchen score: product of KitchenAbvGr and KitchenQual
</li><li>
Overall fireplace score: product of Fireplaces and FireplaceQu
</li><li>
Overall garage score: product of GarageArea and GarageQual
</li><li>
Overall pool score: product of PoolAre and PoolQC
</li><li>
Total number of bathrooms: sum of BsmtFullBath, half of BsmtHalfBath, FullBath and half of HalfBath
</li><li>
Total SF for 1st + 2nd floors: sum of 1stFlrSF and 2ndFlrSF
</li><li>
Total SF for porch: sum of OpenPorchSF, EnclosedPorch, 3SsnPrch, and ScreenPorch
</li>
<ul>

In [None]:
# Put your statements here


### Step 4.2
- The feature "Id" can be dropped (using drop function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html">pandas.DataFrame.drop</a>) of the pandas library).

In [None]:
# Put your statement here


## Step 5: Analysing data statistically and graphically

### Step 5.1
Use describe function (<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html">pands.DataFrame.describe</a>) of pandas library to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution.

In [None]:
# Put your statement here


### Step 5.2
We can explore the correlations between features.

#### Step 5.2.1
Using <a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html">pandas.DataFrame.corr</a> function to compute the pairwise correlation.

In [None]:
# Put your statements here


#### Step 5.2.2
Display the pairwise correlation with a heatmap using <a href="http://seaborn.pydata.org/generated/seaborn.heatmap.html">seaborn.heatmap</a>.

In [None]:
# Put your statements here


<span style="color:red">Give two observations.</span> (Write your observation(s) here.)


### Step 5.3
Using Scatter Plot

#### Step 5.3.1
Explore SalePrice with respect to GrLiveArea using scatter function 
(<a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html">matplotlib.pyplot.scatter</a>) of matplotlib library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> (Write your observation here.)


<span style="color:red">What can we do?</span> (Write your answer here.)



#### Step 5.3.2
Explore SalePrice with respect to TotalSF using scatter function 
(<a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html">matplotlib.pyplot.scatter</a>) of matplotlib library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> (Write your observation here.)



#### Step 5.3.3
Explore SalePrice with respect to YearBuilt using scatter function 
(<a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html">matplotlib.pyplot.scatter</a>) of matplotlib library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> (Write your obeservation here.)



### Step 5.4
Using Count Plot

#### Step 5.4.1
Explore MoSold (Month Sold) using countplot function (<a href="https://seaborn.pydata.org/generated/seaborn.countplot.html">seaborn.countplot</a>) of seaborn library.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> (Write your obeservation here.)



### Step 5.5
Using Box Plot

#### Step 5.5.1
Explore the new feature OverallQual with respect to SalePrice using boxplot function (<a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html">seaborn.boxplot</a>) of seaborn library.

In [None]:
# Put your statments here


<span style="color:red">What is your observation?</span> (Write your obeservation here.)



#### Step 5.5.2
Explore the new feature (Total Number of Bathrooms) created in step 4.1 with respect to SalePrice using boxplot function (<a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html">seaborn.boxplot</a>) of seaborn library.

In [None]:
# Put your statments here


<span style="color:red">What is your observation?</span> (Write your obeservation here.)



#### Step 5.5.3
Explore the Neigborhood with respect to SalePrice using boxplot function (<a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html">seaborn.boxplot</a>) of seaborn library.

Note: you may want to change the size of the plot using <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html">matplotlib.pylot.subplots</a> function
<pre>f, ax = plt.subplots(figsize=(26, 12))</pre>
before creating the box plot.

In [None]:
# Put your statements here


<span style="color:red">What is your observation?</span> (Write your obeservation here.)



## Step 6 Normailzation

### Step 6.1 
Explore the distribution of SalePrice using <a href="https://seaborn.pydata.org/generated/seaborn.distplot.html">seaborn.displot</a> function.

In [None]:
# Put your statements here


### Step 6.2

We can also get the skewness and kurtosis using 
<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.skew.html">pandas.DataFrame.skew</a>
and 
<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.kurt.html">pandas.DataFrame.kurt</a> functions.

In [None]:
# Put your statements here


### Step 6.3

Apply log transformation to SalePrice using <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html">numpy.log</a> function.

In [None]:
# Put your statement here


### Step 6.4

Plot the distribution of SalePrice using seaborn.distplot function again.

In [None]:
# Put your statement here


## Bonus Tasks 

You may consider working on the followings:
<ol>
<li>Investigating whether normalization should be performed on any other features.
</li><li>Performing more/other exploratory data analysis to explore other factors constituting to higher/lower SalePrice.
</li>
</ol>

Note: The bonus tasks will worth at most 10 points depending on the amount and quality of the tasks. 
This assignment worths 100 points. The maximum score of this assignment including bonus is 110 points.
But the maximum score of the 2 assignments together is 200.