# Storytelling Data Visualization Lab

In this lab you'll use a dataset called `housing_prices.csv` which contains the sales data of houses. The dataset and descriptions of the columns are available from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). For your convenience, you can review the descriptions of the data columns from [here](https://drive.google.com/file/d/1-iXooLjNuEXU41dqz8ORQ5JEZPHd9x0X/view?usp=sharing).

Pretend you are a data analyst at an investment company where the board decided to make investments in real estates. Your boss asked you to analyze this housing sales dataset and present to the investment managers on **what features of houses are strong indicators of the final sale price**. You need to present your findings in intuitive ways so that the investment managers understand where your conclusions come from.

#### You will use the appropriate data visualization graphs to tell your stories.

## Challenge 1 - Understanding the Dataset

After receiving the data and clarifying your objectives with your boss, you will first try to understand the dataset. This allows you to decide how you will start your research in the next step.

The dataset is [here](https://drive.google.com/file/d/1MRhRtdX8QuPPEhelBIS_FEl5vJjRLSeE/view?usp=sharing). Please download it and place it in the data folder.<br>
First, import the basic libraries and the dataset.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../data/housing_prices.csv')

#### As a routine before analyzing a dataset, print the first few rows of the dataset

In [None]:
df.head()

You find the dataset has 81 columns which are a lot. 

#### Since the column `Id` is meaningless in our data visualization work, let's drop it

In [None]:
# your code here
df.drop('Id',1,inplace=True)

You care about missing values. If a column has too many missing values, it is not reliable to use it to predict sales price.

#### In the cell below, calculate the percentage of missing values for each column. 

Make a table containing the column name and the percentage of missing values. Print the columns where more than 20% of values are missing. An example of what your output  should look like is [here](https://drive.google.com/file/d/1cuq6qhFZC5wavm-_STcxktBKdAc4xvH8/view?usp=sharing)

[This reference](https://stackoverflow.com/questions/51070985/find-out-the-percentage-of-missing-values-in-each-column-in-the-given-dataset) can help you make the missing values table.

In [None]:
# your code here
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,'percent_missing': percent_missing})
#missing_value_df.sort_values(by =['percent_missing']> 20, inplace=True)
#error ">" not supported between instances of 'list' and 'int'
missing_value_df.sort_values(by =['percent_missing'], inplace=True)
missing_value_df=missing_value_df[missing_value_df['percent_missing']>20]
missing_value_df


#### Drop the columns you find that have more than 20% missing values.

After dropping, check the shape of your dataframes. You should have 75 columns now.

In [None]:
# your code here
for column in missing_value_df['column_name'][missing_value_df.percent_missing > 20]:  
    df.drop(column,1,inplace = True)
df

Since you're asked to analyze sale prices, first let's see if the sale prices (column `SalePrice`) has a normal distribution. This is important because normally distributed data can be better represented with mathematical models.

#### In the cell below, use the propriate graph to visualize the shape of distribution of the sale prices. Then explain what you find from the graph about data distribution.

In [None]:
# your code here
sns.distplot(df['SalePrice'])

In [None]:
# your comment here
#The data is not normally distributed - strongly skewed to the left, with the right tail being significantly longer.


## Bonus Challenge 1 - Adjust Data Distribution

If you used the correct method in the previous step, you should have found the data distribution is skewed to the left. In order to improve your data visualization in the next steps, you can opt to adjust the `SalePrice` column by applying a mathematical function to the values. The goal is to produce a bell-shape normal distribution after applying the mathematical function to the sale price.

*This technique is optional in data visualization but you'll find it useful in your future machine learning analysis.*

#### In the cell below, adjust the `SalePrice` column so that the data are normally distributed.

Try applying various mathematical functions such as square root, power, and log to the `SalePrice` column. Visualize the distribution of the adjusted data until you find a function that makes the data normally distributed. **Create a new column called `SalePriceAdjusted` to store the adjusted sale price.**

[This reference](https://trainingdatascience.com/workshops/histograms-and-skewed-data/) shows you examples on how to adjust skewed data.

In [None]:
# your code here


## Challenge 2 - Exploring Data with Common Sense

Now that we have a general understanding of the dataset, we start exploring the data with common sense by means of data visualization. Yes, in data analysis and even machine learning you are often required to use common sense. You use your common sense to make a scientific guess (i.e. hypothesis) then use data analytics methods to test your hypothesis.

This dataset is about housing sales. According to common sense, housing prices depend on the following factors:

* **Size of the house** (`GrLivArea`, `LotArea`, and `GarageArea`).

* **Number of rooms** (`BedroomAbvGr`, `KitchenAbvGr`, `FullBath`, `HalfBath`, `BsmtFullBath`, `BsmtHalfBath`).

* **How long the house has been built or remodeled** (`YearBuilt` and `YearRemodAdd`).

* **Neighborhood of the house** (`Neighborhood`).

#### In this challenge, use the appropriate graph type to visualize the relationships between `SalePrice` (or `SalePriceAdjusted`) and the fields above. 

Note that:

* Transform certain columns in order to visualize the data properly based on common sense. For example:
    * Visualizing how the number of half bathrooms affected the sale price probably does not make sense. You can create a new column to calculate the total number of bathrooms/rooms then visualize with the calculated number.
    * `YearBuilt` and `YearRemodAdd` are year numbers not the age of the house. You can create two new columns for how long the house has been built or remodeled then visualize with the calculated columns.
* Make comments to explain your thinking process.

In [None]:
# your code here
# add cells as needed
df['House Size']=df['GrLivArea']+df['LotArea']+ df['GarageArea']
df['Bathrooms'] = df['FullBath'] + df['HalfBath'] + df['BsmtFullBath'] + df['BsmtHalfBath']
df['Rooms Number']=df['BedroomAbvGr'] + df['KitchenAbvGr'] + df['Bathrooms']
df['Years Built']=2020 - df['YearBuilt']
df['Years Remodeled']=2020 - df['YearRemodAdd']
df['Years Built and Remodeled']=  df['Years Built'] + df['Years Remodeled']
df.describe()

In [None]:
total_area= df[['GrLivArea','LotArea', 'GarageArea']]
total_area['Total Area']=total_area['GrLivArea']+total_area['LotArea']+ total_area['GarageArea']
total_area.describe()

In [None]:
#Visualization: Size of the house vs SalePrice
#plt.bar(df['House Size'], df['SalePrice'])
#plt.xlabel('House Size')
#plt.ylabel('Price')
#plt.title('House Size vs Sale price')
#plt.show()
#doesn't look great; values too low 
sns.jointplot(x="House Size", y="SalePrice", data = df, kind="hex", color="k");

In [None]:
sns.jointplot(x="House Size", y="SalePrice", data=df, kind="kde")
#i'd like to zoom it in here, but I'm not sure how to  - maybe by plot division?

In [None]:
#Visualization Number of rooms vs SalePrice
plt.bar(df['Rooms Number'], df['SalePrice'])

plt.xlabel('Rooms Number')
plt.ylabel('Price')
plt.title(' Rooms Number vs Sale price')

plt.show()

In [None]:
#Vizualization How long the house has been built or remodeled vs SalePrice

plt.bar(df['Years Built and Remodeled'], df['SalePrice'])

plt.xlabel('Years Built and Remodeled')
plt.ylabel('Price')
plt.title('Years Built and Remodeled vs Sale price')

plt.show()


In [None]:
#Vizualization Neighborhood of the house (Neighborhood) vs SalePrice  
#ax = sns.violinplot(x="Neighborhood", y="SalePrice", hue="Neighborhood",
                    #data=df, palette="muted")
    #doesn't look Clean; flip and increse size for more visibility 
sns.violinplot(y="Neighborhood", x="SalePrice",
                    data=df, palette="muted", figsize=(250,100))   
#not sure how to make it more readable

## Bonus Challenge 2 - Exploring Data with Correlation Heatmap

Now you have explored data visualizations with certain fields based on common sense. In the dataset there are many other fields that you are not sure whether they are important factors for the sale price. What is the best way to explore those fields without investigating them individually?

Making scatter matrix is not an option here because there are too many fields which makes it extremely time consuming to create scatter matrix. One option you have is to create a heatmap. Heatmaps are much less expensive to create than scatter matrixes. You can use heatmaps to visualize the pairwise correlations between each two variables.

Here is a [reference](https://seaborn.pydata.org/examples/many_pairwise_correlations.html) you can use to learn how to creat the pairwise correlation heatmap. Your heatmap should look like this [example](https://drive.google.com/file/d/1JhdNvbAnnWDFXEtDoBtx3B2KKIkqsnSH/view?usp=sharing)

In [None]:
# your code here
sns.set(style="white")

# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=df, columns=df.columns)

# Compute the correlation matrix
corr = d.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In your heatmap, you can easily identify the highly correlated (either positively or negatively) variables by looking for the grids with darker colors. 

#### In the cell below, summarize what variables are highly correlated to the sale price?

In [None]:
# your comment here 
#Negatively correlated: lot area; Year Built; Year Removed; FullBath; Garage (cluster: GarageYrBlt; GarageCars; GarageArea); Sale Price; Bathrooms

## Challenge 3 - Present Your Stories

Now based on your findings from the explorations, summarize and present your stories.

#### Present the top 5 factors that affect the sale price.

Use the following format to present each factor:

1. A title line about the factor.

1. No more than 3 sentences to describe the relationship between the factor and the sale price.

1. Support your point with the appropriate graph.

In [None]:
# your responses here
# add cells as needed

# Factor 1: Rooms Number 

Rooms number seems most normally distributed in terms of correlation. It is not fully normally distributed; with right skewness - reaching the peak at 10 rooms; but dropping at 12 and 14 rooms; and showing a higher proce for 6-room than 8-room apartments. 
Reference graph: Challenge 2; Bar plot. 

# Factor 2: House Size

House size is positively correlated with prce, almost in purely linear fashion. This can be intuitivelty expected, but was illustrated with data vizualization in Challenge 2; Hexbin and Kernel density plots. 

# Factor 3 & 4: Year built; Year Remodelled

Price decreases along with the time built/remodelled, except for peaks at about 50 and 160 years. 
This may be due to specific construction properties or architectural preference of those periods; versus otherwise preferred new-builds. Graph reference: Bar graph, Challenge 2; Heat Map, Bonus Challenge 2.

# Factor 5: Neighbourhood 

Sale price seems to be influenced by neighbourhood placement; both in terms of minimum & maximum values range; and distribution. 
The highest-price-ranking neighbourhoods are: "NoRidge"; "NridgHt"; "StoneBr"
The lowest-price-ranking neighbourhoods are: "NPkVill"; "BrDale"; "Meadow"
Graph reference: Violin graph; Challenge 2.