## Regression of real estate data
For this problem, you will analyze some real estate data. The dataset contains multiple listing service (MLS) real estate transactions for houses sold in 2015-16. We are primarily interested in regressing the `SoldPrice` on the house attributes (`property size`, `house size`, `number of bedrooms`, etc...).

Tasks 2.1-2.3 cover the EDA part on the data, therefore not graded. However, they are considered important part of the process. The goal is to work on these tasks either as a group or individually and share good advice with each other on how to proceed. In the process, we expect that you develop some intuition about the data and explanations 


### Task 2.1: Import the data (already done :))
Use the [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to import the dataset (`houses.csv`). This pandas dataframe will be used for data exploration and linear regression. 

In [4]:
# imports and setup 
import pandas as pd
import scipy as sc
import numpy as np

import statsmodels.formula.api as sm

#%matplotlib notebook
import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 

In [8]:
h = pd.read_csv('housespp.csv',index_col=0) #load data
print(h.shape)

(348, 21)


### Task 2.2: Clean the data 

1. There are 21 different variables associated with each of the 348 houses in this dataset. Skim them and try to get a rough understanding of what information this dataset contains. Here is an explanation of the variables.

'Access': status of the road to the property (asphalt, concrete etc.)<br>
'Acres': total area of the property in acres (acc. to Wikipedia "an acre may be declared as exactly 4,046.8564224<br> square metres")
'AirType': air-conditioning provider type (e.g. central, electric etc.)<br>
'Amenities': extra things available (e.g. cable tv, etc.) <br>
'DaysOnMkt': days that the property stayed on the market <br>
'Deck': how many floors <br>
'GaragCap': # of paring spots <br>
'Heat': heating type (e.g. electric, gas, etc.) <br>
'Latitude': location info (hint: we are in the USA but where?) <br>
'Longitude': location info (hint: we are in the USA but where?) <br>
'LstPrice': listed price <br>
'Patio': # of patios <br>
'PkgSpacs': # of parking spaces <br>
'PropType': property type (condo, single family, townhouse, ...) <br>
'Roof': type of roof (flat, asphalt, etc.) <br>
'SoldPrice': actual price that listing was sold <br>
'Taxes': taxes paid <br>
'TotBed': # of bedrooms <br>
'TotBth': # of bathrooms <br>
'TotSqf': area of the house in square feet <br>
'YearBlt': year built <br>

+ Only keep houses with List Price between 200,000 and 1,000,000 dollars. This is an arbitrary choice and we realize that some people are high rollers, but for our purposes we'll consider the others as outliers. 

+ As a minimal and required step, we are going to keep the following columns. However, for the last task (2.6) you are free to epxlore how other attributes affect the performance as well. But don't go crazy and avoid overfitting!

`['Acres', 'Deck', 'GaragCap', 'Latitude', 'Longitude', 'LstPrice', 'Patio', 'PkgSpacs', 'PropType', 'SoldPrice', 'Taxes', 'TotBed', 'TotBth', 'TotSqf', 'YearBlt']` 

+ Check the datatypes and convert any numbers that were read as strings to numerical values. (Hint: You can use [`str.replace()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) to work with strings.) If there are any categorical values you're interested in, just make sure you include them as you find fit. In particular, convert 'TotSqf' to an integer and add a column titled `Prop_Type_num` that is 
$$
\text{Prop_Type_num}_i = \begin{cases} 
0 & \text{if $i$-th listing is a condo or townhouse} \\
1 & \text{if $i$-th listing is a single family house}
\end{cases}. 
$$
+ Remove the listings with erroneous `Longitude` (one has Longitude = 0) and `Taxes` values (two have unreasonably large values).

In [20]:
# your code goes here

### Task 2.3: Exploratory data analysis 

1. Explore the dataset. Write a short description of the dataset describing the number of items, the number of variables and check to see if the values are reasonable. 

+ Make a bar chart showing the breakdown of the different types of houses (single family, townhouse, condo). 

+ Compute the correlation matrix and use a heat map to visualize the correlation coefficients. 
    - Use a diverging color scale from -1 to +1 (see `vmin` and `vmax` parameters for [pcolor](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.pcolor.html))
    - Show a legend
    - Make sure the proper labels are visible and readable (see [`xticks`](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.xticks.html) and the corresponding [`yticks`](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.yticks.html).

+ Make a scatter plot matrix to visualize the correlations. Color-code the dots by property type. For the plot, only use a subset of the columns: `['Acres', 'LstPrice', 'SoldPrice', 'Taxes', 'TotBed', 'TotBth', 'TotSqf', 'YearBlt']`. Determine which columns have strong correlations. 

+ Describe your findings with each other and share useful insights for the modeling part (to follow)


In [None]:
# your code goes here


**Your Interpretation:** TODO

## DELIVERABLES (DEADLINE 5/March late night, wildcards possible)

Honor code applies from these tasks onwards (only individual work)

Instructions for the deliverable: 

* The tasks that are graded are 2.4-2.6. However, include your work in tasks 2.1-2.3. While, It is not graded, but it's important to include any preprocessing steps you have done, any decisions you made etc.

* Make sure that you include a proper amount/mix of comments, results and code.

* In the end, make sure that all cells are executed properly and everything you need to show is in your (execucted) notebook.

* You are asked to deliver **only your executed notebook file, .ipnyb** and nothing else. Enjoy!

### Task 2.4: Simple  Linear Regression 
Use the `ols` function from the [statsmodels](http://www.statsmodels.org/stable/index.html) package to regress the Sold price on some of the other variables (feel free to include all of them, however your work here should be based on the EDA you have done). Your model should be of the form:
$$
\text{Sold Price} = \beta_0 + \beta_1 x, 
$$
where $x$ is one of the other variables. 

You'll find that the best predictor of sold price is the list price. Report the $R^2$ value for this model (`SoldPrice ~ LstPrice`) and give an interpretation for its meaning. Also give an interpretation of $\beta_1$ for this model. Make a scatterplot of list price vs. sold price and overlay the prediction coming from your regression model.

Let's put categorical variables into play! We will distinguish between single family houses on the one hand and townhouses and condos on the other hand (so using the `Prop_Type_num` variable you constructed in 2.2). Consider the two regression models: 
$$
\text{SoldPrice} = \beta_0 + \beta_1 \times \text{Prop_Type_num}
$$
and 
$$
\text{SoldPrice} = \beta_0  + \beta_1 \times \text{Prop_Type_num} + \beta_2 \times \text{TotSqf}
$$

What happens with the significance of the `Prop_Type_num` coefficient when we consider the first and the second model? How do you explain this (hint: confounders)? Make a scatterplot of `TotSqf` vs. `SoldPrice` where the house types are colored differently to illustrate your explanation.

REMARK: For part 2.4 you do not need to apply cross-validation or regularization or a more complex model (that comes in part 2.6)

In [None]:
# Your code here


**Your Interpretation:** TODO

### Task 2.5: Multilinear Regression 
Develop a multilinear regression model for house prices in this neighborhood. We could use this to come up with a list price for houses coming on the market, so do not include the list price in your model and, for now, ignore the categorical variable `Prop_Type` (or `Prop_Type_num`). Your model should be of the form:
$$
\text{Sold Price} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots +  \beta_n x_n, 
$$
where $x_i$ are predictive variables. Which variables are the best predictors for the Sold Price? 

Specific questions (feel free to extend the scope of your analysis):
1. Often the price per square foot for a house is advertised. Is this what the coefficient for `TotSqf` is measuring? Provide an interpretation for the coefficient for `TotSqf`.  
+ Estimate the value that each Garage space adds to a house. 
+ Does latitude or longitude have an impact on house price? Explain. 

REMARK: For part 2.5 you do not need to apply cross-validation or regularization or a more complex model (that comes in part 2.6)

In [None]:
# your code goes here


**Your Interpretation:** TODO

### Task 2.6: Deliver a robust model, where you have included an analysis of all variables etc.

If we wanted to start a 'house flipping' company, we'd have to be able to do a better job of predicting the sold price than the list price does. How does your model compare?

Based on the exploration in the tasks above, build and deliver a robust model for predicting the sold price. As a minimal and required step here, you need to use cross-validation and regularization and demonstrate their effect on the model.

Once you have such a model, you are free to explore any other models you want (also beyond the scope of the course), however that is not necessary. You are not going to be judged on the performance of your model, but on the methodology you followed to build your model and the interpretation of the results.

**Your Interpretation:** TODO