# Experimental Write Up: Predicting Housing Prices

## I. Project Problem & Hypothesis

#### What's the project about? What problem are you solving?
This project is about predicting the value (price) of a home based on the home's characteristics.  The housing market plays a vital role in the health of our economy.  People and companies are investing in property all the time for financial gains, but the challenge is the ability to forecast the return on an investment by accurately predicting the final price of a home. 
* Problem Statement:  Can we predict the price at which a home will sell for on the market given its characteristics?
* Hypothesis: Housing units with more than one bedroom and a pool are worth more and will sell at a much higher price than single bedroom homes without a pool.

####  Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?
In regards to machine learning, this is a supervised learning problem.  The goal here is to build a regression model from a given set of predictors or independent variables, given that response variable or dependent variable (the outcome) is continuous in nature.  We will be predicting the market value (price) of a housing unit for one particular region of the United States.  The price of the home is our output and the characteristics of the home are our predictors.  With this regression analysis I will generate an equation to describe the statistical relationship between one or more predictor variables and the response variable.

#### What kind of impact do you think it could have?
As I mentioned before, the housing market has a large impact on the economy; it directly affects home builders, the mortgage market, real estates, investment banks, home supply retail outlets, etc. The housing bubble that started in early 2006 affected a majority of states in America. Overvaluation of houses resulted in increased forclosures and a credit crisis, leading to high and prolonged unemployment rates.  I think that proper valuation of housing market can help avoid the onset of a housing bubble.  Lending companies, banks, and individual home owners can all benefit from proper valuation of homes.  In particular, I see value for individuals looking to invest in homes and sell for a profit by having the ability to accurately predict the price of a home before it is sold.

#### What do you think will have the most impact in predicting the value you are interested in solving for?
In this project I am analyzing a particular geographical region of Iowa.  It is important to note that I am not looking at data from all states, just a region within Iowa.  It is also important to consider the demographics of the population in Ames, which is likely to be very different than the population of a more urban city.  Given these factors, I think that variables relating to the size of a home will have the most impact on predicting the value of the property, more so than the variables relating to the type of home.

## II. The Dataset

#### Description of data set available, at the field level (see table)
In this project I am considering real estate data from the city of Ames, Iowa.  The details of every real estate transation is recorded by a city's Tax Assesor's office.  Sometimes that data is readily available for public use and other times it is not.  The dataset I have for the city of Ames is for residential home sales between 2006 and 2010.  The type of information contained in the data is similar to what a typical home buyer would want to know before making a purchase.  The dataset is not large (2930 records) and the variables are a mix of nominal, ordinal, continuous, and discrete.  A few more details on the dataset:

* Source:  Ames, Iowa Assessors Office
* 80 variables directly related to property sales
* 2930 unique observations
* Time period:  2006 - 2010
* The dataset has 82 columns, which includes:
    * 23 categorical nominal
    * 23 categorical ordinal
    * 20 continuous
    * 14 discrete

In [3]:
import pandas as pd
df = pd.read_csv('../Data/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## III. Domain Knowledge

#### What experience do you already have arount this area?  Does it relate or help inform the project in any way?
My experiecne in the real estate domain is limited.  Although I have never purchased or sold property myself, I have been involved with several friends/family who have have.  I have also moved several times from city to city.  Although I am no expert at real estate, my expereiences and interest in the housing market have given me a good understanding of the features and characteristics of a home and how they may impact the price of a house.  This expereince will help inform my data dictionary, which will be quite large given that I have 80 variables in my dataset.

#### What other research efforts exist?
The problem statement for this project is common  Thus, there are several other research efforts that exist especially in the data science community.   This question has become a challenge on 

## IV. Project Concerns

#### What questions do you have about your project?  What are you not sure you quite yet understand?
The housing price data ranges from early-2006 to mid-2010 and I think it is important to note that the mortgage crisis happened during this period and contributed to the economic recession of 2008.  I think house sales in Ames, Iowa were no exception and were influenced by the mortgage crisis during this time.  How do I use time series methods to account for this?  Another question I have is how to handle missing data?  I have a lot of variables in my dataset so I am not quite sure how to handle all of the missing values given that I have a small dataset?

#### What are the assumptions and caveats to the problem?  What data do you not have access to, but wish you did?  What is already implied about the observations in your dataset?
One caveat to the problem is that our dataset is small.  We have less than 3000 observations of which we have to split into test and train data.  Therefore, we must watch out for overfitting in our model.  We must also be aware of overfitting since we have a lot of variables.

## V. Outcomes

#### What do you expect the output to look like? What does your target audience expect the output to look like?
I am expecting the output of the regression model to be a continuous value, which represents the price of a home.  My target audience would be individuals interested in investing and selling property in Ames, Iowa.

#### What gain do you expect from your most important feature on its own?
I would expect that the most important feature will have the largest coefficient since regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control that regression provides is important because it isolates the role of one variable from all of the others in the model.

#### How complicated does your model have to be?
This will depend on which variables I determine are statistically significant.  For every statistically significant predictor, there willl be an associated coefficient and p-value.  Since there are a 80 potential predictors in the dataset....I imagine this model could be quite complex.

#### How successful does your project have to be in order to be considered a "success"?
I can evaluate my model on the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.  In other words, this means that the lower the RMSE value, greater is the accuracy of my prediction model.

In [None]:

### Domain knowledge
* What experience do you already have around this area?
* Does it relate or help inform the project in any way?
* What other research efforts exist?
    * Use a quick Google search to see what approaches others have made, or talk with your colleagues if it is work related about previous attempts at similar problems.
    * This could even just be something like "the marketing team put together a forecast in excel that doesn't do well."
    * Include a benchmark, how other models have performed, even if you are unsure what the metric means.

### Project Concerns
* What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).
* What are the assumptions and caveats to the problem?
    * What data do you not have access to but wish you had?
    * What is already implied about the observations in your data set? For example, if your primary data set is twitter data, it may not be representative of the whole sample (say, predicting who would win an election)
* What are the risks to the project?
    * What's the cost of your model being wrong? (What's the benefit of your model being right?)
    * Is any of the data incorrect? Could it be incorrect?

### Outcomes
* What do you expect the output to look like?
* What does your target audience expect the output to look like?
* What gain do you expect from your most important feature on its own?
* How complicated does your model have to be?
* How successful does your project have to be in order to be considered a "success"?
* What will you do if the project is a bust (this happens! but it shouldn't here)?


