# Predicting House Prices of Washington Using Machine Learning

Hanna Seyoum

## Problem Statement

House prices generally tend to go up with time, with some probability of crashing. With price increase, or decrease, many families (either looking to buy or sell)  are affected. The goal is to predict current values of the houses so that buyers and sellers can make informed decisions.

## The Data

The data contains houses from the state of Washington with 18 attributes such as number of bedrooms and bathrooms, square footage of d/t parts of house, house location, etc. The dataset was acquired from Kaggle.

## Methodology

I have treated this as a supervised learning regression problem.

* Acquired dataset from Kaggle  
* Applied data wrangling & cleaning for feautre engineering and selection, and to handle missing values and outliers 
* Exploratory Data Analysis and Visualizations to find patterns and insights w.r.t various features in housing data  
* Hypothesis testing leveraging inferential statistics  
* Predictive modeling for house prices leveraging linear regression, ridge regression, and random forest regression. 

### Libraries

pandas for:  
    
    data loading, wrangling, cleaning, and manipulation  
    feature selection and engineering  
    descriptive statistics


numpy for:

    generate an array of values
    array sorting and manipulation
    

matplotlib and seaborn for:

    data visualization
    

scikit-learn for:

    data preprocessing
    regression and ensemble models
    cross-validation
    model selection
    model performance / metrics

### Data Wrangling & Cleaning

The data is a CSV file which I uploaded onto pandas as a data frame. A combination of feature engineering and selection was used. There were no NaNs, but instead there were 0s that indicated missing values in some entries.

#### Cleaning steps

* Used a combination of **feature engineering** & **feature selection**. The more the features the better, to a certain extent (i.e., curse of dimensionality), therefore I kept all features except for `statezip` which I deleted after splitting it into `state` and `zipcode` features. 

* Created a `month` feature by extracting the months from the `date` column to factor in how price is affected over time.

* Created a `total_sqft` feature by summing `sqft_living` and `sqft_lot`.

* Changed the data types of `waterfront` & `condition` features from `int` to `category` because they both have values `0` & `1` where 1 means yes & 0 means no.

* Split `statezip` feature into two features, `state` & `zipcode` and deleted the `statezip` after split.


#### Handling missing/zero values

* Found 2 houses with 0 bedrooms & 0 bathrooms that were priced over \$1 million. It was clear that they were erroneous data. Therefore I replaced the bedroom & bathroom values of both houses with the mean bedroom & bathroom values.

* The `yr_renovated` column is a numeric column with years of when the houses were last renovated, & 59.5\% of the rows have 0s as their value. I was not sure if the 0 meant that a house was never renovated, or if the renovation date was missing. I considered removing the column since 59.5\% of the rows are zeros, but instead of deleting the column, I added a boolean array column with 1s for all the houses that have a 0, and 0s for all the houses that have a renovation year listed.

* 1.1\% of the `Price` column has houses with 0 values. It is unlikely that the houses were worth \$0, so I did the same as with the `yr_renovated` column and added a boolean array column. I also created a new dataframe with these houses removed to compare it's models to the ones with \$0 house price included. There was a slight improvement in the models with the 0s removed. However, I don't think this improvement in model performance is worth removing 1.1\% of the data.

#### Outliers

* Found two 3-bedroom houses with prices over \$10 million. To check if they were outliers or possibly erroneous, I plotted a linear regression line of the `price` column & `sqft_living` column to check if the prices are high because of their square footage.

Result: They seem to be erroneous entries, so I created a new dataframe with both houses removed and compared it's model performance with the original dataframe's model. There was significant improvement in model performance. Therefore I've conducted remaining analysis with the dataframe exlcuding these 2 houses.

### Exploratory Data Analysis & Statistical Analysis Plots

#### Feature Observation

Some assumptions:

    Houses with more bedrooms will be worth more
    Houses with more bathrooms will be worth more
    Houses in bad condition will be worth less
    Houses with higher total square footage will be worth more
    Houses with a waterfront will be worth more
    Houses sold most recently will be slightly more expensive than houses sold earlier.

* Plotted a heatmap of all the features and target variable to get an idea of their relationships.
* Plotted scatter plots of `price` against # of bedrooms, house size, and house condition. These plots showed a positive linear relationship between `sqft_living` & `price`, a non linear (polynomial) relationship between `bedrooms` & `price`, and the `condition` plot showed that houses in poor condition are worth much less than houses in moderate and good condition. Also, majority of the houses are in moderate condition, and the plot has a slightly parabolic shape.
* Plotted a barplot of `waterfront` vs `price` which showed that houses with waterfronts are worth more.
* Plotted a line plot of `yr_built` and `price`. This plot shows that house prices are higher with houses that are very old and also new. Old houses could be priced higher due to their historical value. 
* Plot of mean & median prices per bedrooms. Noticed a slight difference between the two plots, with mean prices being slightly larger than median prices. This could be due to some price outliers.
* Plot of mean & median house sizes per number of bedrooms. We see a strong positive linear relationship between `bedrooms` and `price`, and the mean & median plots are very similar.
* An ecdf plot of house prices by each month (may, june, & july) shows that house prices remained about the same in each month.