# Predicting House Prices 🏠

Submitted by Hanaan Shafi.

### Structure

This project consists of the following Python files (all available on GitHub):

1. main.py: Contains the web scraping logic using Redfin's API endpoint + some preliminary analysis
2. data_processing.py: Handles data loading and preprocessing
3. model_builder.py: Contains model building and evaluation functions
4. run_analysis.py: Main script for the analysis
5. analyze_results.py: Analyzes and visualizes
6. utils.py: Some functions used across the project

### City Strategy

So I had started with just 1 city (Chicago) but had to expand to include nearby cities:

* Aurora
* Naperville
* Joliet
* Evanston
* Oak Park
* Schaumburg
* Skokie
* Des Plaines
* Arlington Heights

This was because single city data was way too limited (I was only able to get 3-5 properties per request) and I could barely build a functioning decision tree.

### Web Scraping

I began by using BeautifulSoup scraping HTML directly from Redfin's website, but I kept getting blocked (the website has anti-scraping protection measures in place) so I switched to using Redfin's internal API endpoint (https://www.redfin.com/stingray/api/gis-cs).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("housing_data.csv")
df

Unnamed: 0,Price,Bedrooms,Bathrooms,SquareFeet,YearBuilt,PropertyType,ZipCode
0,329900.0,3.0,2.0,2187.0,2012.0,Unknown,84655.0
1,250000.0,4.0,2.5,2003.0,1973.0,Single Family Residential,36421.0
2,469900.0,4.0,2.0,2536.0,2006.0,Single Family Residential,36420.0
3,155000.0,3.0,2.0,2187.0,2012.0,Vacant Land,84655.0
4,120000.0,3.0,2.0,2187.0,2012.0,Vacant Land,84655.0
...,...,...,...,...,...,...,...
497,2795000.0,4.0,3.5,3200.0,2024.0,Single Family Residential,98119.0
498,4053777.0,3.0,3.0,2095.0,2012.0,Single Family Residential,98101.0
499,999900.0,3.0,2.0,2254.0,2024.0,Single Family Residential,98388.0
500,1159900.0,4.0,4.5,3650.0,2024.0,Single Family Residential,98388.0


In [3]:
!python run_analysis.py

2024-11-30 06:25:45,074 - INFO - Loading and cleaning data...
2024-11-30 06:25:45,102 - INFO - Data loaded and cleaned successfully
2024-11-30 06:25:45,102 - INFO - Analyzing housing data...

Data Summary:
Total properties: 393
Property types:
PropertyType
Single Family Residential    309
Townhouse                     34
Ranch                         22
Condo/Co-op                   12
Multi-Family (2-4 Unit)        7
Mobile/Manufactured Home       6
Other                          2
Unknown                        1
Name: count, dtype: int64

Price statistics:
count    3.930000e+02
mean     4.511144e+05
std      4.541336e+05
min      2.500000e+04
25%      2.200000e+05
50%      3.779900e+05
75%      5.400000e+05
max      4.500000e+06
Name: Price, dtype: float64
2024-11-30 06:25:46,496 - INFO - Preparing features...
2024-11-30 06:25:46,526 - INFO - Building models...
2024-11-30 06:26:41,931 - INFO - Analyzing model results...

Model Performance Comparison:

Decision Tree:
RMSE: $0.52
R² S

## Observations

The decision tree yeilds an R² score of 0.4667 so it explains only about 46.67% of the variance in housing prices -- there is room for improvement, which I suppose could be fixed by having more features + better hyperparameter tuning. Perhaps I could also consider combining multiple decision trees using bagging or boosting.

I also tried out a random forest model to see how it compares to the decision tree results (given we know that random forests often outperform decision trees and are less prone to overfitting). I got an R² score of 0.5650 for the random forest model, which is only slightly higher than that of the decision trees.

Either way, both R² values are lower than ideal, and this could be because of the limited feature set, potentially non-linear relationships between features and price, and data quality issues stemming from web scraping. I think it could be helpful to incorporate more features (such as those in the famous [Ames housing pricing dataset on Kaggle](ttps://www.kaggle.com/datasets/marcopale/housing)) including info on the street, neighborhood, lot area, and property condition (for instance). I definitely could also improve the model by simply increasing the dataset size -- one of the main challenges I faced was the long wait times during data scraping (4 hours) so I had to make a tradeoff between run time and amount of data.

Taking a closer look at the feature importances: The log(square feet) is by far the most important feature in both models, accounting for 74-85% of the importance => the size of the property is the primary driver of housing prices. I applied the log transformation to normalize the distribution of square footage and capture non-linear relationships between size and price. The age of the property is the second most important feature, though its importance is much lower in the random forest model (7% vs 13% in the decision tree). So newer properties tend to be more expensive, but the relationship may not be strictly linear.

I had also added a column with the bedroomnbathroomratio, which ended up being the 3rd most important feature. So the balance between bedrooms and bathrooms affects the price -- and interestingly, this feature is more important than the total number of rooms and the number of bathrooms and bedrooms individually, which is really interesting to observe and does make sense.


## Challenges:

The real estate data was really hard to collect:
* Zillow and Redfin kept blocking my scraping attempts w anti-scraping  protections
* I couldn't get any data from Zillow at all, but had better luck with Redfin. I ended up having to use rotating headers to avoid being blocked
* I encountered so many duplicate properties in my data, so after processing/cleaning, I ended up losing a lot of rows.
* Retrieving the data was a slow, slow process (4hours) (due to rate limiting).

______________________