# Examining The Influence Of Square Footage On The Selling Price Of Houses In Vancouver #

## Introduction: ##
The Vancouver housing market is notorious for its skyrocketing prices-- making it the "second-most unaffordable [housing] market" after Hong Kong according to Kwan, Bloomberg News.

A newcomer to Vancouver looking for a home might wonder what "cheap" is when looking for homes. On the flip side, a seller might wonder whether they are pricing their home below, above, or at the market expectation in order to accelerate or deccelerate their time-to-sale.

(cite this later https://www.bnnbloomberg.ca/hong-kong-housing-ranked-world-s-least-affordable-for-9th-year-1.1201263)

The price of a house can depend on a multitude of factors, including location, size, the age of the home, and many others. Based on information about the property, predictions can be made about the selling price of a home compared to others in the area.

The goal of our project is to utilize regression analysis to determine the relationship between price and total square footage of houses in Vancouver. We will do this using a  [publicly available dataset from the website Kaggle](https://www.kaggle.com/datasets/darianghorbanian/vancouver-home-price-analysis-regression), which has price data, square footage, and other details about Vancouver houses from 2017-2020. 

Our Predictive Question is: “Does the square footage of a house in Vancouver have an impact on its selling price?”

# Methods & Results: #


## loading data from the original source on the web ##

The dataset that we have chosen for this project is a publicly available dataset from the website Kaggle, which has price data, square footage, and other details about Vancouver houses from its respective time period.

https://www.kaggle.com/datasets/darianghorbanian/vancouver-home-price-analysis-regression In order to read our dataset directly from the Kaggle website, we will need to work with the Kaggle API and set it up using an authentication username and key.

First, we will import the necessary libraries to use throughout our project.

In [23]:
### importing the necessary libraries
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

Then we will install the Kaggle package, to interact with the Kaggle API as outlined in their documentation: https://www.kaggle.com/docs/api

In [24]:
# set up Kaggle for downloading data set 

!pip install kaggle
import os

os.environ['KAGGLE_USERNAME'] = 'alexannn'
os.environ['KAGGLE_KEY'] = '134ddfd9c0609f9493f6766bad383898'



In [25]:
# download data set
!kaggle datasets download -d darianghorbanian/vancouver-home-price-analysis-regression --unzip

Dataset URL: https://www.kaggle.com/datasets/darianghorbanian/vancouver-home-price-analysis-regression
License(s): unknown
Downloading vancouver-home-price-analysis-regression.zip to /home/jovyan/work/Working Folder/vancouver_housing_predictions
  0%|                                               | 0.00/30.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 30.1k/30.1k [00:00<00:00, 9.36MB/s]


Now that we have downloaded our data set, we can load it into a pandas data frame

In [26]:
home_prices = pd.read_csv("House sale data Vancouver.csv")
home_prices

Unnamed: 0,Number,Address,List Date,Price,Days on market,Total floor area,Year Built,Age,Lot Size
0,1,3178 GRAVELEY STREET,5/8/2020,1500000,18,2447,1946,74,5674.00
1,2,1438 E 28TH AVENUE,1/22/2020,1300000,7,2146,1982,38,3631.98
2,3,2831 W 49TH AVENUE,6/18/2019,2650000,1,3108,1929,90,9111.00
3,4,2645 TRIUMPH STREET,6/18/2019,1385000,28,2602,1922,97,4022.70
4,5,741-743 E 10TH AVENUE,11/28/2019,1590000,17,1843,1970,49,4026.00
...,...,...,...,...,...,...,...,...,...
1297,1298,65 W KING EDWARD AVENUE,8/22/2019,2630000,42,3035,1939,80,7456.00
1298,1299,3150 E 52ND AVENUE,8/17/2019,1450000,14,2282,1974,45,3993.00
1299,1300,4478 PRINCE ALBERT STREET,2/24/2020,2798000,4,3501,2016,4,3960.00
1300,1301,4038 MILLER STREET,4/5/2019,900000,194,2440,1912,107,3297.00


## Wrangling and cleaning the data into the needed format ##

The tidy data format adheres to the following three principles:

- Each variable corresponds to a column.
- Each observation corresponds to a row.
- Each measurement is a cell value.

Fortunately, in our dataset, our data already meets these requirements and is, therefore, considered tidy. However, for the sake of simplicity, we can drop all the columns except those that are relevant to our analysis.

In [27]:
home_prices = home_prices[['Price', 'Total floor area']]
home_prices

Unnamed: 0,Price,Total floor area
0,1500000,2447
1,1300000,2146
2,2650000,3108
3,1385000,2602
4,1590000,1843
...,...,...
1297,2630000,3035
1298,1450000,2282
1299,2798000,3501
1300,900000,2440


Next, we will split the data to use 75% as the training set. We will set the `Total floor area` column as the target (y) and `Price` column as the input feature (X).

In [28]:
home_training, home_testing = train_test_split(
    home_prices,
    test_size=0.25,
    random_state=2000,  
)

X_train = home_training[["Price"]] 
y_train = home_training["Total floor area"]  

X_test = home_testing[["Price"]] 
y_test = home_testing["Total floor area"]

## Summary of the data set ##

In the following table, we've provided some basic numerical summaries about our dataset. We've included the five-number summary for a quick overview of our data, as well as the standard deviation to represent the spread, the number of missing values, and the overall number of data points.

In [29]:
summary_table = pd.DataFrame({
    'Total floor area': [
        home_training['Total floor area'].count(),
        round(home_training['Total floor area'].mean(), 2),
        round(home_training['Total floor area'].median(), 2),
        round(home_training['Total floor area'].std(), 2),
        round(home_training['Total floor area'].min(), 2),
        round(home_training['Total floor area'].max(), 2),
        home_training['Total floor area'].isnull().sum(),
        round(home_training['Total floor area'].quantile(0.25), 2),
        round(home_training['Total floor area'].quantile(0.75), 2)
    ]
}, index=['Count', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Missing Values', '25th Percentile', '75th Percentile'])

# Display the summary table
summary_table

Unnamed: 0,Total floor area
Count,976.0
Mean,2448.36
Median,2399.0
Std,715.83
Min,301.0
Max,6556.0
Missing Values,0.0
25th Percentile,1980.75
75th Percentile,2832.5


## Visualization of the dataset ##

As we have two quantitative variables in our dataset, with Square Footage as our explanatory variable and house selling price as our response variable, we've opted to visualize our data with a bar chart. This choice enables us to convey the distribution of one variable with respect to the other at a glance.

In [30]:
#record minimum and maximum values in the prices columns
min_price = home_prices['Price'].min()
max_price = home_prices['Price'].max()

# Create the scatter plot with adjusted y-axis scale
scatter_plot = alt.Chart(home_training).mark_circle(opacity=0.5).encode(
    alt.X('Total floor area:Q', title='Total Floor Area (sq ft)'),
    alt.Y('Price:Q', scale=alt.Scale(domain=(min_price, max_price)), title='Price (CAD)'),
    tooltip=['Total floor area:Q', 'Price:Q']
).properties(
    width=600,
    height=600,
    title='Scatter Plot of Price vs. Total Floor Area'
)

scatter_plot

## Performing the data analysis ##

By using cross-validation on our training data, we can choose the optimal $k$. First, we will create a pipeline for $k$-nn, and then perform a cross-validation with 5 folds using the `cross_validate` function.

In [31]:
home_pipe = make_pipeline(
   StandardScaler(), KNeighborsRegressor())


home_cv = pd.DataFrame(
    cross_validate(
        estimator=home_pipe,
        cv=5,
        X = X_train,
        y = y_train,
        scoring = "neg_root_mean_squared_error",
        return_train_score=True
    )
)

We will test 200 values of $k$. First, we will create a parameter grid called that contains values of range 1 to 200. We will then tune the model using the `GridSearchCV` function before we will fit the model to the training dataset. By calling `best_params_` on the model, we can find the number of neighbors for the optimal $k$ value. By calling`best_score_` on the model, we can find the score for the best model.

In [37]:
np.random.seed(2019) 

param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 201, 1),
}
home_tuned = GridSearchCV(estimator=home_pipe, param_grid=param_grid, cv=5, scoring = "neg_root_mean_squared_error", n_jobs=-1)

home_results = pd.DataFrame(home_tuned.fit(home_training[["Total floor area"]], home_training["Price"]).cv_results_) 


home_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003483,0.000324,0.002369,0.000110,1,{'kneighborsregressor__n_neighbors': 1},-555528.754220,-629172.956999,-631816.083758,-603687.286602,-597927.594062,-603626.535128,27543.233182,200
1,0.003298,0.000028,0.002275,0.000031,2,{'kneighborsregressor__n_neighbors': 2},-515797.485603,-543752.558005,-521718.867047,-552993.736387,-519001.936253,-530652.916659,14879.242241,199
2,0.003260,0.000019,0.002275,0.000013,3,{'kneighborsregressor__n_neighbors': 3},-491882.991900,-515478.954608,-475642.828958,-510114.475759,-505044.002217,-499632.650688,14323.585362,198
3,0.003230,0.000011,0.002299,0.000009,4,{'kneighborsregressor__n_neighbors': 4},-487744.125636,-491257.379900,-463348.450025,-509884.917520,-498289.497365,-490104.874089,15369.097618,197
4,0.003215,0.000010,0.002289,0.000009,5,{'kneighborsregressor__n_neighbors': 5},-479640.466931,-457903.045593,-450008.998877,-494313.504613,-480274.407790,-472428.084761,16162.592728,196
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.003283,0.000048,0.005489,0.000036,196,{'kneighborsregressor__n_neighbors': 196},-455167.962553,-428261.899558,-445561.554033,-430016.452402,-445479.511148,-440897.475939,10241.581157,156
196,0.003224,0.000006,0.005505,0.000058,197,{'kneighborsregressor__n_neighbors': 197},-455109.384495,-428227.363962,-445765.290500,-430069.449395,-445463.277936,-440926.953258,10240.059209,158
197,0.003213,0.000008,0.005476,0.000024,198,{'kneighborsregressor__n_neighbors': 198},-455021.844664,-428413.432888,-445650.701358,-430245.090162,-445443.207177,-440954.855250,10119.909169,161
198,0.003243,0.000029,0.005548,0.000031,199,{'kneighborsregressor__n_neighbors': 199},-454866.818828,-428495.295880,-445559.949725,-430531.425876,-445214.844162,-440933.666894,9967.781961,160


In [38]:
home_min = home_tuned.best_params_
home_min

{'kneighborsregressor__n_neighbors': 111}

In [39]:
home_best_RMSPE = -home_tuned.best_score_
home_best_RMSPE

436830.21025052667

## Createing a visualization of the analysis ##

To provide a clear visual representation of how the number of neighbors affects the K-Nearest Neighbors regression model's accuracy, we plotted the Root Mean Squared Prediction Error (RMSPE) against varying values of `n_neighbors`. This graph is instrumental in identifying the optimal balance between model complexity and predictive performance.

In [43]:
# Convert mean_test_score to positive values for the demonstration
home_results['mean_test_score'] = home_results['mean_test_score'].abs()

# Altair plot
chart = alt.Chart(home_results).mark_line(point=True).encode(
    x=alt.X('param_kneighborsregressor__n_neighbors:Q', title='Number of Neighbors (n_neighbors)'),
    y=alt.Y('mean_test_score:Q', title='Root Mean Squared Prediction Error (RMSPE)'),
    tooltip=['param_kneighborsregressor__n_neighbors', 'mean_test_score']
).properties(
    title='Model Performance vs. Number of Neighbors',
    width=600,
    height=400
).interactive()

chart


The graph depicting the performance of the K-Nearest Neighbors regression model shows a sharp decline in the Root Mean Squared Prediction Error (RMSPE) as the number of neighbors increases from 1, suggesting that a single neighbor leads to overfitting. The RMSPE stabilizes shortly after, indicating that adding more neighbors beyond this point does not significantly improve the model's predictive accuracy. This stabilization occurs at around 10 neighbors, which likely represents an optimal balance between model complexity and prediction error, hinting that a relatively simple model is sufficient to capture the trends in the Vancouver housing data.

In [47]:
import altair as alt
import pandas as pd
import numpy as np

# Use the trained pipeline to predict over a range of X values for visualization
X_vis = pd.DataFrame(np.linspace(X_train['Total floor area'].min(), X_train['Total floor area'].max(), 200), columns=['Total floor area'])
home_pipe.set_params(**home_min)  # Set the best found parameters
home_pipe.fit(X_train[['Total floor area']], y_train)  # Fit the pipeline with the training data
y_pred = home_pipe.predict(X_vis)

# Create DataFrames for plotting
train_df = pd.DataFrame({'Total floor area': X_train['Total floor area'], 'Price': y_train})
test_df = pd.DataFrame({'Total floor area': X_test['Total floor area'], 'Price': y_test})
pred_df = pd.DataFrame({'Total floor area': X_vis['Total floor area'], 'Price': y_pred})

# Base chart for the training points
train_points = alt.Chart(train_df).mark_circle(size=60, opacity=0.5, color='blue').encode(
    x='Total floor area:Q',
    y='Price:Q'
)

# Points for the test data
test_points = alt.Chart(test_df).mark_circle(size=60, opacity=0.5, color='green').encode(
    x='Total floor area:Q',
    y='Price:Q'
)

# Line for the prediction
prediction_line = alt.Chart(pred_df).mark_line(color='red').encode(
    x='Total floor area:Q',
    y='Price:Q'
)

# Combine the charts
chart = (train_points + test_points + prediction_line).properties(
    width=600,
    height=400,
    title='KNN Regression Fit'
)

# Display the chart
chart


KeyError: 'Total floor area'

# Discussion: #


# References: #
