### Title: ### 

# Examining the effect of Square Footage on the selling price of Houses in Vancouver #

**Predictive Question:** 
Can we predict the selling price of a house in Vancouver based on its square footage?

## Introduction (Arienne): ##
- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
-Identify and describe the dataset that will be used to answer the question

# Preliminary exploratory data analysis (Alex) : #
- Demonstrate that the dataset can be read from the web into Python
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

## Demonstrating the Data Can Be Read from the Web Into Python: ##

The dataset that we have chosen for this project is a publicly available dataset from the website Kaggle, which has price data, square footage, and other details about Vancouver houses from its respective time period.

https://www.kaggle.com/datasets/darianghorbanian/vancouver-home-price-analysis-regression In order to read our dataset directly from the Kaggle website, we will need to work with the Kaggle API and set it up using an authentication username and key.

First, we will install the necessary libraries to use throughout our project.

In [1]:
# importing the necessary libraries 
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

Then we will install the Kaggle package, to interact with the Kaggle API as outlined in their documentation: https://www.kaggle.com/docs/api

In [2]:
# set up Kaggle for downloading data set 

!pip install kaggle
import os

os.environ['KAGGLE_USERNAME'] = 'alexannn'
os.environ['KAGGLE_KEY'] = '134ddfd9c0609f9493f6766bad383898'

Collecting kaggle
  Downloading kaggle-1.6.6.tar.gz (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.6/84.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.6-py3-none-any.whl size=111943 sha256=2696c20a511fbc38aa2c6b9dfd2b2ffe0a819d942178f528ca353ceec8d83b4c
  Stored in dir

In [3]:
# download data set
!kaggle datasets download -d darianghorbanian/vancouver-home-price-analysis-regression --unzip

Downloading vancouver-home-price-analysis-regression.zip to /home/jovyan/work/vancouver_housing_predictions
  0%|                                               | 0.00/30.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 30.1k/30.1k [00:00<00:00, 9.93MB/s]


In [4]:
home_prices = pd.read_csv("House sale data Vancouver.csv")
home_prices

Unnamed: 0,Number,Address,List Date,Price,Days on market,Total floor area,Year Built,Age,Lot Size
0,1,3178 GRAVELEY STREET,5/8/2020,1500000,18,2447,1946,74,5674.00
1,2,1438 E 28TH AVENUE,1/22/2020,1300000,7,2146,1982,38,3631.98
2,3,2831 W 49TH AVENUE,6/18/2019,2650000,1,3108,1929,90,9111.00
3,4,2645 TRIUMPH STREET,6/18/2019,1385000,28,2602,1922,97,4022.70
4,5,741-743 E 10TH AVENUE,11/28/2019,1590000,17,1843,1970,49,4026.00
...,...,...,...,...,...,...,...,...,...
1297,1298,65 W KING EDWARD AVENUE,8/22/2019,2630000,42,3035,1939,80,7456.00
1298,1299,3150 E 52ND AVENUE,8/17/2019,1450000,14,2282,1974,45,3993.00
1299,1300,4478 PRINCE ALBERT STREET,2/24/2020,2798000,4,3501,2016,4,3960.00
1300,1301,4038 MILLER STREET,4/5/2019,900000,194,2440,1912,107,3297.00


## Cleaning and Wrangling the data into a tidy format: ##

The tidy data format adheres to the following three principles:
1. Each variable corresponds to a column.
2. Each observation corresponds to a row.
3. Each measurement is a cell value.

Fortunately, in our dataset, our data already meets these requirements and is, therefore, considered tidy. However, for the sake of simplicity, we can drop all the columns except those that are relevant to our analysis.

In [5]:
home_prices = home_prices[['Price', 'Total floor area']]

## Summarizing the data in at least one table: ##

In the following table, we've provided some basic numerical summaries about our dataset. We've included the five-number summary for a quick overview of our data, as well as the standard deviation to represent the spread, the number of missing values, and the overall number of data points.

In [6]:
summary_table = pd.DataFrame({
    'Total floor area': [
        home_prices['Total floor area'].count(),
        round(home_prices['Total floor area'].mean(), 2),
        round(home_prices['Total floor area'].median(), 2),
        round(home_prices['Total floor area'].std(), 2),
        round(home_prices['Total floor area'].min(), 2),
        round(home_prices['Total floor area'].max(), 2),
        home_prices['Total floor area'].isnull().sum(),
        round(home_prices['Total floor area'].quantile(0.25), 2),
        round(home_prices['Total floor area'].quantile(0.75), 2)
    ]
}, index=['Count', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Missing Values', '25th Percentile', '75th Percentile'])

# Display the summary table
summary_table

Unnamed: 0,Total floor area
Count,1302.0
Mean,2453.41
Median,2399.0
Std,700.05
Min,301.0
Max,6556.0
Missing Values,0.0
25th Percentile,1993.0
75th Percentile,2837.0


## Visualize The Data: ##

As we have two quantitative variables in our dataset, with Square Footage as our explanatory variable and house selling price as our response variable, we've opted to visualize our data with a bar chart. This choice enables us to convey the distribution of one variable with respect to the other at a glance.

In [7]:
#record minimum and maximum values in the prices columns
min_price = home_prices['Price'].min()
max_price = home_prices['Price'].max()

# Create the scatter plot with adjusted y-axis scale
scatter_plot = alt.Chart(home_prices).mark_circle(opacity=0.5).encode(
    alt.X('Total floor area:Q', title='Total Floor Area (sq ft)'),
    alt.Y('Price:Q', scale=alt.Scale(domain=(min_price, max_price)), title='Price (CAD)'),
    tooltip=['Total floor area:Q', 'Price:Q']
).properties(
    width=600,
    height=600,
    title='Scatter Plot of Price vs. Total Floor Area'
)

scatter_plot

# Methods: #
The data analysis will be conducted using regression by the K-nearest neighbors algorithm. The predictive variable used will be the total floor area of each house, a significant feature buyers consider when looking at new properties. This will therefore be a helpful variable to use in the model, allowing one to predict the price based on how big a house is. 

During our analysis, we will use cross-validation to determine the ideal K value. In general, the K value should follow the overall trend in the training data, while steering clear of the fluctuations and variations in price. This will allow the model to transfer successfully to the new testing data and prevent over/underfitting. After determining the optimal K value, we will train the regression model on the training data set, make predictions on the test data, and compute the RMSPE to evaluate the model.

One way we will visualize the results is a scatter plot with floor area and price on the x-axis and y-axis respectively. A prediction line can be added to the scatter plot so that we can visualize how well the regression model fits the data and overall trend.


# Expected outcomes and significance (Vincent): #

We expect to find that as the total floor area increases, the selling price will also increase.

Being able to predict the selling price of a new house will help home buyers and sellers understand whether a home is fairly priced, overpriced, or underpriced which will inform pricing and negotiations.

This project could lead to further questions like:

1. Where are homes most expensive?
2. How does square footage affect how long it takes to sell a house?
3. How does the home's age affect its sale price?