# Milestone 1: Evaluation metrics

# Title: Predicting the value of houses in Strathcona County
## Summary:
Our team will be working on predicting house prices using the 2023 Property Tax Assessment dataset from Strathcona County Open Data portal. The dataset provides a wealth of information about houses, including attributes like size, location, and other features. By leveraging this data, we aim to build a robust predictive model that accurately estimates house values.
## Introduction:
The team will be using `Ridge` which is a linear model to predict the value of houses. Ridge is a regularization model that is used for predictive modeling and mitigates over fitting, improves model stability especially when features are highly correlated. Ridge helps create robust model that generalize well to new data.
The question we aim to answer: Can we predict house prices using publicly available housing data , and which features most influence the predictions?
Data description: For this project we are going to use the  2023 Property Tax Assessment from Strathcona County Open Data portal. The data set contains the following attributes related to the different houses. The variables we selected for the model are: <br>
                `meters` - numeric variable that show the size of the house <br>
                `garage` - categorical variable where Y means there is a garage and N means no garage. <br>
                `firepl` - categorical variable where Y means there is a fireplace and N means no fireplace<br>
                `bdevl` - categorical variable where Y meas the building was evaluated and N means it was not evaluated<br>
The data set was chosen for its rich feature set, adequate sample size, and public availability making it suitable for building a predictive model.

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
import altair_ally as aly
import altair as alt

In [6]:
housing_df = pd.read_csv("data/2023_Property_Tax_Assessment.csv")
housing_df = housing_df[['meters','garage','firepl','bsmt','bdevl','assess_2022']]
housing_df

Unnamed: 0,meters,garage,firepl,bsmt,bdevl,assess_2022
0,150.590,Y,Y,Y,N,382460
1,123.560,N,Y,N,N,280370
2,104.980,N,N,N,N,402000
3,66.611,N,N,N,N,3690
4,123.830,Y,Y,Y,Y,295910
...,...,...,...,...,...,...
35756,121.330,Y,Y,Y,Y,363000
35757,132.470,Y,Y,Y,N,355000
35758,121.330,Y,Y,Y,N,347000
35759,121.330,Y,Y,Y,Y,363000


In [None]:
alt.data_transformers.enable("vegafusion")

aly.dist(housing_df.assign(garage=lambda df: df['garage'].astype(object)), dtype='object').properties(
    title="Counts of categorical features"
)

In [30]:
grg = alt.Chart(housing_df).mark_point().encode(
    x='garage',
    y='assess_2022',
)

frp = alt.Chart(housing_df).mark_point().encode(
    x='firepl',
    y='assess_2022',
)

bst = alt.Chart(housing_df).mark_point().encode(
    x='bsmt',
    y='assess_2022',
)

bdl = alt.Chart(housing_df).mark_point().encode(
    x='bdevl',
    y='assess_2022',
)

(grg | frp | bst | bdl).properties(
    title="House value assessment per categorical feature"
)

In [7]:
train_df, test_df = train_test_split(housing_df, test_size=0.3, random_state=123)

In [8]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Lists of feature names
categorical_features = ['garage', 'firepl', 'bsmt', 'bdevl']
numeric_features = ['meters']
# Create the column transformer
preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_features),  # One-hot encode categorical columns (drop the first column to avoid redundancy)
    (StandardScaler(), numeric_features),  # Standardize numeric columns
)

# Show the preprocessor
preprocessor

In [9]:
X_train = train_df.drop(columns=["assess_2022"])
X_test = test_df.drop(columns=["assess_2022"])
y_train = train_df["assess_2022"]
y_test = test_df["assess_2022"]

In [10]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import (
    cross_val_score,
    cross_validate
)
from sklearn.pipeline import make_pipeline

# The svc model pipeline
pipeline = make_pipeline(preprocessor, Ridge())

# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results = pd.DataFrame(cross_validate(pipeline, X_train, y_train, cv=5, return_train_score=True)).agg(['mean', 'std']).round(3).T

# Show the train and validation scores
cross_val_results

Unnamed: 0,mean,std
fit_time,0.032,0.003
score_time,0.008,0.002
test_score,0.04,1.572
train_score,0.695,0.09


In [11]:
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)


0.34617887045802354

In [12]:
X_predict = pd.read_csv("data/2023_Property_Assessment_Predictions.csv").drop(columns='assess_2022')

In [13]:
y_predict = pipeline.predict(X_predict)
y_predict = pd.DataFrame(y_predict)
y_predict.columns = ['Predicted_Values']

In [14]:
predictions_df = pd.concat([X_predict,y_predict], axis = 1)
predictions_df

Unnamed: 0,meters,garage,firepl,bsmt,bdevl,Predicted_Values
0,174.23,Y,Y,Y,N,527136.172807
1,132.76,Y,N,Y,Y,451521.996667
2,90.82,Y,Y,N,Y,390750.919488
3,68.54,N,Y,N,N,224816.005694
4,221.3,Y,N,Y,Y,654274.738135
5,145.03,N,N,N,Y,419325.088911
6,102.96,N,N,Y,Y,328692.750689
7,164.28,Y,Y,N,N,498644.978037
8,142.79,N,Y,Y,N,400551.3983
9,115.94,Y,Y,Y,Y,453980.753156


In [None]:
grg2 = alt.Chart(predictions_df).mark_line().encode(
    x='meters',
    y='Predicted_Values',
    color = 'garage'
)

frp2 = alt.Chart(predictions_df).mark_line().encode(
    x='meters',
    y='Predicted_Values',
    color = 'firepl'
)

bst2 = alt.Chart(predictions_df).mark_line().encode(
    x='meters',
    y='Predicted_Values',
    color = 'bsmt'
)

bdl2 = alt.Chart(predictions_df).mark_line().encode(
    x='meters',
    y='Predicted_Values',
    color = 'bdevl'
)

grg2 | frp2 | bst2 | bdl2