# House Prices - Regression Predictions YData 2024
Team: AcadEMY

Team mates: Eran T, Maya L, Yair BH, Adir Golan.

TODO: add table of content with links

## Part 1 - EDA

In [None]:
from utils import load_house_prices_data
import matplotlib.pyplot as plt 

train_origin_df = load_house_prices_data('train')
train_features = train_origin_df.drop('SalePrice', axis='columns')

### 1.1 Which 3 features have the highest number of missing values?

In [None]:
from utils import calc_num_missing_vals_per_col, np
from plot_utils import plot_num_missing_values

num_of_nans = calc_num_missing_vals_per_col(train_features)

plot_num_missing_values(num_of_nans)

max_nans = num_of_nans.nlargest(3).index
print(f"Top 3 features with the most missing values: {max_nans.values}")

### 1.2 How does the price behave over the years?

In [None]:
from plot_utils import plot_price_dist_per_year

plot_price_dist_per_year(train_origin_df)

Graph Insight:
Over the 4 years shown, the mean price have fluctuated; increasing until 2007, with a down movement overall.

### 1.3 Plotting feature distribution using histograms

In [None]:
from plot_utils import plot_column_histograms

plot_column_histograms(train_origin_df)

Graph insights:
- Many unbalanced categorical features, including "SaleType","GarageCond","PavedDrive","Street".
- Some numerical feature resemble a normal distribution: "OverallQual","TotRmsAbvGrd", "GarageArea".
- There is a consistent increase in the number of houses built per year.
- There is seasonality in the month sold - most sales happen in summer (June, July) and least happen in winter (September, October).

### 1.4 Computing Feature Correlation to Label

#### Numeric Features

In [None]:
from plot_utils import plot_numeric_features_correlation_to_target

corr_vector = train_origin_df.select_dtypes(include='number').corr()['SalePrice'].sort_values().drop('SalePrice', axis=0)

plot_numeric_features_correlation_to_target(corr_vector)

Graph insights:
- The number of kitchens above ground has the most negative correlation to the house price.
- "BsmtFinSF2" has little to no correlation to the price.
- "OverallQual" has the highest positive correlation, while "OverallCond" has somewhat negative correlation, meaning physical condition matters less than subjective measures.
- The top 5 features correlated to price mean that people value quality, area of living and garage space. 

#### Categorical Features

In [None]:

from plot_utils import plot_head_and_tail_categorical_corr_to_target
from utils import calc_categorical_feature_correlation_to_target

sorted_correlation = calc_categorical_feature_correlation_to_target(train_origin_df)

plot_head_and_tail_categorical_corr_to_target(sorted_correlation)

### 1.5 More EDA that will help us understand the data and support our modelling decisions

In [None]:
# todo - what will be our modeling decision?
# todo - what graphs will support this?

#### Feature selection (searching features that can be dropped)
The idea is that due to the large number of features in the original dataset (80, not including the target), it might be beneficial to reduce the number of features. We do this in different ways:

##### Highly correlated numerical features
We looked for highly correlated features and decided to drop one of each pair:

In [None]:
from utils import calc_numeric_feature_correlation

numeric_correlations = calc_numeric_feature_correlation(train_features)
threshold = 0.7
highly_correlated_numeric_features = [t for t in numeric_correlations if t[2] >= threshold]

print(highly_correlated_numeric_features)

In [None]:
#features to drop due to high correlation with another feature (one from each pair):
# we drop features that are not common to all samples (e.g., all buildings must have YearBuilt but not necessarily GarageYrBlt)
high_correlated_features_to_drop = ['GarageYrBlt', '1stFlrSF', 'TotRmsAbvGrd', 'GarageCars']

##### Correlation of categorical object type features with the target
By plotting the categorical distributions of each (categorical) feature with respect to the target, we can choose specific features that seem to hold few meaningful information (mostly features with approximately uniform distribution or highly imbalanced distribtion).

In [None]:
# Finding correlation (indirectly) between 'object' features and target:
from plot_utils import plot_mean_price_and_stddev_per_category

plot_mean_price_and_stddev_per_category(train_origin_df)

In [None]:
# Object features that show low correlation to target (by indirect impression):
cat_cols_uncor_w_target = ['LotShape', 'LandContour', 'LotConfig',
                           'LandSlope', 'Condition2', 'RoofMatl', 'BsmtExposure',
                           'BsmtFinType1', 'BsmtFinType2', 'Electrical',
                           'Functional', 'Fence', 'MiscFeature'
                           ]

##### Numerical features with imbalanced data

In [None]:
#numerical features to drop due to high imbalance of the data:
drop_imbalanced = ['Heating', 'Alley', 'Street', 'Utilities']


In [None]:
# features with NaN values that relflect 'None' and should not be discraded (should be counted):
convert_nan_to_str = ['BsmtQual', 'BsmtCond', 'FireplaceQu', 'GarageType',
                      'GarageFinish', 'GarageQual', 'GarageCond'
                      ]

# features with same problem, but were already dropped due to other reasons:
#['BsmtExposure', 'BsmtFinType1', 'PoolQC', 'MiscFeature]

In [None]:
# filtering the data frame according to selected features to drop:
filtered_df = train_origin_df.drop(high_correlated_features_to_drop, axis=1)
filtered_df = filtered_df.drop(cat_cols_uncor_w_target, axis=1)
filtered_df.drop(drop_imbalanced, axis=1, inplace=True)

for feature in convert_nan_to_str:
  filtered_df[feature].fillna(value='No', inplace=True)

##### Feature engineering on pool information + Filling na in LotFrontage

In [None]:
from preprocessing import preprocess

# only 7 samples with pool, but might be important, so:
# we create new *binary* feature 'HavePool' and drop 'PoolQC' 'PoolArea'

# And replacing missing values in 'LotFrontage' with mean values:

filtered_preprocessed_df = preprocess(filtered_df)

##### Treating missing values

In [None]:
# replacing missing values in 'LotFrontage' with mean values:
mean_value_LotFrontage = filtered_df['LotFrontage'].mean()
filtered_df['LotFrontage'].fillna(value=mean_value_LotFrontage, inplace=True)

#### Understanding The Data

In [None]:
from plot_utils import plot_number_of_sales_and_prices_across_time

plot_number_of_sales_and_prices_across_time(filtered_preprocessed_df)

The plots above demonstrate pretty clear seasonality in two features: 
* Sales seasonality: we can see a peak in the number of sales on a yearly basis around May-June, followed by a decrease in sales from June to January, with mainly January as the weakest month. 
* Price seasonality: We can see some seasonality in sales prices, albeit less consistent than in the number of sales case. 

Instrestingly, by looking at the combined plot we can see some periods where the number of sales drops down drastically while average price hits a peak. 

#### Influence of remodeling

In [None]:
housing_train.loc[housing_train['YearRemodAdd'].dropna() == housing_train['YearBuilt'].dropna()]['SalePrice'].mean()

In [None]:
housing_train['SalePrice'].mean()

**The average price of houses that were never remodeled is a bit higher than the general average price.
This finding raised suspicions until we thought about looking at the houses ages:**

##### Average age of never remodeled houses:

In [None]:
2024 - housing_train.loc[housing_train['YearRemodAdd'].dropna() == housing_train['YearBuilt'].dropna()]['YearBuilt'].mean()

##### Average age of all houses:

In [None]:
2024 - housing_train['YearBuilt'].mean()

##### Average age of houses that were remodeled at some point:

In [None]:
2024 - housing_train.loc[housing_train['YearRemodAdd'].dropna() != housing_train['YearBuilt'].dropna()]['YearBuilt'].mean()

**As can be seen, houses that were never remodeled are just younger, on average, which can explain the above result.
Furthermore, the mean age of houses that were never remodeled is ~40, and from ydata profiling we can see that for houses around that age, the age (YearBuilt) is still 
a significant factor for price (for older houses the age becomes much less significant - 60 or 90, it doesn't matter a lot).**

## Part 2 - Baseline Model

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector

**As a baseline model we chose to use a simple linear regression.
We tried two approaches - the first one includes categorical features and the 2nd one excludes them.
In the first approach the categorical values were converted to numbers using pandas categorize method.
Surprisingly (?), the 2nd approach performed much better - the model that relied on it got a score of 0.20614 upon submission,
whereas the model that relied on the 1st approach got a score of 0.77086**

### Features imputation

**Most features don't have missing values. Below we explain how we preprocessed for those that do have missing values.**

* According to the data_desctiption file, the following columns have NA values for houses that lack the actual feature: BsmtQual, BsmtCond, BsmtExposure, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, MiscFeature.
Therefore, we decided to replace NA with 'None'.

* In features BsmtFinType1 and BsmtFinType2, 'Unf' value means unfinished. However, many houses have this value (430 houses have it for BsmtFinType1 and 1256 have it for BsmtFinType2), which seems unlikely. So 'Unf's were also changed to 'None'.

* GarageYrBlt is a numeric feature, yet imputing NA's with mean or something similar didn't look like the right way to go here. 'None string would have complicated this feature as it's supposed to be numeric. We decided to impute it with the YearBuilt value.
_Note that we're strongly considering to drop this feature in our final model as it is highly correlated with YearBuilt anyway._

* LotFrontage (numeric) is ~normally distributed, median and mean are very close to each other. However, Mode value is a bit lower and very common. For now all numeric values (except GarageYrBlt) are imputed with mean. But in the final model imputation with Mode will be considered fir this feature.

* MasVnrArea (numeric) - more than half of the values is just 0. The remaining values are sort of normally distributed though with a very "flat" bell and somewhat left-skewed.
Same comment about the Mode (here it's of course 0) as an option for imputation in the final model.

* MasVnrType (categorical) - at this point imputed with 'None'. However, there is one very common value (BrkFace). most_common will be considered.

* Fence - many missing values, imputed with 'None'. most_common should be considered here too, although it's less distinct.

* Electrical - has only 1 missing value which will be imputed with 'None'.

* Alley (categorical) - vast majority of values are NA's and will be replaced with 'None'. The values that do exist are divided ~50-50 between the 2 possible options.

### Model with both numeric and categorical values:

In [None]:
train_for_model = train_origin_df.copy()

train_for_model['BsmtFinType1'] = train_for_model['BsmtFinType1'].replace('Unf', 'None')
train_for_model['BsmtFinType2'] = train_for_model['BsmtFinType2'].replace('Unf', 'None')
train_for_model['GarageYrBlt'] = train_for_model['GarageYrBlt'].fillna(train_for_model['YearBuilt'])

imputation_pipe = Pipeline([
    ("preprocess", ColumnTransformer(
        transformers=[
        ("mean_imputer", SimpleImputer(missing_values=pd.NA, strategy='mean'),
         make_column_selector(dtype_include='number')),
        ("category_imputer", SimpleImputer(missing_values=pd.NA, strategy='constant', fill_value='None'),
         make_column_selector(dtype_include='object'))
        ],
        verbose_feature_names_out=False).set_output(transform='pandas')
    )
])

y_train = train_for_model['SalePrice']
X_train = train_for_model.drop(columns='SalePrice')

imputation_pipe.fit(X_train)
X_train_wrong_column_order = imputation_pipe.transform(X_train)
X_train = X_train_wrong_column_order[X_train.columns]

categorical_columns = X_train.select_dtypes(include='object').columns
for col in categorical_columns:
    X_train[col], _ = pd.factorize(X_train[col])

simple_linear_model = LinearRegression()
simple_linear_model.fit(X_train, y_train)

X_test = pd.read_csv('test.csv')
X_test['BsmtFinType1'] = X_test['BsmtFinType1'].replace('Unf', 'None')
X_test['BsmtFinType2'] = X_test['BsmtFinType2'].replace('Unf', 'None')
X_test['GarageYrBlt'] = X_test['GarageYrBlt'].fillna(X_test['YearBuilt'])

X_test_wrong_column_order = imputation_pipe.transform(X_test)
X_test = X_test_wrong_column_order[X_test.columns]

for col in categorical_columns:
    X_test[col], _ = pd.factorize(X_test[col])
    
y_pred = simple_linear_model.predict(X_test)

### Model with numeric values only:

In [None]:
train_for_model = train_origin_df.copy()
train_for_model['GarageYrBlt'] = train_for_model['GarageYrBlt'].fillna(train_for_model['YearBuilt'])

y_train = train_for_model['SalePrice']
X_train = train_for_model.drop(columns='SalePrice').select_dtypes(include='number')

simple_imputer = SimpleImputer(missing_values=pd.NA, strategy='mean')
simple_imputer.set_output(transform='pandas')
simple_imputer.fit(X_train)
X_train = simple_imputer.transform(X_train)

simple_linear_model = LinearRegression()
simple_linear_model.fit(X_train, y_train)

X_test = pd.read_csv('test.csv').select_dtypes(include='number')
X_test['GarageYrBlt'] = X_test['GarageYrBlt'].fillna(X_test['YearBuilt'])
X_test = simple_imputer.transform(X_test)

y_pred = simple_linear_model.predict(X_test)