# Modeling

In [17]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr
from src.static import DATA_DIR

In [18]:
df = pd.read_csv(f'{DATA_DIR}/preprocessed_data.csv')

In [19]:
df.dtypes

year                                             int64
record_number                                    int64
census_tract_2020                                int64
tract_income_ratio                               int64
affordability_cat                                int64
num_units                                      float64
num_bedrooms_0-1                               float64
num_bedrooms_>=2                               float64
affordability_level_>100%                      float64
affordability_level_>50, <=60%                 float64
affordability_level_>60, <=80%                 float64
affordability_level_>80, <=100%                float64
affordability_level_>=0, <=50%                 float64
enterprise_flag_fannie                            bool
enterprise_flag_freddie                           bool
date_of_mortgage_note_prior to year aquired       bool
date_of_mortgage_note_same year as acquired       bool
purpose_of_loan_improvement/rehab                 bool
purpose_of

## Train Test Split
The next step will perform out train test split by finding optimal number of years toward the end of the dataset to get an approximate 80/20 split. The cumulative values below are observed and compared to the actual calculation of `# of rows * 0.80

In [20]:
df.year.value_counts().sort_index().cumsum()

year
2018     7600
2019    14884
2020    22895
2021    30364
2022    37342
2023    41868
Name: count, dtype: int64

In [21]:
df.shape[0]*0.80

33494.4

Based on the ovserved values we will use years occuring after 2021 as the test set and the previously occuring records as the training set to mimic real world conditions where future information is not allowed into the training data

In [22]:
train, test = (df[df.year <= 2021], df[df.year > 2021])

## EDA
### Correlation Testing

In [23]:
train.dtypes

year                                             int64
record_number                                    int64
census_tract_2020                                int64
tract_income_ratio                               int64
affordability_cat                                int64
num_units                                      float64
num_bedrooms_0-1                               float64
num_bedrooms_>=2                               float64
affordability_level_>100%                      float64
affordability_level_>50, <=60%                 float64
affordability_level_>60, <=80%                 float64
affordability_level_>80, <=100%                float64
affordability_level_>=0, <=50%                 float64
enterprise_flag_fannie                            bool
enterprise_flag_freddie                           bool
date_of_mortgage_note_prior to year aquired       bool
date_of_mortgage_note_same year as acquired       bool
purpose_of_loan_improvement/rehab                 bool
purpose_of

In [24]:
def get_spearman_corr(df, target):
    correlations = dict()
    for col in df.drop(target, axis=1).select_dtypes(include=np.number).columns:
        corr, p_value = spearmanr(df[col], df[target])
        correlations[col] = {'spearman_corr': corr, 'p_value': p_value}
    correlation_df = pd.DataFrame(correlations).T
    return correlation_df


In [35]:
corr_df = get_spearman_corr(train, 'affordability_cat')
corr_df[corr_df['p_value'] < 0.05].sort_values(by='spearman_corr')

Unnamed: 0,spearman_corr,p_value
"affordability_level_>=0, <=50%",-0.661867,0.0
"affordability_level_>50, <=60%",-0.438803,0.0
num_affordable_units,-0.366007,0.0
num_units,-0.101249,5.218657e-70
num_bedrooms_>=2,-0.067073,1.2704860000000002e-31
census_tract_2020,-0.032483,1.50056e-08
year,0.025289,1.04704e-05
tract_income_ratio,0.248511,0.0
"affordability_level_>60, <=80%",0.271909,0.0
affordability_level_>100%,0.420299,0.0


We have a number of statistically significant correlations with the key takeaways below:
- As affordabliltiy category increases (less affordable) the number more affordabile units decreases
- The number of overall units also decreses
- The tract income ratio only has a small positive corelation with an increase in affordablity category
- Unsuprsingly as affordability category increases the nubmer of less affordable units increases.

These observations cement our understanding of the interplay between affordability_cat and other variables.The most interesting observation here is that despite their apparent implied relationship the tract_income_ratio only has a very weakly positive correlation with the affordability category suggesting that greater incomes do not go hand in hand with lesser affordability.

In [33]:
corr_df = get_spearman_corr(train, 'census_tract_2020')
corr_df[corr_df['p_value'] < 0.05]

Unnamed: 0,spearman_corr,p_value
year,-0.036293,2.516942e-10
record_number,0.015241,0.007912828
tract_income_ratio,-0.459865,0.0
affordability_cat,-0.032483,1.50056e-08
num_units,0.035473,6.288903e-10
num_bedrooms_0-1,0.048626,2.2937690000000003e-17
"affordability_level_>50, <=60%",0.040756,1.208105e-12
"affordability_level_>60, <=80%",0.021181,0.0002233286
"affordability_level_>=0, <=50%",0.065107,6.87205e-30
num_affordable_units,0.022377,9.633311e-05


add analysis here

In [34]:
corr_df = get_spearman_corr(train, 'tract_income_ratio')
corr_df[corr_df['p_value'] < 0.05]

Unnamed: 0,spearman_corr,p_value
year,0.030598,9.670157e-08
census_tract_2020,-0.459865,0.0
affordability_cat,0.248511,0.0
num_bedrooms_0-1,0.02475,1.608354e-05
num_bedrooms_>=2,0.012054,0.03569393
affordability_level_>100%,0.242876,0.0
"affordability_level_>50, <=60%",-0.183027,5.676378e-227
"affordability_level_>60, <=80%",0.039513,5.669559e-12
"affordability_level_>80, <=100%",0.220539,0.0
"affordability_level_>=0, <=50%",-0.22711,0.0


add analysis here