# Predicting Gentrification
*A study into planning application features that can help to signal early warnings of gentrification*
</br></br></br></br>
`Notebook 4: Classification Models`</br>
Author: Mariia Shapovalova</br>
Date: June, 2023

In [2]:
import pandas as pd
import numpy as np

#visualisations
import matplotlib.pyplot as plt
import seaborn as sns

import re
import random
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import OneHotEncoder

# To set up a temporary directory for caching pipeline results
from tempfile import mkdtemp
import tempfile

# To build a pipelines 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# To build custom column column transformers
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# To do a cross-validated grid search
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [11]:
import sys
sys.path.append('..')  # Add the parent directory to the Python path
from functions import *

* Instead of aggregating the rows, .transform() instead to generate the actual class for each row
* Test with the combined metric
* Add more NLP features
* Test different classification thresholds

---

## Load Data

In [13]:
df=pd.read_csv('../../data/clean/merged_income_df',index_col=0)

In [14]:
overview (df)

The dataframe shape is (724869, 29)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PERMIT_TYPE,object,0,0.0,RENOVATION/ALTERATION,ELECTRIC WIRING,RENOVATION/ALTERATION
REVIEW_TYPE,object,0,0.0,STANDARD PLAN REVIEW,EASY PERMIT WEB,SELF CERT
WORK_DESCRIPTION,object,0,0.0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,INSTALLATION OF (1) 30 AMP DUAL POLE BREAKER A...,SELF CERT: INTERIOR ALTERATIONS TO 10TH FLOOR ...
CONTACT_1_TYPE,object,0,0.0,OWNER AS GENERAL CONTRACTOR,CONTRACTOR-ELECTRICAL,SELF CERT ARCHITECT
CONTACT_1_CITY,object,0,0.0,CHICAGO,CHICAGO,OAK BROOK
CONTACT_1_STATE,object,0,0.0,IL,IL,IL
LOG_PROCESSING_TIME,float64,0,0.0,4.394449,1.791759,-23.025851
LOG_BUILDING_FEE_PAID,float64,0,0.0,4.828314,3.688879,6.948475
LOG_ZONING_FEE_PAID,float64,0,0.0,4.317488,-23.025851,4.317488
LOG_OTHER_FEE_PAID,float64,0,0.0,-23.025851,-23.025851,-23.025851


***
<center><h2>Test/Remainder Split<center><h2>

* Test/Validation/Train Splits will be done based on the census tracts (geographies)
* Let's select 20% of the distinct census tracts and separate them in the test dataframe

In [15]:
import random

test_size=0.2

#create a set of distinct census tracts
geo_set=set(df['Census_Tract'])

#measure its length and multiply by the specified test si\e
test_len=int(len(geo_set)*test_size)

random.seed(42)
#select a random subset of distinct census tract of the required size
geo_test=random.sample(list(geo_set),k=test_len)

#create test mask by testing if census tracts belong to the test subset
test_mask=df['Census_Tract'].isin(geo_test)

#apply the mask to generate the test dataset and the inverse to generate remainder dataset
df_test=df[test_mask].reset_index(drop=True)
df_rem=df[~test_mask].reset_index(drop=True)

* Saving the columns that would need to be one hot encoded

In [18]:
ohe_col=list(df_rem.drop(columns='WORK_DESCRIPTION').select_dtypes(include=['object']).columns)
ohe_col

['PERMIT_TYPE',
 'REVIEW_TYPE',
 'CONTACT_1_TYPE',
 'CONTACT_1_CITY',
 'CONTACT_1_STATE',
 'Class_Names']

* Previous `df_window_multi_type` was used to grpup by Census Tract and Year based on the year we are making the prediction from and the training length
* The output was reduced to include only distinct census tracks, let's reqrite the function to conduct .transform() operation instead

In [21]:
help(df_window_multi_type)

Help on function df_window_multi_type in module functions:

df_window_multi_type(df, year, t)
    Combines descriptions & numeric data
    Inputs:
    #df_description to only contain description columns



`Before` : aggregating to only include unique census tracts. Effectively severely reducing the number of data points

In [None]:
def df_window_multi_type(df,year,t):

    '''
    Combines descriptions & numeric data
    Inputs:
    #df_description to only contain description columns
    '''
    #copy dataframe to avoid accidental overwriting
    df_temp=df.copy()

    if 'YEAR' not in df.columns:
        df_temp['YEAR']=df['ISSUE_DATE'].dt.year
        df_temp=df_temp.drop(columns='ISSUE_DATE')

    assert df_temp.index.is_monotonic_increasing, 'Check Indexing: Should be a simple arithmetic sequence'

    df_temp=df_temp.set_index(['Census_Tract','YEAR'])

    #to set 'YEAR' index as a column
    df_temp=df_temp.reset_index(level=1)

    #select relevant years
    #year+1 as the range end to esnure data for the current year is also included
    df_temp=df_temp[df_temp['YEAR'].isin(range(year-t,year+1))].drop(columns='YEAR')

    #instantiate the output dataframe
    df_result=pd.DataFrame()

    #select columns with distriptions
    obj_cols = df_temp.select_dtypes(include=['object']).columns
    num_cols = df_temp.select_dtypes(include=['number']).columns

    ### CAN ADD MORE FEATURES HERE ###

    #Taking averages for numeric columns
    for col in num_cols:
        df_result[col]=df_temp.groupby(level=0)[col].mean()

    #Concatenating qualitative columns
    for col in obj_cols:
        #need to keep in mind that some stings (descriptions) might be missing
        df_result[col]=df_temp.groupby(level=0)[col].apply(lambda x: ' '.join(str(i) for i in x))

    return df_result

`After` : using .transform() to reassign the determined classes back to the original dataframe --> preserving the same number of datapoints

In [None]:
def df_window_multi_type(df,year,t):

    '''
    Combines descriptions & numeric data
    Inputs:
    #df_description to only contain description columns
    '''
    #copy dataframe to avoid accidental overwriting
    df_temp=df.copy()

    if 'YEAR' not in df.columns:
        df_temp['YEAR']=df['ISSUE_DATE'].dt.year
        df_temp=df_temp.drop(columns='ISSUE_DATE')

    assert df_temp.index.is_monotonic_increasing, 'Check Indexing: Should be a simple arithmetic sequence'

    df_temp=df_temp.set_index(['Census_Tract','YEAR'])

    #to set 'YEAR' index as a column
    df_temp=df_temp.reset_index(level=1)

    #select relevant years
    #year+1 as the range end to esnure data for the current year is also included
    df_temp=df_temp[df_temp['YEAR'].isin(range(year-t,year+1))].drop(columns='YEAR')

    #instantiate the output dataframe
    df_result=pd.DataFrame()

    #select columns with distriptions
    obj_cols = df_temp.select_dtypes(include=['object']).columns
    num_cols = df_temp.select_dtypes(include=['number']).columns

    ### CAN ADD MORE FEATURES HERE ###

    #Taking averages for numeric columns
    for col in num_cols:
        df_result[col]=df_temp.groupby(level=0)[col].mean()

    #Concatenating qualitative columns
    for col in obj_cols:
        #need to keep in mind that some stings (descriptions) might be missing
        df_result[col]=df_temp.groupby(level=0)[col].apply(lambda x: ' '.join(str(i) for i in x))

    return df_result

In [20]:
from functions import CustomOneHotEncoder_CT

# Grouping by Census Tract and Year based on the year we are making the prediction from and the training length
from functions import df_window_multi_type