# Predicting Gentrification
*A study into planning application features that can help to signal early warnings of gentrification*
</br></br></br></br>
`Notebook 3`</br>
Author: Mariia Shapovalova</br>
Date: April, 2023

---
## Table of Contents
Notebook 1: Planning Application Dataset Cleaning and EDA
- [0.0 Introduction](#identifier_0)
- [1.0 High Level Overview](#identifier_1)
- [2.0 Initial Feature Selection](#identifier_2)
- 3.0 Infer Missing Census Tracts


---
<center><h2 id="identifier_0">INTRODUCTION</h2><center>

This notebook:
* Merges the two datasets: permits and income data
* Separates the dataset into **test** and **remainder** sets
* Develops workflows (such as overwriting sklearn classes) to be later applied at the modelling stage and writing functions that can be then converted into Column Transformers
  * Note: some functions and classes from this notebook are not then directly used in the modelling noteboook, but nevetheless offer an explanation of how the final modelling workflow was established
  * `define and test steps that would be required for modelling`
* Conducts Further EDA 
* Fits intial models

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder

from functions import *

***

### LOAD DATA

In [5]:
#load cleaned income data
income_df=pd.read_csv('../data/clean/income_cleaned.csv',index_col=0)

#load cleaned planning data
permit_df=pd.read_csv('../data/clean/permits_cleaned.csv',index_col=0)

* After cleaning up the columns some duplciate row appeared, let's drop them

In [6]:
print('Total number of duplicate rows',permit_df.duplicated().sum())
print(f'Percentage number of duplicate rows {permit_df.duplicated().sum()*100/permit_df.shape[0]:.3f}%')

#DROP DUPLICATES
permit_df=permit_df.drop_duplicates()

Total number of duplicate rows 19409
Percentage number of duplicate rows 2.657%


In [7]:
#Expand the cell to check permit_df 
overview(permit_df)

The dataframe shape is (711102, 23)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PERMIT_TYPE,object,0,0.0,RENOVATION/ALTERATION,ELECTRIC WIRING,ELECTRIC WIRING
REVIEW_TYPE,object,0,0.0,STANDARD PLAN REVIEW,EASY PERMIT WEB,EASY PERMIT WEB
ISSUE_DATE,object,0,0.0,01/03/2006,02/24/2023,01/03/2020
WORK_DESCRIPTION,object,0,0.0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,REPAIR SERVICE,KITCHEN REMODEL: ALL FIXTURES AND APPLIANCES T...
CONTACT_1_TYPE,object,0,0.0,OWNER AS GENERAL CONTRACTOR,CONTRACTOR-ELECTRICAL,CONTRACTOR-ELECTRICAL
CONTACT_1_CITY,object,0,0.0,CHICAGO,CHICAGO_SUBURBS,OTHER
CONTACT_1_STATE,object,0,0.0,IL,NJ,IL
CENSUS_TRACT,int64,0,0.0,220702,530503,81500
LOG_PROCESSING_TIME,float64,0,0.0,4.394449,-23.025851,-23.025851
LOG_BUILDING_FEE_PAID,float64,0,0.0,4.828314,4.317488,5.010635


In [9]:
#Expand the cell to check income_df 
overview (income_df)

The dataframe shape is (725, 14)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Census_Tract,float64,0,0.0,10100.0,844700.0,480500.0
Median_Income_2010,float64,0,0.0,36905.0,31204.5,43977.0
Median_Income_2011,float64,0,0.0,31919.0,28496.0,51953.0
Median_Income_2012,float64,0,0.0,31063.0,21358.0,52500.0
Median_Income_2013,float64,0,0.0,32191.0,27439.5,49046.0
Median_Income_2014,float64,0,0.0,30798.0,22994.0,46667.0
Median_Income_2015,float64,0,0.0,32188.0,22792.5,48375.0
Median_Income_2016,float64,0,0.0,29861.0,24070.0,44514.0
Median_Income_2017,float64,0,0.0,33750.0,25552.5,39329.0
Median_Income_2018,float64,0,0.0,37985.0,24206.5,40250.0


---
**<center><h2>MERGE<center><h2>**

* Merge **Permit_df** and **Income_df** on `Census Tract`

In [10]:
#Merge on Census_Tract
df_merged=pd.merge(permit_df,income_df,how='inner',left_on='CENSUS_TRACT', right_on='Census_Tract').drop(columns='CENSUS_TRACT')

In [11]:
#Expand the cell to check merged dataframe - df_merged
overview (df_merged)

The dataframe shape is (705524, 36)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PERMIT_TYPE,object,0,0.0,RENOVATION/ALTERATION,ELECTRIC WIRING,ELECTRIC WIRING
REVIEW_TYPE,object,0,0.0,STANDARD PLAN REVIEW,EASY PERMIT WEB,EASY PERMIT WEB
ISSUE_DATE,object,0,0.0,01/03/2006,10/23/2007,06/13/2007
WORK_DESCRIPTION,object,0,0.0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,INSTALLATION OF (1) 30 AMP DUAL POLE BREAKER A...,COVER ELECTRICAL VIOLATION. INSTALL GFCI IN BA...
CONTACT_1_TYPE,object,0,0.0,OWNER AS GENERAL CONTRACTOR,CONTRACTOR-ELECTRICAL,CONTRACTOR-ELECTRICAL
CONTACT_1_CITY,object,0,0.0,CHICAGO,CHICAGO,CHICAGO
CONTACT_1_STATE,object,0,0.0,IL,IL,IL
LOG_PROCESSING_TIME,float64,0,0.0,4.394449,1.791759,-23.025851
LOG_BUILDING_FEE_PAID,float64,0,0.0,4.828314,3.688879,3.688879
LOG_ZONING_FEE_PAID,float64,0,0.0,4.317488,-23.025851,-23.025851


***
<center><h2>Test/Remainder Split<center><h2>

* Test/Validation/Train Splits will be done based on the census tracts (geographies)
* Let's select 20% of the distinct census tracts and separate them in the test dataframe

In [12]:
import random

test_size=0.2

#create a set of distinct census tracts
geo_set=set(df_merged['Census_Tract'])

#measure its length and multiply by the specified test si\e
test_len=int(len(geo_set)*test_size)

random.seed(42)
#select a random subset of distinct census tract of the required size
geo_test=random.sample(list(geo_set),k=test_len)

#create test mask by testing if census tracts belong to the test subset
test_mask=df_merged['Census_Tract'].isin(geo_test)

#apply the mask to generate the test dataset and the inverse to generate remainder dataset
df_test=df_merged[test_mask].reset_index(drop=True)
df_rem=df_merged[~test_mask].reset_index(drop=True)

In [13]:
#check the shapes
print('remained df shape is ', df_rem.shape)
print('test df shape is ', df_test.shape)

remained df shape is  (533533, 36)
test df shape is  (171991, 36)


* Create 'YEAR' column and drop 'ISSUE_DATE' column

In [14]:
df_rem['ISSUE_DATE']=df_rem['ISSUE_DATE'].astype('datetime64')
df_test['ISSUE_DATE']=df_test['ISSUE_DATE'].astype('datetime64')

#Create YEAR column based on the 'ISSUE_DATE' column
df_rem['YEAR']=df_rem['ISSUE_DATE'].dt.year
df_test['YEAR']=df_test['ISSUE_DATE'].dt.year

df_rem=df_rem.drop(columns='ISSUE_DATE')
df_test=df_test.drop(columns='ISSUE_DATE')

***
<center><h2>OHE<center><h2>

* ONE HOT ENCODING won't be applied to the dataset exported from this notebook because this would prevent from running cross validation correctly
* Instead, this section focuses on defining the workflow that can be later replicated in the modelling notebook
* Additionally, OHE is required to conduct futher EDA
* `mention data leakage`

* First Step in the modelling preparation is to conduct One Hot Encoding for the categorical columns
* Convert any relevant qualitative columns into quantitative columns
  * Cleaning process was done with ohe in mind, hence all qualitative columns, apart from Work Description, should  be ready to be one hot encoded

* Consider the number of unique categories to determine the appropriatness of OneHotEncoding

In [16]:
df_obj=df_rem.select_dtypes("object")
df_obj.nunique()

PERMIT_TYPE              7
REVIEW_TYPE             11
WORK_DESCRIPTION    423971
CONTACT_1_TYPE          25
CONTACT_1_CITY          16
CONTACT_1_STATE         11
dtype: int64

* All categorical columns, apart from 'WORK_DESCRIPTION' can be one hot encoded

In [20]:
#select all categorical columns apart from the 'WORK_DESCRIPTION' column
ohe_col=df_rem.drop(columns='WORK_DESCRIPTION').select_dtypes(include=['object']).columns
ohe_col

Index(['PERMIT_TYPE', 'REVIEW_TYPE', 'CONTACT_1_TYPE', 'CONTACT_1_CITY',
       'CONTACT_1_STATE'],
      dtype='object')

`overwrite OneHotEncoder class`

In [24]:
#Define CustomOneHotEncoder that returns the full dataframe and only en
class CustomOneHotEncoder(OneHotEncoder):
    
    # Override the fit method to fit only the specified columns and return self
    def fit(self, X, col_li, y=None):
        '''
        Inputs:
            X(DataFrame) : Input DataFrame
            col_li(list) : List of columns to apply one hot encoding to
        Output:
            fitted self
        '''
        #super() is a built-in function to call the methods of the base class.
        super().fit(X[col_li])
        return self
        
    # Override the transform method to transform only the specified columns and return a DataFrame
    def transform(self, X, col_li):
        #
        transformed_data = super().transform(X[col_li])
        columns = self.get_feature_names_out()
        df = pd.DataFrame(transformed_data.toarray(), columns=columns)
        df = pd.concat([X.drop(col_li, axis=1), df], axis=1)
        return df

my_ohe = CustomOneHotEncoder()
df_rem_ohe = my_ohe.fit(df_rem, ohe_col)
df_rem_ohe = my_ohe.transform(df_rem, ohe_col)

* Note: initally a similar result was achieved using a function applying one hot encoding. 
* However, ohe should only be applied to the trianing set, hence fit and transform methods had to be separated
* To view the intial function, please, expand the cell below

In [17]:
def ohe_cat(df,col_li):
    '''
    Use:
        Apply OHE for the specified columns only and returns a new full dataframe
    
    Inputs:
        df (pandas.DataFrame): Input DataFrame
        col_li(list): List of column names to apply OHE
    
    Returns:
        df_result(Pandas DataFrame): Result DataFrame with OHE applied to specified columns'''

    #copy dataframe 
    df_result=df.copy()

    # Iterate through each column in the input list
    for col in col_li:

        #Instantiate the encoder
        enc = OneHotEncoder(sparse=False)

        #Encode and convert it into a new DataFrame with column names including the original column name
        df_temp=pd.DataFrame(enc.fit_transform(df[[col]]),columns=[col+'_'+i for i in enc.categories_[0]])

        # Concatenate the encoded df with the existing DataFrame
        df_result=pd.concat([df_result,df_temp],axis=1)

    # Drop the original categorical columns from the result DataFrame
    df_result=df_result.drop(columns=col_li)

    return df_result

In [30]:
# Create an instance of the CustomOneHotEncoder class
encode_df=CustomOneHotEncoder()

# Fit the encoder to the specified columns (ohe_col) in df_rem 
encode_df.fit(df_rem,ohe_col)

# Transform to get a one-hot encoded version of df_rem
df_rem_ohe=encode_df.transform(df_rem,ohe_col)

#NOTE since fit_transform() method was not explicitly overwritted, fit and transform methods of CustomOneHotEncoder should be used separately

df_rem_ohe.head()

Unnamed: 0,WORK_DESCRIPTION,LOG_PROCESSING_TIME,LOG_BUILDING_FEE_PAID,LOG_ZONING_FEE_PAID,LOG_OTHER_FEE_PAID,LOG_SUBTOTAL_PAID,LOG_BUILDING_FEE_UNPAID,LOG_ZONING_FEE_UNPAID,LOG_OTHER_FEE_UNPAID,LOG_SUBTOTAL_UNPAID,...,CONTACT_1_STATE_FL,CONTACT_1_STATE_IL,CONTACT_1_STATE_IN,CONTACT_1_STATE_MI,CONTACT_1_STATE_NJ,CONTACT_1_STATE_NY,CONTACT_1_STATE_OTHER,CONTACT_1_STATE_TX,CONTACT_1_STATE_UT,CONTACT_1_STATE_WI
0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,4.394449,4.828314,4.317488,-23.025851,5.298317,-23.025851,-23.025851,-23.025851,-23.025851,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,REPLACE REAR OPEN WOOD PORCH WITH A NEW STEEL/...,3.583519,5.298317,3.912023,-23.025851,5.521461,-23.025851,-23.025851,-23.025851,-23.025851,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,DECONVERT 2ND FLOOR APARTMENT TO MAKE BUILDING...,-23.025851,4.442651,4.317488,-23.025851,5.075174,-23.025851,-23.025851,-23.025851,-23.025851,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,LOW VOLTAGE BURGLAR ALARM,2.197225,3.688879,-23.025851,-23.025851,3.688879,-23.025851,-23.025851,-23.025851,-23.025851,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,REPAIR EXISTNG TWO STORY WOOD FRAME PORCH PER ...,2.564949,4.442651,3.912023,-23.025851,4.905275,-23.025851,-23.025851,-23.025851,-23.025851,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


* Note, after one hot encoding there is still 1 qualitative column remaining - `Work_Description`
* `Work_Description` will be addressed in section `UDPATE`

###  `XXXXX`

***
<center><h3> Grouping by Cencus Tract and Year <center><h3>


**Dataframge Shape Aim:**
| `Census Tract`| `Year` | Averaged Numerical X Feature 1 (for the given `ct` and `year`) | More X features | Aggregated Qualitative Featute (for the given `ct` and `year`)| <span style="color:green">Mean Income in Year 2010 </span>| <span style="color:green">---</span>| <span style="color:green">Mean Income in Year 2021 </span>|
| --- | --- | --- | --- | --- | --- | --- |--- |
| 10100 | 2021 | XXX |XXX|XXX | XXX | XXX | XXX | 
| 10100 | 2020 | XXX |XXX |XXX |XXX | XXX | XXX | 
| 10100 | 2019 | XXX |XXX |XXX |XXX | XXX | XXX |
| --- | --- | --- |--- |--- |--- |--- |

The next step would be to merge information about each application into groups defined by census tract and year </br>
`improve`

Different approaches to quantitative and qualitative columns have to be taken when grouping the datapoints by year and census tract
* Quantitative columns (including one hot encoded columns) : take the mean
  * taking the mean is more accurate than aggregative quantitative columns because as seen in notebook 2, household counts for each census tract vary singnificantly
  * hence, it would be inaccurate
* Qualitative columns (descriptions only) : join all the descriptions together

In [92]:
def df_window_multi_type(df,year,t):

    '''
    Combines descriptions & numeric data
    Inputs:
    #df_description to only contain description columns
    '''
    #copy dataframe to avoid accidental overwriting
    df_temp=df.copy()

    if 'YEAR' not in df.columns:
        df_temp['YEAR']=df['ISSUE_DATE'].dt.year
        df_temp=df_temp.drop(columns='ISSUE_DATE')

    assert df_temp.index.is_monotonic_increasing, 'Check Indexing: Should be a simple arithmetic sequence'

    df_temp=df_temp.set_index(['Census_Tract','YEAR'])

    #to set 'YEAR' index as a column
    df_temp=df_temp.reset_index(level=1)

    #select relevant years
    #year+1 as the range end to esnure data for the current year is also included
    df_temp=df_temp[df_temp['YEAR'].isin(range(year-t,year+1))].drop(columns='YEAR')

    #instantiate the output dataframe
    df_result=pd.DataFrame()

    #select columns with distriptions
    obj_cols = df_temp.select_dtypes(include=['object']).columns
    num_cols = df_temp.select_dtypes(include=['number']).columns

    ### CAN ADD MORE FEATURES HERE ###

    #Taking averages for numeric columns
    for col in num_cols:
        df_result[col]=df_temp.groupby(level=0)[col].mean()

    #Concatenating qualitative columns
    for col in obj_cols:
        #need to keep in mind that some stings (descriptions) might be missing
        df_result[col]=df_temp.groupby(level=0)[col].apply(lambda x: ' '.join(str(i) for i in x))

    return df_result

In [93]:
#Example of the function use
#Making a prediction in 2015 based on training data for 2015 and 5 years before
df_window_multi_type(df_ohe,2015,5)

Unnamed: 0_level_0,PROCESSING_TIME,BUILDING_FEE_PAID,ZONING_FEE_PAID,OTHER_FEE_PAID,SUBTOTAL_PAID,TOTAL_FEE,REPORTED_COST,Median_Income_2010,Median_Income_2011,Median_Income_2012,...,CONTACT_1_STATE_FL,CONTACT_1_STATE_IL,CONTACT_1_STATE_IN,CONTACT_1_STATE_MI,CONTACT_1_STATE_NJ,CONTACT_1_STATE_NY,CONTACT_1_STATE_OTHER,CONTACT_1_STATE_TX,CONTACT_1_STATE_UT,CONTACT_1_STATE_WI
Census_Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10100,10.234450,248.012153,37.440191,4.306220,310.279330,317.179139,21741.763780,36905.0,31919.0,31063.0,...,0.000000,0.995215,0.004785,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000
10202,13.248408,233.542643,64.108280,3.343949,305.632038,311.351401,21271.133758,35724.0,44107.0,36369.0,...,0.000000,0.980892,0.006369,0.000000,0.00000,0.000000,0.012739,0.000000,0.0,0.000000
10300,12.329588,310.629869,40.215356,1.966292,369.346142,396.416404,38633.235955,45224.0,45964.0,41315.0,...,0.011236,0.966292,0.000000,0.003745,0.00000,0.003745,0.003745,0.007491,0.0,0.003745
10400,8.633229,202.652304,31.387147,1.645768,245.681285,461.353229,63600.407524,44018.0,48138.0,43125.0,...,0.012539,0.971787,0.003135,0.000000,0.00000,0.000000,0.006270,0.006270,0.0,0.000000
10503,20.709091,416.965545,79.772727,4.318182,517.749364,554.577818,64575.268182,18250.0,18952.0,20524.0,...,0.000000,0.990909,0.000000,0.000000,0.00000,0.000000,0.009091,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
843200,13.364964,325.202199,48.608577,3.467153,407.154471,445.150675,31914.177336,40133.0,44537.0,34881.0,...,0.000000,0.963504,0.016423,0.000000,0.00365,0.000000,0.012774,0.001825,0.0,0.000000
843600,12.973134,388.979194,56.940299,9.558209,525.768746,540.059313,38246.080597,24844.0,22606.0,22373.0,...,0.005970,0.967164,0.000000,0.000000,0.00000,0.002985,0.011940,0.005970,0.0,0.005970
843900,8.782828,352.530278,38.446970,5.303030,420.533990,444.389242,38493.449495,35663.0,31774.0,29094.0,...,0.000000,0.964646,0.000000,0.000000,0.00000,0.000000,0.010101,0.000000,0.0,0.025253
844600,19.598639,378.886463,43.282313,7.993197,462.941905,621.624966,52162.993197,33571.0,29052.0,36979.0,...,0.000000,0.986395,0.000000,0.006803,0.00000,0.000000,0.000000,0.006803,0.0,0.000000


***BASELINE MODEL HERE MAYBE?***

---
**Feature Creation**

In [94]:
# Reported Cost over the number of Household

df_model['COST_PER_HOUSEHOLD']=df_model['REPORTED_COST']/df_model['Household_Count']

In [95]:
# Average Cost per one Renovation
# Do fees vary based on the permission type?

---
<center><h3>Classification Model : 2 Sample T-test<center><h3>


* **2 sample t-test unpaired**
* **Barcharts**

* Determine y using the pre-defined functions (the same functions will be used to calculate y when modelling)
* Split data into 'gentrification' & 'non-gentrification' sets
* 

In [96]:
help(abs_perc_income_change)

Help on function abs_perc_income_change in module functions:

abs_perc_income_change(df, year_x, n)
    year_x - current year (the year we are making the prediction from)
    n - the number of years ahead we are making the prediction



---
**Split data into 'gentrification' & 'non-gentrification' sets**

2006--2021</br>
7&8 years</br>
2006--2013 -> train</br>
2013--2021 ->test</br>

In [97]:
def stat_test_prep(df,current_year,train_years,prediction_years):
    df_temp=df.copy()

    if 'WORK_DESCRIPTION' in df_temp.columns:
        df_temp=df_temp.drop(columns='WORK_DESCRIPTION')

    df_temp=df_window_multi_type(df_temp,current_year,train_years)

    change=abs_perc_income_change(df_temp,current_year,prediction_years)
    median_temp=np.median(change)
    y_class_mask=[True if c>median_temp else False for c in change]

    #Prepare the dataframe by dropping the Income columns apart from
    df_temp=df_temp.drop(columns=sel_col(df_temp,'Median')) 

    #Apply the masks
    df_pos_class=df_temp[y_class_mask]
    df_neg_class=df_temp[[not i for i in y_class_mask]]

    pos_len=df_pos_class.shape[0]
    neg_len=df_neg_class.shape[0]

    if pos_len>neg_len:
        n = pos_len-neg_len
        random_rows = df_pos_class.sample(n)
        df_pos_class = df_pos_class.drop(random_rows.index)
    elif neg_len>pos_len:
        n = neg_len-pos_len
        random_rows = df_neg_class.sample(n)
        df_neg_class = df_neg_class.drop(random_rows.index)

    return df_temp, df_pos_class, df_neg_class

* There is a small difference in the number of rows in the 
* Trim the longer dataframe to run ttest

* Loop through each column in df_temp
* Run ttest for each column
* Store results in a dataframe 

In [98]:
from scipy.stats import ttest_ind

def ttest_2(sample1,sample2):

    # Perform the 2-sample t-test
    t,p = ttest_ind(sample1, sample2)
    result=(t,p)

    return result


In [99]:
df_temp, df_pos_class, df_neg_class=stat_test_prep(df_model,2013,7,8)

results_df=pd.DataFrame({'feature':[],'corr_coef':[],'p-values':[]})

for col in df_temp.columns:
    res=ttest_2(df_pos_class[col],df_neg_class[col])

    #.loc[len(results_df)] to add row to the dataframe end
    results_df.loc[len(results_df)]=[col,res[0],res[1]]

* Are there any nulls?

In [100]:
#are there any nulls?
results_df.isna().sum()

feature      0
corr_coef    1
p-values     1
dtype: int64

In [101]:
results_df[results_df['p-values'].isna()]

Unnamed: 0,feature,corr_coef,p-values
17,REVIEW_TYPE_DIRECT DEVELOPER SERVICES,,


* Note: in EDA it was established that REVIEW_TYPE_DIRECT DEVELOPER SERVICES category was only added in 2015

In [102]:
#mask where p-values not null

mask=~results_df['p-values'].isna()
results_df=results_df[mask]

p_val=np.array(results_df['p-values'])

#need to readjust given that multiple t-tests are run
from statsmodels.stats.multitest import multipletests

# apply the Benjamini-Hochberg correction
pval_pass = multipletests((p_val), method='fdr_bh')[0]
pvals_updated = multipletests((p_val), method='fdr_bh')[1]

results_df.loc[:,'new_pval']=pvals_updated
results_df.loc[:,'pval_pass']=pval_pass

In [103]:
pval_not_pass=results_df[results_df['pval_pass']==False]
pval_not_pass

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
3,OTHER_FEE_PAID,0.754231,0.451017,0.54263,False
8,PERMIT_TYPE_DROP,0.811451,0.417441,0.510205,False
9,PERMIT_TYPE_EASY PERMIT PROCESS,-1.073705,0.283403,0.396764,False
12,PERMIT_TYPE_REINSTATE REVOKED PMT,1.994281,0.046591,0.086725,False
15,REVIEW_TYPE_CONVEYANCE DEVICE PERMIT,-0.257918,0.796562,0.851879,False
16,REVIEW_TYPE_DEMOLITION PERMIT,-1.00513,0.315255,0.414663,False
18,REVIEW_TYPE_EASY PERMIT,-0.220517,0.825547,0.870782,False
20,REVIEW_TYPE_ELECTRICAL PLAN REVIEW,-1.170274,0.242373,0.352127,False
21,REVIEW_TYPE_FIRE PROTECTION SYSTEM,0.157622,0.87481,0.886321,False
25,REVIEW_TYPE_TRADITIONAL DEVELOPER SERVICES,-2.03396,0.042412,0.081643,False


In [104]:
results_df.loc[[True if 'ARCHI' in x else False for x in results_df['feature']],:]

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
26,CONTACT_1_TYPE_ARCHITECT,1.882897,0.06021628,0.1053785,False
39,CONTACT_1_TYPE_OWNER AS ARCHITECT,0.919654,0.3581372,0.4520749,False
40,CONTACT_1_TYPE_OWNER AS ARCHITECT & CONTRACTR,0.681003,0.4961422,0.5877377,False
45,CONTACT_1_TYPE_SELF CERT ARCHITECT,5.549826,4.36049e-08,3.730642e-07,True


* try merging 'CONTACT_1_TYPE_OWNER AS ARCHITECT' and 'CONTACT_1_TYPE_OWNER AS ARCHITECT & CONTRACTR' into 'CONTACT_1_TYPE_ARCHITECT'

In [105]:
results_df.loc[[True if 'CONTRACTOR' in x else False for x in results_df['feature']],:].sort_values('corr_coef')

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
28,CONTACT_1_TYPE_CONTRACTOR-ELECTRICAL,-4.315076,1.876747e-05,0.0001032211,True
31,CONTACT_1_TYPE_CONTRACTOR-PLUMBER/PLUMBING,-4.18216,3.335818e-05,0.0001712387,True
34,CONTACT_1_TYPE_CONTRACTOR_ENERGY,-3.307736,0.0009989598,0.003076796,True
43,CONTACT_1_TYPE_PLUMBING CONTRACTOR,-2.050934,0.04072371,0.08040322,False
32,CONTACT_1_TYPE_CONTRACTOR-VENTILATION,-1.44227,0.1497679,0.2402526,False
30,CONTACT_1_TYPE_CONTRACTOR-GENERAL CONTRACTOR,-1.345344,0.1790418,0.270318,False
33,CONTACT_1_TYPE_CONTRACTOR-WRECKING,-1.20267,0.2295965,0.3399794,False
29,CONTACT_1_TYPE_CONTRACTOR-ELEVATOR,-0.346828,0.7288471,0.8131394,False
41,CONTACT_1_TYPE_OWNER AS GENERAL CONTRACTOR,0.180314,0.8569694,0.8798219,False
48,CONTACT_1_TYPE_TENT CONTRACTOR,1.045633,0.2961674,0.401553,False


* try merging 'CONTACT_1_TYPE_CONTRACTOR-VENTILATION','CONTACT_1_TYPE_CONTRACTOR-GENERAL CONTRACTOR' into 'CONTACT_1_TYPE_GENERAL_CONTRACTOR'

In [106]:
change_cat={'CONTACT_1_TYPE':{'OWNER AS ARCHITECT':'ARCHITECT','OWNER AS ARCHITECT & CONTRACTR':'ARCHITECT',\
'CONTRACTOR-PLUMBER/PLUMBING':'PLUMBING CONTRACTOR','CONTRACTOR-VENTILATION':'GENERAL_CONTRACTOR','CONTRACTOR-GENERAL CONTRACTOR':'GENERAL_CONTRACTOR','CONTRACTOR-WRECKING':'GENERAL_CONTRACTOR'}}

In [107]:
df_rem['CONTACT_1_TYPE']=df_rem['CONTACT_1_TYPE'].replace(change_cat['CONTACT_1_TYPE'])

In [108]:
df_model=ohe_cat(df_rem,ohe_col)

In [109]:
df_temp, df_pos_class, df_neg_class=stat_test_prep(df_model,2013,7,8)

results_df=pd.DataFrame({'feature':[],'corr_coef':[],'p-values':[]})

for col in df_temp.columns:
    res=ttest_2(df_pos_class[col],df_neg_class[col])

    #.loc[len(results_df)] to add row to the dataframe end
    results_df.loc[len(results_df)]=[col,res[0],res[1]]

#mask where p-values not null

mask=~results_df['p-values'].isna()
results_df=results_df[mask]

p_val=np.array(results_df['p-values'])

# apply the Benjamini-Hochberg correction
pval_pass = multipletests((p_val), method='fdr_bh')[0]
pvals_updated = multipletests((p_val), method='fdr_bh')[1]

results_df.loc[:,'new_pval']=pvals_updated
results_df.loc[:,'pval_pass']=pval_pass

pval_not_pass=results_df[results_df['pval_pass']==False]
pval_not_pass

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
3,OTHER_FEE_PAID,0.754231,0.451017,0.533704,False
8,PERMIT_TYPE_DROP,0.811451,0.417441,0.502344,False
9,PERMIT_TYPE_EASY PERMIT PROCESS,-1.073705,0.283403,0.386954,False
12,PERMIT_TYPE_REINSTATE REVOKED PMT,1.994281,0.046591,0.079934,False
15,REVIEW_TYPE_CONVEYANCE DEVICE PERMIT,-0.257918,0.796562,0.856908,False
16,REVIEW_TYPE_DEMOLITION PERMIT,-1.00513,0.315255,0.402835,False
18,REVIEW_TYPE_EASY PERMIT,-0.220517,0.825547,0.874833,False
20,REVIEW_TYPE_ELECTRICAL PLAN REVIEW,-1.170274,0.242373,0.343971,False
21,REVIEW_TYPE_FIRE PROTECTION SYSTEM,0.157622,0.87481,0.887307,False
25,REVIEW_TYPE_TRADITIONAL DEVELOPER SERVICES,-2.03396,0.042412,0.075281,False


In [110]:
results_df.loc[[True if 'ARCHI' in x else False for x in results_df['feature']],:]

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
26,CONTACT_1_TYPE_ARCHITECT,2.263157,0.02399614,0.04368529,True
40,CONTACT_1_TYPE_SELF CERT ARCHITECT,5.549826,4.36049e-08,3.439942e-07,True


In [111]:
results_df.loc[[True if 'CONTRACTOR' in x else False for x in results_df['feature']],:].sort_values('corr_coef')

Unnamed: 0,feature,corr_coef,p-values,new_pval,pval_pass
38,CONTACT_1_TYPE_PLUMBING CONTRACTOR,-4.46178,9.772158e-06,4.95588e-05,True
28,CONTACT_1_TYPE_CONTRACTOR-ELECTRICAL,-4.315076,1.876747e-05,8.883268e-05,True
30,CONTACT_1_TYPE_CONTRACTOR_ENERGY,-3.307736,0.0009989598,0.002837046,True
32,CONTACT_1_TYPE_GENERAL_CONTRACTOR,-1.96808,0.04953641,0.07993375,False
29,CONTACT_1_TYPE_CONTRACTOR-ELEVATOR,-0.346828,0.7288471,0.8200696,False
36,CONTACT_1_TYPE_OWNER AS GENERAL CONTRACTOR,0.180314,0.8569694,0.881809,False
43,CONTACT_1_TYPE_TENT CONTRACTOR,1.045633,0.2961674,0.3908334,False
41,CONTACT_1_TYPE_SIGN CONTRACTOR,2.831206,0.00479871,0.01009499,True
34,CONTACT_1_TYPE_MASONRY CONTRACTOR,5.35746,1.220482e-07,8.665423e-07,True


In [119]:
col_drop=list(pval_not_pass['feature'])

import joblib

joblib.dump(col_drop, '../data/interim/drop_feature.pkl')

['../data/interim/drop_feature.pkl']

In [113]:
#Apply the same modifications to the test set
df_test=df_test.replace(change_cat['CONTACT_1_TYPE'])

* Export to CSV the full dataframe
* Export the list of columns to drop after OHE

In [117]:
df_rem.to_csv('../data/interim/df_rem.csv')
df_test.to_csv('../data/interim/df_test.csv')

---
### EDA for NLP
**Classification + NLP**

* Different methods for grouping by numeric and qualitative columns have to be applied
* OHE was applied to most columns but there is still `Work_Description` feature that will have to be vectorized for train and test sets separately
* OHE was only applied to columns that had a few unique categories (hence, unlikely to create data leakage as all categoreis would be present in train & test)
* However, `Work_Description` vectorization would be highly dependent on the train/ test split as the sample would vary greately for each data point
* To group `Work_Description` column by 'Census_Tract' & 'Year':
  * Add all strings (descriptions) together into one fields
  * Separate by the most relevant permit types such as 'PERMIT_TYPE' in RENOVATION/ALTERATION & NEW CONSTRUCTION

In [118]:
df_temp=df_rem[['PERMIT_TYPE','Census_Tract','YEAR','WORK_DESCRIPTION']]
df_temp

Unnamed: 0,PERMIT_TYPE,Census_Tract,YEAR,WORK_DESCRIPTION
0,RENOVATION/ALTERATION,220702,2006,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...
1,DROP,220702,2006,REPLACE REAR OPEN WOOD PORCH WITH A NEW STEEL/...
2,RENOVATION/ALTERATION,220702,2006,DECONVERT 2ND FLOOR APARTMENT TO MAKE BUILDING...
3,ELECTRIC WIRING,220702,2006,LOW VOLTAGE BURGLAR ALARM
4,DROP,220702,2006,REPAIR EXISTNG TWO STORY WOOD FRAME PORCH PER ...
...,...,...,...,...
535506,ELECTRIC WIRING,820901,2006,adding gfi protection to existing receptacles ...
535507,RENOVATION/ALTERATION,821600,2007,PORCH REPAIR PERMIT ONLY (NO INTERIOR WORK UND...
535508,EASY PERMIT PROCESS,821600,2007,REPLACE APPRX. 100 SHEETS OF GYP. BRD. MAXIMUM...
535509,ELECTRIC WIRING,812500,2007,INSTALLATION OF (1) 30 AMP DUAL POLE BREAKER A...


**Merging strings for all permit types**

In [None]:
df_description=pd.DataFrame()

In [None]:
df_description['All_Description']=df_temp.groupby(['Census_Tract','YEAR'])['WORK_DESCRIPTION'].apply(' '.join)

df_renovation=df_temp[df_temp['PERMIT_TYPE']=='RENOVATION/ALTERATION']
df_description['Renovation']=df_renovation.groupby(['Census_Tract','YEAR'])['WORK_DESCRIPTION'].apply(' '.join)

* Overview of the last income dataframe (from year : 2021)

In [None]:
df_description

Unnamed: 0_level_0,Unnamed: 1_level_0,All_Description,Renovation
Census_Tract,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1
10100,2006,REBUILD PARAPET WALLS AS NEEDED AND REPLACE 12...,Porch repair per plans and per code violations...
10100,2007,Deconversion of existing 32-unit residential b...,Deconversion of existing 32-unit residential b...
10100,2008,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...
10100,2009,"REMOVE EXISTING DOORS WITH C LABEL DOORS, INST...",REMOVE AND REPLACE 2 PORCHES AT EXISTING 3 STO...
10100,2010,CHANGE OF GENERAL CONTRACTOR TO PERMIT #100317...,CONVERT THE THREE DWELLING UNITS INTO FOUR DWE...
...,...,...,...
844700,2019,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...
844700,2020,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...
844700,2021,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...
844700,2022,COMPLETE ROOF TEAR OFF AND REPLACEMENT WITH IN...,REPLACE THE EXISTING REAR OPEN WOOD PORCH AS P...


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import string

# import the nltk stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tag import pos_tag 
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\44742\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df_description

Unnamed: 0_level_0,Unnamed: 1_level_0,All_Description,Renovation
Census_Tract,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1
10100,2006,REBUILD PARAPET WALLS AS NEEDED AND REPLACE 12...,Porch repair per plans and per code violations...
10100,2007,Deconversion of existing 32-unit residential b...,Deconversion of existing 32-unit residential b...
10100,2008,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...
10100,2009,"REMOVE EXISTING DOORS WITH C LABEL DOORS, INST...",REMOVE AND REPLACE 2 PORCHES AT EXISTING 3 STO...
10100,2010,CHANGE OF GENERAL CONTRACTOR TO PERMIT #100317...,CONVERT THE THREE DWELLING UNITS INTO FOUR DWE...
...,...,...,...
844700,2019,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...
844700,2020,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...
844700,2021,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...
844700,2022,COMPLETE ROOF TEAR OFF AND REPLACEMENT WITH IN...,REPLACE THE EXISTING REAR OPEN WOOD PORCH AS P...


In [None]:
df_description.isna().sum()

All_Description      0
Renovation         458
dtype: int64

In [None]:
STOP_WORDS = stopwords.words('english')
STOP_WORDS.append('per')

def my_tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listof_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in STOP_WORDS) & (len(word)>2) & (not bool(re.search('\d', word))):
            #stemmer = PorterStemmer()
            #stemmed_word = stemmer.stem(word)
            #listof_words.append(stemmed_word)
            listof_words.append(word)
        else:
            continue

    return listof_words

---
### Topics

---
***Topics in all Descriptions***

`3 min to run`

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

all_types=[str(x) for x in df_description['All_Description']]

#0.05 --> 5% of the documents
bagofwords = TfidfVectorizer(tokenizer=my_tokenizer,min_df=0.01, max_df=0.8)
bagofwords.fit(all_types)

words_tranformed = bagofwords.transform(all_types)
print(words_tranformed.shape)


(12426, 770)
Topic #0 words: retrofit il socket meters electric certification monthly doh plansconditional mobile
Topic #1 words: frame change residence porch family single deck masonry basement contractor
Topic #2 words: sign illuminated elevation channel letters mounted internally led facing set
Topic #3 words: erection elevator unit floor passenger one change space elevation sign
Topic #4 words: maintenance monthly april january february march july may december october
Topic #5 words: elevator city pursuant bureau passenger chicago submitted capacity scope elevators
Topic #6 words: porch wood fixtures low open voltage burglar alarm amp violations
Topic #7 words: porch low voltage erect alarm wood inspection replacement hot heater
Topic #8 words: antennas equipment radios site wireless sprint antenna fiber tmobile att
Topic #9 words: solar array photovoltaic panel spr cbrc cbc qty erect type


`2 min to run`

In [None]:
# fit the LDA topic model
lda = LatentDirichletAllocation(n_components=10, max_iter=30,random_state=1,verbose=0)
lda.fit(words_tranformed)

# for each topic, print the the top 10 most representative words
words = bagofwords.get_feature_names()

print('\n')

for i, topic in enumerate(lda.components_):
    topic_words = " ".join([words[j] for j in topic.argsort()[: -11: -1]])
    print(f"Topic #{i} words: {topic_words}")

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 30


Topic #0 words: retrofit il socket meters electric certification monthly doh plansconditional mobile
Topic 



---
***Topics in Renovation Permits***

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

renovation=[str(x) for x in df_description['Renovation']]

#0.01 -> 1% of the documents
bagofwords_renovation = TfidfVectorizer(tokenizer=my_tokenizer,min_df=0.01, max_df=0.8)
bagofwords_renovation.fit(renovation)

words_tranformed_renovation = bagofwords_renovation.transform(renovation)
print(words_tranformed_renovation.shape)



(12426, 853)


In [None]:
# fit the LDA topic model
lda = LatentDirichletAllocation(n_components=10, max_iter=30,random_state=1,verbose=0)

lda.fit(words_tranformed_renovation)

# for each topic, print the the top 10 most representative words
words = bagofwords_renovation.get_feature_names()

for i, topic in enumerate(lda.components_):
    topic_words = " ".join([words[j] for j in topic.argsort()[: -11: -1]])
    print(f"Topic #{i} words: {topic_words}")

Topic #0 words: porch rear building story wood permit replace basement open floor
Topic #1 words: space floor office permit building work alterations tenant use field
Topic #2 words: addition family single residence story floor rear erect second frame
Topic #3 words: erect addition family single residence owners basement floor deconversion deconvert
Topic #4 words: antennas site install equipment previous radios sector wireless associated tmobile
Topic #5 words: rear permit porch addition story floor deck alterations replace wood
Topic #6 words: antennas communications wireless equipment facility technology install tower monopole related
Topic #7 words: nan retaining compliant habitable columns solar entire process pizza ia
Topic #8 words: spr cbrc type cbc occupancy construction group building story iiia
Topic #9 words: porch wood open replace rear story size location repair building




---
2006--2021</br>
7&8 years</br>
2006--2013 -> train</br>
2013--2021 ->test</br>

Outline a worflow that can replicated later on when modelling 
* Select years relevant for the training window. For the purpose of EDA, let's stick with 2006 to 2013 for training data
* Merge all the descriptions (strings) for each census tract together 
* Vectorise descriptions for each census tract

In [None]:
# First, let's write a function to select the years in a relevant window and add all the strings

def df_window_description(df_description,year,t):

    '''
    Combines descriptions for 
    Inputs:
    #df_description to only contain description columns
    '''
    df_temp=df_description.copy()

    #to set 'YEAR' index as a column
    df_temp=df_temp.reset_index(level=1)

    #year+1 as the range end to esnure data for the current year is also included
    df_temp=df_temp[df_temp['YEAR'].isin(range(year-t,year+1))].drop(columns='YEAR')

    df_result=pd.DataFrame()

    for col in df_temp.columns:
        #need to keep in mind that some stings might be missing
        df_result[col]=df_temp.groupby(level=0)[col].apply(lambda x: ' '.join(str(i) for i in x))

    return df_result

***All Descriptions***

In [None]:
current_year=2013
train_window=7

all_types=df_window_description(df_description,current_year,train_window)['All_Description']
all_types=df_description['All_Description']

bagofwords = TfidfVectorizer(tokenizer=my_tokenizer,min_df=0.05, max_df=0.8)
bagofwords.fit(all_types)

words_tranformed = bagofwords.transform(all_types)
print(words_tranformed.shape)



(12426, 770)


*`Most occuring words in all descriptions`*

In [None]:
#Convert matrixx to array
dense_matrix = words_tranformed.toarray()

#Convert array to a dataframe and assign words as column names
df = pd.DataFrame(dense_matrix,columns=bagofwords.get_feature_names())

words_occurence=df.sum()

words_occurence.sort_values(ascending=False)[0:10]



porch          1094.353429
wood            867.931815
low             862.566396
voltage         849.563234
erect           824.863120
frame           811.530602
change          792.707736
floor           786.794567
alarm           763.149070
replacement     730.347619
single          713.269381
open            704.351518
residence       702.026446
basement        694.055008
family          652.665264
unit            652.315881
qty             646.251439
amp             632.113474
one             614.676114
masonry         614.297293
fixtures        584.044081
front           581.875450
structural      576.135950
burglar         560.492280
contractor      554.775656
sign            553.601663
ft              553.301100
maintenance     543.244973
drywall         532.358900
revision        530.178937
dtype: float64

---
***Renovation Descriptions***

In [None]:
current_year=2013
train_window=7

renovation_words=df_window_description(df_description,current_year,train_window)['Renovation']

bagofwords_renovation = TfidfVectorizer(tokenizer=my_tokenizer,min_df=0.01, max_df=0.8)
bagofwords_renovation.fit(renovation_words)

renovation_tranformed = bagofwords_renovation.transform(renovation_words)
print(renovation_tranformed.shape)



(724, 2151)


*Most occuring words in `renovation` descriptions*

In [None]:
#Convert matrixx to array
dense_matrix = renovation_tranformed.toarray()

#Convert array to a dataframe and assign words as column names
df = pd.DataFrame(dense_matrix,columns=bagofwords_renovation.get_feature_names())

words_occurence=df.sum()

words_occurence.sort_values(ascending=False)[0:10]



porches      82.163618
du           52.214660
second       52.166213
enclosed     49.550788
garage       48.581108
deconvert    42.924754
stair        41.724267
sfr          38.666112
office       38.366544
revision     38.286063
dtype: float64

* `As a part of EDA, run baseline Classification Models to determine which type is more useful: all descriptions or renovation only`

---
***`Prepare DataFrame Export`***

In [None]:
sel_col_model(df_model,'Income').set_index(['Census_Tract','YEAR']).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Median_Income_2010,Median_Income_2011,Median_Income_2012,Median_Income_2013,Median_Income_2014,Median_Income_2015,Median_Income_2016,Median_Income_2017,Median_Income_2018,Median_Income_2019,Median_Income_2020,Median_Income_2021
Census_Tract,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10100,2006,36905.0,31919.0,31063.0,32191.0,30798.0,32188.0,29861.0,33750.0,37985.0,32474.0,42891.0,60316.0
10100,2007,36905.0,31919.0,31063.0,32191.0,30798.0,32188.0,29861.0,33750.0,37985.0,32474.0,42891.0,60316.0
10100,2008,36905.0,31919.0,31063.0,32191.0,30798.0,32188.0,29861.0,33750.0,37985.0,32474.0,42891.0,60316.0
10100,2009,36905.0,31919.0,31063.0,32191.0,30798.0,32188.0,29861.0,33750.0,37985.0,32474.0,42891.0,60316.0
10100,2010,36905.0,31919.0,31063.0,32191.0,30798.0,32188.0,29861.0,33750.0,37985.0,32474.0,42891.0,60316.0


In [None]:
df_y_tempp=sel_col_model(df_model,'Income').set_index(['Census_Tract','YEAR'])

In [None]:
df_description

Unnamed: 0_level_0,Unnamed: 1_level_0,All_Description,Renovation
Census_Tract,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1
10100,2006,REBUILD PARAPET WALLS AS NEEDED AND REPLACE 12...,Porch repair per plans and per code violations...
10100,2007,Deconversion of existing 32-unit residential b...,Deconversion of existing 32-unit residential b...
10100,2008,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...,LIMITED REHABILITATION OF EXISTING RESIDENTIAL...
10100,2009,"REMOVE EXISTING DOORS WITH C LABEL DOORS, INST...",REMOVE AND REPLACE 2 PORCHES AT EXISTING 3 STO...
10100,2010,CHANGE OF GENERAL CONTRACTOR TO PERMIT #100317...,CONVERT THE THREE DWELLING UNITS INTO FOUR DWE...
...,...,...,...
844700,2019,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...,THE WORK INCLUDES A DE-CONVERSION AND AN INTER...
844700,2020,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...,FIRE DAMAGE REPAIR OF EXISTING 2 DWELLING UNIT...
844700,2021,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...,THREE EXISTING WINDOWS ON THE SOUTH ELEVATION ...
844700,2022,COMPLETE ROOF TEAR OFF AND REPLACEMENT WITH IN...,REPLACE THE EXISTING REAR OPEN WOOD PORCH AS P...


In [None]:
print(df_y_tempp.shape)
print(df_description.shape)

(12426, 12)
(12426, 2)


In [None]:
df_nlp=df_description.merge(df_y_tempp,left_index=True, right_index=True)

In [None]:
df_nlp.to_csv('../data/interim/model.csv')

In [None]:
df_model.to_csv('../data/interim/model.csv')