# Predicting Gentrification
*A study into planning application features that can help to signal early warnings of gentrification*
</br></br></br></br>
`Notebook 1b: Further EDA of Permits Dataset --  NLP and Topic Modelling`</br>
Author: Mariia Shapovalova</br>
Date: April, 2023

---
## Table of Contents
Notebook 1b: 
* Vectorization
* LDA
* Adding the newly engineered topics to the dataframe 



---
<center><h2 id="identifier_0">INTRODUCTION</h2><center>

The purpose of this notebook is to perform an Exploratory Data Analysis (EDA) on the Permits Dataset. The focus is on conducting Natural Language Processing (NLP) analysis and utilizing Latent Dirichlet Allocation (LDA) to extract project types. This approach aims to provide more meaningful information for prediction tasks compared to relying solely on the existing permit types, which might not capture the true essence of the architectural projects.

In [4]:
#import relevant packages
import numpy as np
import pandas as pd

import joblib
import re

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from functions import *

#NLP

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import string

from sklearn.decomposition import LatentDirichletAllocation

# import the nltk stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tag import pos_tag 
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mshapovalova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
df=pd.read_csv('../data/clean/permits_cleaned.csv',index_col=0)

In [6]:
overview(df)

The dataframe shape is (730511, 15)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PERMIT_TYPE,object,0,0.0,RENOVATION/ALTERATION,ELECTRIC WIRING,DROP
REVIEW_TYPE,object,0,0.0,STANDARD PLAN REVIEW,EASY PERMIT WEB,SIGN PERMIT
WORK_DESCRIPTION,object,0,0.0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,REPAIR SERVICE,INSTALL A WALL SIGN FOR THE LD PHO AT 4722 N K...
CONTACT_1_TYPE,object,0,0.0,OWNER AS GENERAL CONTRACTOR,CONTRACTOR-ELECTRICAL,SIGN CONTRACTOR
CONTACT_1_CITY,object,0,0.0,CHICAGO,CHICAGO_SUBURBS,CHICAGO
CONTACT_1_STATE,object,0,0.0,IL,OTHER,IL
CENSUS_TRACT,int64,0,0.0,220702,530503,140701
LOG_PROCESSING_TIME,float64,0,0.0,4.394449,-23.025851,5.129899
LOG_BUILDING_FEE_PAID,float64,0,0.0,4.828314,4.317488,4.60517
LOG_ZONING_FEE_PAID,float64,0,0.0,4.317488,-23.025851,5.298317


In [7]:
df['PERMIT_TYPE'].unique()

array(['RENOVATION/ALTERATION', 'NEW CONSTRUCTION', 'DROP',
       'ELECTRIC WIRING', 'EASY PERMIT PROCESS', 'SCAFFOLDING',
       'REINSTATE REVOKED PMT'], dtype=object)

In [8]:
df['REVIEW_TYPE'].unique()

array(['STANDARD PLAN REVIEW', 'SIGN PERMIT', 'SELF CERT',
       'EASY PERMIT WEB', 'EASY PERMIT', 'ELECTRICAL PLAN REVIEW',
       'DEMOLITION PERMIT', 'CONVEYANCE DEVICE PERMIT',
       'TRADITIONAL DEVELOPER SERVICES', 'FIRE PROTECTION SYSTEM',
       'DIRECT DEVELOPER SERVICES'], dtype=object)

* WORK_DESCRIPTION will be the main column of study in this notebook

In [9]:
description_series=df['WORK_DESCRIPTION']

In [10]:
type(description_series)

pandas.core.series.Series

***
<center><h2 id="identifier_1">all_descriptions</h2><center>

* We will start by looking at the descriptions of all projects

In [11]:
#define tokenizer
STOP_WORDS = stopwords.words('english')

def my_tokenizer(document,nouns_only=False):
    # rset to lower case (note, removing punctuation is not necessary in our case, bacause we would be using a regex patter to extract toke)
    document = document.lower()

    # pattern denoting a sequence of at least 2 alphanumeric characters 
    pattern=r"(?u)\b\w\w+\b"

    # tokenize - split by matching a pattern
    tokenized_document = re.findall(pattern, document)

    # tokenized_document contains a list of tokens from the document
    # ['....','....','....']

    if nouns_only:
        # Tag each token with its part of speech
        tagged_words = pos_tag(tokenized_document)

        # Extract only the nouns from the sentence
        tokenized_document = [word for (word, pos) in tagged_words if pos.startswith('N')]
    
    listof_words_result = []

    # remove stopwords and any tokens that contain only digits
    for word in tokenized_document:
        if (not word in STOP_WORDS) & (not bool(re.search('\d', word))):
            stemmer = PorterStemmer()
            stemmed_word = stemmer.stem(word)
            listof_words_result.append(stemmed_word)
            #listof_words.append(word)
        else:
            continue

    return listof_words_result

---
<center><h2 id="identifier_1">TfidfVectorizer</h2><center>

In [12]:
### THIS CODE BLOCK IS COMMENTED OUT BECAUSE TAKES TOO LONG TO RUN. THE OUTPUT WAS PICKELED AND LOADED BACK
'''#0.05 --> 5% of the documents
vect = TfidfVectorizer(tokenizer=lambda document: my_tokenizer(document, nouns_only=True),min_df=0.0001)
words_tranformed=vect.fit_transform(description_series)

print(words_tranformed.shape)'''

#joblib.dump(words_tranformed, '../data/interim/further_nlp_exploration/matrix_nouns_all.pkl')
#joblib.dump(bagofwords, '../data/interim/further_nlp_exploration/vector_nouns_all.pkl')

'#0.05 --> 5% of the documents\nvect = TfidfVectorizer(tokenizer=lambda document: my_tokenizer(document, nouns_only=True),min_df=0.0001)\nwords_tranformed=vect.fit_transform(description_series)\n\nprint(words_tranformed.shape)'

In [13]:
words_tranformed=joblib.load('../data/interim/further_nlp_exploration/matrix_nouns_all.pkl')
bagofwords=joblib.load('../data/interim/further_nlp_exploration/vector_nouns_all.pkl')

---
<center><h2 id="identifier_1">LatentDirichletAllocation</h2><center>

* Fitting LDA for 7 topics

In [14]:
### THIS CODE BLOCK IS COMMENTED OUT BECAUSE TAKES TOO LONG TO RUN. THE OUTPUT WAS PICKELED AND LOADED BACK
'''# fit the LDA topic model
lda = LatentDirichletAllocation(n_components=7, max_iter=30,random_state=1,verbose=1)
lda.fit(words_tranformed)'''

#joblib.dump(lda, '../data/interim/further_nlp_exploration/lda_nouns_all.pkl')

'# fit the LDA topic model\nlda = LatentDirichletAllocation(n_components=7, max_iter=30,random_state=1,verbose=1)\nlda.fit(words_tranformed)'

In [15]:
lda=joblib.load('../data/interim/further_nlp_exploration/lda_nouns_all.pkl')

In [16]:
# for each topic, print the the top 10 most representative words
words = bagofwords.get_feature_names_out()

for i, topic in enumerate(lda.components_):
    topic_words = " ".join([words[j] for j in topic.argsort()[: -10: -1]])
    print(f"Topic #{i} words: {topic_words}")

Topic #0 words: floor work field inspect space alter plan offic permit
Topic #1 words: porch plan wood stori repair famili stair violat deck
Topic #2 words: contractor chang wreck revis stori permit frame resid elev
Topic #3 words: roof fixtur panel outlet circuit switch light unit furnac
Topic #4 words: garag sign elev fenc wall qti scaffold letter work
Topic #5 words: servic voltag alarm instal meter system amp interior work
Topic #6 words: mainten permit heater water replac month antenna march april


In [17]:
perplexity = lda.perplexity(words_tranformed)
print("Perplexity:", perplexity)

Perplexity: 460.6145244954449


A perplexity score of 460 is generally considered low, suggesting that the LDA model fails to provide meaningful topic allocations.

To address this issue, we have two potential approaches: 
* firstly, we can adjust the number of topics we aim to identify, in the hope of achieving better model fits; 
* alternatively, we can refine the dataset by pre-selecting specific project types, allowing us to focus our analysis on a more targeted subset

---

## RENOVATION/ALTERATION
* Let's consider RENOVATION/ALTERATION project type

*-----------------------------------------------------------*
### LDA

In [18]:
# filter the dataframe to only include RENOVATION/ALTERATION permit type
renovation_description_series=df[df['PERMIT_TYPE']=='RENOVATION/ALTERATION']['WORK_DESCRIPTION'].reset_index(drop=True)

In [19]:
#preview the descriptions for RENOVATION/ALTERATION permit type
renovation_description_series

0         INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...
1         Interior alteration of existing partial 34th f...
2         Interior alterations to the retail space on th...
3         DECOVERT EXISTING 11 DU APARTMENT TO 10 DU AS ...
4         REPLACE 2 EXISTING WOOD PORCHES (SAME CONFIGUR...
                                ...                        
146849    SPR 2019 CBC. REVISION TO 100992339. BASEMENT ...
146850    REMOVE/REPLACE FRONT AND REAR WOOD STEPS, INTE...
146851    INTERIOR REMODEL OF EXISTING TWO FAMILY RESIDE...
146852    [SELF-CERTIFICATION 2019 CBRC:] REVISION TO TH...
146853    NEW-(3) ANTENNAS, NEW (6) RADIO UNITS WITH ASS...
Name: WORK_DESCRIPTION, Length: 146854, dtype: object

* Refit TfidfCectorizer on RENOVATION/ALTERATION permit type

In [20]:
vectorizer = TfidfVectorizer(tokenizer=lambda document: my_tokenizer(document, nouns_only=True),min_df=0.001)
renovation_matrix=vectorizer.fit_transform(renovation_description_series)
print(renovation_matrix.shape)

(146854, 494)


In [21]:
#aggregate tfidf_scores to find the most occuring words
vocab_di = {k: v for k, v in zip(vectorizer.get_feature_names_out(), renovation_matrix.toarray().sum(axis=0))}
vocab_df=pd.DataFrame(vocab_di,index=['tfidf_scores']).T

* Preview 20 most occuring words

In [22]:
vocab_df.sort_values('tfidf_scores',ascending=False)[:5]

Unnamed: 0,tfidf_scores
plan,14760.438855
porch,13095.013725
stori,9377.984468
wood,9260.17692
floor,8950.144244


* To perform topic modeling specifically on the 'Renovation/Alteration' permit type, we will fit an LDA model with **3** topics.
* A lot of the topics seem to be surrounding the word 'porch'
* Some topics 'antennas equipment install site wireless tmobile previous swapping rrhs rrus' and 'antennas facility wireless communications equipment radios associated sprint cabling sector' similar to electrical wiring
* One should note that while these are topics in Renovation Permits a lot of the topics appear to be maintenance related, rather than maintenance-related, rather than renovation-related
* Further analysis would be needed to determine the significance of these topics, but it is clear that there is more information that can be extracted when analyzing building permits data since the pre-described features are limited by the way data is recorded.
  

In [23]:
lda = LatentDirichletAllocation(n_components=3, max_iter=30,random_state=1,verbose=1)
lda.fit(renovation_matrix)

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 30


In [24]:
#check perplexity
perplexity = lda.perplexity(renovation_matrix)
print("Perplexity:", perplexity)

Perplexity: 201.80017013401414


The score is improved but still considered low and unlikely to boost the model performance significantly
* 1 - refit on all words not just the nouns
* 1b. - test LDA on more topics (20)
* 2 - compare LDA to HDP

In [25]:
#print topics
words = vectorizer.get_feature_names_out()

for i, topic in enumerate(lda.components_):
    topic_words = " ".join([words[j] for j in topic.argsort()[: -10: -1]])
    print(f"Topic #{i} words: {topic_words}")

Topic #0 words: porch wood plan stair repair locat size stori deck
Topic #1 words: unit stori plan famili addit resid basement build alter
Topic #2 words: space offic floor alter self use cert chang plan


*-----------------------------------------------------------*
### 1a. All words
Let's try refitting on all words, not just nouns to compare the performance

In [26]:
vectorizer_all = TfidfVectorizer(tokenizer=my_tokenizer,min_df=0.001)
renovation_matrix_all_words=vectorizer_all.fit_transform(renovation_description_series)
print(renovation_matrix_all_words.shape)

(146854, 712)


In [27]:
lda_all_words = LatentDirichletAllocation(n_components=3, max_iter=30,random_state=1,verbose=1)
lda_all_words.fit(renovation_matrix_all_words)

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 30


In [28]:
#check perplexity
perplexity = lda_all_words.perplexity(renovation_matrix_all_words)
print("Perplexity:", perplexity)

Perplexity: 272.87602897499846


*-----------------------------------------------------------*
### 1b. All words -- 20 topics

In [29]:
lda_all_words_20 = LatentDirichletAllocation(n_components=20, max_iter=30,random_state=1,verbose=1)
lda_all_words_20.fit(renovation_matrix_all_words)

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 30


In [30]:
#check perplexity
perplexity = lda_all_words_20.perplexity(renovation_matrix_all_words)
print("Perplexity:", perplexity)

Perplexity: 463.27574552787235


* The score went down when fitting on all words as opposed to only the nouns. but this was for 3 topics only, 
* let's try increasing the number of topics drastically to see what effffect it has on the perplexity score

---
### 2. Hierarchical Dirichlet Process (HDP)

We will be using HdpModel from gensim to conduct Hierarchical Dirichlet Process (HDP).</br>
The format to fit the model is as follows: **HdpModel(corpus, dictionary)** where:
* The **corpus** parameter represents the bag-of-words corpus, which is a collection of documents in the form of a list of lists. 
    * Each document in the corpus is represented as a list of tuples, 
    * where **each tuple** contains **a word ID** and **its corresponding frequency** in that document. 
* The **dictionary** parameter represents the mapping between words and their unique integer IDs. 
    * It is an instance of the gensim.corpora.Dictionary class. 
    * The dictionary is used to convert words into their corresponding integer IDs and vice versa. 
    * It is necessary for training and using the HDP model because the model operates on word IDs rather than the actual words.


In [31]:
#pip install gensim

from gensim.models import LdaModel, LsiModel, HdpModel
from gensim.corpora import Dictionary

*-----------------------------------------------------------------------------------------------*


* The corpus parameter in the HdpModel of Gensim expects a bag-of-words representation of the documents
* Refit a count vectorizer instead of Tfidf

In [32]:
from gensim.matutils import Sparse2Corpus
from gensim import corpora, models

# Convert the sparse matrix to a bag-of-words corpus
corpus = Sparse2Corpus(renovation_matrix)

renovation_nouns=[my_tokenizer(doc,nouns_only=True) for doc in renovation_description_series]
dict_LoS = corpora.Dictionary(renovation_nouns)

BoW_corpus = [dict_LoS.doc2bow(doc, allow_update=True) for doc in renovation_nouns]

In [33]:
print(dict_LoS)
print(dict_LoS.token2id)

print(BoW_corpus[0])

Dictionary<10681 unique tokens: ['architect', 'plan', 'remodel', 'alter', 'field']...>
{'architect': 0, 'plan': 1, 'remodel': 2, 'alter': 3, 'field': 4, 'floor': 5, 'inspect': 6, 'offic': 7, 'permit': 8, 'space': 9, 'apart': 10, 'basement': 11, 'decovert': 12, 'du': 13, 'configur': 14, 'porch': 15, 'wood': 16, 'work': 17, 'conveni': 18, 'per': 19, 'stair': 20, 'bear': 21, 'famili': 22, 'fixtur': 23, 'frame': 24, 'millwork': 25, 'nonload': 26, 'plumb': 27, 'resid': 28, 'wall': 29, 'garag': 30, 'partit': 31, 'remov': 32, 'baghous': 33, 'equip': 34, 'erect': 35, 'foot': 36, 'station': 37, 'steel': 38, 'structur': 39, 'support': 40, 'addit': 41, 'alm': 42, 'em': 43, 'fire': 44, 'loc': 45, 'maint': 46, 'req': 47, 'stori': 48, 'sy': 49, 'deconvert': 50, 'interior': 51, 'bur': 52, 'cn': 53, 'construct': 54, 'demolit': 55, 'insp': 56, 'masonri': 57, 'owner': 58, 'chang': 59, 'renov': 60, 'system': 61, 'verif': 62, 'coffe': 63, 'shop': 64, 'rear': 65, 'damag': 66, 'joist': 67, 'repair': 68, 'wa

In [34]:
# Create an LDA model
lda_model = models.LdaModel(BoW_corpus, id2word=dict_LoS)

# Explore the topics
topics = lda_model.show_topics()

for topic in topics:
    print(topic)

(44, '0.242*"damag" + 0.206*"repair" + 0.184*"sink" + 0.117*"joist" + 0.095*"floor" + 0.043*"plan" + 0.020*"shingl" + 0.020*"case" + 0.020*"rafter" + 0.017*"gutter"')
(86, '0.308*"tower" + 0.186*"homeown" + 0.164*"cell" + 0.146*"pad" + 0.069*"plan" + 0.039*"monopol" + 0.032*"shelter" + 0.019*"antenna" + 0.000*"stori" + 0.000*"floor"')
(32, '0.736*"school" + 0.089*"vestibul" + 0.055*"conveni" + 0.052*"pavement" + 0.015*"stori" + 0.009*"plansno" + 0.008*"franklin" + 0.005*"plan" + 0.004*"clay" + 0.002*"floor"')
(30, '0.615*"frame" + 0.170*"stori" + 0.097*"garag" + 0.042*"plan" + 0.038*"ductwork" + 0.026*"porch" + 0.005*"alter" + 0.002*"floor" + 0.000*"addit" + 0.000*"wood"')
(12, '0.468*"buildout" + 0.307*"sf" + 0.094*"duct" + 0.040*"floor" + 0.031*"plan" + 0.022*"skylight" + 0.014*"stori" + 0.005*"decor" + 0.004*"awn" + 0.000*"build"')
(90, '0.849*"use" + 0.080*"storag" + 0.038*"stori" + 0.012*"garden" + 0.009*"floor" + 0.003*"plan" + 0.002*"alter" + 0.002*"build" + 0.001*"appart" + 0.0

In [35]:
from gensim.models import CoherenceModel

# Compute coherence score
coherence_model = CoherenceModel(model=lda_model, texts=renovation_nouns, dictionary=dict_LoS, coherence='c_v')
coherence_score = coherence_model.get_coherence()

print("Coherence Score:", coherence_score)

Coherence Score: 0.3430438330205731


In [36]:
from gensim.models import HdpModel

# Assuming you have BoW_corpus and dict_LoS already defined...

# Create an HDP model
hdp_model = HdpModel(BoW_corpus, id2word=dict_LoS)

# Explore the topics
topics = hdp_model.show_topics()

for topic in topics:
    print(topic)

(0, '0.058*plan + 0.034*floor + 0.033*stori + 0.028*porch + 0.024*alter + 0.022*build + 0.021*permit + 0.020*work + 0.020*wood + 0.019*unit + 0.016*space + 0.016*offic + 0.016*field + 0.015*inspect + 0.015*renov + 0.013*famili + 0.013*addit + 0.013*use + 0.013*basement + 0.012*self')
(1, '0.030*plan + 0.017*floor + 0.016*stori + 0.014*porch + 0.013*alter + 0.010*build + 0.010*wood + 0.009*work + 0.009*permit + 0.009*unit + 0.008*addit + 0.008*famili + 0.007*space + 0.007*field + 0.007*offic + 0.006*inspect + 0.006*renov + 0.006*resid + 0.006*basement + 0.005*use')
(2, '0.030*plan + 0.019*porch + 0.017*stori + 0.014*floor + 0.013*wood + 0.011*alter + 0.010*build + 0.009*permit + 0.009*work + 0.009*unit + 0.007*space + 0.007*field + 0.006*offic + 0.006*inspect + 0.006*stair + 0.006*renov + 0.006*repair + 0.006*addit + 0.006*famili + 0.006*basement')
(3, '0.026*plan + 0.013*floor + 0.013*stori + 0.013*porch + 0.010*alter + 0.009*wood + 0.009*build + 0.009*unit + 0.009*work + 0.008*permit 

In [37]:
# Compute coherence score
coherence_model = CoherenceModel(model=hdp_model, texts=renovation_nouns, dictionary=dict_LoS)
coherence_score = coherence_model.get_coherence()

print("Coherence Score:", coherence_score)

Coherence Score: 0.32898946240735827


* HdpModel is not yet performing as expected
* Let's stick with LDA for now

---

In [38]:
# Create an LDA model
lda_model = models.LdaModel(BoW_corpus,num_topics=3,id2word=dict_LoS)

# Explore the topics
topics = lda_model.show_topics()

for topic in topics:
    print(topic)

(0, '0.086*"stori" + 0.065*"plan" + 0.057*"unit" + 0.041*"basement" + 0.040*"famili" + 0.039*"build" + 0.036*"addit" + 0.034*"floor" + 0.030*"resid" + 0.026*"alter"')
(1, '0.046*"floor" + 0.037*"plan" + 0.036*"offic" + 0.035*"space" + 0.032*"alter" + 0.030*"permit" + 0.029*"self" + 0.029*"use" + 0.027*"work" + 0.025*"field"')
(2, '0.148*"porch" + 0.107*"wood" + 0.089*"plan" + 0.062*"stair" + 0.049*"repair" + 0.042*"locat" + 0.037*"size" + 0.037*"stori" + 0.023*"deck" + 0.021*"violat"')


In [39]:
# Compute coherence score
coherence_model = CoherenceModel(model=lda_model, texts=renovation_nouns, dictionary=dict_LoS, coherence='c_v')
coherence_score = coherence_model.get_coherence()

print("Coherence Score:", coherence_score)

Coherence Score: 0.5386100218289125


* Changing the number of topics from 10 to 3 has increased the coherency score from 0.34 to 0.54

In [46]:
topic_predictions = lda_model.get_document_topics(BoW_corpus)
topic_predictions

<gensim.interfaces.TransformedCorpus at 0x17810aec0>

**sorted(doc_topics, key=lambda x: x[1], reverse=True)** would return an output such as **[(1, 0.8486103), (0, 0.124885045), (2, 0.026504716)]**

In [58]:
pred_df=pd.DataFrame(columns=['pred_class','probability'])
for doc_id, doc_topics in enumerate(topic_predictions):
    # Sort the topic distribution in descending order
    # key=lambda x: x[1] to sort by the second [1] item from the tuple
    sorted_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    # Extract the top predicted topic and its probability
    pred_df.loc[doc_id,:]= sorted_topics[0]

In [60]:
class_di={0:'extension',1:'office/alteration',2:'repair/porch'}

In [66]:
pred_df['class_names']=[class_di[k] for k in pred_df.pred_class]
pred_df

Unnamed: 0,pred_class,probability,class_names
0,0,0.81836,extension
1,1,0.91466,office/alteration
2,1,0.910198,office/alteration
3,0,0.810642,extension
4,2,0.713313,repair/porch
...,...,...,...
146849,0,0.91508,extension
146850,0,0.718805,extension
146851,0,0.715179,extension
146852,1,0.700495,office/alteration


In [73]:
filter=df['PERMIT_TYPE']=='RENOVATION/ALTERATION'
filter

0          True
1         False
2         False
3          True
4          True
          ...  
730506    False
730507    False
730508    False
730509    False
730510    False
Name: PERMIT_TYPE, Length: 730511, dtype: bool

In [79]:
df.loc[filter,'class_names']=pred_df['class_names']

In [82]:
df['class_names'].fillna('not_renovation',inplace=True)

In [90]:
overview(df)

The dataframe shape is (730511, 16)


Unnamed: 0_level_0,Data Types,Total Null Values,Null Values Percentage,Sample Value Head,Sample Value Tail,Sample Value
Column_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PERMIT_TYPE,object,0,0.0,RENOVATION/ALTERATION,ELECTRIC WIRING,EASY PERMIT PROCESS
REVIEW_TYPE,object,0,0.0,STANDARD PLAN REVIEW,EASY PERMIT WEB,EASY PERMIT
WORK_DESCRIPTION,object,0,0.0,INTERIOR REMODELING OF EXISTING 3 D.U. PER PLA...,REPAIR SERVICE,INSTALL INTERIOR DRAIN TILE PER CODE AND USING...
CONTACT_1_TYPE,object,0,0.0,OWNER AS GENERAL CONTRACTOR,CONTRACTOR-ELECTRICAL,EXPEDITOR
CONTACT_1_CITY,object,0,0.0,CHICAGO,CHICAGO_SUBURBS,CHICAGO
CONTACT_1_STATE,object,0,0.0,IL,OTHER,IL
CENSUS_TRACT,int64,0,0.0,220702,530503,20802
LOG_PROCESSING_TIME,float64,0,0.0,4.394449,-23.025851,-23.025851
LOG_BUILDING_FEE_PAID,float64,0,0.0,4.828314,4.317488,6.437752
LOG_ZONING_FEE_PAID,float64,0,0.0,4.317488,-23.025851,3.912023


***
## EXPORT CLEANED DATAFRAME

In [91]:
df.to_csv('../data/clean/permits_cleaned.csv')