# Topic Modeling with Empath 

## Empath module information 
GitHub Repo : [Empath](https://github.com/Ejhfast/empath-client)  
Research publication : [Empath : Understanding Topic Signals in Large-Scale Text](https://arxiv.org/pdf/1602.06979.pdf)



We will compare three methods used for Topic Modeling 
1. **Document-Term Matrix** from TF-IDF Vectorizer  
    Text is preprocessed (no stop words, lemmatization, parts-of-speech tagging and etc.) 
    Bag-of-Words where order of words is not preserved   

2. **Corpus** (aka. full transcript)  
   Order of words is preserved  

3. **Subtitle**  
   The original subtitle of each transcript, typically consists of one or two lines of summary of the transcript. 

In [8]:
import pandas as pd
from empath import Empath

In [4]:
lexicon = Empath()

### 1. Topic Modeling with Document-Term Matrix 

In [112]:
# load the top 30 common words dictionary created using TF-IDF vectorizer
common_words_dict = pd.read_pickle('/Users/lihuicham/Desktop/Y2S2/BT4222/project/standup-comedy-analysis/main/pickle/common_words_tfidf.pkl')

In [113]:
words = []
for i in common_words_dict[32] : 
    words.append(i[0])
    
sent = ' '.join(words)
sent

'woman ali cum cheat husband radiologist pussy dick men fan ct chill colonoscopy barbara orgasm magician standup wong power dude face single respect underwear millionaire year quality money pubes earn'

In [114]:
lext_dict_dtm = lexicon.analyze(sent, normalize=True)

In [115]:
# top 15 categories 
dtm_topics = sorted(lext_dict_dtm, key=lext_dict_dtm.get, reverse=True)[:15]

### 2. Topic Modeling with Corpus (Full Transcript) 

In [50]:
# load corpus 
df = pd.read_pickle('/Users/lihuicham/Desktop/Y2S2/BT4222/project/standup-comedy-analysis/main/pickle/corpus.pkl')
df.head()

Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,lets go she said ill do anything you w...
1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,i dont know what you were thinking like im no...
2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,join me in welcoming the author of six number ...
3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...",premiered on december ladies and gentlemen g...
4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,please welcome to the stage jim jefferies hell...


In [120]:
transcript = df.Transcript[32].strip()
# transcript

In [117]:
lext_dict_transcript = lexicon.analyze(transcript, normalize=True)

In [118]:
# top 20 categories 
corpus_topics = sorted(lext_dict_transcript, key=lext_dict_transcript.get, reverse=True)[:20]

### 3. Topic Modeling with Subtitle 

In [121]:
subtitle = df.Subtitle[32]
subtitle

'Ali Wong discusses her deepest fantasies, the challenges of monogamy, and her feelings about single people.'

In [122]:
lext_dict_subtitle = lexicon.analyze(subtitle, normalize=True)

In [123]:
# top 10 categories 
subtitle_topics = sorted(lext_dict_subtitle, key=lext_dict_subtitle.get, reverse=True)[:10]

## Comparing the three methods 

In [124]:
corpus_topics  # 20 topics 

['children',
 'family',
 'speaking',
 'swearing_terms',
 'body',
 'communication',
 'positive_emotion',
 'negative_emotion',
 'wedding',
 'giving',
 'sexual',
 'business',
 'masculine',
 'feminine',
 'home',
 'party',
 'trust',
 'appearance',
 'valuable',
 'friends']

In [125]:
dtm_topics  # 15 topics 

['power',
 'money',
 'magic',
 'banking',
 'sexual',
 'valuable',
 'children',
 'wedding',
 'cold',
 'family',
 'masculine',
 'wealthy',
 'social_media',
 'fear',
 'body']

In [126]:
subtitle_topics  # 10 topics 

['help',
 'office',
 'dance',
 'money',
 'wedding',
 'domestic_work',
 'sleep',
 'medical_emergency',
 'cold',
 'hate']

### Getting the common topics by pairing the methods 

In [127]:
# corpus X dtm 
set_corpus_dtm = set(corpus_topics)&set(dtm_topics) 
corpus_dtm = sorted(set_corpus_dtm, key = lambda k : corpus_topics.index(k))
corpus_dtm

['children', 'family', 'body', 'wedding', 'sexual', 'masculine', 'valuable']

In [128]:
# dtm X subtitle 
set_dtm_subtitle = set(dtm_topics)&set(subtitle_topics) 
dtm_subtitle = sorted(set_dtm_subtitle, key = lambda k : dtm_topics.index(k))
dtm_subtitle

['money', 'wedding', 'cold']

In [129]:
# corpus X subtitle  
set_corpus_subtitle = set(corpus_topics)&set(subtitle_topics) 
corpus_subtitle = sorted(set_corpus_subtitle, key = lambda k : corpus_topics.index(k))
corpus_subtitle

['wedding']

In [130]:
# corpus X dtm X subtitle 
set_all_methods = set(corpus_topics)&set(subtitle_topics)&set(dtm_topics) 
all_methods = sorted(set_corpus_subtitle, key = lambda k : corpus_topics.index(k))
all_methods

['wedding']

## Conclusion 

We can consider combining all three methods and produce one or two most possible topic(s) mentioned in each transcript. 

## Test 
Use one simple sentence to test with the `Empath` library

In [27]:
lex_dict_test = lexicon.analyze("he hit the other person", normalize=True)
# uncomment below line to see the normalized counts of each categories 
# lex_dict_test 

In [63]:
# get the top 5 categories 
sorted(lex_dict_test, key=lex_dict_test.get, reverse=True)[:5]

['movement', 'violence', 'pain', 'negative_emotion', 'help']