# 4. Topic modeling

For topic modeling, the input is DTM, number of topics, and number of iterations.

<b>4.1</b>
Goal is to perform topic modeling across the following combined segments of the 11 episodes: intro, 1/3, 2/3/, 3/3, outro

<i>Our goal is to find themes across the intro, 1/3, 2/3, 3/3, and outro and see which segments tend to talk about which themes.</i>

In [1]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import scipy.sparse

from gensim import corpora, models, matutils
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
#!pip install python-Levenshtein

In [3]:
#!pip uninstall numpy -y
#!pip install numpy

In [4]:
#pip install -U gensim

## 4.1 Pre processing

In [5]:
#first need to combine all segments across episodes
#performed the above in 1. Data Cleaning jupyter notebook
df = pd.read_pickle("s19_df_corpus.pkl")
df.head()

Unnamed: 0,episode,start_times,segment,5_min_interval,lines
0,1,2022-10-04 00:00:00,intro,0,i still cannot believe that i am the bachelor...
1,1,2022-10-04 00:05:00,first third,5,i did not want you to leave but now you do ...
2,1,2022-10-04 00:10:00,first third,10,by going through what we did go through so i...
3,1,2022-10-04 00:15:00,first third,15,all right anxiety attack here we go wow yea...
4,1,2022-10-04 00:20:00,first third,20,nice clearly i am not good at juggling two w...


In [6]:
#let's delete unnecessary columns
df = df.drop(columns=['start_times', '5_min_interval', 'episode'], axis=1)
df.head()

Unnamed: 0,segment,lines
0,intro,i still cannot believe that i am the bachelor...
1,first third,i did not want you to leave but now you do ...
2,first third,by going through what we did go through so i...
3,first third,all right anxiety attack here we go wow yea...
4,first third,nice clearly i am not good at juggling two w...


In [7]:
#going to combine all intros, first thirds, second thirds, third thirds, and outros
df["lines_by_segment"] = df.groupby(["segment"])["lines"].transform(lambda x: ' '.join(x))
#df = df.drop_duplicates().reset_index(drop=True)


In [8]:
df = df.groupby("segment").last()
df = df.drop(['lines'], axis=1)

In [9]:
df.head()

Unnamed: 0_level_0,lines_by_segment
segment,Unnamed: 1_level_1
first third,i did not want you to leave but now you do ...
intro,i still cannot believe that i am the bachelor...
outro,gabby and rachel we are the bachelorette wh...
second third,yeah he is my type yeah yeah i am immediat...
third third,so i am really curious to see what he has for...


In [10]:
#should only have 2 columns and 5 rows 
df.info()

#looks good

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, first third to third third
Data columns (total 1 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   lines_by_segment  5 non-null      object
dtypes: object(1)
memory usage: 80.0+ bytes


In [11]:
#let's perform some cleaning on this corpus before we apply topic modeling
df_segment = list(df.lines_by_segment)
df_segment

[' i did not want you to leave  but now you do  so what the is the difference  except that one time it was going to be my decision  which you did not want it to be  and now it is your decision so it is easier  no yes it is  i tried so hard  i gave you everything  i fought for this every single day  and you never once fought for me  you never did  i was really heartbroken  but gabby and i really picked up the pieces for each other  and i am so ready to start this new chapter with her  just got back to colorado  and i have to prepare for the bachelorette  so i need to pack get my together  organize say goodbye to my dog  in a matter of hours  which seems insane  but i am so ready to do it  come give your mom a kiss  i love you i am going to miss you  as crazy as this all is  i feel like i am packed  i am ready to go  get a good night s rest  get on the plane  fly to la and see gabby  being the bachelorette has not even sunk in yet  and i do not want it to  because i think the second it s

In [12]:
#we make our NEW DTM
cv = CountVectorizer(stop_words='english') 
X = cv.fit_transform(df_segment)
dtm_seg = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
dtm_seg.index = df.index
dtm_seg


Unnamed: 0_level_0,aah,abc,abilities,ability,able,abs,absolute,absolutely,absurd,abundance,...,younger,youse,yum,yummy,zach,zachary,zero,zhh,zodiac,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,0,2,0,1,35,0,4,39,1,0,...,3,1,0,0,52,1,5,0,0,5
intro,0,0,0,0,4,2,0,4,1,0,...,0,0,0,0,8,0,0,0,0,0
outro,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
second third,1,1,1,1,34,1,3,43,0,2,...,1,0,1,1,17,0,0,2,1,4
third third,0,1,0,1,36,0,0,40,0,0,...,0,0,0,0,15,0,3,0,0,0


In [13]:
#now we have a TERM document matrix
#easier to manipualte
term_doc = dtm_seg.transpose()
term_doc
#code to help me do all the below cells:
#https://github.com/adashofdata/nlp-in-python-tutorial

segment,first third,intro,outro,second third,third third
aah,0,0,0,1,0
abc,2,0,0,1,1
abilities,0,0,0,1,0
ability,1,0,0,1,1
able,35,4,0,34,36
...,...,...,...,...,...
zachary,1,0,0,0,0
zero,5,0,0,0,3
zhh,0,0,0,2,0
zodiac,0,0,0,1,0


In [14]:
#what are the top 20 words by segment?
top_dict = {}
for c in term_doc.columns:
    top = term_doc[c].sort_values(ascending=False).head(20)
    top_dict[c]=list(zip(top.index, top.values))
    
top_dict

{'first third': [('like', 1149),
  ('know', 559),
  ('yeah', 473),
  ('just', 465),
  ('really', 371),
  ('going', 304),
  ('want', 259),
  ('think', 256),
  ('oh', 252),
  ('feel', 234),
  ('love', 208),
  ('right', 179),
  ('good', 173),
  ('rachel', 173),
  ('did', 165),
  ('time', 152),
  ('gabby', 148),
  ('okay', 130),
  ('today', 116),
  ('kind', 110)],
 'intro': [('like', 136),
  ('going', 70),
  ('just', 67),
  ('know', 65),
  ('yeah', 58),
  ('rachel', 58),
  ('gabby', 55),
  ('want', 48),
  ('right', 47),
  ('love', 47),
  ('really', 46),
  ('feel', 44),
  ('think', 43),
  ('week', 34),
  ('guys', 32),
  ('oh', 32),
  ('excited', 27),
  ('did', 27),
  ('time', 21),
  ('tonight', 20)],
 'outro': [('like', 61),
  ('going', 36),
  ('know', 34),
  ('want', 30),
  ('love', 29),
  ('just', 28),
  ('oh', 26),
  ('yeah', 24),
  ('feel', 19),
  ('think', 17),
  ('gabby', 16),
  ('right', 14),
  ('rachel', 14),
  ('good', 13),
  ('did', 12),
  ('really', 11),
  ('bachelorette', 10),
 

In [15]:
#checking out top 10 words per segment
for i, top_words in top_dict.items():
    print(i)
    print(', '.join([word for word, count in top_words[0:11]]))
    print("  ")

first third
like, know, yeah, just, really, going, want, think, oh, feel, love
  
intro
like, going, just, know, yeah, rachel, gabby, want, right, love, really
  
outro
like, going, know, want, love, just, oh, yeah, feel, think, gabby
  
second third
like, know, just, yeah, really, going, oh, think, feel, want, guys
  
third third
like, know, just, yeah, really, think, want, feel, going, rose, right
  


<B>NOTE</B>: need to remove LIKE...not a useful term. Same as yeah. and oh.

In [16]:
#look at top 10 words from each episode
words = []
for c in term_doc.columns:
    top = [word for (word,count) in top_dict[c]]
    for t in top:
        words.append(t)
        
words

['like',
 'know',
 'yeah',
 'just',
 'really',
 'going',
 'want',
 'think',
 'oh',
 'feel',
 'love',
 'right',
 'good',
 'rachel',
 'did',
 'time',
 'gabby',
 'okay',
 'today',
 'kind',
 'like',
 'going',
 'just',
 'know',
 'yeah',
 'rachel',
 'gabby',
 'want',
 'right',
 'love',
 'really',
 'feel',
 'think',
 'week',
 'guys',
 'oh',
 'excited',
 'did',
 'time',
 'tonight',
 'like',
 'going',
 'know',
 'want',
 'love',
 'just',
 'oh',
 'yeah',
 'feel',
 'think',
 'gabby',
 'right',
 'rachel',
 'good',
 'did',
 'really',
 'bachelorette',
 'mm',
 'look',
 'got',
 'like',
 'know',
 'just',
 'yeah',
 'really',
 'going',
 'oh',
 'think',
 'feel',
 'want',
 'guys',
 'right',
 'love',
 'good',
 'did',
 'gabby',
 'rachel',
 'time',
 'okay',
 'kind',
 'like',
 'know',
 'just',
 'yeah',
 'really',
 'think',
 'want',
 'feel',
 'going',
 'rose',
 'right',
 'rachel',
 'did',
 'gabby',
 'oh',
 'time',
 'love',
 'okay',
 'guys',
 'good']

In [122]:
from collections import Counter
Counter(words).most_common()
#we can see how often each of these top words are contained
#in our 5 documents
#now this shouldn't be too shooking since 3 of our 5 documents contain a LOT of words.

[('like', 5),
 ('know', 5),
 ('yeah', 5),
 ('just', 5),
 ('really', 5),
 ('going', 5),
 ('want', 5),
 ('think', 5),
 ('oh', 5),
 ('feel', 5),
 ('love', 5),
 ('right', 5),
 ('rachel', 5),
 ('did', 5),
 ('gabby', 5),
 ('good', 4),
 ('time', 4),
 ('okay', 3),
 ('guys', 3),
 ('kind', 2),
 ('today', 1),
 ('week', 1),
 ('excited', 1),
 ('tonight', 1),
 ('bachelorette', 1),
 ('mm', 1),
 ('look', 1),
 ('got', 1),
 ('rose', 1)]

In [123]:
#going to remove stop words
add_stop_words = ['like',
 'yeah',
 'know',
 'going',
 'just',
 'really',
 'think',
 'feel',
 'want',
 'oh',
 'right',
'today',
'tonight',
'did', 'able']
                  
add_stop_words
#these are same stop words removed as in 2. EDA notebook plus today and tonight and did
#and able

['like',
 'yeah',
 'know',
 'going',
 'just',
 'really',
 'think',
 'feel',
 'want',
 'oh',
 'right',
 'today',
 'tonight',
 'did',
 'able']

In [124]:
#let's make a new dtm without these stop words now
from sklearn.feature_extraction import text

#adding new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

#creating new DTM
cv_r = CountVectorizer(stop_words=stop_words) 
X_r = cv_r.fit_transform(df_segment)
dtm_seg_r = pd.DataFrame(X_r.toarray(), columns=cv_r.get_feature_names())
dtm_seg_r.index = df.index
dtm_seg_r

dtm_seg_r.to_pickle("DTM_v3.pkl")
#DTM_V3 is a DTM where each document/row is A SEGMENT and stop words removed
#DTM_V2 is a DTM where each document/row is AN EPISODE
#DTM_V1 is a DTM where each document/row is a 5 min segment across ALL the episodes!

dtm_seg_r

Unnamed: 0_level_0,aah,abc,abilities,ability,abs,absolute,absolutely,absurd,abundance,accept,...,younger,youse,yum,yummy,zach,zachary,zero,zhh,zodiac,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,0,2,0,1,0,4,39,1,0,21,...,3,1,0,0,52,1,5,0,0,5
intro,0,0,0,0,2,0,4,1,0,1,...,0,0,0,0,8,0,0,0,0,0
outro,0,0,0,0,0,0,1,0,0,3,...,0,0,0,0,1,0,0,0,0,0
second third,1,1,1,1,1,3,43,0,2,17,...,1,0,1,1,17,0,0,2,1,4
third third,0,1,0,1,0,0,40,0,0,54,...,0,0,0,0,15,0,3,0,0,0


In [125]:
term_doc_2 = dtm_seg_r.transpose()
term_doc_2

segment,first third,intro,outro,second third,third third
aah,0,0,0,1,0
abc,2,0,0,1,1
abilities,0,0,0,1,0
ability,1,0,0,1,1
abs,0,2,0,1,0
...,...,...,...,...,...
zachary,1,0,0,0,0
zero,5,0,0,0,3
zhh,0,0,0,2,0
zodiac,0,0,0,1,0


## 4.2 Topic Modeling - Attempt #1 (all text minus stop words)
- To perform LDA topic modeling we need 3 things: DTM transposed, number of topics, and number of iterations.

In [126]:
#trying topic modeling with 3 topics
sparse_counts = scipy.sparse.csr_matrix(term_doc_2)
corpus = matutils.Sparse2Corpus(sparse_counts)
id2word = dict((v, k) for k, v in cv_r.vocabulary_.items())
lda = models.LdaModel(corpus=corpus, num_topics=2, id2word=id2word, passes=10)
lda.print_topics()

2022-10-05 01:23:00,102 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:00,103 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:00,106 : INFO : using serial LDA version on this node
2022-10-05 01:23:00,109 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:00,221 : INFO : -8.766 per-word bound, 435.3 perplexity estimate based on a held-out corpus of 5 documents with 41406 words
2022-10-05 01:23:00,221 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:00,245 : INFO : topic #0 (0.500): 0.014*"love" + 0.013*"gabby" + 0.013*"rachel" + 0.011*"good" + 0.010*"time" + 0.010*"okay" + 0.009*"guys" + 0.008*"kind" + 0.008*"rose" + 0.007*"got"
2022-10-05 01:23:00,249 : INFO : topic #1 (0.500): 0.012*"love" + 0.012*"rachel" + 0.011*"guys" + 0.010*"time" + 0.010*"gabby"

[(0,
  '0.014*"rachel" + 0.013*"love" + 0.012*"gabby" + 0.011*"good" + 0.011*"time" + 0.009*"okay" + 0.009*"rose" + 0.009*"guys" + 0.008*"kind" + 0.007*"thank"'),
 (1,
  '0.012*"guys" + 0.011*"love" + 0.010*"good" + 0.009*"gabby" + 0.009*"time" + 0.009*"rachel" + 0.008*"kind" + 0.007*"okay" + 0.006*"lot" + 0.006*"little"')]

In [127]:
#trying topic modeling with 3 topics
lda = models.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=10)
lda.print_topics()

2022-10-05 01:23:01,437 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:01,438 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:01,440 : INFO : using serial LDA version on this node
2022-10-05 01:23:01,447 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:01,578 : INFO : -8.881 per-word bound, 471.4 perplexity estimate based on a held-out corpus of 5 documents with 41406 words
2022-10-05 01:23:01,579 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:01,595 : INFO : topic #0 (0.333): 0.013*"rachel" + 0.011*"guys" + 0.011*"love" + 0.011*"time" + 0.010*"gabby" + 0.008*"okay" + 0.007*"kind" + 0.007*"good" + 0.006*"lot" + 0.006*"rose"
2022-10-05 01:23:01,597 : INFO : topic #1 (0.333): 0.014*"gabby" + 0.013*"rachel" + 0.012*"good"

2022-10-05 01:23:02,617 : INFO : topic #2 (0.333): 0.014*"love" + 0.011*"good" + 0.009*"rachel" + 0.009*"time" + 0.007*"excited" + 0.007*"okay" + 0.007*"gabby" + 0.006*"mean" + 0.006*"date" + 0.006*"kind"
2022-10-05 01:23:02,619 : INFO : topic diff=0.042187, rho=0.316228
2022-10-05 01:23:02,691 : INFO : -7.042 per-word bound, 131.8 perplexity estimate based on a held-out corpus of 5 documents with 41406 words
2022-10-05 01:23:02,691 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:02,705 : INFO : topic #0 (0.333): 0.004*"blanco" + 0.004*"history" + 0.003*"renaissance" + 0.003*"love" + 0.003*"child" + 0.002*"dreams" + 0.002*"kid" + 0.002*"gnarly" + 0.002*"rachel" + 0.002*"feet"
2022-10-05 01:23:02,706 : INFO : topic #1 (0.333): 0.014*"rachel" + 0.013*"gabby" + 0.013*"love" + 0.012*"guys" + 0.011*"good" + 0.011*"time" + 0.010*"rose" + 0.010*"okay" + 0.008*"kind" + 0.007*"thank"
2022-10-05 01:23:02,707 : INFO : topic #2 (0.333): 0.014*"love" + 0.011*"good" + 0.009*"rachel" + 0

[(0,
  '0.004*"blanco" + 0.004*"history" + 0.003*"renaissance" + 0.003*"love" + 0.003*"child" + 0.002*"dreams" + 0.002*"kid" + 0.002*"gnarly" + 0.002*"rachel" + 0.002*"feet"'),
 (1,
  '0.014*"rachel" + 0.013*"gabby" + 0.013*"love" + 0.012*"guys" + 0.011*"good" + 0.011*"time" + 0.010*"rose" + 0.010*"okay" + 0.008*"kind" + 0.007*"thank"'),
 (2,
  '0.014*"love" + 0.011*"good" + 0.009*"rachel" + 0.009*"time" + 0.007*"excited" + 0.007*"okay" + 0.007*"gabby" + 0.006*"mean" + 0.006*"date" + 0.006*"kind"')]

In [128]:
#trying topic modeling with 4 topics
lda = models.LdaModel(corpus=corpus, num_topics=4, id2word=id2word, passes=10)
lda.print_topics()

2022-10-05 01:23:02,724 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:02,728 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:02,733 : INFO : using serial LDA version on this node
2022-10-05 01:23:02,743 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:02,843 : INFO : -9.012 per-word bound, 516.2 perplexity estimate based on a held-out corpus of 5 documents with 41406 words
2022-10-05 01:23:02,844 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:02,860 : INFO : topic #0 (0.250): 0.011*"rachel" + 0.011*"love" + 0.010*"good" + 0.009*"time" + 0.009*"gabby" + 0.009*"guys" + 0.008*"okay" + 0.007*"kind" + 0.007*"let" + 0.007*"thank"
2022-10-05 01:23:02,861 : INFO : topic #1 (0.250): 0.013*"good" + 0.012*"rachel" + 0.010*"time" + 0.010*"love" + 0.010*"ros

2022-10-05 01:23:03,675 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:03,711 : INFO : topic #0 (0.250): 0.007*"love" + 0.005*"blanco" + 0.004*"history" + 0.004*"bachelorette" + 0.003*"rachel" + 0.003*"renaissance" + 0.003*"said" + 0.003*"gabby" + 0.003*"good" + 0.003*"child"
2022-10-05 01:23:03,713 : INFO : topic #1 (0.250): 0.020*"rose" + 0.013*"rachel" + 0.011*"gabby" + 0.010*"time" + 0.010*"thank" + 0.010*"okay" + 0.009*"good" + 0.008*"love" + 0.008*"guys" + 0.007*"kind"
2022-10-05 01:23:03,714 : INFO : topic #2 (0.250): 0.002*"rachel" + 0.002*"love" + 0.002*"good" + 0.001*"kind" + 0.001*"time" + 0.001*"guys" + 0.001*"okay" + 0.001*"rose" + 0.001*"thank" + 0.001*"gabby"
2022-10-05 01:23:03,716 : INFO : topic #3 (0.250): 0.015*"love" + 0.013*"rachel" + 0.012*"gabby" + 0.012*"good" + 0.011*"guys" + 0.011*"time" + 0.009*"okay" + 0.008*"kind" + 0.007*"got" + 0.007*"excited"
2022-10-05 01:23:03,718 : INFO : topic diff=0.060626, rho=0.333333
2022-10-05 01:23:03,804 : INFO :

[(0,
  '0.006*"love" + 0.005*"blanco" + 0.005*"history" + 0.004*"renaissance" + 0.004*"bachelorette" + 0.003*"child" + 0.003*"said" + 0.003*"dreams" + 0.003*"kid" + 0.003*"mm"'),
 (1,
  '0.020*"rose" + 0.013*"rachel" + 0.011*"gabby" + 0.010*"time" + 0.010*"thank" + 0.010*"okay" + 0.009*"good" + 0.008*"love" + 0.008*"guys" + 0.007*"hard"'),
 (2,
  '0.001*"rachel" + 0.001*"love" + 0.001*"good" + 0.001*"kind" + 0.001*"time" + 0.001*"guys" + 0.001*"okay" + 0.001*"rose" + 0.001*"thank" + 0.001*"gabby"'),
 (3,
  '0.015*"love" + 0.013*"rachel" + 0.012*"gabby" + 0.012*"good" + 0.011*"guys" + 0.011*"time" + 0.009*"okay" + 0.008*"kind" + 0.007*"got" + 0.007*"excited"')]

## 4.2 Topic Modeling - Attempt #2 (nouns+adjectives)

In [129]:
# Let's create a function to pull out nouns from a string of text
#https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
from nltk import word_tokenize, pos_tag

def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [130]:
# Apply the nouns function to the transcripts to filter only on nouns and adjectives
data_nouns_adj = pd.DataFrame(df.lines_by_segment.apply(nouns_adj))
data_nouns_adj

Unnamed: 0_level_0,lines_by_segment
segment,Unnamed: 1_level_1
first third,i difference time decision decision easier yes...
intro,i i bachelorette rachel bachelorette amazing e...
outro,gabby rachel bachelorette more best friends wo...
second third,yeah type yeah yeah i i anyone tino s entrance...
third third,i curious i incredible journey i incredible cr...


In [131]:
# Create a new document-term matrix using only nouns and adjectives
#also remove common words with max_df and min_df
#removing words that appear too much, 95% of the documents
#removing words that appear to little, less than 1% of the documents

cvna = CountVectorizer(stop_words=stop_words, min_df=0.25, max_df=0.80)
data_cvna = cvna.fit_transform(data_nouns_adj.lines_by_segment)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0_level_0,abc,ability,abs,absolute,absurd,action,active,actual,adult,advance,...,wrong,ya,yay,yep,yes,yesterday,yo,young,younger,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,1,1,0,4,1,0,1,2,1,1,...,9,1,0,1,16,1,3,2,3,5
intro,0,0,2,0,1,0,1,0,2,0,...,2,0,0,0,0,0,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
second third,0,1,1,3,0,1,0,4,2,0,...,8,0,2,1,25,11,4,1,1,4
third third,1,1,0,0,0,1,0,1,0,1,...,4,1,2,0,22,4,1,1,0,0


In [132]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [133]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:09,298 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:09,305 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:09,315 : INFO : using serial LDA version on this node
2022-10-05 01:23:09,318 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:09,376 : INFO : -7.448 per-word bound, 174.7 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:09,377 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:09,404 : INFO : topic #0 (0.500): 0.010*"fun" + 0.009*"real" + 0.009*"group" + 0.008*"better" + 0.008*"relationship" + 0.007*"special" + 0.007*"decision" + 0.007*"strong" + 0.007*"dad" + 0.006*"easy"
2022-10-05 01:23:09,405 : INFO : topic #1 (0.500): 0.010*"real" + 0.010*"relationship" + 0.010*"fun" + 0.009*"g

[(0,
  '0.014*"fun" + 0.011*"group" + 0.010*"better" + 0.008*"couple" + 0.008*"relationship" + 0.008*"dad" + 0.007*"special" + 0.006*"strong" + 0.006*"yes" + 0.006*"nate"'),
 (1,
  '0.012*"real" + 0.010*"relationship" + 0.009*"special" + 0.009*"decision" + 0.008*"easy" + 0.007*"group" + 0.007*"fun" + 0.007*"yes" + 0.007*"aven" + 0.006*"dad"')]

In [134]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:10,051 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:10,076 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:10,083 : INFO : using serial LDA version on this node
2022-10-05 01:23:10,093 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:10,390 : INFO : -7.594 per-word bound, 193.1 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:10,391 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:10,438 : INFO : topic #0 (0.333): 0.010*"fun" + 0.009*"relationship" + 0.009*"decision" + 0.008*"aven" + 0.008*"better" + 0.008*"real" + 0.008*"group" + 0.007*"special" + 0.006*"easy" + 0.006*"yes"
2022-10-05 01:23:10,440 : INFO : topic #1 (0.333): 0.011*"fun" + 0.011*"real" + 0.0

2022-10-05 01:23:10,992 : INFO : topic #1 (0.333): 0.011*"fun" + 0.010*"real" + 0.010*"relationship" + 0.009*"group" + 0.009*"special" + 0.008*"better" + 0.008*"dad" + 0.008*"easy" + 0.008*"yes" + 0.007*"decision"
2022-10-05 01:23:10,992 : INFO : topic #2 (0.333): 0.012*"group" + 0.011*"decision" + 0.011*"couple" + 0.008*"second" + 0.008*"meatball" + 0.008*"families" + 0.007*"romantic" + 0.007*"dates" + 0.006*"relationships" + 0.006*"word"
2022-10-05 01:23:10,993 : INFO : topic diff=0.043969, rho=0.316228
2022-10-05 01:23:11,038 : INFO : -6.716 per-word bound, 105.1 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:11,039 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:11,059 : INFO : topic #0 (0.333): 0.002*"relationship" + 0.002*"aven" + 0.002*"fun" + 0.002*"decision" + 0.002*"real" + 0.002*"special" + 0.002*"better" + 0.002*"group" + 0.002*"days" + 0.002*"easy"
2022-10-05 01:23:11,061 : INFO : topic #1 (0.333): 0.011*"fun" + 

[(0,
  '0.002*"relationship" + 0.002*"aven" + 0.002*"fun" + 0.002*"decision" + 0.002*"real" + 0.002*"special" + 0.002*"better" + 0.002*"group" + 0.002*"days" + 0.002*"easy"'),
 (1,
  '0.011*"fun" + 0.010*"real" + 0.010*"relationship" + 0.009*"group" + 0.009*"special" + 0.008*"better" + 0.008*"dad" + 0.008*"easy" + 0.008*"yes" + 0.007*"decision"'),
 (2,
  '0.012*"group" + 0.011*"decision" + 0.011*"couple" + 0.008*"second" + 0.008*"meatball" + 0.008*"families" + 0.007*"romantic" + 0.007*"dates" + 0.006*"relationships" + 0.006*"word"')]

In [135]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:11,090 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:11,092 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:11,094 : INFO : using serial LDA version on this node
2022-10-05 01:23:11,099 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:11,155 : INFO : -7.743 per-word bound, 214.3 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:11,157 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:11,187 : INFO : topic #0 (0.250): 0.010*"real" + 0.010*"fun" + 0.009*"special" + 0.008*"better" + 0.008*"group" + 0.008*"relationship" + 0.008*"dad" + 0.007*"yes" + 0.007*"decision" + 0.006*"strong"
2022-10-05 01:23:11,188 : INFO : topic #1 (0.250): 0.011*"relationship" + 0.010*"fun" + 0.009*"group" + 0.009*

2022-10-05 01:23:12,633 : INFO : topic diff=0.108661, rho=0.353553
2022-10-05 01:23:12,710 : INFO : -6.783 per-word bound, 110.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:12,711 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:12,731 : INFO : topic #0 (0.250): 0.002*"real" + 0.002*"fun" + 0.002*"special" + 0.002*"better" + 0.002*"group" + 0.002*"relationship" + 0.002*"dad" + 0.002*"yes" + 0.002*"decision" + 0.002*"strong"
2022-10-05 01:23:12,732 : INFO : topic #1 (0.250): 0.015*"fun" + 0.012*"group" + 0.010*"better" + 0.009*"relationship" + 0.008*"dad" + 0.008*"special" + 0.008*"couple" + 0.007*"strong" + 0.007*"nate" + 0.007*"mom"
2022-10-05 01:23:12,734 : INFO : topic #2 (0.250): 0.012*"child" + 0.010*"kid" + 0.010*"blanco" + 0.008*"sex" + 0.008*"feet" + 0.007*"dad" + 0.007*"pilot" + 0.006*"horse" + 0.006*"dreams" + 0.005*"easy"
2022-10-05 01:23:12,735 : INFO : topic #3 (0.250): 0.014*"real" + 0.010*"decision" + 0.010*"

[(0,
  '0.001*"real" + 0.001*"fun" + 0.001*"special" + 0.001*"better" + 0.001*"group" + 0.001*"relationship" + 0.001*"dad" + 0.001*"yes" + 0.001*"decision" + 0.001*"strong"'),
 (1,
  '0.015*"fun" + 0.012*"group" + 0.010*"better" + 0.009*"relationship" + 0.008*"dad" + 0.008*"special" + 0.008*"couple" + 0.007*"strong" + 0.007*"nate" + 0.007*"mom"'),
 (2,
  '0.013*"child" + 0.011*"kid" + 0.011*"blanco" + 0.009*"sex" + 0.009*"feet" + 0.007*"dad" + 0.007*"pilot" + 0.007*"horse" + 0.007*"dreams" + 0.005*"easy"'),
 (3,
  '0.014*"real" + 0.011*"decision" + 0.010*"relationship" + 0.009*"special" + 0.008*"easy" + 0.008*"impression" + 0.008*"yes" + 0.007*"meatball" + 0.007*"group" + 0.007*"aven"')]

## 4.2 Topic Modeling - Attempt #3 (nouns+adjectives+Pnouns)

In [136]:
# Let's create a function to pull out nouns from a string of text
#https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

def nouns_adj_Pnouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'NNP'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [137]:
# Apply the nouns function to the transcripts to filter only on nouns and adjectives
data_nouns_adj_pnoun = pd.DataFrame(df.lines_by_segment.apply(nouns_adj_Pnouns))
data_nouns_adj_pnoun

Unnamed: 0_level_0,lines_by_segment
segment,Unnamed: 1_level_1
first third,i difference time decision decision easier yes...
intro,i i bachelorette rachel bachelorette amazing e...
outro,gabby rachel bachelorette more best friends wo...
second third,yeah type yeah yeah i i anyone tino s entrance...
third third,i curious i incredible journey i incredible cr...


In [138]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words)
data_cvna = cvna.fit_transform(data_nouns_adj_pnoun.lines_by_segment)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj_pnoun.index
data_dtmna

Unnamed: 0_level_0,aah,abc,abilities,ability,abs,absolute,absurd,abundance,accept,acceptance,...,york,young,younger,youse,yum,zach,zachary,zero,zhh,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,0,1,0,1,0,4,1,0,2,0,...,0,2,3,1,0,43,1,1,0,5
intro,0,0,0,0,2,0,1,0,0,0,...,0,0,0,0,0,6,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
second third,1,0,1,1,1,3,0,2,0,1,...,0,1,1,0,1,12,0,0,1,4
third third,0,1,0,1,0,0,0,0,0,0,...,3,1,0,0,0,13,0,0,0,0


In [139]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [140]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:18,717 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:18,722 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:18,723 : INFO : using serial LDA version on this node
2022-10-05 01:23:18,726 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:18,786 : INFO : -8.422 per-word bound, 343.0 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:18,787 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:18,798 : INFO : topic #0 (0.500): 0.019*"good" + 0.018*"rachel" + 0.017*"time" + 0.017*"gabby" + 0.013*"kind" + 0.010*"love" + 0.010*"okay" + 0.010*"guys" + 0.009*"lot" + 0.009*"way"
2022-10-05 01:23:18,799 : INFO : topic #1 (0.500): 0.015*"time" + 0.015*"gabby" + 0.013*"good" + 0.011*"rachel" + 0.011*"guys" 

[(0,
  '0.019*"good" + 0.018*"time" + 0.017*"gabby" + 0.017*"rachel" + 0.013*"kind" + 0.011*"guys" + 0.011*"love" + 0.010*"way" + 0.009*"little" + 0.009*"okay"'),
 (1,
  '0.005*"gabby" + 0.005*"week" + 0.004*"rachel" + 0.004*"bachelorette" + 0.004*"font" + 0.003*"new" + 0.003*"love" + 0.003*"excited" + 0.003*"orleans" + 0.002*"guys"')]

In [141]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:19,542 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:19,544 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:19,554 : INFO : using serial LDA version on this node
2022-10-05 01:23:19,558 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:19,650 : INFO : -8.558 per-word bound, 377.0 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:19,651 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:19,664 : INFO : topic #0 (0.333): 0.017*"good" + 0.017*"rachel" + 0.014*"gabby" + 0.014*"kind" + 0.013*"time" + 0.010*"guys" + 0.010*"love" + 0.010*"okay" + 0.009*"lot" + 0.009*"way"
2022-10-05 01:23:19,665 : INFO : topic #1 (0.333): 0.021*"gabby" + 0.020*"good" + 0.017*"rachel" 

2022-10-05 01:23:20,504 : INFO : topic diff=0.027360, rho=0.316228
2022-10-05 01:23:20,553 : INFO : -6.717 per-word bound, 105.2 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:20,553 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:20,566 : INFO : topic #0 (0.333): 0.018*"time" + 0.017*"rachel" + 0.017*"gabby" + 0.016*"good" + 0.012*"kind" + 0.010*"way" + 0.009*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"
2022-10-05 01:23:20,567 : INFO : topic #1 (0.333): 0.019*"good" + 0.017*"time" + 0.015*"rachel" + 0.014*"gabby" + 0.012*"love" + 0.012*"kind" + 0.010*"date" + 0.009*"way" + 0.008*"okay" + 0.008*"ready"
2022-10-05 01:23:20,567 : INFO : topic #2 (0.333): 0.017*"gabby" + 0.017*"good" + 0.016*"time" + 0.015*"rachel" + 0.013*"kind" + 0.013*"guys" + 0.010*"little" + 0.010*"love" + 0.009*"lot" + 0.009*"day"
2022-10-05 01:23:20,568 : INFO : topic diff=0.018666, rho=0.301511
2022-10-05 01:23:20,569 : INFO : LdaModel lifecycle

[(0,
  '0.018*"time" + 0.017*"rachel" + 0.017*"gabby" + 0.016*"good" + 0.012*"kind" + 0.010*"way" + 0.009*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"'),
 (1,
  '0.019*"good" + 0.017*"time" + 0.015*"rachel" + 0.014*"gabby" + 0.012*"love" + 0.012*"kind" + 0.010*"date" + 0.009*"way" + 0.008*"okay" + 0.008*"ready"'),
 (2,
  '0.017*"gabby" + 0.017*"good" + 0.016*"time" + 0.015*"rachel" + 0.013*"kind" + 0.013*"guys" + 0.010*"little" + 0.010*"love" + 0.009*"lot" + 0.009*"day"')]

In [142]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

2022-10-05 01:23:20,603 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:20,610 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:20,613 : INFO : using serial LDA version on this node
2022-10-05 01:23:20,619 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:20,687 : INFO : -8.707 per-word bound, 418.0 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:20,688 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:20,702 : INFO : topic #0 (0.250): 0.019*"good" + 0.019*"rachel" + 0.017*"time" + 0.014*"gabby" + 0.014*"kind" + 0.011*"love" + 0.010*"little" + 0.009*"ready" + 0.009*"thing" + 0.008*"way"
2022-10-05 01:23:20,703 : INFO : topic #1 (0.250): 0.019*"gabby" + 0.017*"good" + 0.016*"time" + 0.015*"rachel" + 0.013*

2022-10-05 01:23:21,324 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:21,349 : INFO : topic #0 (0.250): 0.018*"time" + 0.018*"rachel" + 0.017*"gabby" + 0.016*"good" + 0.012*"kind" + 0.010*"way" + 0.010*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"
2022-10-05 01:23:21,350 : INFO : topic #1 (0.250): 0.021*"gabby" + 0.019*"rachel" + 0.013*"week" + 0.013*"love" + 0.010*"time" + 0.010*"guys" + 0.009*"bachelorette" + 0.009*"good" + 0.007*"things" + 0.007*"kind"
2022-10-05 01:23:21,351 : INFO : topic #2 (0.250): 0.003*"good" + 0.003*"kind" + 0.003*"rachel" + 0.003*"time" + 0.003*"gabby" + 0.003*"little" + 0.002*"guys" + 0.002*"okay" + 0.002*"lot" + 0.002*"way"
2022-10-05 01:23:21,351 : INFO : topic #3 (0.250): 0.020*"good" + 0.018*"time" + 0.016*"gabby" + 0.015*"rachel" + 0.014*"kind" + 0.011*"guys" + 0.011*"love" + 0.010*"little" + 0.010*"date" + 0.009*"lot"
2022-10-05 01:23:21,352 : INFO : topic diff=0.059180, rho=0.333333
2022-10-05 01:23:21,404 : INFO : -6.713 per-wo

[(0,
  '0.018*"time" + 0.018*"rachel" + 0.017*"gabby" + 0.016*"good" + 0.012*"kind" + 0.010*"way" + 0.010*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"'),
 (1,
  '0.021*"gabby" + 0.019*"rachel" + 0.014*"week" + 0.014*"love" + 0.010*"time" + 0.010*"bachelorette" + 0.009*"guys" + 0.009*"good" + 0.007*"things" + 0.007*"kind"'),
 (2,
  '0.002*"good" + 0.002*"kind" + 0.002*"rachel" + 0.002*"time" + 0.002*"gabby" + 0.002*"little" + 0.001*"guys" + 0.001*"okay" + 0.001*"lot" + 0.001*"way"'),
 (3,
  '0.020*"good" + 0.018*"time" + 0.016*"gabby" + 0.015*"rachel" + 0.014*"kind" + 0.011*"guys" + 0.011*"love" + 0.010*"little" + 0.010*"date" + 0.009*"lot"')]

## 4.2 Topic Modeling - Attempt #4 (nouns+adjectives+Pnouns+noun plural)


In [143]:
# Let's create a function to pull out nouns from a string of text
#https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

def NN_JJ_NNP_NNS(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'NNP' or pos[:2] == 'NNS'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [144]:
data_NN_JJ_NNP_NNS = pd.DataFrame(df.lines_by_segment.apply(NN_JJ_NNP_NNS))
data_NN_JJ_NNP_NNS

Unnamed: 0_level_0,lines_by_segment
segment,Unnamed: 1_level_1
first third,i difference time decision decision easier yes...
intro,i i bachelorette rachel bachelorette amazing e...
outro,gabby rachel bachelorette more best friends wo...
second third,yeah type yeah yeah i i anyone tino s entrance...
third third,i curious i incredible journey i incredible cr...


In [145]:
# Create a new document-term matrix using only NN JJ NNP NNS
cvnns = CountVectorizer(stop_words=stop_words)
data_cvnns = cvnns.fit_transform(data_NN_JJ_NNP_NNS.lines_by_segment)
data_dtmnns = pd.DataFrame(data_cvnns.toarray(), columns=cvnns.get_feature_names())
data_dtmnns.index = data_NN_JJ_NNP_NNS.index
data_dtmnns

Unnamed: 0_level_0,aah,abc,abilities,ability,abs,absolute,absurd,abundance,accept,acceptance,...,york,young,younger,youse,yum,zach,zachary,zero,zhh,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,0,1,0,1,0,4,1,0,2,0,...,0,2,3,1,0,43,1,1,0,5
intro,0,0,0,0,2,0,1,0,0,0,...,0,0,0,0,0,6,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
second third,1,0,1,1,1,3,0,2,0,1,...,0,1,1,0,1,12,0,0,1,4
third third,0,1,0,1,0,0,0,0,0,0,...,3,1,0,0,0,13,0,0,0,0


In [146]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmnns.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvnns.vocabulary_.items())

In [147]:
#trying topic modeling with 2 topics
lda = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:26,802 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:26,803 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:26,805 : INFO : using serial LDA version on this node
2022-10-05 01:23:26,810 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:26,871 : INFO : -8.420 per-word bound, 342.6 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:26,872 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:26,889 : INFO : topic #0 (0.500): 0.018*"gabby" + 0.017*"good" + 0.016*"rachel" + 0.014*"time" + 0.010*"kind" + 0.009*"week" + 0.009*"day" + 0.009*"way" + 0.008*"hard" + 0.008*"lot"
2022-10-05 01:23:26,890 : INFO : topic #1 (0.500): 0.018*"time" + 0.017*"good" + 0.016*"rachel" + 0.015*"gabby" + 0.014*"kind" +

[(0,
  '0.009*"gabby" + 0.009*"rachel" + 0.007*"time" + 0.007*"good" + 0.006*"hard" + 0.006*"ceremony" + 0.006*"conversation" + 0.005*"week" + 0.005*"night" + 0.005*"ready"'),
 (1,
  '0.019*"good" + 0.018*"time" + 0.017*"gabby" + 0.017*"rachel" + 0.014*"kind" + 0.011*"guys" + 0.011*"love" + 0.010*"way" + 0.009*"little" + 0.009*"okay"')]

In [148]:
lda = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:27,646 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:27,648 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:27,650 : INFO : using serial LDA version on this node
2022-10-05 01:23:27,652 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:27,723 : INFO : -8.557 per-word bound, 376.7 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:27,724 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:27,739 : INFO : topic #0 (0.333): 0.018*"good" + 0.018*"gabby" + 0.017*"time" + 0.015*"rachel" + 0.013*"kind" + 0.011*"guys" + 0.011*"love" + 0.009*"way" + 0.009*"ready" + 0.009*"little"
2022-10-05 01:23:27,740 : INFO : topic #1 (0.333): 0.017*"good" + 0.016*"time" + 0.015*"gabby

2022-10-05 01:23:28,529 : INFO : topic diff=0.030422, rho=0.316228
2022-10-05 01:23:28,581 : INFO : -6.693 per-word bound, 103.4 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:28,582 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:28,592 : INFO : topic #0 (0.333): 0.019*"good" + 0.017*"time" + 0.017*"gabby" + 0.016*"rachel" + 0.013*"kind" + 0.011*"love" + 0.011*"guys" + 0.009*"little" + 0.009*"date" + 0.009*"way"
2022-10-05 01:23:28,594 : INFO : topic #1 (0.333): 0.002*"good" + 0.002*"time" + 0.002*"gabby" + 0.001*"kind" + 0.001*"rachel" + 0.001*"lot" + 0.001*"way" + 0.001*"day" + 0.001*"guys" + 0.001*"okay"
2022-10-05 01:23:28,595 : INFO : topic #2 (0.333): 0.018*"time" + 0.017*"rachel" + 0.017*"gabby" + 0.015*"good" + 0.012*"kind" + 0.010*"way" + 0.009*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"
2022-10-05 01:23:28,595 : INFO : topic diff=0.022420, rho=0.301511
2022-10-05 01:23:28,597 : INFO : LdaModel lifecycle e

[(0,
  '0.019*"good" + 0.017*"time" + 0.017*"gabby" + 0.016*"rachel" + 0.013*"kind" + 0.011*"love" + 0.011*"guys" + 0.009*"little" + 0.009*"date" + 0.009*"way"'),
 (1,
  '0.002*"good" + 0.002*"time" + 0.002*"gabby" + 0.001*"kind" + 0.001*"rachel" + 0.001*"lot" + 0.001*"way" + 0.001*"day" + 0.001*"guys" + 0.001*"okay"'),
 (2,
  '0.018*"time" + 0.017*"rachel" + 0.017*"gabby" + 0.015*"good" + 0.012*"kind" + 0.010*"way" + 0.009*"okay" + 0.009*"hard" + 0.009*"guys" + 0.009*"night"')]

In [149]:
lda = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:28,621 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:28,623 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:28,624 : INFO : using serial LDA version on this node
2022-10-05 01:23:28,628 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:28,695 : INFO : -8.714 per-word bound, 419.9 perplexity estimate based on a held-out corpus of 5 documents with 24948 words
2022-10-05 01:23:28,696 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:28,710 : INFO : topic #0 (0.250): 0.019*"time" + 0.019*"good" + 0.016*"rachel" + 0.015*"gabby" + 0.013*"kind" + 0.012*"love" + 0.010*"way" + 0.009*"day" + 0.009*"lot" + 0.009*"little"
2022-10-05 01:23:28,711 : INFO : topic #1 (0.250): 0.021*"gabby" + 0.018*"good" + 0.016*"time" + 0.012*"kind" + 0.011*"rache

2022-10-05 01:23:29,342 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:29,357 : INFO : topic #0 (0.250): 0.019*"good" + 0.018*"time" + 0.016*"rachel" + 0.015*"gabby" + 0.013*"love" + 0.013*"kind" + 0.010*"date" + 0.009*"way" + 0.009*"okay" + 0.008*"ready"
2022-10-05 01:23:29,357 : INFO : topic #1 (0.250): 0.003*"gabby" + 0.003*"good" + 0.002*"time" + 0.002*"kind" + 0.002*"rachel" + 0.002*"guys" + 0.002*"little" + 0.001*"love" + 0.001*"date" + 0.001*"lot"
2022-10-05 01:23:29,358 : INFO : topic #2 (0.250): 0.021*"gabby" + 0.019*"rachel" + 0.013*"week" + 0.013*"love" + 0.010*"time" + 0.009*"guys" + 0.009*"bachelorette" + 0.009*"good" + 0.007*"kind" + 0.007*"things"
2022-10-05 01:23:29,360 : INFO : topic #3 (0.250): 0.018*"good" + 0.018*"time" + 0.017*"gabby" + 0.016*"rachel" + 0.014*"kind" + 0.012*"guys" + 0.010*"lot" + 0.010*"little" + 0.010*"way" + 0.010*"okay"
2022-10-05 01:23:29,362 : INFO : topic diff=0.071835, rho=0.333333
2022-10-05 01:23:29,411 : INFO : -6.717 per-wo

[(0,
  '0.019*"good" + 0.017*"time" + 0.016*"rachel" + 0.015*"gabby" + 0.013*"love" + 0.012*"kind" + 0.010*"date" + 0.009*"way" + 0.009*"okay" + 0.008*"ready"'),
 (1,
  '0.002*"gabby" + 0.002*"good" + 0.001*"time" + 0.001*"kind" + 0.001*"rachel" + 0.001*"guys" + 0.001*"little" + 0.001*"love" + 0.001*"date" + 0.001*"lot"'),
 (2,
  '0.021*"gabby" + 0.019*"rachel" + 0.014*"week" + 0.013*"love" + 0.010*"time" + 0.010*"bachelorette" + 0.009*"guys" + 0.009*"good" + 0.007*"things" + 0.007*"kind"'),
 (3,
  '0.019*"good" + 0.018*"time" + 0.017*"gabby" + 0.016*"rachel" + 0.014*"kind" + 0.012*"guys" + 0.010*"lot" + 0.010*"little" + 0.010*"way" + 0.010*"okay"')]

## 4.2 Topic Modeling - Attempt #5 (nouns+adjectives+Pnouns+noun plural & min_df & max_df)

In [150]:
# Create a new document-term matrix using only NN JJ NNP NNS
#ignoring terms that appear in less than 20% of the documents
#ignoring terms that appear in more than 80% of the documents
cvnns_2 = CountVectorizer(stop_words=stop_words, min_df=0.25, max_df=.80)
data_cvnns_2 = cvnns_2.fit_transform(data_NN_JJ_NNP_NNS.lines_by_segment)
data_dtmnns_2 = pd.DataFrame(data_cvnns_2.toarray(), columns=cvnns_2.get_feature_names())
data_dtmnns_2.index = data_NN_JJ_NNP_NNS.index
data_dtmnns_2

Unnamed: 0_level_0,abc,ability,abs,absolute,absurd,action,active,actual,adult,advance,...,wrong,ya,yay,yep,yes,yesterday,yo,young,younger,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,1,1,0,4,1,0,1,2,1,1,...,9,1,0,1,16,1,3,2,3,5
intro,0,0,2,0,1,0,1,0,2,0,...,2,0,0,0,0,0,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
second third,0,1,1,3,0,1,0,4,2,0,...,8,0,2,1,25,11,4,1,1,4
third third,1,1,0,0,0,1,0,1,0,1,...,4,1,2,0,22,4,1,1,0,0


In [151]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmnns_2.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvnns_2.vocabulary_.items())

In [152]:
#trying topic modeling with 2 topics
lda = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:29,570 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:29,572 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:29,573 : INFO : using serial LDA version on this node
2022-10-05 01:23:29,574 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:29,640 : INFO : -7.452 per-word bound, 175.1 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:29,642 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:29,651 : INFO : topic #0 (0.500): 0.010*"better" + 0.009*"relationship" + 0.009*"real" + 0.008*"fun" + 0.008*"special" + 0.008*"dad" + 0.008*"easy" + 0.006*"yes" + 0.006*"group" + 0.006*"aven"
2022-10-05 01:23:29,652 : INFO : topic #1 (0.500): 0.012*"fun" + 0.011*"group" + 0.010*"real" + 0.009*"relationship" +

[(0,
  '0.010*"special" + 0.010*"relationship" + 0.009*"real" + 0.009*"fun" + 0.008*"easy" + 0.008*"aven" + 0.007*"group" + 0.007*"mom" + 0.007*"decision" + 0.007*"dad"'),
 (1,
  '0.011*"fun" + 0.010*"group" + 0.009*"real" + 0.009*"better" + 0.008*"relationship" + 0.008*"decision" + 0.008*"couple" + 0.008*"yes" + 0.007*"dad" + 0.006*"special"')]

In [153]:
#trying topic modeling with 3 topics
lda = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:30,483 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:30,484 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:30,486 : INFO : using serial LDA version on this node
2022-10-05 01:23:30,488 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:30,538 : INFO : -7.591 per-word bound, 192.8 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:30,539 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:30,567 : INFO : topic #0 (0.333): 0.011*"fun" + 0.009*"real" + 0.008*"group" + 0.007*"yes" + 0.007*"relationship" + 0.007*"special" + 0.007*"easy" + 0.007*"better" + 0.006*"nate" + 0.006*"dad"
2022-10-05 01:23:30,570 : INFO : topic #1 (0.333): 0.011*"real" + 0.009*"decision" + 0.0

2022-10-05 01:23:31,158 : INFO : topic #2 (0.333): 0.016*"fun" + 0.011*"better" + 0.010*"group" + 0.009*"dad" + 0.008*"relationship" + 0.007*"nate" + 0.007*"yes" + 0.007*"strong" + 0.007*"special" + 0.007*"easy"
2022-10-05 01:23:31,159 : INFO : topic diff=0.027459, rho=0.316228
2022-10-05 01:23:31,206 : INFO : -6.744 per-word bound, 107.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:31,206 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:31,215 : INFO : topic #0 (0.333): 0.011*"child" + 0.009*"kid" + 0.009*"blanco" + 0.008*"sex" + 0.008*"feet" + 0.006*"dad" + 0.006*"pilot" + 0.006*"horse" + 0.006*"dreams" + 0.004*"yes"
2022-10-05 01:23:31,216 : INFO : topic #1 (0.333): 0.012*"real" + 0.010*"decision" + 0.010*"relationship" + 0.009*"special" + 0.009*"group" + 0.007*"fun" + 0.007*"easy" + 0.007*"meatball" + 0.007*"aven" + 0.006*"better"
2022-10-05 01:23:31,217 : INFO : topic #2 (0.333): 0.016*"fun" + 0.011*"better" + 0.010*"gr

[(0,
  '0.011*"child" + 0.009*"kid" + 0.009*"blanco" + 0.008*"sex" + 0.008*"feet" + 0.006*"dad" + 0.006*"pilot" + 0.006*"horse" + 0.006*"dreams" + 0.004*"yes"'),
 (1,
  '0.012*"real" + 0.010*"decision" + 0.010*"relationship" + 0.009*"special" + 0.009*"group" + 0.007*"fun" + 0.007*"easy" + 0.007*"meatball" + 0.007*"aven" + 0.006*"better"'),
 (2,
  '0.016*"fun" + 0.011*"better" + 0.010*"group" + 0.009*"dad" + 0.008*"relationship" + 0.007*"nate" + 0.007*"yes" + 0.007*"strong" + 0.007*"special" + 0.007*"easy"')]

In [154]:
#trying topic modeling with 4 topics
lda = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:31,231 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:31,233 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:31,234 : INFO : using serial LDA version on this node
2022-10-05 01:23:31,236 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:31,281 : INFO : -7.743 per-word bound, 214.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:31,282 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:31,292 : INFO : topic #0 (0.250): 0.010*"group" + 0.010*"decision" + 0.010*"fun" + 0.009*"relationship" + 0.009*"real" + 0.008*"special" + 0.007*"couple" + 0.007*"dad" + 0.007*"easy" + 0.007*"better"
2022-10-05 01:23:31,293 : INFO : topic #1 (0.250): 0.011*"fun" + 0.009*"yes" + 0.008*"relationship" + 0.008*"

2022-10-05 01:23:31,719 : INFO : topic diff=0.092188, rho=0.353553
2022-10-05 01:23:31,770 : INFO : -6.783 per-word bound, 110.1 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:31,770 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:31,792 : INFO : topic #0 (0.250): 0.011*"special" + 0.011*"relationship" + 0.010*"group" + 0.010*"real" + 0.010*"decision" + 0.008*"fun" + 0.007*"easy" + 0.007*"aven" + 0.006*"paris" + 0.006*"mom"
2022-10-05 01:23:31,792 : INFO : topic #1 (0.250): 0.002*"fun" + 0.002*"yes" + 0.002*"relationship" + 0.002*"group" + 0.002*"special" + 0.002*"real" + 0.002*"better" + 0.002*"easy" + 0.002*"dad" + 0.001*"perfect"
2022-10-05 01:23:31,793 : INFO : topic #2 (0.250): 0.012*"fun" + 0.010*"real" + 0.010*"better" + 0.009*"group" + 0.009*"yes" + 0.009*"relationship" + 0.008*"dad" + 0.007*"nate" + 0.007*"easy" + 0.007*"special"
2022-10-05 01:23:31,794 : INFO : topic #3 (0.250): 0.013*"child" + 0.011*"kid" + 0.011*"

[(0,
  '0.011*"special" + 0.011*"relationship" + 0.010*"group" + 0.010*"real" + 0.010*"decision" + 0.008*"fun" + 0.007*"aven" + 0.007*"easy" + 0.006*"paris" + 0.006*"mom"'),
 (1,
  '0.001*"fun" + 0.001*"yes" + 0.001*"relationship" + 0.001*"group" + 0.001*"special" + 0.001*"real" + 0.001*"better" + 0.001*"easy" + 0.001*"dad" + 0.001*"perfect"'),
 (2,
  '0.012*"fun" + 0.010*"real" + 0.009*"better" + 0.009*"group" + 0.009*"yes" + 0.009*"relationship" + 0.008*"dad" + 0.007*"nate" + 0.007*"easy" + 0.007*"special"'),
 (3,
  '0.013*"child" + 0.011*"kid" + 0.011*"blanco" + 0.009*"sex" + 0.009*"feet" + 0.007*"dad" + 0.007*"pilot" + 0.007*"horse" + 0.007*"dreams" + 0.005*"easy"')]

In [155]:
#trying topic modeling with 5 topics
lda = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:31,915 : INFO : using symmetric alpha at 0.2
2022-10-05 01:23:31,918 : INFO : using symmetric eta at 0.2
2022-10-05 01:23:31,919 : INFO : using serial LDA version on this node
2022-10-05 01:23:31,920 : INFO : running online (multi-pass) LDA training, 5 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:31,960 : INFO : -7.913 per-word bound, 241.0 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:31,961 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:31,974 : INFO : topic #0 (0.200): 0.010*"fun" + 0.010*"group" + 0.009*"real" + 0.009*"relationship" + 0.007*"better" + 0.007*"aven" + 0.007*"special" + 0.007*"mom" + 0.006*"couple" + 0.006*"easy"
2022-10-05 01:23:31,975 : INFO : topic #1 (0.200): 0.010*"real" + 0.008*"relationship" + 0.008*"special" + 0.008*"g

2022-10-05 01:23:32,360 : INFO : topic #4 (0.200): 0.014*"real" + 0.011*"decision" + 0.010*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.008*"relationship" + 0.008*"roses" + 0.007*"group" + 0.007*"mario"
2022-10-05 01:23:32,362 : INFO : topic diff=0.162139, rho=0.377964
2022-10-05 01:23:32,425 : INFO : -6.848 per-word bound, 115.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:32,425 : INFO : PROGRESS: pass 6, at document #5/5
2022-10-05 01:23:32,437 : INFO : topic #0 (0.200): 0.002*"fun" + 0.002*"group" + 0.002*"real" + 0.002*"relationship" + 0.002*"better" + 0.002*"aven" + 0.002*"special" + 0.002*"mom" + 0.002*"couple" + 0.002*"easy"
2022-10-05 01:23:32,438 : INFO : topic #1 (0.200): 0.002*"real" + 0.002*"relationship" + 0.002*"special" + 0.002*"group" + 0.002*"decision" + 0.002*"fun" + 0.002*"easy" + 0.002*"mom" + 0.002*"aven" + 0.002*"better"
2022-10-05 01:23:32,438 : INFO : topic #2 (0.200): 0.017*"fun" + 0.01

[(0,
  '0.001*"fun" + 0.001*"group" + 0.001*"real" + 0.001*"relationship" + 0.001*"better" + 0.001*"aven" + 0.001*"special" + 0.001*"mom" + 0.001*"couple" + 0.001*"easy"'),
 (1,
  '0.001*"real" + 0.001*"relationship" + 0.001*"special" + 0.001*"group" + 0.001*"decision" + 0.001*"fun" + 0.001*"easy" + 0.001*"mom" + 0.001*"aven" + 0.001*"better"'),
 (2,
  '0.017*"fun" + 0.011*"better" + 0.010*"group" + 0.010*"dad" + 0.009*"relationship" + 0.008*"nate" + 0.008*"yes" + 0.007*"strong" + 0.007*"mom" + 0.007*"special"'),
 (3,
  '0.011*"special" + 0.010*"relationship" + 0.009*"group" + 0.009*"decision" + 0.009*"real" + 0.008*"fun" + 0.008*"easy" + 0.007*"aven" + 0.007*"paris" + 0.006*"mom"'),
 (4,
  '0.014*"real" + 0.011*"decision" + 0.010*"impression" + 0.010*"yes" + 0.009*"gentlemen" + 0.009*"meatball" + 0.008*"roses" + 0.008*"relationship" + 0.007*"group" + 0.007*"mario"')]

## 4.2 Topic Modeling - Attempt #6 (nouns+adjectives+Pnouns+noun plural & min_df & max_df & ngrams)

In [156]:
# Create a new document-term matrix using only NN JJ NNP NNS
#ignoring terms that appear in less than 20% of the documents
#ignoring terms that appear in more than 80% of the documents
cvnns_3 = CountVectorizer(stop_words=stop_words, ngram_range=(1,2), min_df=0.25, max_df=.80)
data_cvnns_3 = cvnns_3.fit_transform(data_NN_JJ_NNP_NNS.lines_by_segment)
data_dtmnns_3 = pd.DataFrame(data_cvnns_3.toarray(), columns=cvnns_3.get_feature_names())
data_dtmnns_3.index = data_NN_JJ_NNP_NNS.index
data_dtmnns_3

Unnamed: 0_level_0,abc,ability,abs,absolute,absurd,action,active,actual,adult,advance,...,young,young lot,younger,zach great,zach incredible,zach love,zach man,zach thank,zach tyler,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,1,1,0,4,1,0,1,2,1,1,...,2,1,3,3,1,1,1,1,0,5
intro,0,0,2,0,1,0,1,0,2,0,...,0,0,0,0,0,1,0,0,1,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
second third,0,1,1,3,0,1,0,4,2,0,...,1,1,1,1,1,1,0,1,0,4
third third,1,1,0,0,0,1,0,1,0,1,...,1,0,0,0,0,0,1,1,1,0


In [157]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmnns_3.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvnns_3.vocabulary_.items())

In [158]:
#trying topic modeling with 2 topics
lda = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:32,706 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:32,708 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:32,709 : INFO : using serial LDA version on this node
2022-10-05 01:23:32,711 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:32,789 : INFO : -8.559 per-word bound, 377.1 perplexity estimate based on a held-out corpus of 5 documents with 15563 words
2022-10-05 01:23:32,790 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:32,804 : INFO : topic #0 (0.500): 0.006*"real" + 0.006*"relationship" + 0.005*"fun" + 0.004*"special" + 0.004*"easy" + 0.004*"better" + 0.004*"decision" + 0.004*"group" + 0.004*"yes" + 0.003*"strong"
2022-10-05 01:23:32,805 : INFO : topic #1 (0.500): 0.006*"group" + 0.005*"fun" + 0.004*"better" + 0.004*"speci

[(0,
  '0.005*"special" + 0.005*"relationship" + 0.005*"real" + 0.004*"fun" + 0.004*"easy" + 0.004*"aven" + 0.004*"decision" + 0.004*"group" + 0.003*"mom" + 0.003*"dad"'),
 (1,
  '0.006*"fun" + 0.005*"group" + 0.005*"real" + 0.005*"better" + 0.004*"relationship" + 0.004*"decision" + 0.004*"yes" + 0.004*"couple" + 0.004*"dad" + 0.004*"special"')]

In [159]:
#trying topic modeling with 3 topics
lda = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:33,569 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:33,571 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:33,573 : INFO : using serial LDA version on this node
2022-10-05 01:23:33,576 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:33,653 : INFO : -8.759 per-word bound, 433.4 perplexity estimate based on a held-out corpus of 5 documents with 15563 words
2022-10-05 01:23:33,654 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:33,667 : INFO : topic #0 (0.333): 0.006*"fun" + 0.004*"real" + 0.004*"group" + 0.004*"relationship" + 0.004*"special" + 0.004*"decision" + 0.004*"better" + 0.003*"yes" + 0.003*"dad" + 0.003*"aven"
2022-10-05 01:23:33,668 : INFO : topic #1 (0.333): 0.006*"group" + 0.005*"special" 

2022-10-05 01:23:34,358 : INFO : topic #1 (0.333): 0.006*"fun" + 0.006*"real" + 0.005*"group" + 0.005*"relationship" + 0.005*"special" + 0.005*"better" + 0.005*"decision" + 0.004*"easy" + 0.004*"dad" + 0.004*"yes"
2022-10-05 01:23:34,359 : INFO : topic #2 (0.333): 0.004*"child" + 0.003*"kid" + 0.003*"blanco" + 0.003*"sex" + 0.003*"feet" + 0.002*"dad" + 0.002*"pilot" + 0.002*"live finale" + 0.002*"horse" + 0.002*"dreams"
2022-10-05 01:23:34,360 : INFO : topic diff=0.032659, rho=0.316228
2022-10-05 01:23:34,417 : INFO : -7.898 per-word bound, 238.5 perplexity estimate based on a held-out corpus of 5 documents with 15563 words
2022-10-05 01:23:34,418 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:34,426 : INFO : topic #0 (0.333): 0.001*"fun" + 0.000*"group" + 0.000*"real" + 0.000*"relationship" + 0.000*"decision" + 0.000*"special" + 0.000*"better" + 0.000*"couple" + 0.000*"yes" + 0.000*"aven"
2022-10-05 01:23:34,426 : INFO : topic #1 (0.333): 0.006*"fun" + 0.006*"real" + 0.0

[(0,
  '0.001*"fun" + 0.000*"group" + 0.000*"real" + 0.000*"relationship" + 0.000*"decision" + 0.000*"special" + 0.000*"better" + 0.000*"couple" + 0.000*"yes" + 0.000*"aven"'),
 (1,
  '0.006*"fun" + 0.006*"real" + 0.005*"group" + 0.005*"relationship" + 0.005*"special" + 0.005*"better" + 0.005*"decision" + 0.004*"easy" + 0.004*"dad" + 0.004*"yes"'),
 (2,
  '0.004*"child" + 0.003*"kid" + 0.003*"blanco" + 0.003*"sex" + 0.003*"feet" + 0.002*"dad" + 0.002*"pilot" + 0.002*"live finale" + 0.002*"horse" + 0.002*"dreams"')]

In [160]:
#trying topic modeling with 4 topics
lda = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:34,442 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:34,446 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:34,448 : INFO : using serial LDA version on this node
2022-10-05 01:23:34,453 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:34,527 : INFO : -8.989 per-word bound, 508.1 perplexity estimate based on a held-out corpus of 5 documents with 15563 words
2022-10-05 01:23:34,528 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:34,546 : INFO : topic #0 (0.250): 0.007*"real" + 0.006*"fun" + 0.005*"better" + 0.005*"special" + 0.005*"relationship" + 0.005*"group" + 0.004*"decision" + 0.003*"easy" + 0.003*"nate" + 0.003*"second"
2022-10-05 01:23:34,546 : INFO : topic #1 (0.250): 0.005*"group" + 0.005*"relationship" + 0.005*"fun" + 0.0

2022-10-05 01:23:35,112 : INFO : topic diff=0.111048, rho=0.353553
2022-10-05 01:23:35,193 : INFO : -7.927 per-word bound, 243.3 perplexity estimate based on a held-out corpus of 5 documents with 15563 words
2022-10-05 01:23:35,194 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:35,206 : INFO : topic #0 (0.250): 0.006*"group" + 0.005*"decision" + 0.005*"couple" + 0.004*"second" + 0.004*"meatball" + 0.004*"families" + 0.003*"romantic" + 0.003*"dates" + 0.003*"relationships" + 0.003*"word"
2022-10-05 01:23:35,207 : INFO : topic #1 (0.250): 0.005*"child" + 0.004*"kid" + 0.004*"blanco" + 0.003*"sex" + 0.003*"feet" + 0.003*"dad" + 0.003*"pilot" + 0.002*"live finale" + 0.002*"horse" + 0.002*"dreams"
2022-10-05 01:23:35,208 : INFO : topic #2 (0.250): 0.001*"fun" + 0.001*"special" + 0.001*"yes" + 0.001*"group" + 0.001*"real" + 0.001*"aven" + 0.001*"dad" + 0.001*"second" + 0.001*"easy" + 0.001*"mom"
2022-10-05 01:23:35,209 : INFO : topic #3 (0.250): 0.007*"fun" + 0.006*"real" + 0.0

[(0,
  '0.006*"group" + 0.005*"decision" + 0.005*"couple" + 0.004*"second" + 0.004*"meatball" + 0.004*"families" + 0.003*"romantic" + 0.003*"dates" + 0.003*"relationships" + 0.003*"word"'),
 (1,
  '0.005*"child" + 0.004*"kid" + 0.004*"blanco" + 0.003*"sex" + 0.003*"feet" + 0.003*"dad" + 0.003*"pilot" + 0.003*"live finale" + 0.003*"horse" + 0.003*"dreams"'),
 (2,
  '0.001*"fun" + 0.000*"special" + 0.000*"yes" + 0.000*"group" + 0.000*"real" + 0.000*"aven" + 0.000*"dad" + 0.000*"second" + 0.000*"easy" + 0.000*"mom"'),
 (3,
  '0.007*"fun" + 0.006*"real" + 0.006*"relationship" + 0.005*"group" + 0.005*"special" + 0.005*"better" + 0.004*"easy" + 0.004*"dad" + 0.004*"decision" + 0.004*"yes"')]

## 4.2 Topic Modeling - Attempt #7 (nouns+adjectives+Pnouns+noun plural &ngrams)

In [161]:
# Create a new document-term matrix using only NN JJ NNP NNS
#ignoring terms that appear in less than 20% of the documents
#ignoring terms that appear in more than 80% of the documents
cvnns_4 = CountVectorizer(stop_words=stop_words, ngram_range=(1,2))
data_cvnns_4 = cvnns_4.fit_transform(data_NN_JJ_NNP_NNS.lines_by_segment)
data_dtmnns_4 = pd.DataFrame(data_cvnns_4.toarray(), columns=cvnns_4.get_feature_names())
data_dtmnns_4.index = data_NN_JJ_NNP_NNS.index
data_dtmnns_4

Unnamed: 0_level_0,aah,aah baby,abc,abc progress,abc time,abilities,abilities ethan,ability,ability greater,ability incredible,...,zone,zone cream,zone good,zone hi,zone honest,zone let,zone mom,zone okay,zone way,zone yes
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,0,0,1,0,1,0,0,1,0,0,...,5,0,1,1,1,0,1,0,0,1
intro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
second third,1,1,0,0,0,1,1,1,1,0,...,4,1,0,0,0,1,0,1,1,0
third third,0,0,1,1,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [162]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmnns_4.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvnns_4.vocabulary_.items())

In [163]:
#trying topic modeling with 2 topics
lda = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:35,527 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:35,530 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:35,536 : INFO : using serial LDA version on this node
2022-10-05 01:23:35,542 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:35,738 : INFO : -10.613 per-word bound, 1566.0 perplexity estimate based on a held-out corpus of 5 documents with 49891 words
2022-10-05 01:23:35,739 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:35,783 : INFO : topic #0 (0.500): 0.008*"good" + 0.007*"rachel" + 0.007*"time" + 0.005*"gabby" + 0.005*"kind" + 0.004*"way" + 0.004*"okay" + 0.004*"love" + 0.004*"date" + 0.004*"guys"
2022-10-05 01:23:35,785 : INFO : topic #1 (0.500): 0.008*"gabby" + 0.006*"time" + 0.005*"kind" + 0.005*"rachel" + 0.005*"goo

[(0,
  '0.008*"good" + 0.008*"time" + 0.007*"gabby" + 0.007*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.004*"love" + 0.004*"way" + 0.004*"little" + 0.004*"okay"'),
 (1,
  '0.003*"gabby" + 0.002*"rachel" + 0.002*"week" + 0.002*"love" + 0.002*"bachelorette" + 0.001*"guys" + 0.001*"gabby rachel" + 0.001*"history" + 0.001*"time" + 0.001*"men"')]

In [164]:
#trying topic modeling with 3 topics
lda = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:37,793 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:37,809 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:37,823 : INFO : using serial LDA version on this node
2022-10-05 01:23:37,838 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:38,061 : INFO : -10.986 per-word bound, 2027.6 perplexity estimate based on a held-out corpus of 5 documents with 49891 words
2022-10-05 01:23:38,062 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:38,105 : INFO : topic #0 (0.333): 0.007*"good" + 0.006*"time" + 0.006*"rachel" + 0.005*"gabby" + 0.005*"love" + 0.004*"kind" + 0.004*"guys" + 0.004*"way" + 0.004*"little" + 0.004*"okay"
2022-10-05 01:23:38,106 : INFO : topic #1 (0.333): 0.008*"gabby" + 0.007*"good" + 0.007*"tim

2022-10-05 01:23:40,025 : INFO : topic #2 (0.333): 0.006*"rachel" + 0.006*"gabby" + 0.006*"time" + 0.006*"good" + 0.004*"kind" + 0.004*"way" + 0.003*"guys" + 0.003*"hard" + 0.003*"okay" + 0.003*"love"
2022-10-05 01:23:40,026 : INFO : topic diff=0.018158, rho=0.316228
2022-10-05 01:23:40,218 : INFO : -9.211 per-word bound, 592.8 perplexity estimate based on a held-out corpus of 5 documents with 49891 words
2022-10-05 01:23:40,219 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:40,248 : INFO : topic #0 (0.333): 0.009*"good" + 0.008*"time" + 0.007*"gabby" + 0.006*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.005*"love" + 0.004*"little" + 0.004*"date" + 0.004*"lot"
2022-10-05 01:23:40,249 : INFO : topic #1 (0.333): 0.005*"gabby" + 0.004*"rachel" + 0.003*"week" + 0.002*"love" + 0.002*"guys" + 0.002*"time" + 0.002*"bachelorette" + 0.002*"excited" + 0.002*"kind" + 0.002*"gabby rachel"
2022-10-05 01:23:40,251 : INFO : topic #2 (0.333): 0.006*"rachel" + 0.006*"gabby" + 0.006*"time" + 

[(0,
  '0.009*"good" + 0.008*"time" + 0.007*"gabby" + 0.006*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.005*"love" + 0.004*"little" + 0.004*"date" + 0.004*"lot"'),
 (1,
  '0.005*"gabby" + 0.004*"rachel" + 0.003*"week" + 0.002*"love" + 0.002*"guys" + 0.002*"time" + 0.002*"bachelorette" + 0.002*"excited" + 0.002*"kind" + 0.002*"gabby rachel"'),
 (2,
  '0.006*"rachel" + 0.006*"gabby" + 0.006*"time" + 0.006*"good" + 0.004*"kind" + 0.004*"way" + 0.003*"guys" + 0.003*"hard" + 0.003*"okay" + 0.003*"love"')]

In [165]:
#trying topic modeling with 4 topics
lda = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:40,278 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:40,284 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:40,296 : INFO : using serial LDA version on this node
2022-10-05 01:23:40,326 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:40,523 : INFO : -11.438 per-word bound, 2775.2 perplexity estimate based on a held-out corpus of 5 documents with 49891 words
2022-10-05 01:23:40,524 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:40,560 : INFO : topic #0 (0.250): 0.007*"good" + 0.007*"gabby" + 0.007*"time" + 0.006*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.004*"way" + 0.004*"lot" + 0.004*"love" + 0.004*"okay"
2022-10-05 01:23:40,562 : INFO : topic #1 (0.250): 0.007*"time" + 0.007*"good" + 0.006*"rachel" + 0.005*"gabby" + 0.005*"ki

2022-10-05 01:23:42,147 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:42,175 : INFO : topic #0 (0.250): 0.008*"good" + 0.008*"time" + 0.008*"gabby" + 0.007*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.004*"lot" + 0.004*"way" + 0.004*"little" + 0.004*"okay"
2022-10-05 01:23:42,176 : INFO : topic #1 (0.250): 0.006*"gabby" + 0.005*"rachel" + 0.004*"week" + 0.003*"love" + 0.003*"time" + 0.003*"guys" + 0.002*"bachelorette" + 0.002*"kind" + 0.002*"excited" + 0.002*"gabby rachel"
2022-10-05 01:23:42,178 : INFO : topic #2 (0.250): 0.000*"gabby" + 0.000*"time" + 0.000*"good" + 0.000*"rachel" + 0.000*"guys" + 0.000*"love" + 0.000*"day" + 0.000*"family" + 0.000*"kind" + 0.000*"way"
2022-10-05 01:23:42,179 : INFO : topic #3 (0.250): 0.008*"good" + 0.007*"time" + 0.007*"rachel" + 0.006*"gabby" + 0.006*"love" + 0.005*"kind" + 0.004*"date" + 0.004*"way" + 0.003*"ready" + 0.003*"guys"
2022-10-05 01:23:42,181 : INFO : topic diff=0.033603, rho=0.333333
2022-10-05 01:23:42,362 : INFO : -9.23

[(0,
  '0.008*"good" + 0.008*"time" + 0.008*"gabby" + 0.007*"rachel" + 0.006*"kind" + 0.005*"guys" + 0.004*"lot" + 0.004*"way" + 0.004*"little" + 0.004*"okay"'),
 (1,
  '0.006*"gabby" + 0.005*"rachel" + 0.004*"week" + 0.003*"love" + 0.003*"guys" + 0.003*"time" + 0.002*"bachelorette" + 0.002*"excited" + 0.002*"kind" + 0.002*"gabby rachel"'),
 (2,
  '0.000*"gabby" + 0.000*"time" + 0.000*"good" + 0.000*"rachel" + 0.000*"guys" + 0.000*"love" + 0.000*"day" + 0.000*"family" + 0.000*"kind" + 0.000*"way"'),
 (3,
  '0.008*"good" + 0.007*"time" + 0.007*"rachel" + 0.006*"gabby" + 0.006*"love" + 0.005*"kind" + 0.004*"date" + 0.004*"way" + 0.003*"ready" + 0.003*"guys"')]

## 4.2 Topic Modeling - Attempt #8 (nouns+adjectives+noun plural & max_df & min_df

In [166]:
# Let's create a function to pull out nouns from a string of text
#https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

def NN_JJ_NNS(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'NNS'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [167]:
data_NN_JJ_NNS = pd.DataFrame(df.lines_by_segment.apply(NN_JJ_NNS))
data_NN_JJ_NNS

Unnamed: 0_level_0,lines_by_segment
segment,Unnamed: 1_level_1
first third,i difference time decision decision easier yes...
intro,i i bachelorette rachel bachelorette amazing e...
outro,gabby rachel bachelorette more best friends wo...
second third,yeah type yeah yeah i i anyone tino s entrance...
third third,i curious i incredible journey i incredible cr...


In [168]:
# Create a new document-term matrix using only NN JJ NNP NNS
cvnns = CountVectorizer(stop_words=stop_words, min_df=0.25, max_df=0.80)
data_cvnns = cvnns.fit_transform(data_NN_JJ_NNS.lines_by_segment)
data_dtmnns = pd.DataFrame(data_cvnns.toarray(), columns=cvnns.get_feature_names())
data_dtmnns.index = data_NN_JJ_NNS.index
data_dtmnns

Unnamed: 0_level_0,abc,ability,abs,absolute,absurd,action,active,actual,adult,advance,...,wrong,ya,yay,yep,yes,yesterday,yo,young,younger,zone
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
first third,1,1,0,4,1,0,1,2,1,1,...,9,1,0,1,16,1,3,2,3,5
intro,0,0,2,0,1,0,1,0,2,0,...,2,0,0,0,0,0,0,0,0,0
outro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
second third,0,1,1,3,0,1,0,4,2,0,...,8,0,2,1,25,11,4,1,1,4
third third,1,1,0,0,0,1,0,1,0,1,...,4,1,2,0,22,4,1,1,0,0


In [169]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmnns.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvnns.vocabulary_.items())

In [170]:
#trying topic modeling with 2 topics
lda = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:48,073 : INFO : using symmetric alpha at 0.5
2022-10-05 01:23:48,078 : INFO : using symmetric eta at 0.5
2022-10-05 01:23:48,079 : INFO : using serial LDA version on this node
2022-10-05 01:23:48,080 : INFO : running online (multi-pass) LDA training, 2 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:48,155 : INFO : -7.453 per-word bound, 175.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:48,156 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:48,190 : INFO : topic #0 (0.500): 0.011*"relationship" + 0.010*"fun" + 0.009*"group" + 0.008*"better" + 0.008*"real" + 0.007*"special" + 0.007*"dad" + 0.007*"decision" + 0.006*"nate" + 0.006*"easy"
2022-10-05 01:23:48,191 : INFO : topic #1 (0.500): 0.011*"real" + 0.010*"fun" + 0.009*"special" + 0.008*"group" +

[(0,
  '0.011*"fun" + 0.010*"group" + 0.009*"real" + 0.009*"better" + 0.008*"relationship" + 0.008*"decision" + 0.008*"yes" + 0.008*"couple" + 0.007*"dad" + 0.007*"special"'),
 (1,
  '0.010*"special" + 0.010*"relationship" + 0.009*"real" + 0.008*"fun" + 0.008*"easy" + 0.007*"aven" + 0.007*"group" + 0.007*"decision" + 0.007*"mom" + 0.006*"dad"')]

In [171]:
lda = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:48,838 : INFO : using symmetric alpha at 0.3333333333333333
2022-10-05 01:23:48,842 : INFO : using symmetric eta at 0.3333333333333333
2022-10-05 01:23:48,844 : INFO : using serial LDA version on this node
2022-10-05 01:23:48,847 : INFO : running online (multi-pass) LDA training, 3 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:48,888 : INFO : -7.594 per-word bound, 193.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:48,889 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:48,902 : INFO : topic #0 (0.333): 0.010*"real" + 0.010*"relationship" + 0.010*"group" + 0.009*"easy" + 0.008*"special" + 0.008*"better" + 0.007*"decision" + 0.007*"nate" + 0.007*"fun" + 0.007*"yes"
2022-10-05 01:23:48,903 : INFO : topic #1 (0.333): 0.013*"fun" + 0.009*"real" + 0.0

2022-10-05 01:23:49,363 : INFO : topic #0 (0.333): 0.013*"real" + 0.011*"decision" + 0.009*"relationship" + 0.008*"impression" + 0.008*"yes" + 0.008*"meatball" + 0.008*"special" + 0.007*"easy" + 0.007*"group" + 0.007*"gentlemen"
2022-10-05 01:23:49,364 : INFO : topic #1 (0.333): 0.015*"fun" + 0.010*"group" + 0.010*"relationship" + 0.010*"better" + 0.009*"special" + 0.009*"dad" + 0.008*"mom" + 0.008*"easy" + 0.007*"real" + 0.007*"aven"
2022-10-05 01:23:49,364 : INFO : topic #2 (0.333): 0.012*"group" + 0.011*"decision" + 0.011*"couple" + 0.008*"second" + 0.008*"meatball" + 0.008*"families" + 0.007*"romantic" + 0.007*"dates" + 0.006*"relationships" + 0.006*"word"
2022-10-05 01:23:49,365 : INFO : topic diff=0.049156, rho=0.316228
2022-10-05 01:23:49,398 : INFO : -6.744 per-word bound, 107.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:49,399 : INFO : PROGRESS: pass 9, at document #5/5
2022-10-05 01:23:49,411 : INFO : topic #0 (0.333): 0.013

[(0,
  '0.013*"real" + 0.011*"decision" + 0.008*"relationship" + 0.008*"impression" + 0.008*"yes" + 0.008*"meatball" + 0.008*"special" + 0.007*"easy" + 0.007*"gentlemen" + 0.007*"group"'),
 (1,
  '0.015*"fun" + 0.010*"group" + 0.010*"relationship" + 0.010*"better" + 0.009*"special" + 0.009*"dad" + 0.008*"mom" + 0.008*"easy" + 0.007*"aven" + 0.007*"real"'),
 (2,
  '0.012*"group" + 0.011*"decision" + 0.011*"couple" + 0.008*"second" + 0.008*"meatball" + 0.008*"families" + 0.007*"romantic" + 0.007*"dates" + 0.006*"relationships" + 0.006*"word"')]

In [172]:
lda = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
lda.print_topics()

2022-10-05 01:23:49,430 : INFO : using symmetric alpha at 0.25
2022-10-05 01:23:49,433 : INFO : using symmetric eta at 0.25
2022-10-05 01:23:49,435 : INFO : using serial LDA version on this node
2022-10-05 01:23:49,436 : INFO : running online (multi-pass) LDA training, 4 topics, 10 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:23:49,480 : INFO : -7.745 per-word bound, 214.6 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:49,481 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:23:49,491 : INFO : topic #0 (0.250): 0.011*"fun" + 0.010*"real" + 0.009*"group" + 0.008*"relationship" + 0.008*"decision" + 0.007*"better" + 0.007*"special" + 0.007*"mom" + 0.007*"nate" + 0.006*"easy"
2022-10-05 01:23:49,492 : INFO : topic #1 (0.250): 0.011*"relationship" + 0.010*"real" + 0.010*"fun" + 0.010*"g

2022-10-05 01:23:49,851 : INFO : topic diff=0.128818, rho=0.353553
2022-10-05 01:23:49,894 : INFO : -6.740 per-word bound, 106.9 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:23:49,895 : INFO : PROGRESS: pass 7, at document #5/5
2022-10-05 01:23:49,903 : INFO : topic #0 (0.250): 0.003*"fun" + 0.002*"real" + 0.002*"group" + 0.002*"special" + 0.002*"relationship" + 0.002*"decision" + 0.002*"mom" + 0.002*"better" + 0.002*"easy" + 0.002*"nate"
2022-10-05 01:23:49,903 : INFO : topic #1 (0.250): 0.011*"fun" + 0.010*"real" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"decision" + 0.007*"easy" + 0.007*"dad" + 0.007*"yes"
2022-10-05 01:23:49,904 : INFO : topic #2 (0.250): 0.011*"child" + 0.010*"blanco" + 0.010*"kid" + 0.008*"feet" + 0.008*"sex" + 0.007*"dad" + 0.006*"horse" + 0.006*"pilot" + 0.006*"dreams" + 0.005*"easy"
2022-10-05 01:23:49,905 : INFO : topic #3 (0.250): 0.002*"real" + 0.002*"special" + 0.002*"fu

[(0,
  '0.002*"fun" + 0.002*"real" + 0.002*"group" + 0.002*"special" + 0.002*"relationship" + 0.002*"decision" + 0.001*"mom" + 0.001*"better" + 0.001*"easy" + 0.001*"nate"'),
 (1,
  '0.011*"fun" + 0.010*"real" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"decision" + 0.007*"easy" + 0.007*"dad" + 0.007*"yes"'),
 (2,
  '0.012*"child" + 0.010*"blanco" + 0.010*"kid" + 0.009*"feet" + 0.009*"sex" + 0.007*"dad" + 0.007*"horse" + 0.007*"pilot" + 0.006*"dreams" + 0.005*"easy"'),
 (3,
  '0.001*"real" + 0.001*"special" + 0.001*"fun" + 0.001*"group" + 0.001*"better" + 0.001*"easy" + 0.001*"dad" + 0.001*"relationship" + 0.001*"decision" + 0.001*"strong"')]

## 4.3 Identify Topics in Each Document

Out of all the topic models, attempt #8 with 2 topics makes the most sense. 

In [174]:
#Final LDA model for now
#trying topic modeling with 2 topics
ldana = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=80)
ldana.print_topics()

2022-10-05 01:29:06,394 : INFO : using symmetric alpha at 0.5
2022-10-05 01:29:06,396 : INFO : using symmetric eta at 0.5
2022-10-05 01:29:06,397 : INFO : using serial LDA version on this node
2022-10-05 01:29:06,399 : INFO : running online (multi-pass) LDA training, 2 topics, 80 passes over the supplied corpus of 5 documents, updating model once every 5 documents, evaluating perplexity every 5 documents, iterating 50x with a convergence threshold of 0.001000
2022-10-05 01:29:06,451 : INFO : -7.453 per-word bound, 175.2 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:06,452 : INFO : PROGRESS: pass 0, at document #5/5
2022-10-05 01:29:06,470 : INFO : topic #0 (0.500): 0.011*"fun" + 0.010*"group" + 0.009*"special" + 0.009*"real" + 0.009*"relationship" + 0.008*"better" + 0.007*"dad" + 0.007*"couple" + 0.007*"nate" + 0.007*"decision"
2022-10-05 01:29:06,475 : INFO : topic #1 (0.500): 0.010*"real" + 0.009*"relationship" + 0.009*"fun" + 0.008*"d

2022-10-05 01:29:07,111 : INFO : topic diff=0.013943, rho=0.288675
2022-10-05 01:29:07,166 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:07,167 : INFO : PROGRESS: pass 11, at document #5/5
2022-10-05 01:29:07,184 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.009*"better" + 0.008*"real" + 0.007*"dad" + 0.007*"easy" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:07,185 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"meatball" + 0.008*"gentlemen" + 0.007*"roses" + 0.007*"relationship" + 0.006*"group" + 0.006*"mario"
2022-10-05 01:29:07,186 : INFO : topic diff=0.010066, rho=0.277350
2022-10-05 01:29:07,221 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:07,224 : INFO : PROGRESS: pass 12, at document #5/5
2022-10-05

2022-10-05 01:29:07,765 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"meatball" + 0.008*"gentlemen" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:07,765 : INFO : topic diff=0.000453, rho=0.204124
2022-10-05 01:29:07,799 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:07,800 : INFO : PROGRESS: pass 23, at document #5/5
2022-10-05 01:29:07,812 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"dad" + 0.007*"easy" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:07,813 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"meatball" + 0.008*"gentlemen" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:07,814 : INFO : topic diff=0.000347, rho=

2022-10-05 01:29:08,351 : INFO : PROGRESS: pass 34, at document #5/5
2022-10-05 01:29:08,361 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:08,363 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:08,363 : INFO : topic diff=0.000032, rho=0.166667
2022-10-05 01:29:08,399 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:08,400 : INFO : PROGRESS: pass 35, at document #5/5
2022-10-05 01:29:08,409 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10

2022-10-05 01:29:08,921 : INFO : topic diff=0.000008, rho=0.145865
2022-10-05 01:29:08,955 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:08,955 : INFO : PROGRESS: pass 46, at document #5/5
2022-10-05 01:29:08,967 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:08,967 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:08,968 : INFO : topic diff=0.000007, rho=0.144338
2022-10-05 01:29:08,998 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:08,999 : INFO : PROGRESS: pass 47, at document #5/5
2022-10-05

2022-10-05 01:29:09,504 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:09,505 : INFO : topic diff=0.000013, rho=0.130189
2022-10-05 01:29:09,537 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:09,537 : INFO : PROGRESS: pass 58, at document #5/5
2022-10-05 01:29:09,550 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:09,550 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:09,551 : INFO : topic diff=0.000010, rho=

2022-10-05 01:29:10,182 : INFO : PROGRESS: pass 69, at document #5/5
2022-10-05 01:29:10,197 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10-05 01:29:10,198 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"
2022-10-05 01:29:10,199 : INFO : topic diff=0.000002, rho=0.118678
2022-10-05 01:29:10,231 : INFO : -6.707 per-word bound, 104.5 perplexity estimate based on a held-out corpus of 5 documents with 8879 words
2022-10-05 01:29:10,231 : INFO : PROGRESS: pass 70, at document #5/5
2022-10-05 01:29:10,239 : INFO : topic #0 (0.500): 0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"
2022-10

2022-10-05 01:29:10,685 : INFO : topic #1 (0.500): 0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"


[(0,
  '0.012*"fun" + 0.010*"group" + 0.010*"relationship" + 0.009*"special" + 0.008*"better" + 0.008*"real" + 0.007*"easy" + 0.007*"dad" + 0.007*"aven" + 0.007*"mom"'),
 (1,
  '0.012*"real" + 0.009*"decision" + 0.009*"impression" + 0.009*"yes" + 0.008*"gentlemen" + 0.008*"meatball" + 0.007*"roses" + 0.006*"relationship" + 0.006*"mario" + 0.006*"group"')]

These two topics are rough.

- Topic 0: fun, relationship ~ "dating"
- Topic 1: decision, group ~ "rose selection"

In [175]:
# Let's take a look at which topics each segment contains
corpus_transformed = ldana[corpusn]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmnns.index))

[(0, 'first third'),
 (0, 'intro'),
 (1, 'outro'),
 (0, 'second third'),
 (1, 'third third')]

These kind of make sense to me, so let's keep.

Topic 0: fun, relationship ~ "dating" [intro, 1/3, 2/3]  
Topic 1: decision, group ~ "rose selection" [3/3, outro]

Future Work:
- Lemmatize text
- Topic modeling across episodes
- Topic modeling across each episode x segment

In [6]:
pip install pyldavis

Note: you may need to restart the kernel to use updated packages.
