<a href="https://colab.research.google.com/github/krbok/AI-Screening-Resume-Parser/blob/main/Assignment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling

# Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

# Topic Modeling - Attempt #1 (All Text)

In [None]:
import pandas as pd
import pickle

data = pd.read_pickle('/content/dtm.pkl')
data

Unnamed: 0,abilities,absurd,accord,adam,adolescent,adriana,ads,adulthood,advertiserbreak,advertisercan,...,worldchanging,worlds,worldted,wreck,writer,year,youre,youth,youtube,zone
Adam +Munder,1,0,1,2,0,0,3,0,0,0,...,2,2,1,0,0,0,0,0,0,0
Adriana Galván,0,0,1,0,1,2,3,1,0,0,...,2,0,1,0,0,0,0,0,0,0
Andy Jarvis,0,0,1,0,0,0,3,0,0,0,...,2,0,1,1,0,0,0,0,0,0
Angus Hervey,0,0,1,0,0,0,3,0,1,0,...,2,0,1,0,0,0,0,0,0,0
Chin-Teng Lin,0,0,1,0,0,0,3,0,0,0,...,2,0,1,0,0,0,0,0,0,0
Eugenia Kuyda,0,0,1,0,0,0,3,0,0,1,...,2,0,1,0,0,0,0,0,0,0
Huiyi Lin,0,0,1,0,0,0,3,0,0,0,...,2,0,1,0,0,0,0,0,0,0
Irena Arslanova,0,0,1,0,0,0,3,0,0,0,...,2,0,1,0,0,0,2,0,0,0
Joshua Amponsem,0,2,1,0,0,0,3,0,0,0,...,2,0,1,0,0,0,0,1,0,0
Mariana Atencio,0,0,1,0,0,0,2,0,0,0,...,2,0,1,0,0,1,0,0,1,0


In [None]:
!pip install numpy==1.24.4 scipy==1.10.1 gensim==4.3.1



In [None]:
from gensim import matutils, models
import scipy.sparse

In [None]:
tdm = data.transpose()
tdm.head()

Unnamed: 0,Adam +Munder,Adriana Galván,Andy Jarvis,Angus Hervey,Chin-Teng Lin,Eugenia Kuyda,Huiyi Lin,Irena Arslanova,Joshua Amponsem,Mariana Atencio,Mel Robbins,Simone Stolzoff
abilities,1,0,0,0,0,0,0,0,0,0,0,0
absurd,0,0,0,0,0,0,0,0,2,0,0,0
accord,1,1,1,1,1,1,1,1,1,1,1,1
adam,2,0,0,0,0,0,0,0,0,0,0,0
adolescent,0,1,0,0,0,0,0,0,0,0,0,0


In [None]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
print(corpus)

<gensim.matutils.Sparse2Corpus object at 0x7d4d6e7eb690>


In [None]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("/content/cv_stop (1).pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [None]:
import pickle
import pandas as pd
from gensim import corpora, models

# 1. Load vectorizer and document-term matrix
cv = pickle.load(open('/content/cv_stop (1).pkl', 'rb'))
data_stop = pd.read_pickle('/content/dtm.pkl')  # Adjust path if needed

# 2. Get the feature names (i.e., terms used)
feature_names = cv.get_feature_names_out()

# 3. Reconstruct tokenized text for each document
texts = []
for _, row in data_stop.iterrows():
    doc = []
    for word, count in zip(feature_names, row):
        doc.extend([word] * int(count))
    texts.append(doc)

# 4. Create dictionary and corpus
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

# 5. Train the LDA model
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)

# 6. Print topics
for idx, topic in lda.print_topics():
    print(f"Topic #{idx}: {topic}")


Topic #0: 0.043*"tedxsf" + 0.015*"majority" + 0.015*"tech" + 0.012*"adulthood" + 0.008*"personal" + 0.008*"initiativespartner" + 0.008*"december" + 0.008*"partnersrelated" + 0.008*"content" + 0.008*"fix"
Topic #1: 0.041*"tedxsf" + 0.015*"majority" + 0.013*"tech" + 0.010*"adulthood" + 0.008*"december" + 0.007*"thoughts" + 0.007*"lens" + 0.007*"fix" + 0.007*"content" + 0.007*"partnersrelated"


In [None]:
import pickle
import pandas as pd
from gensim import corpora, models

# 1. Load vectorizer and document-term matrix
cv = pickle.load(open('/content/cv_stop (1).pkl', 'rb'))
data_stop = pd.read_pickle('/content/dtm.pkl')  # Adjust path if needed

# 2. Get the feature names (i.e., terms used)
feature_names = cv.get_feature_names_out()

# 3. Reconstruct tokenized text for each document
texts = []
for _, row in data_stop.iterrows():
    doc = []
    for word, count in zip(feature_names, row):
        doc.extend([word] * int(count))
    texts.append(doc)

# 4. Create dictionary and corpus
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

# 5. Train the LDA model
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)

# 6. Print topics
for idx, topic in lda.print_topics():
    print(f"Topic #{idx}: {topic}")


Topic #0: 0.048*"tedxsf" + 0.017*"majority" + 0.016*"tech" + 0.012*"adulthood" + 0.009*"december" + 0.008*"content" + 0.008*"partnersrelated" + 0.008*"fix" + 0.008*"initiativespartner" + 0.008*"thoughts"
Topic #1: 0.011*"screw" + 0.008*"mix" + 0.008*"summer" + 0.008*"embrace" + 0.008*"set" + 0.005*"youre" + 0.005*"time" + 0.005*"epidemic" + 0.005*"bad" + 0.005*"encourage"
Topic #2: 0.003*"tedxsf" + 0.002*"majority" + 0.002*"tech" + 0.002*"december" + 0.002*"adulthood" + 0.002*"initiativespartner" + 0.002*"fix" + 0.002*"lens" + 0.002*"thoughts" + 0.002*"partnersrelated"


In [None]:
import pickle
import pandas as pd
from gensim import corpora, models

# 1. Load vectorizer and document-term matrix
cv = pickle.load(open('/content/cv_stop (1).pkl', 'rb'))
data_stop = pd.read_pickle('/content/dtm.pkl')  # Adjust path if needed

# 2. Get the feature names (i.e., terms used)
feature_names = cv.get_feature_names_out()

# 3. Reconstruct tokenized text for each document
texts = []
for _, row in data_stop.iterrows():
    doc = []
    for word, count in zip(feature_names, row):
        doc.extend([word] * int(count))
    texts.append(doc)

# 4. Create dictionary and corpus
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

# 5. Train the LDA model
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=12)

# 6. Print topics
for idx, topic in lda.print_topics():
    print(f"Topic #{idx}: {topic}")


Topic #0: 0.042*"tedxsf" + 0.016*"majority" + 0.015*"tech" + 0.012*"adulthood" + 0.008*"partnersrelated" + 0.008*"december" + 0.008*"content" + 0.008*"personal" + 0.008*"thoughts" + 0.008*"fix"
Topic #1: 0.049*"tedxsf" + 0.017*"majority" + 0.016*"tech" + 0.012*"adulthood" + 0.009*"december" + 0.009*"lens" + 0.009*"fix" + 0.009*"thoughts" + 0.009*"content" + 0.009*"initiativespartner"
Topic #2: 0.002*"tedxsf" + 0.002*"majority" + 0.002*"tech" + 0.002*"adulthood" + 0.002*"fix" + 0.002*"initiativespartner" + 0.002*"december" + 0.002*"thoughts" + 0.002*"content" + 0.002*"lens"
Topic #3: 0.016*"tedxsf" + 0.015*"attend" + 0.011*"english" + 0.011*"globeaboutour" + 0.011*"screen" + 0.010*"menu" + 0.008*"mindsted" + 0.008*"carbon" + 0.008*"research" + 0.008*"adriana"


In [None]:
import pickle
import pandas as pd
from gensim import corpora, models

# 1. Load vectorizer and document-term matrix
cv = pickle.load(open('/content/cv_stop (1).pkl', 'rb'))
data_stop = pd.read_pickle('/content/dtm.pkl')  # Adjust path if needed

# 2. Get the feature names (i.e., terms used)
feature_names = cv.get_feature_names_out()

# 3. Reconstruct tokenized text for each document
texts = []
for _, row in data_stop.iterrows():
    doc = []
    for word, count in zip(feature_names, row):
        doc.extend([word] * int(count))
    texts.append(doc)

# 4. Create dictionary and corpus
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

# 5. Train the LDA model
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=20)

# 6. Print topics
for idx, topic in lda.print_topics():
    print(f"Topic #{idx}: {topic}")


Topic #0: 0.002*"tedxsf" + 0.002*"majority" + 0.002*"tech" + 0.002*"partnersrelated" + 0.002*"adulthood" + 0.002*"december" + 0.002*"initiativespartner" + 0.002*"lens" + 0.002*"content" + 0.002*"personal"
Topic #1: 0.002*"tedxsf" + 0.002*"tech" + 0.002*"adulthood" + 0.002*"majority" + 0.002*"december" + 0.002*"lens" + 0.002*"fix" + 0.002*"initiativespartner" + 0.002*"thoughts" + 0.002*"content"
Topic #2: 0.050*"tedxsf" + 0.018*"majority" + 0.017*"tech" + 0.012*"adulthood" + 0.009*"december" + 0.009*"partnersrelated" + 0.009*"fix" + 0.009*"content" + 0.009*"lens" + 0.009*"thoughts"
Topic #3: 0.028*"closer" + 0.014*"analytical" + 0.014*"screen" + 0.014*"globeaboutour" + 0.010*"tedxsf" + 0.010*"changeenvironmentglobal" + 0.010*"kuyda" + 0.010*"absurd" + 0.010*"inmembershiptype" + 0.010*"talk"
Topic #4: 0.002*"tedxsf" + 0.002*"majority" + 0.002*"adulthood" + 0.002*"tech" + 0.002*"content" + 0.002*"lens" + 0.002*"december" + 0.002*"personal" + 0.002*"thoughts" + 0.002*"fix"


## Topic Modeling - Attempt #2 (Nouns Only)


In [None]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [None]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
Adam +Munder,adam munder ai bridge deaf hear worlds skip ma...
Adriana Galván,adriana galván reason take risk like teenager ...
Andy Jarvis,andy jarvis fee billion people without destroy...
Angus Hervey,angus hervey break bad news bubble part skip m...
Chin-Teng Lin,chinteng lin mindreading potential ai skip mai...
Eugenia Kuyda,eugenia kuyda ai companion help heal lonelines...
Huiyi Lin,huiyi lin poverty look like plate skip main co...
Irena Arslanova,irena arslanova heartbeat shape sense time ski...
Joshua Amponsem,joshua amponsem absurd inequality climate work...
Mariana Atencio,mariana atencio make special skip main content...


In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
import pandas as pd

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Function to extract nouns
def get_nouns(text):
    doc = nlp(text)
    return [token.text for token in doc if token.pos_ == "NOUN"]



# Apply noun extraction
data_nouns = pd.DataFrame(data_clean["transcript"].apply(get_nouns))
data_nouns.head()


Unnamed: 0,transcript
Adam +Munder,"[deaf, worlds, contentskip, searchideas, libra..."
Adriana Galván,"[risk, teenager, contentskip, searchideas, lib..."
Andy Jarvis,"[people, nature, contentskip, searchideas, lib..."
Angus Hervey,"[hervey, news, bubble, part, contentskip, sear..."
Chin-Teng Lin,"[potential, contentskip, searchideas, library,..."


In [None]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# If you haven't already defined this:
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = list(text.ENGLISH_STOP_WORDS.union(add_stop_words))  # ✅ convert to list

# Let's make sure data_nouns is a proper column
data_nouns.columns = ['nouns']  # ✅ rename the column

# Join list of nouns into string for each row
data_nouns['nouns_str'] = data_nouns['nouns'].apply(lambda x: ' '.join(x))

# Now build the document-term matrix
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns['nouns_str'])

# Build a DataFrame from the matrix
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index

# Show it
data_dtmn.head()


Unnamed: 0,abilities,accord,ads,adulthood,advertiserbreak,advertiserdoes,advertiserwhat,advocate,alarm,angus,...,ways,work,world,worldchanging,worlds,wreck,writer,year,youth,youtube
Adam +Munder,1,1,3,0,0,0,0,0,0,0,...,1,0,1,1,2,0,0,0,0,0
Adriana Galván,0,1,3,1,0,0,0,0,0,0,...,1,0,1,1,0,0,0,0,0,0
Andy Jarvis,0,1,3,0,0,0,0,0,0,0,...,1,0,1,1,0,1,0,0,0,0
Angus Hervey,0,1,3,0,1,0,0,0,0,1,...,1,0,1,1,0,0,0,0,0,0
Chin-Teng Lin,0,1,3,0,0,0,0,0,0,0,...,1,0,1,1,0,0,0,0,0,0


In [None]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.037*"support" + 0.027*"ads" + 0.019*"conferences" + 0.019*"tedx" + 0.016*"partner" + 0.011*"talk" + 0.011*"lessons" + 0.011*"innovators" + 0.011*"share" + 0.011*"ways"'),
 (1,
  '0.020*"poverty" + 0.019*"support" + 0.016*"life" + 0.014*"ads" + 0.012*"tedx" + 0.012*"conferences" + 0.011*"robbins" + 0.010*"line" + 0.010*"talk" + 0.010*"world"')]

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.005*"support" + 0.004*"ads" + 0.004*"partner" + 0.004*"tedx" + 0.004*"conferences" + 0.004*"tededtedx" + 0.004*"store" + 0.004*"eventsdiscovertopicsexplore" + 0.004*"world" + 0.004*"lessons"'),
 (1,
  '0.037*"support" + 0.028*"ads" + 0.020*"tedx" + 0.020*"conferences" + 0.016*"partner" + 0.012*"talk" + 0.012*"lessons" + 0.011*"ways" + 0.011*"innovators" + 0.011*"challenge"'),
 (2,
  '0.024*"news" + 0.018*"hervey" + 0.016*"support" + 0.013*"update" + 0.013*"bubble" + 0.013*"changeenvironmentglobal" + 0.013*"issuesdiseasehealthpublic" + 0.013*"progress" + 0.013*"conversation" + 0.013*"deaf"')]

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=20)
ldan.print_topics()

[(0,
  '0.036*"support" + 0.026*"ads" + 0.019*"conferences" + 0.019*"tedx" + 0.016*"partner" + 0.012*"lessons" + 0.012*"talk" + 0.011*"challenge" + 0.011*"ways" + 0.011*"world"'),
 (1,
  '0.031*"support" + 0.024*"ads" + 0.016*"tedx" + 0.016*"conferences" + 0.014*"partner" + 0.014*"climate" + 0.014*"poverty" + 0.011*"work" + 0.011*"innovators" + 0.011*"share"'),
 (2,
  '0.004*"ads" + 0.004*"support" + 0.004*"conferences" + 0.004*"partner" + 0.004*"tedx" + 0.004*"talk" + 0.004*"collectiveideas" + 0.004*"highlight" + 0.004*"coverage" + 0.004*"futureprograms"')]

In [None]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.039*"news" + 0.030*"hervey" + 0.021*"update" + 0.021*"bubble" + 0.021*"issuesdiseasehealthpublic" + 0.021*"progress" + 0.021*"changeenvironmentglobal" + 0.011*"conversation" + 0.011*"sanctuaries" + 0.011*"stories"'),
 (1,
  '0.038*"support" + 0.029*"ads" + 0.020*"conferences" + 0.020*"tedx" + 0.017*"partner" + 0.012*"lessons" + 0.012*"talk" + 0.011*"challenge" + 0.011*"ways" + 0.011*"innovators"'),
 (2,
  '0.029*"support" + 0.021*"ads" + 0.017*"tedx" + 0.017*"life" + 0.017*"conferences" + 0.014*"partner" + 0.012*"world" + 0.012*"talk" + 0.010*"robbins" + 0.009*"highlight"'),
 (3,
  '0.005*"support" + 0.004*"tedx" + 0.004*"conferences" + 0.004*"ads" + 0.004*"term" + 0.004*"partner" + 0.004*"lessons" + 0.004*"talk" + 0.004*"video" + 0.004*"events"')]

In [None]:
# Let's try 5 topics and 20 passes
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=20)
ldan.print_topics()

[(0,
  '0.038*"support" + 0.028*"ads" + 0.021*"conferences" + 0.021*"tedx" + 0.016*"partner" + 0.013*"lessons" + 0.013*"talk" + 0.012*"challenge" + 0.012*"innovators" + 0.012*"ways"'),
 (1,
  '0.058*"life" + 0.036*"robbins" + 0.025*"effort" + 0.024*"reclaim" + 0.024*"simone" + 0.013*"reward" + 0.013*"alarm" + 0.013*"button" + 0.013*"snooze" + 0.013*"change"'),
 (2,
  '0.004*"support" + 0.004*"ads" + 0.004*"conferences" + 0.004*"poverty" + 0.004*"partner" + 0.004*"series" + 0.004*"talk" + 0.004*"innovators" + 0.004*"content" + 0.004*"tededtedx"'),
 (3,
  '0.035*"support" + 0.026*"ads" + 0.018*"conferences" + 0.018*"partner" + 0.018*"tedx" + 0.013*"poverty" + 0.011*"share" + 0.011*"talk" + 0.009*"futureprograms" + 0.009*"coverage"'),
 (4,
  '0.004*"support" + 0.004*"ads" + 0.004*"poverty" + 0.004*"tedx" + 0.004*"partner" + 0.004*"conferences" + 0.004*"lessons" + 0.004*"inmembershiptype" + 0.004*"privacy" + 0.004*"tededtedx"')]

# Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [None]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def nouns_adj(text):
    doc = nlp(text)
    return [token.text for token in doc if token.pos_ in ["NOUN", "ADJ"]]


In [None]:
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj.columns = ['nouns_adj']  # Rename column for clarity
data_nouns_adj.head()

Unnamed: 0,nouns_adj
Adam +Munder,"[deaf, worlds, main, contentskip, searchideas,..."
Adriana Galván,"[risk, teenager, main, contentskip, searchidea..."
Andy Jarvis,"[people, nature, main, contentskip, searchidea..."
Angus Hervey,"[hervey, bad, news, bubble, part, main, conten..."
Chin-Teng Lin,"[potential, skip, main, contentskip, searchide..."


In [None]:
data_nouns_adj['joined'] = data_nouns_adj['nouns_adj'].apply(lambda x: ' '.join(x))


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# If not done already: convert list of tokens into a string
data_nouns_adj['joined'] = data_nouns_adj['nouns_adj'].apply(lambda x: ' '.join(x))

# Set up the vectorizer with stop words and max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=0.8)

# Apply it on the correct column
data_cvna = cvna.fit_transform(data_nouns_adj['joined'])

# Create the document-term matrix
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index

# Show result
data_dtmna.head()

Unnamed: 0,abilities,absurd,adolescent,adulthood,advertiserbreak,advertiserdoes,advertiserwhat,advocate,african,alarm,...,viewssimone,warm,way,work,worlds,wreck,writer,year,youth,youtube
Adam +Munder,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
Adriana Galván,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Andy Jarvis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Angus Hervey,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Chin-Teng Lin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.025*"climate" + 0.021*"work" + 0.014*"life" + 0.014*"risk" + 0.010*"topicsclimate" + 0.010*"potential" + 0.010*"simone" + 0.010*"reclaim" + 0.010*"explore" + 0.010*"teenager"'),
 (1,
  '0.019*"poverty" + 0.013*"news" + 0.010*"heartbeat" + 0.010*"shape" + 0.010*"kuyda" + 0.010*"companion" + 0.010*"hervey" + 0.010*"deaf" + 0.010*"robbins" + 0.007*"skip"')]

In [None]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.023*"news" + 0.018*"heartbeat" + 0.018*"shape" + 0.018*"hervey" + 0.018*"robbins" + 0.013*"sense" + 0.013*"irena" + 0.013*"progress" + 0.013*"changeenvironmentglobal" + 0.013*"brain"'),
 (1,
  '0.038*"climate" + 0.020*"work" + 0.020*"risk" + 0.014*"potential" + 0.014*"strategic" + 0.014*"teenager" + 0.014*"key" + 0.014*"inequality" + 0.014*"changeenvironmentafricaactivismleadershipyouthcountdowninternational" + 0.008*"communication"'),
 (2,
  '0.026*"poverty" + 0.014*"deaf" + 0.014*"food" + 0.014*"kuyda" + 0.014*"life" + 0.014*"line" + 0.014*"companion" + 0.010*"simone" + 0.010*"reclaim" + 0.010*"real"')]

In [None]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.037*"poverty" + 0.019*"risk" + 0.019*"food" + 0.013*"teenager" + 0.013*"way" + 0.013*"policy" + 0.013*"issuessocial" + 0.013*"line" + 0.007*"brain" + 0.007*"neuroscientist"'),
 (1,
  '0.034*"climate" + 0.018*"work" + 0.018*"robbins" + 0.018*"shape" + 0.018*"heartbeat" + 0.018*"deaf" + 0.012*"inequality" + 0.012*"changeenvironmentafricaactivismleadershipyouthcountdowninternational" + 0.012*"life" + 0.012*"effort"'),
 (2,
  '0.024*"news" + 0.018*"life" + 0.018*"hervey" + 0.018*"potential" + 0.018*"companion" + 0.018*"kuyda" + 0.013*"changeenvironmentglobal" + 0.013*"issuesdiseasehealthpublic" + 0.013*"bad" + 0.013*"progress"'),
 (3,
  '0.004*"deaf" + 0.004*"worlds" + 0.004*"real" + 0.004*"demo" + 0.004*"live" + 0.004*"participants" + 0.004*"engineer" + 0.004*"american" + 0.004*"language" + 0.004*"sign"')]

In [None]:
# Let's try 4 topics And 20 passes
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=20)
ldana.print_topics()

[(0,
  '0.029*"climate" + 0.029*"poverty" + 0.015*"work" + 0.015*"companion" + 0.015*"kuyda" + 0.015*"heartbeat" + 0.015*"shape" + 0.015*"food" + 0.010*"sense" + 0.010*"line"'),
 (1,
  '0.027*"robbins" + 0.027*"deaf" + 0.019*"life" + 0.019*"effort" + 0.019*"worlds" + 0.010*"real" + 0.010*"searchloading" + 0.010*"youtube" + 0.010*"brain" + 0.010*"demo"'),
 (2,
  '0.037*"risk" + 0.025*"teenager" + 0.014*"brain" + 0.014*"neuroscientist" + 0.014*"explore" + 0.014*"key" + 0.014*"strategic" + 0.014*"bold" + 0.014*"reason" + 0.014*"inner"'),
 (3,
  '0.025*"news" + 0.019*"hervey" + 0.019*"life" + 0.013*"skip" + 0.013*"potential" + 0.013*"bubble" + 0.013*"simone" + 0.013*"update" + 0.013*"reclaim" + 0.013*"changeenvironmentglobal"')]

# Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [None]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.027*"robbins" + 0.027*"shape" + 0.027*"heartbeat" + 0.019*"sense" + 0.019*"brain" + 0.019*"life" + 0.019*"effort" + 0.019*"irena" + 0.010*"searchloading" + 0.010*"youtube"'),
 (1,
  '0.031*"climate" + 0.016*"work" + 0.016*"companion" + 0.016*"kuyda" + 0.016*"skip" + 0.016*"potential" + 0.011*"inequality" + 0.011*"changeenvironmentafricaactivismleadershipyouthcountdowninternational" + 0.011*"loneliness" + 0.011*"healthsocietyartificial"'),
 (2,
  '0.052*"poverty" + 0.027*"risk" + 0.019*"line" + 0.019*"food" + 0.019*"policy" + 0.019*"issuessocial" + 0.019*"teenager" + 0.010*"way" + 0.010*"researcher" + 0.010*"problem"'),
 (3,
  '0.028*"news" + 0.022*"hervey" + 0.022*"deaf" + 0.022*"life" + 0.015*"issuesdiseasehealthpublic" + 0.015*"progress" + 0.015*"bubble" + 0.015*"changeenvironmentglobal" + 0.015*"bad" + 0.015*"update"')]

hese four topics look pretty decent. Let's settle on these for now.

Topic 0: shape,heartbeat
Topic 1: climate,work
Topic 2: poverty,risk
Topic 3: news,life

In [None]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
# Get the dominant topic for each document
dominant_topics = [max(doc, key=lambda x: x[1])[0] for doc in corpus_transformed]

# Pair with document index
list(zip(dominant_topics, data_dtmna.index))

[(3, 'Adam +Munder'),
 (2, 'Adriana Galván'),
 (1, 'Andy Jarvis'),
 (3, 'Angus Hervey'),
 (1, 'Chin-Teng Lin'),
 (1, 'Eugenia Kuyda'),
 (2, 'Huiyi Lin'),
 (0, 'Irena Arslanova'),
 (1, 'Joshua Amponsem'),
 (1, 'Mariana Atencio'),
 (0, 'Mel Robbins'),
 (3, 'Simone Stolzoff')]

# Additional Exercises

1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different part of speech and see if you can get better topics

# 1. Fine-tune your LDA model parameters

- Adjust key LDA parameters like:

1. num_topics: Try 3, 5, 7, etc.

2. passes: Higher can help refine topics

3. alpha and eta: Control topic and word distribution sparsity

In [None]:
from gensim import models

# Try with 5 topics and more passes for deeper learning
ldana_tuned = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=5,
    passes=20,
    alpha='auto',
    eta='auto',
    random_state=42
)

# Print out the topics
topics_tuned = ldana_tuned.print_topics(num_words=10)
for topic in topics_tuned:
    print(topic)


(0, '0.033*"deaf" + 0.023*"worlds" + 0.013*"skip" + 0.013*"text" + 0.013*"partnership" + 0.013*"live" + 0.013*"barriers" + 0.013*"participants" + 0.013*"demo" + 0.013*"english"')
(1, '0.028*"life" + 0.028*"risk" + 0.019*"work" + 0.019*"explore" + 0.019*"teenager" + 0.019*"simone" + 0.019*"reclaim" + 0.019*"potential" + 0.010*"strategic" + 0.010*"key"')
(2, '0.039*"robbins" + 0.027*"life" + 0.027*"effort" + 0.015*"change" + 0.015*"bed" + 0.015*"psychologyhappinesssuccessmotivationpersonal" + 0.015*"autopilot" + 0.015*"reward" + 0.015*"happier" + 0.015*"usual"')
(3, '0.041*"poverty" + 0.028*"news" + 0.021*"heartbeat" + 0.021*"shape" + 0.021*"hervey" + 0.015*"food" + 0.015*"sense" + 0.015*"irena" + 0.015*"issuessocial" + 0.015*"line"')
(4, '0.042*"climate" + 0.022*"kuyda" + 0.022*"companion" + 0.022*"work" + 0.015*"topicsclimate" + 0.015*"help" + 0.015*"healthsocietyartificial" + 0.015*"loneliness" + 0.015*"changeenvironmentafricaactivismleadershipyouthcountdowninternational" + 0.015*"ine

In [None]:
!pip install pyLDAvis


Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim import corpora, models
import pyLDAvis.gensim_models
import pyLDAvis


In [None]:
import gensim


In [None]:
from gensim import matutils


In [None]:
# Convert cleaned transcripts to a list
docs = data_clean['transcript'].tolist()

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8)
X_tfidf = tfidf_vectorizer.fit_transform(docs)

# Convert to gensim corpus format
corpus_tfidf = gensim.matutils.Sparse2Corpus(X_tfidf, documents_columns=False)

# Create dictionary
id2word_tfidf = dict((v, k) for k, v in tfidf_vectorizer.vocabulary_.items())


In [None]:
lda_tfidf = models.LdaModel(
    corpus=corpus_tfidf,
    id2word=id2word_tfidf,
    num_topics=5,
    passes=10,
    random_state=42
)


In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()


In [None]:
from gensim import corpora

id2word_tfidf = corpora.Dictionary()
id2word_tfidf.token2id = {term: idx for idx, term in enumerate(tfidf_feature_names)}
id2word_tfidf.id2token = {idx: term for term, idx in id2word_tfidf.token2id.items()}


In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_tfidf, corpus_tfidf, id2word_tfidf)


# 🔵 Left Panel: Intertopic Distance Map (via MDS)
This shows how the different topics relate to each other spatially using multidimensional scaling (MDS).

Each circle represents a topic, and:

The size of the circle = the prevalence (i.e. how dominant that topic is across your corpus).

The distance between circles = how different the topics are from each other. More distance = more distinct topics.

📌 Observation:
Topics 4 and 5 are overlapping significantly, suggesting they might be closely related or redundant. Topics 1, 2, and 3 are more distinct.



# 📊 Right Panel: Top 30 Most Salient Terms
This shows the most important terms across the whole corpus, not limited to any single topic.

Terms like poverty, heartbeat, billion, news, and people are prominent in your documents.

📌 Observation:
The model has picked up on some meaningful and specific terms (e.g. arslanova, heartbeat, neuroscience, poverty) which could suggest topics related to:

Health or neuroscience

Economic or social issues

Media/news coverage

# 👀 Notable Terms
Words like:

topicshealthtimebrainneuroscienceonehuman

bodyheartexploredtexted

...seem concatenated or possibly tokenization issues. You might want to:

Re-check your preprocessing (e.g., how tokens were split, or if stemming/lemmatization was used correctly).

Consider refining token cleaning or removing certain compound artifacts.

# ✅ Conclusion Summary:
5 topics were found.

Some topics are overlapping — you might reduce the number of topics or improve token filtering.

Meaningful terms (poverty, heartbeat, brain, human, perception) suggest your model is picking up interpretable themes.

There might be a text preprocessing issue (check those long, unbroken words).