# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

## Topic Modeling - Attempt #1 (All Text)

In [48]:
# Let's read in our document-term matrix
import pandas as pd
import pickle
import nltk

data = pd.read_pickle('dtm_stop.pkl')
data.head()

Unnamed: 0,aashima,abbas,abdul,abha,abhay,abhigyan,abhijit,abhishek,abigail,abilash,...,zealands,zeroes,zerosum,zika,zombie,zoom,zurbuchen,¹⁵,¹⁸,āwe
file1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file11,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file13,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- gensim: Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is particularly useful for unsupervised machine learning tasks on text data.

- matutils: This module in Gensim provides utility functions for working with matrices and vectors. It includes functions for converting between different matrix formats, such as converting sparse matrices to dense matrices and vice versa.

- models: The models module in Gensim contains implementations for various models used in natural language processing, such as Word2Vec, Doc2Vec, and Latent Dirichlet Allocation (LDA). These models are commonly used for tasks like word and document embeddings, topic modeling, and similarity analysis.

- scipy.sparse: SciPy is a scientific computing library for Python, and scipy.sparse provides functionality for working with sparse matrices. Sparse matrices are efficient data structures for representing matrices where the majority of elements are zero. They are commonly used in NLP tasks to represent large document-term matrices efficiently, especially when dealing with large corpora.

In [2]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,file1,file10,file11,file12,file13,file14,file15,file16,file17,file18,...,file38,file39,file4,file40,file41,file5,file6,file7,file8,file9
aashima,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abbas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abdul,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abha,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abhay,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [5]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [6]:
cv.vocabulary_.items()

dict_items([('wooden', 10606), ('structure', 9158), ('says', 8297), ('ancient', 424), ('relativesthe', 7796), ('anusandhan', 496), ('national', 6104), ('research', 7897), ('foundation', 3667), ('nrf', 6326), ('bill', 1039), ('wasrecently', 10419), ('approvedin', 547), ('parliament', 6711), ('aims', 271), ('revolutionise', 8011), ('development', 2519), ('rd', 7587), ('ecosystem', 2905), ('india', 4596), ('bringing', 1206), ('significant', 8667), ('additional', 126), ('investment', 4868), ('government', 3940), ('private', 7221), ('sector', 8412), ('international', 4822), ('collaborationsthe', 1721), ('vision', 10352), ('promote', 7316), ('longterm', 5444), ('innovative', 4712), ('acrossbasic', 92), ('sciences', 8367), ('humanities', 4373), ('social', 8791), ('sciencesin', 8368), ('bid', 1032), ('position', 7059), ('global', 3910), ('superpowerhowever', 9324), ('happen', 4087), ('existing', 3271), ('issues', 4928), ('need', 6147), ('addressedone', 129), ('issue', 4926), ('acquiring', 88),

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [7]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=80)
lda.print_topics()

[(0,
  '0.015*"faculty" + 0.008*"university" + 0.007*"student" + 0.007*"institute" + 0.005*"indian" + 0.005*"science" + 0.004*"fellow" + 0.003*"postdoctoral" + 0.003*"students" + 0.003*"iisc"'),
 (1,
  '0.005*"students" + 0.004*"also" + 0.004*"research" + 0.004*"one" + 0.003*"may" + 0.003*"cancer" + 0.003*"said" + 0.003*"would" + 0.003*"science" + 0.003*"people"')]

In [8]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.006*"sleep" + 0.005*"health" + 0.004*"cancer" + 0.003*"also" + 0.003*"food" + 0.003*"may" + 0.003*"light" + 0.003*"cells" + 0.003*"work" + 0.003*"ultraprocessed"'),
 (1,
  '0.006*"students" + 0.004*"also" + 0.004*"research" + 0.003*"science" + 0.003*"one" + 0.003*"like" + 0.003*"scientific" + 0.003*"fusion" + 0.003*"time" + 0.003*"many"'),
 (2,
  '0.016*"faculty" + 0.008*"university" + 0.008*"institute" + 0.007*"student" + 0.006*"science" + 0.005*"students" + 0.005*"indian" + 0.004*"fellow" + 0.004*"postdoctoral" + 0.003*"iisc"')]

In [9]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=80)
lda.print_topics()

[(0,
  '0.021*"faculty" + 0.010*"institute" + 0.010*"student" + 0.010*"university" + 0.008*"students" + 0.007*"indian" + 0.006*"science" + 0.005*"fellow" + 0.005*"postdoctoral" + 0.004*"iisc"'),
 (1,
  '0.005*"science" + 0.005*"scientists" + 0.004*"neanderthals" + 0.004*"also" + 0.004*"one" + 0.004*"time" + 0.004*"wild" + 0.003*"new" + 0.003*"may" + 0.003*"said"'),
 (2,
  '0.007*"students" + 0.006*"research" + 0.004*"sleep" + 0.004*"also" + 0.003*"like" + 0.003*"one" + 0.003*"ai" + 0.003*"would" + 0.003*"years" + 0.003*"many"'),
 (3,
  '0.004*"fusion" + 0.004*"also" + 0.004*"cancer" + 0.003*"plasma" + 0.003*"cells" + 0.003*"research" + 0.003*"said" + 0.003*"health" + 0.003*"researchers" + 0.003*"work"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [10]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]  # [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
    return ' '.join(all_nouns)

In [45]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean2.pkl')
data_clean.head()

Unnamed: 0,transcript,name
file1,what a wooden structure says about our ancien...,file1
file10,how fiefdoms and do or die imperil the success...,file10
file11,years later and still no clarity on green cle...,file11
file12,a newfound neutron star might light the way fo...,file12
file13,uttering the uterus mapping myths and menstrua...,file13


POS tagging is the process of assigning a part-of-speech tag (such as noun, verb, adjective, etc.) to each word in a given text. NLTK uses pre-trained statistical models, like the averaged perceptron tagger, to perform this task.

In [12]:
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...

[nltk_data]   Package averaged_perceptron_tagger is already up-to-

[nltk_data]       date!


True

In [46]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns.head()

Unnamed: 0,transcript
file1,structure ancient relativesthe research founda...
file10,fiefdoms success sportspeoplea photo lecture h...
file11,years clarity clearance waterway whyillustrati...
file12,neutron star way class objecta teacher teaches...
file13,mapping myths health indiaillustration fractio...


In [47]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['a', 'about', 'above', 'across', 'after', 'again', 'against', 'ain', 'all', 'almost',
                        'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'an', 'and',
                        'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'aren', "aren't",
                        'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both',
                        'but', 'by', 'can', 'cannot', 'could', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't",
                        'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few',
                        'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven',
                        "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how',
                        'however', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just',
                        'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself',
                        'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or',
                        'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own','same', 'shan', "shan't",
                        'she', "she's", 'should', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll",
                        'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this',
                        'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't",
                        'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom',
                        'why', 'will', 'with', 'won', "won't", 'would', 'wouldn', "wouldn't", 'y', 'you', "you'd",
                        "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves', 'couldn', 'aren',
                        "aren't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
                        "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
                        'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't"]
stop_words = list(text.ENGLISH_STOP_WORDS.union(add_stop_words))

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn.head()




Unnamed: 0,aashima,abbas,abdul,abha,abhay,abhijit,abhishek,abigail,abilash,ability,...,zachariah,zagreb,zealands,zeroes,zombie,zoom,zurbuchen,¹⁵,¹⁸,āwe
file1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file11,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file13,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [16]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=80)
ldan.print_topics()

[(0,
  '0.016*"students" + 0.008*"research" + 0.007*"science" + 0.005*"time" + 0.005*"years" + 0.005*"education" + 0.004*"work" + 0.004*"people" + 0.004*"student" + 0.004*"university"'),
 (1,
  '0.025*"faculty" + 0.013*"university" + 0.012*"institute" + 0.011*"student" + 0.007*"science" + 0.007*"health" + 0.005*"fellow" + 0.005*"scientists" + 0.005*"fusion" + 0.005*"cancer"')]

In [17]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=80)
ldan.print_topics()

[(0,
  '0.030*"faculty" + 0.014*"university" + 0.014*"institute" + 0.013*"student" + 0.008*"science" + 0.007*"health" + 0.005*"fellow" + 0.005*"people" + 0.005*"students" + 0.005*"cancer"'),
 (1,
  '0.007*"students" + 0.006*"fusion" + 0.006*"time" + 0.005*"neanderthals" + 0.005*"scientists" + 0.004*"university" + 0.004*"years" + 0.004*"food" + 0.004*"way" + 0.004*"plasma"'),
 (2,
  '0.015*"students" + 0.014*"research" + 0.011*"science" + 0.006*"cancer" + 0.006*"cells" + 0.006*"researchers" + 0.005*"spaces" + 0.005*"universities" + 0.005*"gender" + 0.005*"years"')]

In [18]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=80)
ldan.print_topics()

[(0,
  '0.032*"faculty" + 0.017*"university" + 0.014*"student" + 0.014*"institute" + 0.008*"science" + 0.006*"fellow" + 0.005*"iit" + 0.005*"neanderthals" + 0.005*"scientists" + 0.005*"years"'),
 (1,
  '0.017*"students" + 0.012*"research" + 0.011*"science" + 0.007*"fusion" + 0.006*"cancer" + 0.005*"time" + 0.005*"way" + 0.005*"spaces" + 0.005*"universities" + 0.004*"gender"'),
 (2,
  '0.011*"health" + 0.010*"cancer" + 0.009*"cells" + 0.006*"violence" + 0.006*"countries" + 0.005*"education" + 0.005*"researchers" + 0.005*"survivors" + 0.004*"research" + 0.004*"women"'),
 (3,
  '0.013*"students" + 0.007*"people" + 0.006*"food" + 0.005*"scientists" + 0.005*"health" + 0.005*"student" + 0.004*"air" + 0.004*"science" + 0.004*"evidence" + 0.004*"iiser"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [19]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [49]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj.head()

Unnamed: 0,transcript
file1,wooden structure ancient relativesthe anusandh...
file10,fiefdoms success indias sportspeoplea represen...
file11,years clarity green clearance ganga waterway w...
file12,newfound neutron star way new class stellar ob...
file13,uterus mapping myths menstrual health indiaill...


In [50]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna.head()




Unnamed: 0,aashima,abbas,abdul,abha,abhay,abhigyan,abhijit,abhishek,abigail,abilash,...,zambiathis,zealands,zeroes,zerosum,zombie,zoom,zurbuchen,¹⁵,¹⁸,āwe
file1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file11,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file13,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [23]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.005*"students" + 0.005*"science" + 0.005*"cancer" + 0.004*"research" + 0.003*"scientists" + 0.003*"health" + 0.003*"people" + 0.003*"researchers" + 0.003*"work" + 0.003*"time"'),
 (1,
  '0.021*"faculty" + 0.011*"university" + 0.010*"student" + 0.009*"institute" + 0.008*"students" + 0.006*"indian" + 0.005*"research" + 0.005*"science" + 0.005*"fellow" + 0.005*"postdoctoral"')]

In [24]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.012*"students" + 0.006*"research" + 0.004*"health" + 0.004*"years" + 0.004*"cancer" + 0.004*"neanderthals" + 0.003*"education" + 0.003*"science" + 0.003*"time" + 0.003*"age"'),
 (1,
  '0.005*"health" + 0.005*"time" + 0.005*"people" + 0.004*"sleep" + 0.004*"scientists" + 0.003*"work" + 0.003*"research" + 0.003*"light" + 0.003*"new" + 0.003*"years"'),
 (2,
  '0.024*"faculty" + 0.012*"university" + 0.011*"student" + 0.011*"institute" + 0.009*"science" + 0.007*"indian" + 0.006*"fellow" + 0.005*"postdoctoral" + 0.005*"research" + 0.005*"iisc"')]

In [25]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.006*"people" + 0.006*"time" + 0.005*"students" + 0.004*"scientists" + 0.004*"research" + 0.004*"years" + 0.004*"science" + 0.003*"brain" + 0.003*"data" + 0.003*"work"'),
 (1,
  '0.012*"students" + 0.009*"science" + 0.007*"fusion" + 0.005*"plasma" + 0.005*"spaces" + 0.005*"sleep" + 0.005*"gender" + 0.004*"way" + 0.004*"reactor" + 0.004*"energy"'),
 (2,
  '0.010*"cancer" + 0.006*"research" + 0.005*"years" + 0.005*"neanderthals" + 0.004*"cells" + 0.004*"researchers" + 0.004*"new" + 0.004*"food" + 0.004*"health" + 0.004*"wild"'),
 (3,
  '0.033*"faculty" + 0.016*"university" + 0.015*"student" + 0.015*"institute" + 0.011*"indian" + 0.008*"fellow" + 0.008*"postdoctoral" + 0.007*"science" + 0.006*"iisc" + 0.006*"students"')]

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [44]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.011*"students" + 0.008*"cancer" + 0.005*"health" + 0.004*"women" + 0.004*"gender" + 0.004*"cells" + 0.004*"india" + 0.003*"science" + 0.003*"education" + 0.003*"spaces"'),
 (1,
  '0.026*"faculty" + 0.013*"university" + 0.012*"student" + 0.011*"institute" + 0.008*"indian" + 0.007*"science" + 0.006*"fellow" + 0.006*"postdoctoral" + 0.005*"iisc" + 0.004*"iit"'),
 (2,
  '0.006*"fusion" + 0.004*"plasma" + 0.004*"wild" + 0.004*"years" + 0.004*"scientists" + 0.004*"new" + 0.003*"research" + 0.003*"reactor" + 0.003*"energy" + 0.003*"using"'),
 (3,
  '0.006*"students" + 0.006*"research" + 0.004*"science" + 0.004*"scientific" + 0.004*"work" + 0.003*"time" + 0.003*"said" + 0.003*"years" + 0.003*"researchers" + 0.002*"light"')]

These four topics look pretty decent. Let's settle on these for now.
* Topic 0: faculty, students, student, university
* Topic 1: cancer, health, cells
* Topic 2: fusion, plasma, sleep, health
* Topic 3: research, scientists, cancer

In [27]:
# # Extracting topic distribution for each document
topics_per_doc_with_words = [(max(prob, key=lambda y:y[1]), ldana.show_topic(max(prob, key=lambda y:y[1])[0])) for prob in ldana[corpusna]]
topics_per_doc_with_words


[((0, 0.9975674),
  [('students', 0.016134335),
   ('cancer', 0.008812206),
   ('research', 0.0075111883),
   ('education', 0.0059801457),
   ('science', 0.004566536),
   ('treatment', 0.004523931),
   ('student', 0.0041044373),
   ('university', 0.0037702366),
   ('gender', 0.0036870108),
   ('spaces', 0.0035824713)]),
 ((0, 0.9977634),
  [('students', 0.016134335),
   ('cancer', 0.008812206),
   ('research', 0.0075111883),
   ('education', 0.0059801457),
   ('science', 0.004566536),
   ('treatment', 0.004523931),
   ('student', 0.0041044373),
   ('university', 0.0037702366),
   ('gender', 0.0036870108),
   ('spaces', 0.0035824713)]),
 ((0, 0.99897337),
  [('students', 0.016134335),
   ('cancer', 0.008812206),
   ('research', 0.0075111883),
   ('education', 0.0059801457),
   ('science', 0.004566536),
   ('treatment', 0.004523931),
   ('student', 0.0041044373),
   ('university', 0.0037702366),
   ('gender', 0.0036870108),
   ('spaces', 0.0035824713)]),
 ((1, 0.9984666),
  [('fusion', 0

* Topic 0: faculty, students, student, university
* Topic 1: cancer, health, cells
* Topic 2: fusion, plasma, sleep, health
* Topic 3: research, scientists, cancer

### Assignment:
1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.

In [28]:
ldana = models.LdaModel(corpus=corpusna, num_topics=10, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.038*"cancer" + 0.021*"treatment" + 0.013*"cells" + 0.011*"breast" + 0.010*"senescent" + 0.009*"air" + 0.005*"evidence" + 0.005*"health" + 0.005*"centre" + 0.005*"patients"'),
 (1,
  '0.015*"countries" + 0.013*"education" + 0.008*"pandemic" + 0.007*"global" + 0.007*"closures" + 0.005*"school" + 0.004*"average" + 0.004*"world" + 0.004*"students" + 0.003*"effects"'),
 (2,
  '0.016*"sleep" + 0.009*"students" + 0.006*"india" + 0.006*"health" + 0.005*"light" + 0.005*"brain" + 0.005*"education" + 0.004*"suicide" + 0.004*"suicides" + 0.004*"rhythm"'),
 (3,
  '0.000*"faculty" + 0.000*"student" + 0.000*"students" + 0.000*"university" + 0.000*"science" + 0.000*"research" + 0.000*"institute" + 0.000*"work" + 0.000*"indian" + 0.000*"new"'),
 (4,
  '0.009*"scientific" + 0.009*"science" + 0.008*"papers" + 0.007*"research" + 0.007*"ai" + 0.006*"scientists" + 0.006*"researchers" + 0.005*"paper" + 0.005*"scholars" + 0.005*"solar"'),
 (5,
  '0.010*"health" + 0.009*"universities" + 0.008*"agricu

In [29]:
# Let's create a function to pull out nouns, adjectives, and preposition from a string of text
from nltk import word_tokenize, pos_tag

def nouns_adj_verb(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun_adj_verb = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'VB'
    tokenized = word_tokenize(text)
    all_noun_adj_verb = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj_verb(pos)] 
    return ' '.join(all_noun_adj_verb)

In [51]:
# Apply the nouns_adj_preposition function to the transcripts to filter only on nouns, adjectives, and preposition
data_nouns_adj_verb = pd.DataFrame(data_clean.transcript.apply(nouns_adj_verb))
data_nouns_adj_verb.head()

Unnamed: 0,transcript
file1,wooden structure says ancient relativesthe anu...
file10,fiefdoms do die success indias sportspeoplea r...
file11,years clarity green clearance ganga waterway w...
file12,newfound neutron star light way new class stel...
file13,uttering uterus mapping myths menstrual health...


In [52]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj_verb.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna.head()




Unnamed: 0,aashima,abbas,abdul,abha,abhay,abhigyan,abhijit,abhishek,abigail,abilash,...,zambiathis,zealands,zeroes,zerosum,zombie,zoom,zurbuchen,¹⁵,¹⁸,āwe
file1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file11,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
file13,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [33]:
# Let's start with 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.027*"faculty" + 0.013*"student" + 0.013*"institute" + 0.013*"university" + 0.012*"students" + 0.009*"indian" + 0.008*"science" + 0.006*"fellow" + 0.006*"postdoctoral" + 0.006*"iiser"'),
 (1,
  '0.008*"cancer" + 0.007*"research" + 0.005*"said" + 0.004*"health" + 0.003*"scientists" + 0.003*"students" + 0.003*"wild" + 0.003*"years" + 0.003*"food" + 0.003*"scientific"'),
 (2,
  '0.008*"science" + 0.005*"research" + 0.004*"universities" + 0.004*"scientists" + 0.004*"scientific" + 0.004*"problems" + 0.004*"agricultural" + 0.004*"people" + 0.004*"think" + 0.003*"light"'),
 (3,
  '0.012*"fusion" + 0.009*"plasma" + 0.006*"reactor" + 0.006*"energy" + 0.004*"time" + 0.004*"power" + 0.004*"fuel" + 0.003*"mix" + 0.003*"way" + 0.003*"knowledge"'),
 (4,
  '0.006*"sleep" + 0.005*"neanderthals" + 0.003*"years" + 0.003*"work" + 0.003*"neanderthal" + 0.003*"researchers" + 0.003*"brain" + 0.003*"new" + 0.003*"disease" + 0.003*"says"')]

In [34]:
# Let's start with 10 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=10, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.012*"cancer" + 0.011*"health" + 0.008*"treatment" + 0.007*"violence" + 0.007*"women" + 0.006*"breast" + 0.005*"survivors" + 0.005*"screening" + 0.004*"work" + 0.004*"ashas"'),
 (1,
  '0.020*"sleep" + 0.012*"cancer" + 0.009*"cells" + 0.006*"senescent" + 0.006*"brain" + 0.005*"health" + 0.005*"work" + 0.005*"hours" + 0.005*"rhythm" + 0.004*"mental"'),
 (2,
  '0.009*"fusion" + 0.009*"students" + 0.006*"plasma" + 0.005*"reactor" + 0.005*"energy" + 0.004*"food" + 0.004*"ultraprocessed" + 0.004*"time" + 0.004*"ai" + 0.004*"using"'),
 (3,
  '0.010*"research" + 0.006*"wild" + 0.006*"researchers" + 0.006*"scientists" + 0.005*"light" + 0.004*"new" + 0.004*"said" + 0.004*"vaccine" + 0.004*"crops" + 0.004*"plants"'),
 (4,
  '0.008*"said" + 0.005*"preparators" + 0.004*"hindi" + 0.004*"think" + 0.004*"scientists" + 0.004*"students" + 0.004*"work" + 0.004*"fossils" + 0.003*"paleontology" + 0.003*"language"'),
 (5,
  '0.007*"disease" + 0.007*"air" + 0.006*"solar" + 0.005*"nanoplastics" + 0.0

while using LDA for topic modelling for no of topic = 5, we are able to identify all the topics.
But for no of topics = 10 or 12, topics are repeated. 

## Latent Semantic Analysis (LSA): 
LSA is a classical technique used for dimensionality reduction. It applies Singular Value Decomposition (SVD) to a term-document matrix to identify underlying topics.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Convert DataFrame to sparse matrix
sparse_data_dtmn = data_dtmna

# Perform LSA
lsa_model = TruncatedSVD(n_components=10)  # Adjust the number of components as needed
lsa_data = lsa_model.fit_transform(sparse_data_dtmn)

# Get the vocabulary from CountVectorizer
cvn = CountVectorizer(stop_words=stop_words, max_df=.8)
cvn.fit_transform(data_nouns_adj_verb.transcript)  # Assuming your_text_data contains the text data
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())
topics = []
# Print topics (terms) generated by LSA
# The components of the LSA model represent topics. Each component is a linear combination of terms
# You can print the top terms for each component to interpret the topics
terms = lsa_model.components_
for i, comp in enumerate(terms):
    topic_terms = []
    terms_comp = zip(id2wordn.values(), comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:10]  # Print top 10 terms per topic
    for term in sorted_terms:
        topic_terms.append(term[0])
    topics.append(topic_terms)

# Print topics in the requested format
for i, topic in enumerate(topics):
    print(f"Topic {i + 1}: {topic}")





Topic 1: ['distant', 'cascading', 'pradeep', 'gujarathetal', 'vivek', 'codehowever', 'outer', 'daynight', 'paris', 'sarkar']

Topic 2: ['accredited', 'hugging', 'daynight', 'remainunvaccinated', 'umang', 'bathing', 'studentorganisers', 'renowned', 'held', 'physician']

Topic 3: ['cohens', 'commit', 'requiringexceptional', 'paperud', 'analogiesprofessor', 'copernicus', 'warm', 'pile', 'interventionone', 'alsorepresentative']

Topic 4: ['alsorepresentative', 'interventionone', 'bathing', 'redressal', 'pile', 'onefifth', 'effortthose', 'digestionconsider', 'eliminates', 'bluesince']

Topic 5: ['digestionconsider', 'hardiness', 'conjured', 'bythe', 'cbti', 'renowned', 'religion', 'hurrying', 'homeskathuria', 'bacteria']

Topic 6: ['money', 'pouring', 'final', 'swimming', 'supports', 'reproduction', 'fuse', 'maternal', 'dementia', 'medication']

Topic 7: ['accredited', 'developedthe', 'shortage', 'vanish', 'culminating', 'anomaliesdo', 'scientists', 'deemed', 'waist', 'thenbhi']

Topic 8: [