# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
import pandas as pd

# Load data
netflix_data_lda = pd.read_csv('netflix_titles_with_IMDB.csv', usecols=['show_id','listed_in', 'description'])
netflix_data_lda.head(10)

Unnamed: 0,show_id,listed_in,description
0,s1,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [2]:
# Load the regular expression library
import re
# Remove punctuation and Convert the titles to lowercase
netflix_data_lda['description_processed'] = \
netflix_data_lda['description'].map(lambda x: re.sub('[,\.!?]', '', x))
netflix_data_lda['description_processed'].map(lambda x: x.lower())
netflix_data_lda['description_processed'].head(10)

0    As her father nears the end of his life filmma...
1    After crossing paths at a party a Cape Town te...
2    To protect his family from a powerful drug lor...
3    Feuds flirtations and toilet talk go down amon...
4    In a city of coaching centers known to train I...
5    The arrival of a charismatic young priest brin...
6    Equestria's divided But a bright-eyed hero bel...
7    On a photo shoot in Ghana an American model sl...
8    A talented batch of amateur bakers face off in...
9    A woman adjusting to life after a loss contend...
Name: description_processed, dtype: object

In [3]:
# Import required gensim libraries
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# List and remove stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Function to tokenize sentences into words and removes punctuation
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]

# Cleaned 'description' column and change them to list
data = netflix_data_lda.description_processed.values.tolist()
data_words = list(sent_to_words(data))
#Remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1][0][:30])
print(data_words[4:5][0][:30])

['father', 'nears', 'end', 'life', 'filmmaker', 'kirsten', 'johnson', 'stages', 'death', 'inventive', 'comical', 'ways', 'help', 'face', 'inevitable']
['city', 'coaching', 'centers', 'known', 'train', 'india', 'finest', 'collegiate', 'minds', 'earnest', 'unexceptional', 'student', 'friends', 'navigate', 'campus', 'life']


In [5]:
# Import the required libraries
import gensim.corpora as corpora
# Create Dictionary and Create Corpus to be used for LDA model
id2word = corpora.Dictionary(data_words)
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:1][0][:30])
print(corpus[4:5][0][:30])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(11, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1)]


In [6]:
# Import the required libraries
from gensim import corpora
# Create a dictionary with the corpus
corpus = data_words
dictionary = corpora.Dictionary(corpus)
# Convert the corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

In [9]:
# Import the required libraries
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Coherence score in topic modeling and finding different score
for i in range(2, 11):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)

    # Run the CoherenceModel using LsiModel
    coherence_model = CoherenceModel(model=lsi, texts=data_words, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

Coherence score with 2 clusters: 0.24387326605597992
Coherence score with 3 clusters: 0.2576667664132633
Coherence score with 4 clusters: 0.3168215794725647
Coherence score with 5 clusters: 0.2903055939219434
Coherence score with 6 clusters: 0.260698223848533
Coherence score with 7 clusters: 0.27761391603323526
Coherence score with 8 clusters: 0.27922338904195115
Coherence score with 9 clusters: 0.2830709835213531
Coherence score with 10 clusters: 0.2506689470205811


In [11]:
from pprint import pprint

# Define the number of topics as an integer
num_topics = 1

# Build an LDA model using the corpus and dictionary
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=num_topics)

# Print the keywords associated with the topics
print(lda_model.print_topics())

# Apply the LDA model to the corpus to get topic distributions for each document
doc_lda = lda_model[corpus]

Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 109, in worker
    initializer(*initargs)
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamulticore.py", line 346, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 768, in do_estep
    gamma, sstats = self.inference(chunk, collect_sstats=True)
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 699, in inference
    ids = [int(idx) for idx, _ in doc]
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 699, in <listcomp>
    ids = [int(idx) for idx, _ in doc]
ValueError: too many

KeyboardInterrupt: 

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [12]:
# Write your code here
import pandas as pd

# Load data
netflix_data_lsa = pd.read_csv('netflix_titles_with_IMDB.csv', usecols=['show_id','listed_in', 'description'])
netflix_data_lsa.head(10)


Unnamed: 0,show_id,listed_in,description
0,s1,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [13]:
#check for the null values
netflix_data_lsa['description'].isna().value_counts()
netflix_data_lsa['listed_in'].isna().value_counts()

listed_in
False    8807
Name: count, dtype: int64

In [14]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string, strip_short, stem_text

# Define a preprocessing
def preprocess(text):
    CUSTOM_FILTERS = [lambda x: x.lower(),
                      remove_stopwords,
                      strip_punctuation,
                      strip_short, stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)

    return text

# preprocessing function to all reviews in the 'description' column
netflix_data_lsa['Review_Text (Clean)'] = netflix_data_lsa['description'].apply(lambda x: preprocess(x))
netflix_data_lsa.head()

Unnamed: 0,show_id,listed_in,description,Review_Text (Clean)
0,s1,Documentaries,"As her father nears the end of his life, filmm...","[father, near, end, life, filmmak, kirsten, jo..."
1,s2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...","[cross, path, parti, cape, town, teen, set, pr..."
2,s3,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,"[protect, famili, power, drug, lord, skill, th..."
3,s4,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...","[feud, flirtat, toilet, talk, incarcer, women,..."
4,s5,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,"[citi, coach, center, known, train, india’, fi..."


In [15]:
# Import required libraries
from gensim import corpora

# Create a dictionary with the corpus.
corpus = netflix_data_lsa['Review_Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# Changing the corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

In [16]:
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Coherence score and find the coherence score with a different number of topics from 3 to 8
for i in range(3,8):
    lsi = LsiModel(bow, num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=netflix_data_lsa['Review_Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

Coherence score with 3 clusters: 0.31217603759955453
Coherence score with 4 clusters: 0.3276309621035619
Coherence score with 5 clusters: 0.2588906800903731
Coherence score with 6 clusters: 0.31124532780018166
Coherence score with 7 clusters: 0.3030814875169092


In [17]:
# Create an LSI model with 2 topics
lsi = LsiModel(bow, num_topics=2, id2word=dictionary)

# Print the 5 words with the strongest association
for topic_num, words in lsi.print_topics(num_words=5):
    print('Words in Topic {}: {}.'.format(topic_num, words))

Words in Topic 0: 0.326*"life" + 0.275*"young" + 0.269*"new" + 0.231*"famili" + 0.227*"love".
Words in Topic 1: -0.563*"young" + 0.476*"new" + -0.308*"man" + 0.296*"life" + -0.253*"woman".


In [18]:
# LSI model to the bag-of-words representation
corpus_lsi = lsi[bow]

score1 = []
score2 = []

# Create a DataFrame that shows scores assigned for both topics and remove unwanted review
data_1 = netflix_data_lsa.iloc[:len(score1)]

# Create a DataFrame
topic_df = pd.DataFrame()
topic_df['Text'] = data_1['description']
topic_df['Topic 0 score'] = score1
topic_df['Topic 1 score'] = score2

#topic with the highest score for each review
topic_df['Topic'] = topic_df[['Topic 0 score', 'Topic 1 score']].apply(lambda x: x.argmax(), axis=1)
topic_df.head()

  topic_df['Topic'] = topic_df[['Topic 0 score', 'Topic 1 score']].apply(lambda x: x.argmax(), axis=1)


Unnamed: 0,Text,Topic 0 score,Topic 1 score,Topic


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [19]:
# Write your code here
!pip install lda2vec
!pip install preprocessing
!pip install Corpus
!pip install pyLDAvis



In [20]:
# Import required libraries
import numpy as np

import pyLDAvis
pyLDAvis.enable_notebook()

In [21]:
import pandas as pd
import re
# Loading the  data
netflix_data_ida = pd.read_csv('netflix_titles_with_IMDB.csv', usecols=['show_id','listed_in', 'description'])

# Remove punctuation and Convert the titles to lowercase
netflix_data_ida['description_processed'] = \
netflix_data_ida['description'].map(lambda x: re.sub('[,\.!?]', '', x))
netflix_data_ida['description_processed'].map(lambda x: x.lower())

# Show first few rows from dataframe
netflix_data_ida['description_processed'].head(10)

  and should_run_async(code)
  netflix_data_ida['description'].map(lambda x: re.sub('[,\.!?]', '', x))


0    As her father nears the end of his life filmma...
1    After crossing paths at a party a Cape Town te...
2    To protect his family from a powerful drug lor...
3    Feuds flirtations and toilet talk go down amon...
4    In a city of coaching centers known to train I...
5    The arrival of a charismatic young priest brin...
6    Equestria's divided But a bright-eyed hero bel...
7    On a photo shoot in Ghana an American model sl...
8    A talented batch of amateur bakers face off in...
9    A woman adjusting to life after a loss contend...
Name: description_processed, dtype: object

In [22]:
# Import required gensim libraries
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


  and should_run_async(code)


In [23]:
# remove stopwords and list them.
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Function to tokenize sentences into words and removes punctuation
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]

# Cleaned 'description' column is converted into list
data = netflix_data_ida.description_processed.values.tolist()
data_words = list(sent_to_words(data))

# remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1][0][:30])
print(data_words[4:5][0][:30])

  and should_run_async(code)


['father', 'nears', 'end', 'life', 'filmmaker', 'kirsten', 'johnson', 'stages', 'death', 'inventive', 'comical', 'ways', 'help', 'face', 'inevitable']
['city', 'coaching', 'centers', 'known', 'train', 'india', 'finest', 'collegiate', 'minds', 'earnest', 'unexceptional', 'student', 'friends', 'navigate', 'campus', 'life']


In [24]:
# Import the required libraries
from gensim import corpora

# Create a dictionary with the corpus and Convert corpus to a bag of words
corpus = data_words
dictionary = corpora.Dictionary(corpus)
bow = [dictionary.doc2bow(text) for text in corpus]

  and should_run_async(code)


In [25]:
# Import the required libraries
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

for i in range(3,8):
    lsi = LsiModel(bow,num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=data_words, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

  and should_run_async(code)
  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 3 clusters: 0.25766676641326325


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 4 clusters: 0.3036966431179765


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 5 clusters: 0.2477139507672014


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 6 clusters: 0.26796498857924017


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 7 clusters: 0.25066666167407503


In [26]:
# Import the required libraries for lda2vec
from gensim.models.ldamulticore import LdaMulticore

# num_topics_lda = 7, where maximum Coherence score is achieved = 0.3183
id2word = corpora.Dictionary(data_words)
num_topics_lda = 7
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=data_words, id2word=id2word, num_topics=num_topics_lda)

# Print the Keyword in the 10 topics
print(lda_model.print_topics())
doc_lda = lda_model[corpus]

  and should_run_async(code)
Process ForkPoolWorker-16:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 109, in worker
    initializer(*initargs)
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamulticore.py", line 346, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 768, in do_estep
    gamma, sstats = self.inference(chunk, collect_sstats=True)
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 699, in inference
    ids = [int(idx) for idx, _ in doc]
  File "/usr/local/lib/python3.10/dist-packages/gensim/models/ldamodel.py", line 699, in <listcomp>
    ids = [int(idx) for idx, 

KeyboardInterrupt: 

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [28]:
# Write your code here
import pandas as pd
import re
# Load data
netflix_data_bert = pd.read_csv('netflix_titles_with_IMDB.csv', usecols=['show_id','listed_in', 'description'])


# Remove punctuation and Convert the titles to lowercase
netflix_data_bert['description_processed'] = \
netflix_data_bert['description'].map(lambda x: re.sub('[,\.!?]', '', x))
netflix_data_bert['description_processed'].map(lambda x: x.lower())

netflix_data_bert['description_processed'].head(10)


  and should_run_async(code)
  netflix_data_bert['description'].map(lambda x: re.sub('[,\.!?]', '', x))


0    As her father nears the end of his life filmma...
1    After crossing paths at a party a Cape Town te...
2    To protect his family from a powerful drug lor...
3    Feuds flirtations and toilet talk go down amon...
4    In a city of coaching centers known to train I...
5    The arrival of a charismatic young priest brin...
6    Equestria's divided But a bright-eyed hero bel...
7    On a photo shoot in Ghana an American model sl...
8    A talented batch of amateur bakers face off in...
9    A woman adjusting to life after a loss contend...
Name: description_processed, dtype: object

In [29]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation , preprocess_string, strip_short, stem_text

# Preprocess the given data
def preprocess(text):

    # clean text based on given filters
    CUSTOM_FILTERS = [lambda x: x.lower(),
                                remove_stopwords,
                                strip_punctuation,
                                strip_short,
                                stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)

    return text

# Apply the function to all reviews
netflix_data_bert['Review_Text'] = netflix_data_lda['description'].apply(lambda x: preprocess(x))
netflix_data_bert

  and should_run_async(code)


Unnamed: 0,show_id,listed_in,description,description_processed,Review_Text
0,s1,Documentaries,"As her father nears the end of his life, filmm...",As her father nears the end of his life filmma...,"[father, near, end, life, filmmak, kirsten, jo..."
1,s2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",After crossing paths at a party a Cape Town te...,"[cross, path, parti, cape, town, teen, set, pr..."
2,s3,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,To protect his family from a powerful drug lor...,"[protect, famili, power, drug, lord, skill, th..."
3,s4,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Feuds flirtations and toilet talk go down amon...,"[feud, flirtat, toilet, talk, incarcer, women,..."
4,s5,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,In a city of coaching centers known to train I...,"[citi, coach, center, known, train, india’, fi..."
...,...,...,...,...,...
8802,s8803,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",A political cartoonist a crime reporter and a ...,"[polit, cartoonist, crime, report, pair, cop, ..."
8803,s8804,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",While living alone in a spooky town a young gi...,"[live, spooki, town, young, girl, befriend, mo..."
8804,s8805,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,Looking to survive in a world taken over by zo...,"[look, surviv, world, taken, zombi, dorki, col..."
8805,s8806,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",Dragged from civilian life a former superhero ...,"[drag, civilian, life, superhero, train, new, ..."


In [30]:
!pip install bertopic

  and should_run_async(code)




In [31]:
# Import the required libraries
from bertopic import BERTopic

# Clean the data
desc = netflix_data_bert.Review_Text.to_list()

# Create the BERTopic model
topic_model = BERTopic(language="english")
topics, probs = topic_model.fit_transform(netflix_data_bert.description)

  and should_run_async(code)
  np.bool8: (False, True),
  self._all_finite = is_finite(X)


In [32]:
# Import the required libraries
from gensim import corpora

# Create a dictionary with the corpus
corpus = netflix_data_bert['Review_Text']
dictionary = corpora.Dictionary(corpus)

# Convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

  and should_run_async(code)


In [33]:
# Import the required libraries
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

for i in range(3,8):
    lsi = LsiModel(bow,num_topics=i, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=data_words, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))

topic_model.visualize_barchart(top_n_topics=5, n_words = 5, width = 350, height = 350)

  and should_run_async(code)
  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


Coherence score with 3 clusters: nan


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 4 clusters: nan


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 5 clusters: nan


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 6 clusters: nan


  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


Coherence score with 7 clusters: nan


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here

"""


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [34]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
I feel like more difficult in completing the exercise because these are very new modeling technique. I have learned in these exercise.
I feel good in extracting the data and applying the extracting feature on the text data.
challenges encountered are like these are applying the modeling and k feature in the code.
'''


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



'\nPlease write you answer here:\nI feel like more difficult in completing the exercise because these are very new modeling technique. I have learned in these exercise.\nI feel good in extracting the data and applying the extracting feature on the text data.\nchallenges encountered are like these are applying the modeling and k feature in the code.\n'