# QM 701: Advanced Data Analytics and Applications
# Homework 4
---
## Objective
In this homework, we will look at Latent Dirichlet Allocation (LDA) for topic modeling, applying it to a dataset of book summaries. Through this assignment, you will engage with the process of topic modeling, including the training of multiple LDA models, evaluating of their performance and inspecting of the topics generated.
## Tasks
This homework includes the following 7 questions:
* **Q1**: Model Training. You will train two LDA models, one with 5 topics and another with 10 topics. (10 points)
* **Q2**: Model Evaluation. You will calculate the perplexity of each model and do the model selection. (15 points).
* **Q3**: Model Inspection. You will find the topic words of each model and compare these two models. (10 points)
* **Q4**: Model Test with Classic Books. You will test the model with several books and interpret the results (20 points)
* **Q5**: Conceptual Questions on Topic Modeling (20 points)
* **Q6**: Parsing with spaCy (15 points)
* **Q7**: Named Entity Recognition (NER) (10 points)

## Environment Setup

In [2]:
import pandas as pd
import numpy as np
import csv
import yaml

from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim import corpora
import gensim.parsing.preprocessing as preprocessing
from gensim.utils import simple_preprocess

import nltk
nltk.download(["averaged_perceptron_tagger", "wordnet"])

from pprint import pprint
from tqdm.auto import tqdm
tqdm.pandas()

# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from wordcloud import WordCloud

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Data Setup
We use the CMU Book Summary Dataset for this homework. The dataset contains plot summaries for 16,559 books extracted from Wikipedia. You can check more details about the dataset from its [website](https://www.cs.cmu.edu/~dbamman/booksummaries.html).

### Data Loading
By running the following codes, we download the dataset and read it into the pandas dataframe `book_df`.

In [3]:
# download the dataset
!wget https://duke.box.com/shared/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b -O book.csv

# read the dataset in a pandas dataframe
book_df = pd.read_csv("book.csv", header = 0, usecols = ["title", "summary", "summary_name_removed"])

--2024-07-26 02:57:57--  https://duke.box.com/shared/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b
Resolving duke.box.com (duke.box.com)... 74.112.186.157
Connecting to duke.box.com (duke.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b [following]
--2024-07-26 02:57:57--  https://duke.box.com/public/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b
Reusing existing connection to duke.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://duke.app.box.com/public/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b [following]
--2024-07-26 02:57:57--  https://duke.app.box.com/public/static/jeumvyorvtl4fpcy75bh3dw4ai5ap43b
Resolving duke.app.box.com (duke.app.box.com)... 74.112.186.157
Connecting to duke.app.box.com (duke.app.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/b1!2r4fx

The dataset has the following three columns:

* `title`: the title of each book.
* `summary`: the plot summary of each book.
* `summary_name_removed`: the plot summary with all people names removed.

Since people names are irrelevant for the book topic modeling, we use named entity recognition (NER) to remove them.

Let's take a peek at the first five rows in `book_df`.

In [None]:
book_df.head()

Unnamed: 0,title,summary,summary_name_removed
0,Animal Farm,"Old Major, the old boar on the Manor Farm, ca...","Old Major , the old boar on , calls the anim..."
1,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",", a teenager living in near - future England..."
2,The Plague,The text of The Plague is divided into five p...,The text of The Plague is divided into five ...
3,An Enquiry Concerning Human Understanding,The argument of the Enquiry proceeds by a ser...,The argument of the proceeds by a series of ...
4,A Fire Upon the Deep,The novel posits that space around the Milky ...,The novel posits that space around the Milky...


### Data Preprocessing
We preprocess the text in `summary_name_removed` with the following steps:

* Lower the letters.
* Remove punctuations, digits, multiple_whitespaces, and stopwords.
* Lemmatize the tokens.

Each processed summary will be converted into a list of tokens and stored in the column `processed_summary`.

You can run the codes for preprocessing without any edits needed. It will take **6-8 minutes**.

In [4]:
# drop books with empty summaries
book_df.dropna(axis="index", subset=["summary_name_removed"], inplace=True)

# preprocess
def lemmatize_text(token_list, wnl):
  """
  This function tags the pos of the input tokens,
  and lemmatize the tokens.
  """
  # POS tag each word
  for word, tag in nltk.pos_tag(token_list):
    # Mapping the pos tags to the types supported by wnl
    if tag.startswith("NN"):
      yield wnl.lemmatize(word, pos='n')
    elif tag.startswith('VB'):
      yield wnl.lemmatize(word, pos='v')
    elif tag.startswith('JJ'):
      yield wnl.lemmatize(word, pos='a')
    elif tag.startswith('RB'):
      yield wnl.lemmatize(word, pos='r')
    else:
      yield wnl.lemmatize(word)

# lower letters, remove punctuations, digits, multiple whitespaces, and stopwords
CUSTOM_FILTERS = [lambda x: x.lower(), preprocessing.strip_punctuation, preprocessing.strip_numeric, preprocessing.strip_multiple_whitespaces, preprocessing.remove_stopwords]
book_df["processed_summary"] = book_df["summary_name_removed"].apply(lambda x: preprocessing.preprocess_string(x, CUSTOM_FILTERS))

# lemmatize the tokens
wnl = nltk.WordNetLemmatizer()
book_df["processed_summary"] = book_df["processed_summary"].progress_apply(lambda x: simple_preprocess(" ".join(lemmatize_text(x, wnl))))

  0%|          | 0/16559 [00:00<?, ?it/s]

Let's compare the summaries before and after the preprocessing steps.

In [None]:
book_df[["summary", "summary_name_removed", "processed_summary"]].head()

Unnamed: 0,summary,summary_name_removed,processed_summary
0,"Old Major, the old boar on the Manor Farm, ca...","Old Major , the old boar on , calls the anim...","[old, major, old, boar, call, animal, farm, me..."
1,"Alex, a teenager living in near-future Englan...",", a teenager living in near - future England...","[teenager, live, near, future, england, lead, ..."
2,The text of The Plague is divided into five p...,The text of The Plague is divided into five ...,"[text, plague, divide, part, town, oran, thous..."
3,The argument of the Enquiry proceeds by a ser...,The argument of the proceeds by a series of ...,"[argument, proceeds, series, incremental, step..."
4,The novel posits that space around the Milky ...,The novel posits that space around the Milky...,"[novel, posit, space, milky, way, divide, conc..."


## Q1 **Model Training**
Based on the processed plot summaries, we train LDA models to allocate the topic of each book.

The first step is to build the corpus. You can run the following codes to generate `book_corpus` without any edits needed. Here, we filter out the top 50 most frequent tokens among all summaries.

In [5]:
# create a dictionary from the processed summaries
summary_list = book_df["processed_summary"].tolist()
book_dictionary = Dictionary(summary_list)
# filter top 50 most frequent tokens
book_dictionary.filter_n_most_frequent(50)
# build the corpus for LDA training
book_corpus = [book_dictionary.doc2bow(text) for text in summary_list]

### a) Training a LDA model with 10 topics
The following codes train an LDA model called `lda_10topics` with `num_topics = 10`. You can run it without any edits needed.

For more information about the parameters, you may check `gensim`'s official document about the LDA model [here](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel).

In [6]:
lda_10topics = LdaModel(corpus = book_corpus, num_topics = 10, id2word = book_dictionary, passes = 2, iterations = 50, random_state = 30)

### b) Training a LDA model with 5 topics
Next, train an LDA model called `lda_5topics` with 5 topics.

In [7]:
# Your code for Q1 b)
lda_5topics = LdaModel(corpus=book_corpus, num_topics=5, id2word=book_dictionary, passes=2, iterations=50, random_state=30)

## Q2 **Model Evaluation**
After training these two LDA models: `lda_10topics` and `lda_5topics`, we next need to evaluate their performance.

### a) Perplexity
**For each LDA model**, write codes to print out the perplexity and fill in the blanks.

Hint: The perplexity should be a positive number. You may check the class notebook about how to calculate perplexity scores. The calculation may take up to **2 mins**.

In [8]:
# Your code for Q2 a)

# Calculate and print perplexity for lda_10topics
perplexity_10topics = (2 ** lda_10topics.log_perplexity(book_corpus))
print(f'Perplexity for lda_10topics: {perplexity_10topics}')

# Calculate and print perplexity for lda_5topics
perplexity_5topics = (2 ** lda_5topics.log_perplexity(book_corpus))
print(f'Perplexity for lda_5topics: {perplexity_5topics}')

Perplexity for lda_10topics: 0.0018330215467130331
Perplexity for lda_5topics: 0.002303149865008182


The perplexity for `lda_5topics` = 0.001833
The perplexity for `lda_10topics` = 0.002303

In [None]:
# Your code for Q2 a)

# Calculate and print perplexity for lda_10topics
perplexity_10topics = (2 ** lda_10topics.log_perplexity(book_corpus))
print(f'Perplexity for lda_10topics: {perplexity_10topics}')

# Calculate and print perplexity for lda_5topics
perplexity_5topics = (2 ** lda_5topics.log_perplexity(book_corpus))
print(f'Perplexity for lda_5topics: {perplexity_5topics}')

Perplexity for lda_10topics: 0.0018330215467130331
Perplexity for lda_5topics: 0.002303149865008182


### b) Hyperparameter tuning via Perplexity

Revise the LDA model training task as follows: Construct two additional LDA models, setting `num_topics` to 10 for both, but vary the `passes` parameter - use `passes = 3` for one model and `passes = 5` for the other. After training, print and compare their perplexity scores with the original `lda_10topics` model.

Hint: you should name each trained model (with different value of the variable `passes`) with a different name.

In [9]:
# Your code for Q2 b)
# Train LDA model with 10 topics and 3 passes
lda_10topics_3passes = LdaModel(corpus=book_corpus, num_topics=10, id2word=book_dictionary, passes=3, iterations=50, random_state=30)

# Train LDA model with 10 topics and 5 passes
lda_10topics_5passes = LdaModel(corpus=book_corpus, num_topics=10, id2word=book_dictionary, passes=5, iterations=50, random_state=30)

# Calculate and print perplexity for lda_10topics
perplexity_10topics = (2 ** lda_10topics.log_perplexity(book_corpus))
print(f'Perplexity for lda_10topics with 2 passes: {perplexity_10topics}')

# Calculate and print perplexity for lda_10topics_3passes
perplexity_10topics_3passes = (2 ** lda_10topics_3passes.log_perplexity(book_corpus))
print(f'Perplexity for lda_10topics with 3 passes: {perplexity_10topics_3passes}')

# Calculate and print perplexity for lda_10topics_5passes
perplexity_10topics_5passes = (2 ** lda_10topics_5passes.log_perplexity(book_corpus))
print(f'Perplexity for lda_10topics with 5 passes: {perplexity_10topics_5passes}')

Perplexity for lda_10topics with 2 passes: 0.001833027361816546
Perplexity for lda_10topics with 3 passes: 0.0018554782507544875
Perplexity for lda_10topics with 5 passes: 0.001876652394238549


### c) Model selection based on perplexity

As seen in Q2b) perplexity score can be used as one of the criteria for choosing our model. However, it should not be the only score we use if we are selecting models with different number of topics.

Please provide a reason on why selecting between a 5-topics model and a 10-topics model solely based on perplexity scores is **NOT** a great idea, if we aim to categorize books in the dataset.


Please enter your answer to Q2 c) here: <br>
Choosing a model based on only perplexity can result in models without meaningful insights <br><br>
Low perplexity scores do not necessarily equate to more interpretable or useful models <br><br>
Perplexity and Interpretability: Perplexity measures how well the model predicts the data, but it doesn't consider how easy the topics are to understand. A model with more topics (like the 10-topic model) might have a better perplexity score but produce topics that are harder to interpret. A 5-topic model might offer broader, easier-to-understand categories that are more useful for categorization.<br><br>

Topic Coherence: This measures how logically the words within each topic fit together. A model might have a good perplexity score but low coherence, meaning the topics it creates are not meaningful or useful for categorizing books.<br><br>

Relevance to Application: The goal is to create meaningful categories. If the data naturally falls into about five main themes, a 10-topic model might create unnecessary, less useful distinctions.<br><br>

Overfitting: More topics can mean the model captures noise rather than the actual patterns in the data. This makes the topics less useful for future data.<br><br>

Balance of Detail and Usability: More topics provide more detail but can overwhelm users. It's important to balance having enough topics to capture distinct categories without becoming overly complex.<br><br>

Perplexity is useful but should be combined with other metrics like topic coherence and interpretability. This ensures the chosen model is both statistically sound and practically useful.

## Q3 **Model Inspection**

### a) Topic words

For both `lda_10topics` and `lda_5topics`, write codes to extract and display the top 10 words from each topic.


Hint: You may need the function `.print_topics()`. You can check the class notebook for more details.

In [10]:
# Your code for Q3 a)
# Extract and display the top 10 words from each topic for lda_10topics
print("Top 10 words for each topic in lda_10topics:")
for idx, topic in lda_10topics.print_topics(num_words=10):
    print(f"Topic {idx + 1}: {topic}")

# Extract and display the top 10 words from each topic for lda_5topics
print("\nTop 10 words for each topic in lda_5topics:")
for idx, topic in lda_5topics.print_topics(num_words=10):
    print(f"Topic {idx + 1}: {topic}")

Top 10 words for each topic in lda_10topics:
Topic 1: 0.008*"child" + 0.008*"mother" + 0.007*"son" + 0.007*"marry" + 0.007*"daughter" + 0.006*"wife" + 0.006*"brother" + 0.005*"husband" + 0.005*"fall" + 0.004*"marriage"
Topic 2: 0.020*"character" + 0.012*"write" + 0.008*"author" + 0.007*"narrator" + 0.006*"include" + 0.005*"section" + 0.005*"reader" + 0.005*"film" + 0.005*"protagonist" + 0.005*"play"
Topic 3: 0.052*"vampire" + 0.032*"heaven" + 0.017*"angel" + 0.015*"human" + 0.012*"blood" + 0.009*"marina" + 0.008*"god" + 0.006*"hell" + 0.006*"egypt" + 0.006*"fledgling"
Topic 4: 0.021*"murder" + 0.010*"case" + 0.009*"police" + 0.006*"crime" + 0.006*"suspect" + 0.006*"wife" + 0.005*"miss" + 0.005*"investigation" + 0.005*"killer" + 0.004*"body"
Topic 5: 0.008*"mother" + 0.008*"school" + 0.007*"get" + 0.006*"house" + 0.006*"say" + 0.006*"girl" + 0.005*"want" + 0.005*"start" + 0.005*"run" + 0.004*"boy"
Topic 6: 0.009*"agent" + 0.006*"terrorist" + 0.006*"island" + 0.005*"president" + 0.005*"e

### b) Model selection based on topic words
Based on the outputs of the topic words, answer the following question.

Imagine that we are working want to get a broad overview on the books in our dataset. Which of the LDA models, `lda_10topics` and `lda_5topics`, would you prefer, and why?

Please enter your answer to Q3 b) here: <br><br>
I'd choose lda_5topics model because it is simpler, more interpretable. It gives a highlevel overview

## Q4 **Model Test with Classic Books**

We choose several classic books to test our `lda_10topics` model. For each book, we have provided its index in the corpus and the codes to display its summary. You can also check the wiki link to know more about these books.

Your task is to get each book's topics predicted by the `lda_10topics` model.

Note: In this problem, **we will only test the `lda_10topics` model**.

Hint: You may need the function `.get_document_topics()`. You can check the class notebook for more details.

### a) The Adventures of Tom Sawyer [[wiki]](https://en.wikipedia.org/wiki/The_Adventures_of_Tom_Sawyer)

The book is indexed as 565 in the corpus. You can run the following code to display its title and the first 1000 words in it summary.

In [11]:
pprint(book_df["title"][565])
pprint(book_df["summary"][565][:1000], width=200, compact=True)

'The Adventures of Tom Sawyer'
(' In the 1840s an imaginative and mischievous boy named Tom Sawyer lives with his Aunt Polly and his half-brother, Sid, in the Mississippi River town of St. Petersburg, Missouri. After playing '
 'hooky from school on Friday and dirtying his clothes in a fight, Tom is made to whitewash the fence as punishment all of the next day. At first, Tom is disheartened by having to forfeit his day '
 'off. However, he soon cleverly persuades his friends to trade him small treasures for the privilege of doing his work. He trades the treasures he got by tricking his friends into whitewashing the '
 'fence for tickets given out in Sunday school for memorizing Bible verses, which can be used to claim a Bible as a prize. He received enough tickets to be given the Bible. However, he loses much of '
 'his glory when, in response to a question to show off his knowledge, he incorrectly answers that the first disciples were David and Goliath. Tom falls in love with Becky Th

Write codes to get the predicted topics by the `lda_10topics` model for *The Adventures of Tom Sawyer*.

In [13]:
# Your code for Q4 a)
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
# Sample text from the summary of "The Adventures of Tom Sawyer"
tom_sawyer_summary = book_df["summary"][565]

# Preprocess the text
def preprocess(text):
    return [word for word in simple_preprocess(text) if word not in STOPWORDS]
# Preprocess the text
tom_sawyer_preprocessed_summary = preprocess(tom_sawyer_summary)

# Convert to bag-of-words format
tom_sawyer_bow_vector = book_dictionary.doc2bow(tom_sawyer_preprocessed_summary)

# Get the topic distribution for the document
tom_sawyer_topic_distribution = lda_10topics[tom_sawyer_bow_vector]

print(f"Predicted topics for 'The Adventures of Tom Sawyer':")
pprint(tom_sawyer_topic_distribution)

Predicted topics for 'The Adventures of Tom Sawyer':
[(0, 0.10516674), (3, 0.044732172), (4, 0.5651654), (6, 0.27699742)]


### b) Harry Potter and the Chamber of Secrets [[wiki]](https://en.wikipedia.org/wiki/Harry_Potter_and_the_Chamber_of_Secrets)

The book is indexed as 480 in the corpus. You can run the following code to display its title and the first 1000 words in it summary.

In [14]:
pprint(book_df["title"][480])
pprint(book_df["summary"][480][:1000], width=200, compact=True)

'Harry Potter and the Chamber of Secrets'
(' Harry Potter and the Chamber of Secrets begins as Harry spends a miserable summer with his only remaining family, the Dursleys. During a dinner party hosted by his uncle and aunt, Harry is '
 'visited by Dobby, a house-elf. Dobby warns Harry not to return to Hogwarts, the magical school for wizards that Harry attended the previous year, explaining that terrible things will happen there. '
 'that Harry is not allowed to use magic away from Hogwarts. Harry is rescued by his friend Ron Weasley and his brothers Fred and George in a flying car, and spends the rest of the summer at the '
 'Weasley home. When Harry uses Floo Powder to get to Diagon Alley he accidentally ends up in a dark-arts dealing end of town, Knockturn Alley. Fortunately, he meets Hagrid who gets him back to '
 'Diagon Alley. While shopping for s')


Write codes to get the predicted topics by the `lda_10topics` model for *Murder on the Orient Express*

In [15]:
# Your code for Q4 b)
# Sample text from the summary of "Harry Potter and the Chamber of Secrets"
harry_potter_summary = book_df["summary"][480]

# Preprocess the text
harry_potter_preprocessed_summary = preprocess(harry_potter_summary)

# Convert to bag-of-words format
harry_potter_bow_vector = book_dictionary.doc2bow(harry_potter_preprocessed_summary)

# Get the topic distribution for the document
harry_potter_topic_distribution = lda_10topics[harry_potter_bow_vector]

print(f"Predicted topics for 'Harry Potter and the Chamber of Secrets':")
pprint(harry_potter_topic_distribution)

Predicted topics for 'Harry Potter and the Chamber of Secrets':
[(0, 0.054424617), (3, 0.0261911), (4, 0.38797382), (6, 0.53019416)]


### c) Contact [[wiki]](https://en.wikipedia.org/wiki/Contact_(novel))

The book is indexed as 532 in the corpus. You can run the following code to display its title and the first 1000 words in it summary.

In [16]:
pprint(book_df["title"][532])
pprint(book_df["summary"][532][:1000], width=200, compact=True)

'Contact'
(' Eleanor "Ellie" Arroway is the director of "Project Argus," in which scores of radio telescopes in New Mexico have been dedicated to the search for extraterrestrial intelligence (SETI). The '
 'project discovers the first confirmed communication from extraterrestrial beings. The communication is a repeating series of the first 261 prime numbers (a sequence of prime numbers is a commonly '
 'predicted first message from alien intelligence, since mathematics is considered a "universal language," and it is conjectured that algorithms that produce successive prime numbers are '
 'sufficiently complicated so as to require intelligence to implement them). Further analysis reveals that a second message is contained in polarization modulation of the signal. The second message '
 "is a retransmission of Earth's first television signal broadcast powerful enough to escape the ionosphere and be received in interstellar space; in this case, Adolf Hitler's opening speech at the "
 '1936

Write codes to get the predicted topics by the `lda_10topics` model for *Contact*

In [17]:
# Your code for Q4 c)
# Sample text from the summary of "Contact"
contact_summary = book_df["summary"][532]

# Preprocess the text
contact_preprocessed_summary = preprocess(contact_summary)

# Convert to bag-of-words format
contact_bow_vector = book_dictionary.doc2bow(contact_preprocessed_summary)

# Get the topic distribution for the document
contact_topic_distribution = lda_10topics[contact_bow_vector]

print(f"Predicted topics for 'Contact':")
pprint(contact_topic_distribution)

Predicted topics for 'Contact':
[(0, 0.02746116),
 (1, 0.0225474),
 (3, 0.030117117),
 (4, 0.122058876),
 (7, 0.55840635),
 (9, 0.22698976)]


In [18]:
# look up how to print out the topics lda_10topics
print("Top 10 words for each topic in lda_10topics:")
for idx, topic in lda_10topics.print_topics(num_words=10):
    print(f"Topic {idx + 1}: {topic}")


Top 10 words for each topic in lda_10topics:
Topic 1: 0.008*"child" + 0.008*"mother" + 0.007*"son" + 0.007*"marry" + 0.007*"daughter" + 0.006*"wife" + 0.006*"brother" + 0.005*"husband" + 0.005*"fall" + 0.004*"marriage"
Topic 2: 0.020*"character" + 0.012*"write" + 0.008*"author" + 0.007*"narrator" + 0.006*"include" + 0.005*"section" + 0.005*"reader" + 0.005*"film" + 0.005*"protagonist" + 0.005*"play"
Topic 3: 0.052*"vampire" + 0.032*"heaven" + 0.017*"angel" + 0.015*"human" + 0.012*"blood" + 0.009*"marina" + 0.008*"god" + 0.006*"hell" + 0.006*"egypt" + 0.006*"fledgling"
Topic 4: 0.021*"murder" + 0.010*"case" + 0.009*"police" + 0.006*"crime" + 0.006*"suspect" + 0.006*"wife" + 0.005*"miss" + 0.005*"investigation" + 0.005*"killer" + 0.004*"body"
Topic 5: 0.008*"mother" + 0.008*"school" + 0.007*"get" + 0.006*"house" + 0.006*"say" + 0.006*"girl" + 0.005*"want" + 0.005*"start" + 0.005*"run" + 0.004*"boy"
Topic 6: 0.009*"agent" + 0.006*"terrorist" + 0.006*"island" + 0.005*"president" + 0.005*"e

### d) Model Interpretation
Compare the predicted topics for each book from Q4 a), b) and c) with the topic words displayed in Q3 a). Do you agree with the model's predictions?

Hint: You may first guess the meaning of each topic based on the topic words. That will help you to understand the model predictions.

Please enter your answer to Q4 d) here:<br><br>
Top 10 words for each topic in lda_10topics:<br>
Topic 1: 0.008*"child" + 0.008*"mother" + 0.007*"son" + 0.007*"marry" + 0.007*"daughter" + 0.006*"wife" + 0.006*"brother" + 0.005*"husband" + 0.005*"fall" + 0.004*"marriage"<br>
Topic 2: 0.020*"character" + 0.012*"write" + 0.008*"author" + 0.007*"narrator" + 0.006*"include" + 0.005*"section" + 0.005*"reader" + 0.005*"film" + 0.005*"protagonist" + 0.005*"play"<br>
Topic 3: 0.052*"vampire" + 0.032*"heaven" + 0.017*"angel" + 0.015*"human" + 0.012*"blood" + 0.009*"marina" + 0.008*"god" + 0.006*"hell" + 0.006*"egypt" + 0.006*"fledgling"<br>
Topic 4: 0.021*"murder" + 0.010*"case" + 0.009*"police" + 0.006*"crime" + 0.006*"suspect" + 0.006*"wife" + 0.005*"miss" + 0.005*"investigation" + 0.005*"killer" + 0.004*"body"<br>
Topic 5: 0.008*"mother" + 0.008*"school" + 0.007*"get" + 0.006*"house" + 0.006*"say" + 0.006*"girl" + 0.005*"want" + 0.005*"start" + 0.005*"run" + 0.004*"boy"<br>
Topic 6: 0.009*"agent" + 0.006*"terrorist" + 0.006*"island" + 0.005*"president" + 0.005*"escape" + 0.005*"san" + 0.005*"american" + 0.004*"francisco" + 0.004*"tiger" + 0.004*"chinese"<br>
Topic 7: 0.007*"attack" + 0.006*"escape" + 0.005*"king" + 0.005*"city" + 0.005*"fight" + 0.005*"battle" + 0.005*"power" + 0.004*"army" + 0.004*"group" + 0.004*"order"<br>
Topic 8: 0.013*"ship" + 0.010*"human" + 0.009*"earth" + 0.006*"planet" + 0.005*"crew" + 0.005*"destroy" + 0.004*"space" + 0.004*"attack" + 0.004*"alien" + 0.004*"doctor"<br>
Topic 9: 0.023*"roman" + 0.011*"shannon" + 0.010*"gaul" + 0.008*"gallic" + 0.007*"prospero" + 0.007*"druid" + 0.005*"rome" + 0.005*"elephant" + 0.004*"village" + 0.003*"carnage"<br>
Topic 10: 0.010*"war" + 0.007*"state" + 0.006*"chapter" + 0.004*"american" + 0.004*"political" + 0.004*"government" + 0.003*"include" + 0.003*"society" + 0.003*"history" + 0.003*"military"<br>

Predicted topics for 'The Adventures of Tom Sawyer':
[(0, 0.10516674), (3, 0.044732172), (4, 0.5651654), (6, 0.27699742)]<br>

Predicted topics for 'Harry Potter and the Chamber of Secrets':
[(0, 0.054424617), (3, 0.0261911), (4, 0.38797382), (6, 0.53019416)]<br>

Predicted topics for 'Contact':
[(0, 0.02746116),
 (1, 0.0225474),
 (3, 0.030117117),
 (4, 0.122058876),
 (7, 0.55840635),
 (9, 0.22698976)]<br>

Based on the predicted topics and the comparison of the top words for each model, we can see that the model's predictions align partially with the main themes (summary section) of "The Adventures of Tom Sawyer," "Harry Potter," and "Contact."<br>
For example, Tom Sawyer is highly correlated with topic 4 and 6, which is on murder, crime and terroist.<br>
Harry Potter is highly correlated with topic 6 and 4. (may not be)<br>
Contact is highly correlated with topic 7 and 9. <br>


## Q5 **Conceptual Questions on Topic Modeling**



### a) Difference between LSI and LDA.

Please discuss the reason why the LDA model may be preferred LSI model as we try to categorize the books in this example.

Please enter your answer to Q5 a) here:<br><br>
even though LDA - Latent Dirichlet Allocation might be Slower and more difficult to implement, it is Easier to interpret and use by
humans and it is Based on statistical inference
(iterative / algorithmic). It is very important that we can interpret it.<br>
LDA can also naturally handle the hierarchical structure of topics. It also tends to produce more coherent and meaningful topics,

### b) LDA for Sentiment Analysis

Professor W wants to perform sentiment analysis on a dataset of online restaurant reviews using the Latent Dirichlet Allocation (LDA) model. He plans to train an LDA model with 3 topics using the restaurant reviews. Then, he will manually examine the top ten words in each topic to determine whether each topic is 'positive', 'neutral', or 'negative'.

Do you think this approach will produce a useful sentiment analyzer? Explain why or why not.



Please enter your answer to Q5 b) here:<br><br>
Using the Latent Dirichlet Allocation (LDA) model for sentiment analysis of online restaurant reviews may not be the most effective approach.

### LDA
1. LDA is a topic modeling tool to identify topics (clusters of related words) in a corpus of text. It does not inherently distinguish between sentiments (positive, neutral, negative) but rather groups words into topics based on their co-occurrence patterns in the text.

2. Topics generated by LDA are not guaranteed to correspond to sentiments. For example, a topic might include words related to food items or restaurant ambiance without conveying whether the sentiment is positive or negative.

While professor W manually examine top 10 words, he could be very subjective, LDA is hard to seperate mixed feelings and contextual surroundings.


## Q6 **Parsing Sentences with Spacy**

In this question, we will shift gears from LDA and apply spaCy to uncover grammatical relationships between tokens in sentences. First, run the code provided below to install and import the spaCy library.


In [19]:
!python -m spacy download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md')
from spacy import displacy

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### a) Run the code provided below to perform parsing with spaCy to uncover grammatical relationships between tokens in the example sentence.

The columns in the token_list stands for the following:
- TOKEN:     The original word text of the token.
- POS:     The simple part-of-speech tag of the token.
- TAG:     The "fine-grained" part-of-speech tag of the token.
- ENTITYTYPE:     The entity type tag of the token.
- DEPENDENT:     The syntactic dependency relation of the token.

In [20]:
# code for parsing a sentence
test_phrase = 'April and Mary went to the library to get math books.'
doc = nlp(test_phrase)
token_list = [(token.text, token.pos_, token.tag_, token.ent_type_, token.dep_) for token in doc]
print(pd.DataFrame(token_list, columns=['TOKEN','POS','TAG','ENTITYTYPE','DEPENDENT']))

      TOKEN    POS  TAG ENTITYTYPE DEPENDENT
0     April  PROPN  NNP       DATE  npadvmod
1       and  CCONJ   CC                   cc
2      Mary  PROPN  NNP     PERSON      conj
3      went   VERB  VBD                 ROOT
4        to    ADP   IN                 prep
5       the    DET   DT                  det
6   library   NOUN   NN                 pobj
7        to   PART   TO                  aux
8       get   VERB   VB                advcl
9      math   NOUN   NN             compound
10    books   NOUN  NNS                 dobj
11        .  PUNCT    .                punct


### b) Do you agree with the simple part-of-speech tags from spaCy? What about the entity type tags?

Please enter your answer to the question above here:<br><br>
I agree with the simple part-of-speech tags, most of them are correct like math is noun, books is plurarl noun. get is verb. I don't agree with entity type tags because April is a name, not tag. Mary is a person

### c) Write a short sentence, and use spaCy to do dependency parsing on it. You may borrow some codes from the above.

In [21]:
# Your code for the question above.

# Loop through each token 't' in the processed 'doc' object.
for token in doc:
    # Print a list of various attributes of the token:
    #   token.text:     The original word text of the token.
    #   token.dep_:     The syntactic dependency relation of the token.
    #   token.head.text: The original word text of the token's head in the dependency tree.
    #   token.pos_:     The simple part-of-speech tag of the token.
    #   token.tag_:     The "fine-grained" part-of-speech tag of the token.
    print([token.text, token.dep_, token.head.text, token.pos_, token.tag_])

['April', 'npadvmod', 'went', 'PROPN', 'NNP']
['and', 'cc', 'April', 'CCONJ', 'CC']
['Mary', 'conj', 'April', 'PROPN', 'NNP']
['went', 'ROOT', 'went', 'VERB', 'VBD']
['to', 'prep', 'went', 'ADP', 'IN']
['the', 'det', 'library', 'DET', 'DT']
['library', 'pobj', 'to', 'NOUN', 'NN']
['to', 'aux', 'get', 'PART', 'TO']
['get', 'advcl', 'went', 'VERB', 'VB']
['math', 'compound', 'books', 'NOUN', 'NN']
['books', 'dobj', 'get', 'NOUN', 'NNS']
['.', 'punct', 'went', 'PUNCT', '.']


### d) Visualization for the Dependencies

Next, use the package Displacy to visualize the dependencies of our example sentence in part a).

In [22]:
# Your code for the question above.
# Visualize the dependency parsing
displacy.render(doc, style='dep', jupyter=True)

## Q7 **Named Entity Recognition (NER)**
In this section, we use spaCy to do NER on a dataset of financial news. But before that, we need to clarify the types of entities supported by spaCy. The following codes print out all supported types and their explanations. You can run them without any editings needed.

In [23]:
import pprint as pp

# entities supported by spaCy and their explanations
pp.pprint([(label, spacy.explain(label)) for label in nlp.get_pipe("ner").labels])

[('CARDINAL', 'Numerals that do not fall under another type'),
 ('DATE', 'Absolute or relative dates or periods'),
 ('EVENT', 'Named hurricanes, battles, wars, sports events, etc.'),
 ('FAC', 'Buildings, airports, highways, bridges, etc.'),
 ('GPE', 'Countries, cities, states'),
 ('LANGUAGE', 'Any named language'),
 ('LAW', 'Named documents made into laws.'),
 ('LOC', 'Non-GPE locations, mountain ranges, bodies of water'),
 ('MONEY', 'Monetary values, including unit'),
 ('NORP', 'Nationalities or religious or political groups'),
 ('ORDINAL', '"first", "second", etc.'),
 ('ORG', 'Companies, agencies, institutions, etc.'),
 ('PERCENT', 'Percentage, including "%"'),
 ('PERSON', 'People, including fictional'),
 ('PRODUCT', 'Objects, vehicles, foods, etc. (not services)'),
 ('QUANTITY', 'Measurements, as of weight or distance'),
 ('TIME', 'Times smaller than a day'),
 ('WORK_OF_ART', 'Titles of books, songs, etc.')]


### Data Setup
Our dataset contains 4846 sentences extracted from **financial news articles** before 2013 on all listed companies in **Finland stock market**.

The dataset is from the paper Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." *Journal of the Association for Information Science and Technology* 65.4 (2014): 782-796. You can check the paper [here](https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23062).

First, let's download the dataset.

In [24]:
# download the dataset
!wget https://duke.box.com/shared/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3 -O all-data.csv
# store it in news_df
news_df = pd.read_csv("all-data.csv", usecols=[1], names=["content"], encoding="latin-1")

--2024-07-26 03:37:02--  https://duke.box.com/shared/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3
Resolving duke.box.com (duke.box.com)... 74.112.186.157
Connecting to duke.box.com (duke.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3 [following]
--2024-07-26 03:37:02--  https://duke.box.com/public/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3
Reusing existing connection to duke.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://duke.app.box.com/public/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3 [following]
--2024-07-26 03:37:02--  https://duke.app.box.com/public/static/anno25ihglqxif8jcjoo4f7p7vd2ajt3
Resolving duke.app.box.com (duke.app.box.com)... 74.112.186.157
Connecting to duke.app.box.com (duke.app.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/b1!e4l95

In [25]:
# We next do several necessary preprocessing steps on the dataset, which removes punctuation, digits, stopwords and single-letter words
# You can run the code without any editings needed.

CUSTOM_FILTERS = [preprocessing.strip_punctuation, preprocessing.strip_numeric, preprocessing.remove_stopwords, lambda x: preprocessing.strip_short(x, minsize=2)]
news_df["processed_content"] = news_df["content"].apply(lambda x: ' '.join(preprocessing.preprocess_string(x, CUSTOM_FILTERS)))

We use spaCy to analyze the preprocessed sentences in the dataset using the following single-line code. You can run it without any edits needed. It may take up to **2 mins** to finish the process.

In [None]:
news_df["nlp"] = news_df["processed_content"].apply(nlp)

### a) Most frequent monetary entities
The most frequent entitiles within an entity type can capture some important information of the corpus. The following codes display the top-10 frequent entities in the entity type "**MONEY**": Monetary values, including unit. You can run them without any editings needed.

In [None]:
# find the "MONEY" entities in each sentence
news_df["money"] = news_df['nlp'].apply(lambda doc: [token.text for token in doc if token.ent_type_ == "MONEY"])
# return the top-10 frequent entities
news_df["money"].explode().value_counts()[:10]

money
EUR        464
mn         305
million    265
mln        218
euro       115
USD         55
quarter     46
billion     36
euros       28
eur         25
Name: count, dtype: int64

### b) Do the top-10 frequent monetary entities match your expectation?

Please enter your answer to the question above here:<br><br>
Yes, I see abbreveration of euro, as in EUR, euro, euros, eur.<br>
I also see mn, mln, million as for million and billion.

### c) Most frequent geographic entities.
Write (and run) codes to find the top-10 frequent geographic entities.

Hint 1: The code structure is similar to that of the monetary entities

Hint 2: Geographic entities are labeled as "GPE" in spaCy, which include countries, cities, and states.

In [None]:
# Your code for the question above.
# Find the "GPE" entities in each sentence
news_df["geographic"] = news_df['nlp'].apply(lambda doc: [token.text for token in doc if token.ent_type_ == "GPE"])

# Return the top-10 frequent entities
top_10_geographic_entities = news_df["geographic"].explode().value_counts()[:10]
print(top_10_geographic_entities)

geographic
Finland       299
Russia         81
Helsinki       61
China          42
Sweden         40
US             34
Petersburg     28
Estonia        24
UK             23
India          21
Name: count, dtype: int64


### d) Most frequent organizational entities.
Write (and run) code to find the top-10 frequent geographic entities.


In [None]:
# Your code for the question above.
# Find the "ORG" entities in each sentence
news_df["organizational"] = news_df['nlp'].apply(lambda doc: [token.text for token in doc if token.ent_type_ == "ORG"])

# Return the top-10 frequent entities
top_10_organizational_entities = news_df["organizational"].explode().value_counts()[:10]
print(top_10_organizational_entities)

organizational
Group          160
Nokia          132
Corporation    116
Oyj            110
Bank            93
Finnish         71
HEL             69
Ltd             57
Plc             55
Oy              47
Name: count, dtype: int64
