# QM 701: Advanced Data Analytics and Applications
# Homework 3
---
## Objective
This homework is designed to enhance your understanding of word embeddings using Word2Vec and FastText models. You will work with a dataset of movie reviews, import, preprocess, and analyze the text data. The primary objectives include training Word2Vec and FastText models, exploring word similarity, and comparing models trained on domain-specific data versus pre-trained models on some larger corpora.


## Tasks
In this homework, you will implement **language models** for **learning word embeddings**. The homework includes the following 5 questions:

* **Q1**: Text Pre-Processing. You will start by importing and preprocessing a dataset of movie reviews. (15 points)
* **Q2**: Model Training. You need to train **Word2Vec** and **FastText** models under different settings on the processed dataset. (20 points)
* **Q3**: Model Evaluation. Using these trained models, you are required to find the synonyms of the given tokens. (20 points)
* **Q4**: Model Comparison. You will compare your models with some other models that are pre-trained on large and more datasets. (20 points)
* **Q5**: Non-Coding Questions. You can check your understandings about language models and word embeddings by answering these questions. (25 points)



In [1]:
# import packages
# feel free to import other libraries here
import pprint as pp
import numpy as np
import pandas as pd
import gensim

## File and Data setup

For this assignment, we will use the movie review dataset, which is originally from the paper *Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts (Pang & Lee, ACL 2004)*. This dataset includes 2k reviews on movies **before 2002**. You may check the detailed description on the dataset from their [paper](https://aclanthology.org/P04-1035/) and [website](https://www.cs.cornell.edu/people/pabo/movie-review-data/).

This dataset is supported and distributed by NLTK, which enables us to download it easily.

In [2]:
# download required corpora
import nltk
nltk.download(["movie_reviews", "wordnet", "averaged_perceptron_tagger"])
from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
# transform the dataset into a pandas Series object,
# where each element of the Series is the full text of a movie review,
# with words separated by spaces
doc = [" ".join(movie_reviews.words(file_id)) for file_id in movie_reviews.fileids()]
text_series = pd.Series(doc)

In [4]:
# inspect the Series using the describe() method
text_series.str.len().describe()

count     2000.0000
mean      3904.2600
std       1718.6621
min         91.0000
25%       2745.5000
50%       3640.5000
75%       4730.5000
max      15097.0000
dtype: float64

In [5]:
# Display a review
pp.pprint(text_series[1100][0:2000], width=100, compact=True)

('james l . brooks , one of the developers of the simpsons and director of broadcast news , '
 'returns to the big screen with this entertaining , if slightly flawed comedy . nicholson plays '
 "melvin udall , probably the most horrible person ever on the screen . he ' s racist , homophobic "
 ', and never has a good word to say to anyone . so , nobody talks to him , except waitress carol '
 'conelly ( t . v sitcom star hunt , who was last seen in twister , 1996 ) . naturally , udall , '
 'conelly and gay neighbor simon bishop ( kinnear ) who nicholson hates , all hit it off in the '
 'end . like good will hunting ( 1997 ) and titanic ( 1997 ) , even though the outcome is '
 'completely obvious , as good as it gets is an enjoyable , funny and warm comedy . nicholson is '
 'hilarious as melvin , churning out insults with superb relish . only nicholson could get away '
 'with the lines that melvin delivers . hunt is also good as waitress carol , and easily rises to '
 "the challenge of n

## Q1. Text Pre-Processing

In this section, we will first preprocess the dataset and format it for the model training.

We already provide the code for main part of the preprocessing for you. Your task is to define your `CUSTOM_FILTERS` of gensim functions to remove punctuations, digits, and stopwords.

Hint: You may check our posted `Class3.ipynb` notebook about preprocessing for some help. You can also find the official documentatios of `gensim` about their preprocessing functions [here](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string).

In [8]:
import gensim.parsing.preprocessing as preprocess
from nltk.stem import WordNetLemmatizer
# Your code for Q1: define your CUSTOM_FILTERS of gensim functions to apply preprocessing steps.
CUSTOM_FILTERS =  [preprocess.strip_punctuation, preprocess.strip_numeric, preprocess.remove_stopwords]

With your `CUSTOME_FILTERS`, you can run the following codes for preprocessing without any edits needed.

It may take up to 2mins to finish the preprocessing (mostly due to the lemmatization).

In [9]:
def lemmatize_text(token_list, wnl):
  """
  This function tags the pos of the input tokens,
  and lemmatize the tokens.
  """
  # POS tag each word
  for word, tag in nltk.pos_tag(token_list):
    # Mapping the pos tags to the types supported by wnl
    if tag.startswith("NN"):
      yield wnl.lemmatize(word, pos='n')
    elif tag.startswith('VB'):
      yield wnl.lemmatize(word, pos='v')
    elif tag.startswith('JJ'):
      yield wnl.lemmatize(word, pos='a')
    elif tag.startswith('RB'):
      yield wnl.lemmatize(word, pos='r')
    else:
      yield wnl.lemmatize(word)

# Remove punctuations, digits, and stopwords with your defined CUSTOM_FILTERS
processed_text = text_series.apply(lambda x: preprocess.preprocess_string(x, CUSTOM_FILTERS))

# Lemmatize the tokens
wnl = nltk.WordNetLemmatizer()
processed_text = processed_text.apply(lambda x: " ".join(lemmatize_text(x, wnl)))

# Lower the letters and return a token list of each review
processed_text = processed_text.apply(lambda x: gensim.utils.simple_preprocess(x, deacc=False))

## Q2. Model Training

After finishing the preprocessing, let's turn to some famous language models for word embeddings. You are required to train two **Word2Vec** models and two **FastText** models, using continuous **bag of words (CBOW)** and **skip-grams (SG)**.

In total, you should have the following four models. Please **name** the models using the following names:

*   w2v_cbow: Word2Vec + CBOW
*   w2v_sg: Word2Vec + SG
*   ft_cbow: FastText + CBOW
*   ft_sg: FastText + SG

Please ensure to name your models exactly as above.

You may consider training these models using `gensim`. In that case, training each of the models takes around 1-2 minutes. You may check the `gensim`'s documentation on [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) and [FastText](https://radimrehurek.com/gensim/models/fasttext.html). You are free to try different hyperparameters.

Note: Skip-Gram tends to work well with a small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words.

Hint 1: If you choose to use `gensim`, you may first import these models like this

``` python
from gensim.models import Word2Vec, FastText
```
Hint 2: Like to Class 3 notebbok, you should start your `gensim` model training like this
```python
w2v_cbow = Word2Vec(...)
```

In [10]:
# Your code for Q2.
from gensim.models import Word2Vec, FastText


w2v_cbow = Word2Vec(sentences=processed_text, vector_size=100, window=5, min_count=5, epochs=10, workers=4, sg=0)
w2v_sg = Word2Vec(sentences=processed_text, vector_size=100, window=5, min_count=5, epochs=10, workers=4, sg=1)
ft_cbow = FastText(sentences=processed_text, vector_size=100, window=5, min_count=5, epochs=10, workers=4, sg=0)
ft_sg = FastText(sentences=processed_text, vector_size=100, window=5, min_count=5, epochs=10, workers=4, sg=1)



## Q3. Model Evaluation: Synonyms
With the trained models, we can find the most similar words (i.e., synonyms) of given tokens based on their word embeddings.

### a) Single Input Token
For 3a), we will examine the top-10 most similar words for several tokens. After each displayed top-10 most similar words, be sure to discuss any thing you find interesting.

The following code generates the top-10 most similar words for "**disney**" using w2v_sg (Word2vec + SG).

If you named and trained your model according to the instruction of Q2, you can run the following code witout any edits needed.

In [11]:
# top-10 most similar words for "disney" using w2v_sg
w2v_sg.wv.most_similar(positive="disney")

[('animation', 0.7560753226280212),
 ('animate', 0.7494263052940369),
 ('dreamworks', 0.7273356318473816),
 ('animated', 0.7238619923591614),
 ('pocahontas', 0.722666323184967),
 ('mermaid', 0.7204435467720032),
 ('hercules', 0.6872342824935913),
 ('hunchback', 0.6869755983352661),
 ('walt', 0.6835805177688599),
 ('dame', 0.6810756325721741)]

The following codes generate the top-10 most similar words for "**oscar**" using ft_sg (FastText + SG).

If you name and train your model correctly, you can run the following code witout any edits needed.

In [12]:
# top-10 most similar words for "oscar" using ft_sg
ft_sg.wv.most_similar(positive="oscar")

[('nominee', 0.8424609899520874),
 ('nomi', 0.8273019194602966),
 ('nominate', 0.8212134838104248),
 ('nominated', 0.8141596913337708),
 ('nomination', 0.8136633038520813),
 ('academy', 0.7813323140144348),
 ('award', 0.7514181733131409),
 ('accolade', 0.7081405520439148),
 ('winner', 0.6830302476882935),
 ('globe', 0.677161693572998)]

The following codes generate the top-10 most similar words for "**actress**" using ft_cbow (FastText + CBOW).

If you name and train your model correctly, you can run the following code witout any edits needed.

In [13]:
# top-10 most similar words for "actress" using ft_cbow
ft_cbow.wv.most_similar(positive="actress")

[('actresses', 0.9699483513832092),
 ('roles', 0.8727955222129822),
 ('keanu', 0.8705829977989197),
 ('role', 0.8694130778312683),
 ('support', 0.8523364663124084),
 ('actor', 0.8521780371665955),
 ('performer', 0.8507847785949707),
 ('charisma', 0.8448072075843811),
 ('perky', 0.8443838953971863),
 ('nicole', 0.8372035026550293)]

**Answer:** w2v_sg: predict its surrounding context words. <br>
ft_sg: Similar to w2v_sg but includes subwords, like 'nominee','nomi','nominate','nominated','nomination' <br>
ft_cbow: it predicts a target word from the average of its context subwords. from actress, I believe it derived act, where actor, roles, person names come from.

### b) Multiple Input Tokens
Our models can also find the similar words for multiple input tokens. The following code generates the top-10 most similar words for two tokens: "**kung**" and "**fu**" using w2v_cbow. Examine the output and describe any interesting findings.

If you name and train your model correctly, you can run the following code witout any edits needed.

In [14]:
# top-10 most similar words for "disney" using w2v_sg
w2v_cbow.wv.most_similar(positive=["kung", "fu"])

[('choreograph', 0.8920067548751831),
 ('jet', 0.8469336032867432),
 ('kong', 0.8339348435401917),
 ('hong', 0.8251234889030457),
 ('slo', 0.8224210143089294),
 ('spectacular', 0.8176441788673401),
 ('heavy', 0.8067103624343872),
 ('lam', 0.8032325506210327),
 ('jaw', 0.8006710410118103),
 ('choreography', 0.7998129725456238)]

**Answer: ** w2v_cbow: It is the opposite of w2v_sg. CBOW predicts a target word based on its context. For example, given context words like "saving", "account", it might predict "bank". <br>
In our case, it finds Kung Fu, and find its contectual words like "choreograph", "Jet Li", "spectacular", "hong kong", etc.

### c) Your Own Token
It's your turn to play with the models! Try to find one token such that at least one of our models to generate top-10 similar words for one or multiple tokens. Then, explain your finding (ie., why the top-10 words make sense or why they do not make sense).

Hint 1: The dataset only includes reviews on movies before 2002. Our models are not familiar with any movies released after that.

Hint 2: For the same input token, our models may have very different performance. So try all of these four models and find the best one.

In [19]:
# Your codes for Q3 c)
w2v_sg.wv.most_similar(positive="film")
w2v_cbow.wv.most_similar(positive="film")
ft_sg.wv.most_similar(positive="film")
ft_cbow.wv.most_similar(positive="film")

[('ilm', 0.9665961265563965),
 ('films', 0.9600110650062561),
 ('filmed', 0.9400854706764221),
 ('filming', 0.8896265625953674),
 ('filmmaking', 0.8443182706832886),
 ('moviemaking', 0.8414859175682068),
 ('file', 0.8361199498176575),
 ('filmmaker', 0.8339509963989258),
 ('movie', 0.8196320533752441),
 ('filthy', 0.7854481339454651)]

**Answer: **
I used common words "Film" to find contextual words.
w2v_sg: reshoot, cerebral, artistically, inexcusable, counterbalance, adjectives, deliverance, stimulate, mislead.<br>
Explanation: SG is focused on predicting context given a target word, so it's likely to associate closely related terms that appear in similar contexts as "artistically" in movie reviews. <br>
w2v_cbow: movie, consider, entertainment, thriller, matrix, list, masterpiece, entertaining, cinematic, product<br>
CBOW predicts a word based on context, so it captures common collocates of "film." This list includes general synonyms and related industry terms.<br>
ft_sg: 'movie', 'films', 'filmed', 'moviegoer', 'duguay', 'hk', 'moviemaking',
 'mainstream', ilm, caligula<br>
 FastText breaks down words into subwords, it recognizes derivational forms like "film's" and "filmmaker," along with closely related terms. The presence of compound words like "films", "filmed" and "movie", "moviegoer", "moviemaking" shows its sensitivity to subword cues.<br>
 ft_cbow: 'ilm', 'films', 'filmed', 'filming', 'filmmaking', 'moviemaking', 'file', 'filmmaker', 'movie', 'filthy'<br>
 ft_cbow captured a lot of subwords. <br>
 I would say ft_sg has the best performance, which is grealy related to the word film


## Q4. Model Comparison with Pre-trained Embeddings
`gensim` also provides word embeddings pre-trained on some large corpus using different models. Next, we download some of these embeddings and compare them with our models to see the difference.

### a) Downloading Pre-trained Embeddings
The embeddings that we choose are called `word2vec-google-news-300`. They are generated by the Word2Vec pre-trained on a part of the Google News dataset with about 100 billion words.

You can run the following codes without any edits to download the embeddings. It may take around 10 mins.

In [20]:
import gensim.downloader
trained_w2c = gensim.downloader.load('word2vec-google-news-300')



### b) Comparison with Our Models
Next, we will compare our models with these pre-trained word embeddings.

The following code generates the top-10 most similar words for "**disney**" using the pre-trained word embeddings. If the model is donwloaded correctly, the word "disneyland" and "orlando" should appear.

In [21]:
trained_w2c.most_similar(positive="disney")

[('alice', 0.6571455597877502),
 ('harry_potter', 0.6108603477478027),
 ('gwen', 0.598574161529541),
 ('mario', 0.5946714878082275),
 ('nolan', 0.5925750732421875),
 ('disneyland', 0.584934413433075),
 ('orlando', 0.584503173828125),
 ('hannah_montana', 0.5791226029396057),
 ('jackie', 0.5755028128623962),
 ('nikki', 0.5743952989578247)]

If you compare the above 10 words with the results in Q3 a), where our w2v_sg also generates the top-10 similar words of "disney", you may realize that although our models are trained on a much smaller dataset (around 12800:1 in terms of the number of tokens), they can also (even better) capture the word similarities in the context of movies. This illustrates the importance of domain-specific knowledge in language models.

However, do our models alwasy perform well? Next, try to find the top-10 similar words of "**computer**" using both the pre-trained word embeddings `trained_w2c` and any one of our models. Compare the results and discuss why they generate such different results.

In [22]:
# Your codes for Q4 b)
trained_w2c.most_similar(positive="computer")

[('computers', 0.7979379892349243),
 ('laptop', 0.6640493273735046),
 ('laptop_computer', 0.6548868417739868),
 ('Computer', 0.647333562374115),
 ('com_puter', 0.6082080006599426),
 ('technician_Leonard_Luchko', 0.5662748217582703),
 ('mainframes_minicomputers', 0.5617720484733582),
 ('laptop_computers', 0.5585449934005737),
 ('PC', 0.5539618730545044),
 ('maker_Dell_DELL.O', 0.5519254207611084)]

In [23]:
w2v_sg.wv.most_similar(positive="computer")

[('plug', 0.7988744974136353),
 ('panama', 0.7644680738449097),
 ('aircraft', 0.7613723874092102),
 ('immerse', 0.7590673565864563),
 ('electrical', 0.7573085427284241),
 ('slime', 0.7563731074333191),
 ('exposure', 0.7525171041488647),
 ('detonate', 0.752166748046875),
 ('advanced', 0.7460222244262695),
 ('destroys', 0.7444332242012024)]

**Answer**: The trained_w2c has a much better result (laptop, PC, technician) compared to w2v_sg (plug, aircraft, exposure) that I used above. The reason being w2v_sg doens't have domain-specific knowledge in "computer" in movie reviews.


### c) Your Own Token
Come up with a word on your own, then generate the top-10 similar words under the pre-trained `trained_w2c` model and compare it with the top-10 similar words under the `ft_sg` we trained in Q3. Examine the output and discuss anything you find interesting.

In [24]:
# Your codes for Q4 c)
trained_w2c.most_similar(positive="AI")

[('Steven_Spielberg_Artificial_Intelligence', 0.5575934052467346),
 ('Index_MDE_##/###/####', 0.5415324568748474),
 ('Enemy_AI', 0.5256390571594238),
 ('Ace_Combat_Zero', 0.522663414478302),
 ('DOA4', 0.5182536244392395),
 ('mechs', 0.5137375593185425),
 ('mech', 0.5077533721923828),
 ('playstyle', 0.507252037525177),
 ('AI_bots', 0.5051203370094299),
 ('deathmatch_mode', 0.5045916438102722)]

In [25]:
ft_sg.wv.most_similar(positive="AI")

[('pego', 0.8042883276939392),
 ('niccol', 0.7451541423797607),
 ('egoyan', 0.7275657057762146),
 ('screenwriting', 0.7199376225471497),
 ('akiva', 0.7163833379745483),
 ('irvin', 0.7145070433616638),
 ('koontz', 0.7029660940170288),
 ('crichton', 0.7007303237915039),
 ('irving', 0.6890618801116943),
 ('adapt', 0.6870779395103455)]

**Answer**: Again, The trained_w2c has a much better result (Steven_Spielberg_Artificial_Intelligence, Enemy_AI, AI_bots) compared to ft_sg (pego, niccol, egoyan) that I used above. The reason being ft_sg doens't have domain-specific knowledge in "AI" in movie reviews.

When you finish Q4, delete the pre-trained word embeddings by running the following code to free up RAM.


In [26]:
del trained_w2c

---



## Q5 Non-Coding Questions:

### a) Word Embeddings (for Word2Vec):
1. What does the word embedding vector tell us?


A word embedding vector tells us<br><br>

1.   Similarity Amoung words using cosine similarity. For example, doctor and hospital would be close.
2.   SG learns embeddings by predicting surrounding words in a sentence given a target word, capturing the contextual usage of words. CBOW predict a target word from its surrounding context, focusing more on how words tend to co-occur.
3.   Word embeddings can solve word analogies by simple vector arithmetic.
4.   FastText, include sub-word information which allows them to capture morphological relationships between words, such as tense, plurality, and other grammatical variations.
5.   Embeddings can capture broader language patterns and features. For example, they can reflect gender, tense, and other syntactic distinctions implicitly because these features affect how words are used in context.


2. What can I do if I care about the embeddings for a sentence as opposed to a word?


We can utilize


1.   non-weighted word embeddings
2.   weigthed averaging, like TF-IDF, put on different weights for better analysis
3.   RNN, recurrent neural networks or CNN, Convoluntional Neural Networds embeddings.
4.   Transformer-based models: BERT, GPT. BERT processes the entire sentence at once and generates a contextual embedding for each word.
5.   The Universal Sentence Encoder (USE). Transformer architecture or Deep averaging network.


3. What happens if the word or token is not in the dictionary?



w2v_sg will throw an error, where ft_sg will get similar list <br>
The unknown word can either be ignored, or marked as unknown, or tokenized to subwords(fasttext), or find its meaning from contextual embeddings.

### b) What is the difference between 'cbow' and 'skip-gram'? Does this explain the difference between results from

```python
w2v_cbow.wv.most_similar(positive="disney")
```

and

```python
w2v_sg.wv.most_similar(positive="disney")
```



**Answer:**
CBOW results: animate, dvd, decade, cinema, remake, version, animated, entry, theatrical, classic <br>
SG results: animation, animate, dreamworks, animated, pocahontas, mermaid, hercules, hunchback, walt, dame<br>
CBOW predicts a target word based on its context. Essentially, the model looks at a window of surrounding words (the context) and tries to predict the word in the middle of this window. It is usually faster than sg, and requires less data. However, it tends to oversimplify the context. <br>
SG predicts the context words from a target word. For each word in the corpus, it uses the current word as input and tries to predict words within a certain range before and after the current word. While slower, Skip-Gram tends to handle infrequent words better and can produce more detailed embeddings for a wider range of words.<br>
In our case, we can see cbow returned more variety of words, where sg returned words with different formats, for example, we are seeing animation, animate, animated, etc. It did capture pocahontas, hunchback which are rare words compared to cbow,

In [27]:
w2v_cbow.wv.most_similar(positive="disney")

[('animate', 0.8779931664466858),
 ('dvd', 0.8596371412277222),
 ('decade', 0.8538360595703125),
 ('cinema', 0.8518430590629578),
 ('remake', 0.8399472832679749),
 ('version', 0.8386768102645874),
 ('animated', 0.8268529176712036),
 ('entry', 0.8246707916259766),
 ('theatrical', 0.8244009017944336),
 ('classic', 0.8214408755302429)]

In [28]:
w2v_sg.wv.most_similar(positive="disney")

[('animation', 0.7560753226280212),
 ('animate', 0.7494263052940369),
 ('dreamworks', 0.7273356318473816),
 ('animated', 0.7238619923591614),
 ('pocahontas', 0.722666323184967),
 ('mermaid', 0.7204435467720032),
 ('hercules', 0.6872342824935913),
 ('hunchback', 0.6869755983352661),
 ('walt', 0.6835805177688599),
 ('dame', 0.6810756325721741)]



---

## End of Assignment