# Homework 1

The first homework is comprised of two separate parts:


1.  Text preprocessing and representation
2.  Word embeddings

You can get 10 points for this homework: 5 points for the first part, and 5 points for the second part.

To do this homework make a copy of this notebook (`File` -> `Save a copy in Drive`) and work on it. Alternatively, download the notebook, if you want to work on it locally (`File` -> `Download` -> `Download .ipynb`).

When submitting the homework, please make sure to run all the cells, see that there are no errors, and the outputs for all cells are present in the saved version that you submit.

**Note: Please also state at the beginning of your homework if you collaborated with anybody when doing the work and the names of your collaborators if you had them.**

In [1]:
!pip install --quiet datasets evaluate

## Part A: Text preprocessing and representation (5 points)

In part A of the homework, we will use the [IMDB](https://huggingface.co/datasets/imdb) dataset. IMDB is a movie review dataset for binary sentiment classification. The dataset provides 25,000 movie reviews for training and 25,000 for testing. You will explore the dataset in Task 1 and build a text classification pipeline in Task 2.


In [2]:
from datasets import load_dataset
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
dataset = load_dataset("imdb")

dataset

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Gordei\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gordei\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Gordei\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 28.5MB/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:04<00:00, 4.51MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:04<00:00, 5.02MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:08<00:00, 4.94MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 324990.70 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 336624.75 examples/s]
Generating unsupervised split: 100%|██████████| 

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
dataset['train'][15001]

{'text': 'Those of the "Instant Gratification" era of horror films will no doubt complain about this film\'s pace and lack of gratuitous effects and body count. The fact is, "The Empty Acre" is a good a example of how independent horror films should be done.<br /><br />If you avoid the indie racks because you are tired of annoying teens or twenty somethings getting killed by some baddie whose back-story could have come off the back of a Count Chocula box, "The Empty Acre" is the movie for you.<br /><br />Set in the decaying remnants of the rural American dream, "The Empty Acre" is the tale of a young couple struggling with the disappearance of their six-month-old baby. As the couple\'s weak relationship falls apart, a larger story plays out in the background. At night, a shapeless dark mass seethes from a sun baked barren acre on their farm and seemingly devours anything in its path, leaving no sign that it was ever there.<br /><br />The film is loaded with enigmatic characters and vis

### Task 1. Dataset statistics (2 points)


#### Task 1.1. General dataset statistics [1 points]

Tokenize reviews using either NLTK or spaCy.
Compute the number of sentences, number of tokens, and number of unique tokens per review.
Report the following statistics for both train and test sets:

*   average number of sentences
*   average number of tokens
*   average number of unique tokens (word types)
*   maximum number of sentences
*   maximum number of tokens
*   maximum number of unique tokens (word types)


In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

import numpy as np

sent_counts, word_counts,  dict_lens = [], [], []
for review in dataset['train']:
  review = review['text']
  sent_tokenized = sent_tokenize(review)
  word_tokenized = word_tokenize(review)
  n_sent = len(sent_tokenized)
  n_words = len(word_tokenized)
  n_uq_words = len(set(word_tokenized))

  sent_counts.append(n_sent)
  word_counts.append(n_words)
  dict_lens.append(n_uq_words)

print(f'avg sentence count: {np.mean(sent_counts)}')
print(f'avg word count: {np.mean(word_counts)}')
print(f'avg dict size: {np.mean(dict_lens)}')

print(f'max sentence count: {np.max(sent_counts)}')
print(f'max word count: {np.max(word_counts)}')
print(f'max dict size: {np.max(dict_lens)}')


avg sentence count: 10.84228
avg word count: 282.61352
avg dict size: 153.71348
max sentence count: 282
max word count: 2818
max dict size: 726


#### Task 1.2. Lemmatization  and case-folding [1 points]

Lemmatize reviews and provide the following statistics:
*   Average number of unique lemmas (in original case)
*   Average number of unique lemmas (in lowercase)

***Question.*** Why is there a difference between the numbers of unique lemmas in original and lowercase texts? Provide examples of lemmas that contribute to the difference.

Note: Lemmatization could take some time (8-16 minutes). You can take 5000 samples from the train set and do this task on a smaller portion of the dataset.

In [5]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

lemm_dict_sizes, lemm_dict_sizes_lower = [], []
print('Example difference lemmeas: ')
for i, review in enumerate(dataset['train']):
  if i > 5000:
    break
  review = review['text']

  tokens = word_tokenize(review)
  lemm_text = [wnl.lemmatize(word) for word in tokens]

  tokens_lower = word_tokenize(review.lower())
  lemm_text_lower = [wnl.lemmatize(word) for word in tokens_lower]


  if i in [1, 2, 5, 7, 12, 21]:
    print(set(lemm_text).difference(set(lemm_text_lower)))

  lemm_dict_sizes.append(len(set(lemm_text)))
  lemm_dict_sizes_lower.append(len(set(lemm_text_lower)))

print('----------------------------------------------')
print(f'avg uniqe lemmas: {np.mean(lemm_dict_sizes)}')
print(f'avg uniqe lemmas lower: {np.mean(lemm_dict_sizes_lower)}')

Example difference lemmeas: 
{'NC-17', 'Before', 'This', 'As', 'The', 'Sevigny', 'Granted', 'And', 'Am', 'Bunny', 'Gallo', 'It', 'Yellow', 'Chloe', 'R-rated', 'I', 'Vincent', 'Brown', 'Nowhere', 'In', 'Curious', 'American'}
{'IMPORTANT', 'One', 'This', 'If', 'The'}
{'Some', 'And', 'Marxism', 'I', 'There', 'The', 'Even'}
{'Before', 'A', 'Masters', 'Ricky.', 'Danny', 'The', 'Although', 'Overall', 'Ricky', 'Pino', 'Lucille', 'PBS', 'E', 'Ball', 'It', 'Rachel', 'I', 'Desi', 'If', 'He', 'To', 'Biography', 'Finding', 'York', 'She', 'Lucy', 'Arnaz', 'Ethel', 'At', 'Love', 'Laughter', 'Fred', 'American', 'When'}
{'Never', 'Fosse', 'Last', 'This', 'The', 'Star', 'Bogdanovich', 'Show', 'Dorothy', 'Bob', 'Stratten', 'I', 'Very', 'Orson', 'Welles', 'Hansen', 'Hepburn', 'Ms.', 'Playboy', 'Paper', 'Patty', 'Audrey', 'Moon', 'Picture', 'SISTER'}
{'We', 'And', '2D', '3D', '3-D', 'I', 'Watch', 'Dir-Brad', 'Sykes', 'Wow', 'The', 'Well', 'That', 'Mindless'}
----------------------------------------------


The difference in the number of unique lemmas between original and lowercase texts arises because the original text preserves case distinctions, leading to recognition of different forms of a word (e.g., "Apple" and "apple" as distinct). Lowercasing the text eliminates these distinctions, potentially reducing the count of unique lemmas as different forms are now treated as the same word.

### Task 2. Text classification pipeline [3 points]

In task 2, you will focus on the sentiment analysis task. Sentiment analysis is concerned with analyzing the expressed opinion in a sentence or a text document. You will set up a basic text classification pipeline for binary sentiment classification (0 for negative sentiment and 1 for positive sentiment). You can start by loading the IMDB dataset (again).

In [6]:
dataset = load_dataset("imdb")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


#### Task 2.1. Custom tokenization [0.5]
Create a custom tokenization function. Your tokenization function should minimally:
- tokenize each text
- lowercase
- remove punctuation

Optionally:
- lemmatize or stem each text

In [121]:
wnl = WordNetLemmatizer()

def tokenize_review(review):
    tokens = word_tokenize(review)
    tokens = [token.lower() for token in tokens]
    tokens_punct = [word for word in tokens if word.isalpha()]
    lemm_tokemns = [wnl.lemmatize(word) for word in tokens_punct]

    return lemm_tokemns


#### 2.2. Vectorization [0.5 points]
Initialize the `TfidfVectorizer.` Pass custom tokenizer to vectorizer. Fit and transform the vectorizer on the `text` column of the train split and transform the `text` column of the test split into vector representations: `train_x` and `test_x`. Accordingly, assign the `label` column in each split to `train_y` and `test_y`.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenize_review)
train_x = vectorizer.fit_transform(dataset['train']['text'])
train_y = dataset['train']['label']
test_x = vectorizer.transform(dataset['test']['text'])
test_y = dataset['test']['label']



#### 2.3. Training the model [0.5 points]

Initialize and train `LogisticRegression` with different hyperparameters than we used in the class.  Refer to the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression) for more details on different hyperparameters.


In [10]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(train_x, train_y)

#### 2.4. Evaluating the model [0.5 points]

Evaluate the model on the test set. Report precision, recall, and f1-score. You can use sklearn's [`classification report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification%20report#sklearn.metrics.classification_report) to get the scores with the precision of 4 digits, i.e. your score should have 4 digits after the decimal point (e.g. 0.8896).




In [11]:
from sklearn.metrics import classification_report

y_pred = clf.predict(test_x)
print(classification_report(test_y, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8808    0.8803    0.8806     12500
           1     0.8804    0.8809    0.8806     12500

    accuracy                         0.8806     25000
   macro avg     0.8806    0.8806    0.8806     25000
weighted avg     0.8806    0.8806    0.8806     25000



#### 2.5 Inference function [0.5 point]
Write a code that would allow you to input any text into the model and get the prediction. To do that, use the same vectorizer as for the training data to transform the input text for the model.

Predict a label for the example text below.

In [13]:
example_text = """"Don't Look Up" tells a chilling story of lies, oppression, explosion, and deceit in modern day world, but in a light hearted way. The story itself is disturbing, but the delivery is not too depressing. The numerous stars add to the entertaining factor too. I enjoyed watching it."""

In [14]:
def predict(text, model=clf, vectorizer=vectorizer):
  tokenized = tokenize_review(text)
  vectorized = vectorizer.transform(tokenized)
  pred = model.predict(vectorized)[0]
  return pred

predict(example_text)

0

#### 2.6 Qualitative evaluation [0.5 point]

Come up with four short movie reviews that would be predicted as true positive, true negative, false positive, and false negative by the model.

Usually, just one or two short sentences are enough. Also, your writing skills are not assessed here, so you can write anything as long as it works! If you cannot come up with anything that meets the criteria, you can write down below why you think it didn't work and what your strategy was.

In [124]:
tn = 'This movie sucks and is bad, horrible'
tp = 'The movie is amazing, great and wonderful!'
fn = 'There are flavourful dark undertones in every scene, making it a horrifically exciting watch'
fp = 'The movie is great for the twisted people that get off on indiscriminate violence.'

print(f'TN: {predict(tn)}')
print(f'TP: {predict(tp)}')
print(f'FN: {predict(fn)}')
print(f'FP: {predict(fp)}')


TN: 0
TP: 1
FN: 0
FP: 1


## Part B: Word embeddings (5 points)

In [125]:
#!pip install --quiet --upgrade datasets evaluate gensim nltk

### Task 1 Word embeddings analysis [1.5 points]

#### Task 1.1 Load the model [0.5 points]
Download the Word2Vec model trained on English Wikipedia Dump of February 2017 via the [URL](http://vectors.nlpl.eu/repository/20/6.zip).

Load the model either from `bin` or `txt` file with `gensim` or your custom code.

In [126]:
from gensim.models import KeyedVectors
model_path = 'model.bin'
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

#### Task 1.2 Words with multiple meanings [1 point]

The same word can represent multiple meanings due to homonymy or polysemy. Consider, for example, the word ["bow"](https://en.wiktionary.org/wiki/bow#Noun) which can mean several unrelated things including a type of weapon and a type of knot. Word2Vec and similar word embedding models are generally unable to distinguish between different meanings of the same word. Let us examine how this affects the words' neighbors.

For this task, come up with at least 3 words with multiple meanings in English. For each word output its 10 nearest neighbors. For each neighbor find the related meaning on the word's Wiktionary page (if there is such a meaning on [Wiktionary](https://en.wiktionary.org/wiki/Wiktionary:Main_Page)).

Analyze the related meanings. What do you think is possible to say about the corpus the model was trained on based on these meanings?

In [29]:
for word in ['well', 'bank', 'bat']:
    neighbors = model.most_similar(word, topn=10)
    print(f'"{word}" top 10 neibhours are: ')
    print(', '.join([tup[0] for tup in neighbors]))
    print('\n')

"well" top 10 neibhours are: 
including, besides, notably, addition, also, especially, particularly, Besides, etc., importantly


"bank" top 10 neibhours are: 
banks, Bank, banking, Citibank, non-bank, Eurobank, depositor, depositors, inter-bank, state-chartered


"bat" top 10 neibhours are: 
bats, Rhinolophus, sheath-tailed, mouse-eared, Hipposideros, Phyllostomidae, Molossidae, Rhinolophidae, tube-nosed, leaf-nosed




- 'well' - the word closest to it's role as an exlamation, not as an adjective or noun. Hard to say anything about the corpus here.
- 'bank' - Close to it's meaning in finance - corpus had more financial content.
- 'bat' - Close to animal meaning and latin specie names, likely more scientific content in the corpus.

### Task 2 Reverse dictionary with word embeddings [3.5 points]

The goal of this task is to build a very simple reverse dictionary based on word embeddings. A reverse dictionary is a resource that helps you find words based on their definitions or descriptions, rather than the traditional dictionary format which helps you find the definition of a word. In a reverse dictionary, you start with an idea, concept, or description of what you're trying to express and then look up the term or phrase that matches that description.

#### Task 2.1 Load the data [0.5 points]

Use the Google Drive [link](https://drive.google.com/file/d/1emoyY4Nfhu8O6MSvnwAtSry3JdehyBwe/view?usp=sharing) to download the dataset first introduced in [Learning to Understand Phrases by Embedding the Dictionary](https://aclanthology.org/Q16-1002/). Not everyting in this JSON is relevant to this task. We are only interested in "word" and "definitions" fields. The "word" field contains the target words we want to predict, while the "definitions" field, despite the name, contains a single definition per word. We want to use the definitions to find the target words.

In [85]:
import json
fpath = 'data_desc_c.json'
with open(fpath, 'r') as f:
    data = json.load(f)

[{'word': 'forget', 'lexnames': ['verb.cognition'], 'root_affix': [], 'sememes': ['forget'], 'definitions': 'when you knew a fact or to do something in the past but then without trying you lost this knowledge'}, {'word': 'office', 'lexnames': ['noun.group', 'noun.act', 'noun.artifact', 'noun.state'], 'root_affix': [], 'sememes': ['part', 'room', 'engage', 'handle', 'organization', 'duty', 'alive', 'Occupation', 'earn', 'affairs'], 'definitions': 'a room in a house or building where people study or work'}, {'word': 'cheap', 'lexnames': ['adj.all'], 'root_affix': [], 'sememes': ['cheap', 'commerce'], 'definitions': 'something that does not cost a lot of money'}, {'word': 'obtain', 'lexnames': ['verb.change', 'verb.possession', 'verb.stative'], 'root_affix': [], 'sememes': ['obtain'], 'definitions': 'to get or achieve something that you want'}, {'word': 'foot', 'lexnames': ['noun.person', 'noun.act', 'noun.artifact', 'noun.quantity', 'verb.motion', 'noun.group', 'noun.communication', 'nou

#### Task 2.2 Vectorize the definitions [1 point]

Transform each definition in the dataset into the vector representation using the Word2Vec model that we loaded in the previous task. To do so, follow these steps:


1. Tokenize the definition
2. Remove the words that are not present in the model's dictionary
3. Remove repeating words
4. For each token retrieve the embedding from the model
5. Take the mean of the retrieved embeddings to produce a single vector representation of the definiton

In [127]:
model = KeyedVectors.load_word2vec_format('model.bin', binary=True)
vocab = model.key_to_index

def vectorize_dictionary(dict_data):
    emb_pairs = []
    for word in dict_data:
        tokenized_def = tokenize_review(word['definitions']) 
        tokenized_def = list(dict.fromkeys(tokenized_def))
        tokenized_def = [token for token in tokenized_def if token in vocab]
        embeddings = [model[token] for token in tokenized_def]
        def_emb = np.mean(embeddings, axis=0)
        emb_pairs.append((word['word'], def_emb))
    return emb_pairs
def_emb_pairs = vectorize_dictionary(data)

#### Task 2.3 Retrieve words similar to definitions [2 points]

1. For each definition compute the cosine similarity between the words in the model's vocabulary and the definition vector, sort the words in descending order based on the similarity to the definition
2. For each definition find the index of the target word in the list obtained in the previous step - this is the rank of the target word
3. Compute the following statistic for the ranks: max, min, median
4. Find the definition with the highest target word rank and output the target word, the definition, and the top 10 words most similar to the definition. Are there any other words in the top 10 that fit the definition?
5. Repeat the previous step for the word with the lowest target word rank
6. Analyze the definitions from the two previous steps. Suggest an explanation for the results

In [128]:
from sklearn.metrics.pairwise import cosine_similarity
vocab_embeds = np.array([model[token] for token in vocab])
def_embeds = np.array([pair[1] for pair in def_emb_pairs])

norm_vocab = np.linalg.norm(vocab_embeds, axis=1, keepdims=True)
norm_def = np.linalg.norm(def_embeds, axis=1, keepdims=True)
vocab_embeds_norm = vocab_embeds / norm_vocab
def_embeds_norm = def_embeds / norm_def

cosine_similarity = np.dot(vocab_embeds_norm, def_embeds_norm.T)
print(cosine_similarity.shape)

# soring columns
sorted_indices = np.argsort(cosine_similarity, axis=0)[::-1, :]
print(sorted_indices.shape)

(302866, 200)
(302866, 200)


In [102]:
target_ranks = {}
for i, word in enumerate(data):
    target_word = word['word']
    target_word_vocab_ind = vocab[target_word]
    definition_ranks = sorted_indices[:, i]

    target_rank = np.where(definition_ranks == target_word_vocab_ind)[0][0]
    target_ranks[target_word] = target_rank

In [104]:
pure_target_ranks = np.array(list(target_ranks.values()))

print(f'Vocab size {len(vocab)}')
print(f'Target rank median {np.median(pure_target_ranks)}')
print(f'Target rank min {np.min(pure_target_ranks)}')
print(f'Target rank max {np.max(pure_target_ranks)}')

Vocab size 302866
Target rank median 521.0
Target rank min 0
Target rank max 104167


In [114]:
min_rank_target_ind = np.argmin(pure_target_ranks)
target_word = data[min_rank_target_ind]
print(f'Highest target word rank word: "{target_word['word']}"')
print(f'Highest target word rank definition: "{target_word['definitions']}"')

similar_word_vocab_inds = sorted_indices[:10, min_rank_target_ind]
vocab_list = np.array(list(vocab.keys()))
print('The 10 most similar words to the difinition are: ')
print(', '.join(vocab_list[similar_word_vocab_inds]))

Highest target word rank word: "art"
Highest target word rank definition: "the word for painting music theatre sculpture and other creative activities"
The 10 most similar words to the difinition are: 
art, painting, music, the, sculpture, for, a, dance, is, of


The difinition of art contains all of the words that are closest to it directly, consequently these words are close to each other (painting, music, sculpture, dance). So the word art has such a high ranking because the definition for it contains word that are described as art and as such appear together in the corpus of the embedding model. The other close words are adjunctions, which are generally very common in language.  

In [117]:
max_rank_target_ind = np.argmax(pure_target_ranks)
target_word = data[max_rank_target_ind]

print(f'Lowest target word rank word: "{target_word['word']}"')
print(f'Lowest target word rank definition: "{target_word['definitions']}"')

similar_word_vocab_inds = sorted_indices[:10, max_rank_target_ind]
print('The 10 most similar words to the difinition are: ')
print(', '.join(vocab_list[similar_word_vocab_inds]))

Lowest target word rank word: "president"
Lowest target word rank definition: "the most important politician in various countries including the usa"
The 10 most similar words to the difinition are: 
various, notably, including, important, well, many, besides, numerous, in, shrew-forms


Here we see that the embedding model is not so good for the word "president", as the most similar words do not appear to be closely related to it, hence the low ranking of the word. This is likely because the definition describes the target word not by association with other words, but in a more contextual and inderect manner. This highlights that a more complex model is needed to capture the essence of such definitions. 