<a href="https://colab.research.google.com/github/jdasam/aat3020-2023/blob/main/Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

In this assignment, you will explore about word vectors.

- Submision: A report in ``pdf``, your completed notebook file in ``ipynb``, and training data in ``txt``
    - The assignment will be evalulated mainly with report. So please include every detail you want to present in your report
    - Report: Free format. You can copy and paste part of your code for some problems.
      - Report has to be written in English
    - ipynb: Save your notebook (with output of each cell if possible) as ipynb and submit it
- Evaluation criteria
    - How interesting and original are the presented examples
    - How well you describe the reason of success or failure of your examples by considering how Word2Vec is trained

## 0. Setup
- Check ``gensim`` library is installed
  - if not, you can install using ``!pip install gensim``
- List the downloadable vectors from ``gensim``


In [1]:
import gensim
import numpy as np
import pprint as pp

In [None]:
import gensim.downloader
list(gensim.downloader.info()['models'].keys())

- Among the Word2Vec model codes above, select one model of your choice among ``glove-wiki-gigaword`` or ``glove-twitter``
    - numbers at the last represents the number of dimension of each Word2Vec Model
        - e.g. ``glove-twitter-200`` was trained on twitter dataset while embedding each word into 200-dim vector
        - e.g. ``glove-wiki-gigaword-300`` was trained on wikipedia dataset while embedding each word into 300-dim vector
- Download the selected model and load it as a ``model``

In [None]:
your_model_code = 'glove-wiki-gigaword-300' # select among the model code aboves
model = gensim.downloader.load(your_model_code) # download and load the model. It can take some time

In [None]:
# test the model output
model['cat']

## Problem 1. Find Most Similar Words (10 pts)
- One of the most simple and typical use case of Word2Vec is finding a word based on similarity.
- You can list the most similar words for a given query word by using ``model.most_similar(your_word)``
    - Usually, every word in Word2Vec model is in lowercase
- **In your report**, present more than **5** interesting examples and explain **why it was interesting for you**
    - Try to explain why those words are regarded similar in Word2Vec, considering how it was trained
   

In [None]:
target_word = 'sogang' # Enter your word string here
# check the word is in the vocabulary of the model
assert model.has_index_for(target_word), f"The selected word, {target_word}, is not included in the model's vocabulary"
model.most_similar(target_word)

## Problem 2. Word Analogy (10 pts)
- Another interesting thing you can play with Word2Vec is word analogy
- Word analogy is done by adding and subtracting the word vector
- In the cell below, you can run an example like this
    - ``analogy('man', 'king', 'woman')`` represents a question of "man is to king as woman is to what?"
- Try with your own choice.
- **In your report**, present at least **5** interesting examples of your choice
    - You can include the failure case
    - Describe what did you expect and why the result was interesting for you

In [None]:
def analogy(model, x1, x2, y1):
  pp.pprint(model.most_similar([x2, y1], negative=[x1]))

# Try with your own word choice
analogy(model, 'man', 'king', 'woman')

## Problem 3. Simple Mathematics with Word2Vec (5 pts)
- In this problem, you have to complete the given functions ``word_analogy_with_vector`` and ``get_cosine_similarity``
  - To get the exactly same result with ``model.most_similar()``, you have to normalize each vector before doring arithmetic.
  - Using L2 norm (sqrt of sum of square of every item in the vector)
  - The result will also naturally include the positive query words itsef.
- In your report, **please include your code for these functions**


In [None]:
def word_analogy_with_vector(model, x_1, x_2, y_1):
  '''
  This function takes a gensim Word2Vec model and outputs a vector to find y2 that corresponds to x_1 → x_2 == y_1 → y_2
  e.g. x_1 (man) → x_2 (king) == y_1 (woman) → y_2(?)

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x_1, x_2, y_1 (str): Words in the model's vocabulary.

  output (np.ndarray): A vector in np.ndarray, which can be used to find proper y_2 for given (model, x_1, x_2, y_1)
  '''

  # Write your code from here
  return

# test whether the function works well
result_vector = word_analogy_with_vector(model, 'man', 'king', 'woman')
print('result vector is ', result_vector)
assert isinstance(result_vector, np.ndarray), "Output of the function has to be np.ndarray"
model.most_similar(result_vector)

In [None]:
def get_cosine_similarity(model, x, y):
  '''
  This function returns cosine similarity of x,y

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x, y (str): Words in the model's vocabulary.

  output
  similarity (float): cosine similarity between x's vector and y's vector
  '''
  # Write your codes from here
  return

# test the output with your own choice
word_a = 'good'
word_b = 'bad'

similarity = get_cosine_similarity(model, word_a, word_b)
print(similarity)
assert -1 <= similarity <= 1, "Similarity has to be between -1 and 1"

print('gensim library result:', model.similarity(word_a, word_b))

## Problem 4. Visualize Word Vectors (10 pts)
- Select a list of words of your interest
    - **At least 30 words for minimum**
    - ``word_list`` is a list of strings
    - every element in ``word_list`` has to be included in the model's vocabulary
- Visualize the vectors of words using dimensionality reduction (in this case, PCA)
- In your report, describe how words are located in 2D space
    - How are the words clustered?
    - Do you think the words are properly located based on their semantic meanings?
    - Is there anything suprising or unexpected examples?

In [None]:
# Run this cell to
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import plotly.express as px

def display_pca_scatterplot(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")
  word_vectors = np.array([model[w] for w in words])

  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # plt.figure(figsize=(12,12))
  # plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
  # for word, (x,y) in zip(words, twodim):
  #     plt.text(x+0.05, y+0.05, word, fontsize=15)
  fig = px.scatter(twodim, x=0, y=1, text=words)
  fig.update_traces(textposition='top center')
  fig.show()



In [None]:
# Select word list of your own interests
word_list = [
]

display_pca_scatterplot(model, word_list)

## Problem 5. Train New Word2Vec (15pts)
- Word2Vec models can be trained on different corpus (text)
- Train your own model with your custom selection of text
- In your report, present at least **5** interesting examples that makes different result by dataset selection
    - You can compare some word analogy examples or similairites or visualization
- You can refer [Official Documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) Word2Vec Model

In [None]:
# You don't have to change this cell
import string
from gensim.models import Word2Vec

def remove_punctuation(x):
  return x.translate(''.maketrans('', '', string.punctuation))
def make_tokenized_corpus(corpus):
  out= [ [y.lower() for y in remove_punctuation(sentence).split(' ') if y] for sentence in corpus]
  return [x for x in out if x!=[]]

In [None]:
your_text_fn = '' # Enter your text file name here

with open(your_text_fn, 'r') as f:
  strings = f.readlines()
sample_text = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')
corpus = make_tokenized_corpus(sample_text)

In [None]:
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=2, workers=4)
model = model.wv # To match with previous codes, we use wv (KeyedVector) of the Word2Vec class