![Logo Uni Köln](https://raw.githubusercontent.com/jmelsbach/ai-im/main/img/uni-logo.png)

# Exercise 02 Notebook - Preprocessing and Word2Vec


In this exercise you will create vector representations of words and documents.
In the python eco system there are several libraries that make it very easy to implement to achieve this.

In this notebook we will use the  `gensim` library for text preprocessing and the training of a `word2Vec` model.

In [None]:
!pip uninstall gensim -y
!pip install gensim

## 1. Using a pretrained Word2Vec Model
In this part we will download a pretrained word2vec model that was trained on a huge news corpus by google. The model was trained in 2013 so the data includes news articles from before that year. The model is quite large so downloading it will take about 10 minutes.

In [None]:
%%time
import gensim.downloader as api
# download takes about 10 minutes
wv = api.load('word2vec-google-news-300')

Download has finished and we can now start to explore the model. As a first step find out how to get a vector for the word `king`. How many dimensions has the vector? You can learn how to extract the word vectors in [this word2vec tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#introducing-the-word2vec-model).

In [None]:
# Get the word vector of the word 'king'


In [None]:
# Proof that the dimensionality is indeed 300


In [None]:
# What are the most similar words to king?


In [None]:
# Test if the example from the lecture works: v(King) - v(Man) + v(Woman) = v(Queen)


### 1.1 Exploring the model

We saw in the lecture that it can useful to preprocess the text data before training a vectorization technique. Try to find out by finding examples how the data in the google corpus has been preprocessed. Things you should check:
* lowercasing
* stopword removal
* stemming
* n-grams

As you are now more familar with the model complete the following tasks:
* use analogies to find out the capital 'Cameroon'
* use analogies to find out who was the prime minister of spain
* use analogies to find out what the German national dish is
* find other examples where this works out and makes sense
* find examples where this doesn't work well
* use the word2vec model to find out which of the following words does not fit in the group: `["breakfast", "cereal", "dinner", "lunch"]`

### 1.2 Bias in word embeddings
Word2Vec learns word relationships on a training corpus. If there is a bias in the training corpus, there is a great chance that those are also reflected in the resulting word embeddings. [This Paper](https://arxiv.org/pdf/1607.06520.pdf) has studied bias in word2vec word embeddings. Look at the examples given in the paper and try if you can reproduce any of them.

In [None]:
wv.most_similar(positive=['computer_programmer', 'woman'], negative=['man'])

### 1.3 Visualizing Word Embeddings
Word Embeddings usually live in a very high dimensional space. Visualizing data with more than three dimensions is very difficult and becomes impossible very quickly.

If we want to visualize a high dimensional vector we have to use so called dimensionality reduction techniques that are able to reduce the dimensions of a group of vectors and preserve relative properties. It is needless to say that you lose a lot of information if you reduce the dimensionality but is is a good way to visualize the relation between word embeddings. The following code uses the so called `principal component analysis (PCA)`  to make 2-dimensional vectors. Click [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) for details about the implementation.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(0)
def display_pca_scatterplot(model, words):


    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(20,12))
    plt.rcParams.update({'font.size': 18})
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

Create a list of about 20 words you would like to plot in the diagramm.

In [None]:
word_list = ['Germany']

Run the `display_pca_scatterplot` function and interpret the results.

In [None]:
display_pca_scatterplot(wv, word_list)

## 2. Training a Word2Vec Model
In this part we will train our first word2vec model by our self using the gensim library. We will train our model on the text8 dataset which we can download directly with the gensim library. The dataset is already in the right format to train a model on it.

In [None]:
import gensim
import gensim.downloader as api
from gensim.models import Word2Vec

In [None]:
# text8 corpus consists of wikipedia data from the year 2006
# http://mattmahoney.net/dc/textdata.html
corpus = api.load('text8')

Look into the word2vec [documentation](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html) of the gensim library and to the following tasks:

Train a Word2Vec Model with the following hyperparameters:
  * vector size of 100
  * window size of 5
  * negative_sampling of 3

  The training of the model on the text8 dataset should take about 2 minutes.

In [None]:
# train word2vec model
%%time


### 2.1 Exploring the model

The resulting model is much smaller than the pretrained model we used previously. Solve the same tasks as in Section 1. Does it work as well?



In [None]:
# most similart to king


In [None]:
# king - man + woman = ?


In [None]:
# What is the capital of Cameroon?

### 2.2 Visualizing Word Embeddings
Once again visualize a list of words. You can use the list you created in Section 1 but you might have to make some adjustments.

In [None]:
# example list
word_list = ['coffee', 'tea', 'beer', 'wine', 'water',                                      # Beverages
             'spaghetti', 'hamburger', 'pizza',                                             # Food
             'dog', 'horse', 'cat', 'mouse',                                                # Animals
             'france', 'germany', 'hungary', 'china',                                       # countries
             'school', 'college', 'university', 'institute',                                # Education
             'soccer', 'basketball', 'baseball', 'football']                                # Sports

In [None]:
display_pca_scatterplot(wv, word_list)

## 3. Training a word2vec model with a custom dataset

In this section we will train word embeddings on our own dataset. We will preprocess the data by our selfes and will therefore have full control of the process.

### 3.1 Downloading the dataset
First of all we need to download our dataset. You can directly read the data from this link: `https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv` into a pandas `DataFrame`.

In [None]:
# create a pandas dataframe and save it in a variable called data
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv")

In [None]:
data

Explore the data.
* What data kind of data is it?
* How large is the dataset?
* What are the labels?
* How are the labels distributed?

In [None]:
# get the first 10 rows of the data


In [None]:
# how many training examples do we have?


In [None]:
# how many labels do we have for each class?


### 3.2 Preprocessing the corpus



Implement a function that gets a text as an input and returns the preprocessed version of it. Your preprocessing should include the following steps:
* lowercase text
* remove punctuation
* remove stopwords
* remove numbers

you can use the gensim library for the preprocessing. Visit the [documentation](https://radimrehurek.com/gensim_3.8.3/parsing/preprocessing.html) to learn how it works.


In [None]:
# import gensim library for preprocessing
import gensim.parsing.preprocessing as pp

In [None]:
# define preprocessing function
def preprocess_text(text):
  # lowercasing
  # remove punctuation
  # remove stopword
  # remove numbers

  return text

Execute the following Cell. Your function should print out something like ```text includes number```.

In [None]:
test_text = "This is a text that INCLUDES the number 34."
preprocessed_text = preprocess_text(test_text)
print(preprocessed_text)
assert preprocessed_text == 'text includes number '

In [None]:
# apply it to the text in our DataFrame
# and save the results in a new column called review_pp


Some words happen to appear very frequently next to each other. It can be very useful to combine this word into a single representation before tokenizing the corpus.

The combination of two words is called bigram. Have a look at the following examples:
* new york -> new_york
* star wars -> star_wars

The following code blocks show how you can train a phraser with the gensim library. You can learn more about the Phraser [here](https://radimrehurek.com/gensim_3.8.3/models/phrases.html).



In [None]:
"Hi Ho".split()

In [None]:
%%time
from gensim.models.phrases import Phrases
phrases = Phrases(data['review_pp'].apply(lambda x: x.split()))

You can apply the bigram model like this:

In [None]:
phrases["the lion king is by far my favorite movie!".split()]

Apply the bigrams to the text in the `review_pp` column and overwrite it.

### 3.3 Creating a Corpus Class

In this section we create a class for our corpus that will be the input for the training algorithm of our word2vec model.

Write a class called `MyCorpus`.
It should have an `__init__` function that takes a list or array of strings as an input and saves it into a class variable `self.data`.

Also implement an `__iter__` function that loops over `self.data` and `yields` each line as a list of the words. (hint: use `.split()`to split a string into words

In [None]:
class MyCorpus:
  def __init__(self, data):
    pass
  def __iter__(self,):
    pass

In [None]:
# instantiate the MyCorpus class with a list of the preprocessed texts in our dataframe.
# save it in a variable called sentences
sentences = None

In [None]:
%%time
model = gensim.models.Word2Vec(sentences=sentences)

Once again explore the model by using the similarity function. Remember that this is a very specific dataset we trained the model on. It is about movie reviews and the analogies we used before probably won't work here.

But maybe there are some interesting relationships between actors and movies encoded in the word embeddings.

In [None]:
model.wv.most_similar('matrix', topn=10)

In [None]:
word_list = ['stallone' ,'clint_eastwood', 'schwarzenegger', 'pulp_fiction',
             'aladdin', 'brad_pitt', 'angelina_jolie', 'cameron_diaz',  'orlando_bloom', 'incredibles',
             'mulan', 'godfather', 'tarantino', 'shrek', 'star_wars', 'matrix', 'lord_rings', "jurassic_park"]

## Visualizing Word Embeddings
Once again create a list that you want to plot in 2D-Space.
Try to use a list of movie titles and actors and see how similar they are.

In [None]:
display_pca_scatterplot(model.wv, word_list)