<a href="https://colab.research.google.com/github/griu/deeplearningupc/blob/master/embeddings_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings and Data Processing Lab

In this lab we explore how to work with word embeddings (training, usage and visualization) and check how embeddings differ according to the data processing we apply. We use input corpora as words or stems or lemmas and we learn how to obtain the part of speech of a word. 


### Packages

We will use `gensim` for word embeddings and `nltk` for some data processing tasks. Also a logger, `matplotlib` for visualisation and `sklearn` for representations.

In [0]:
!pip install logging

Collecting logging
  Downloading https://files.pythonhosted.org/packages/93/4b/979db9e44be09f71e85c9c8cfc42f258adfb7d93ce01deed2788b2948919/logging-0.4.9.6.tar.gz (96kB)
Building wheels for collected packages: logging
  Running setup.py bdist_wheel for logging: started
  Running setup.py bdist_wheel for logging: finished with status 'done'
  Stored in directory: C:\Users\ferrancm1\AppData\Local\pip\Cache\wheels\7d\2e\cb\a51fbdf351b2efebcf857f8b2c8d59b6ccd44ea2e9bb4005d6
Successfully built logging
Installing collected packages: logging
Successfully installed logging-0.4.9.6


distributed 1.21.8 requires msgpack, which is not installed.
grin 1.2.1 requires argparse>=1.1, which is not installed.
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [1]:
# imports needed and set up logging
import gensim 
import nltk
import logging

import warnings
warnings.filterwarnings("ignore")

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)




### Dataset: Game of Thrones books

The most important thing is data. We'll use five volumes of Game of Thrones I downloaded from
https://github.com/nihitx/game-of-thrones- and applied a basic cleaning to speed up the process.

Download the file to use today here: http://www.lsi.upc.edu/~cristinae/labNLP1/got.5books.clean.txt

Do you see part of the extension? _.clean._ That's the MOST IMPORTANT thing in machine learning: take care of your data, look at it, clean it, look at it again. We'll do some cleaning and pre-processing in the next lab.

Now, let's take a closer look at this data below by printing the first line. 

In [2]:
# Open the file and print the first line
dataFile="got.5books.clean.txt"

## CODE HERE
with open (dataFile, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


"We should start back," Gared urged as the woods began to grow dark around them. "The wildlings are dead."



### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. We'll do first a mild pre-processing using `gensim.utils.simple_preprocess(sentence)`. This does some basic pre-processing (lowercase tokens, ignoring tokens that are too short or too long) and returns a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [3]:
# Write a function `readInput(inputFile)` that reads a file and applies the `simple_preprocess`
def readInput(inputFile):
    """Method to read the input file"""
    ## CODE HERE
    logging.info("reading file {0}.. this may take a while".format(inputFile))
    with open (inputFile, 'rb') as f:
        for i,line in enumerate (f):
            if (i%10000==0):
                logging.info("read {0} lines".format(i))
            yield gensim.utils.simple_preprocess(line)


# read the tokenized file into a list (sentences) of lists (tokens) named `sentences`
## CODE HERE
sentences = list(readInput(dataFile))
logging.info("Done reading data file")
# print some examples
## CODE HERE
print(sentences[0])



2019-04-23 19:07:12,632 : INFO : reading file got.5books.clean.txt.. this may take a while
2019-04-23 19:07:12,634 : INFO : read 0 lines
2019-04-23 19:07:13,343 : INFO : read 10000 lines
2019-04-23 19:07:14,014 : INFO : read 20000 lines
2019-04-23 19:07:14,421 : INFO : Done reading data file


[u'we', u'should', u'start', u'back', u'gared', u'urged', u'as', u'the', u'woods', u'began', u'to', u'grow', u'dark', u'around', u'them', u'the', u'wildlings', u'are', u'dead']


## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the data, a list (sentences) of lists (tokens) for a complete corpus. Word2Vec uses all these tokens to internally create a vocabulary

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Remember you are training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [10]:
# Define a basic Word2Vec model (gensim.models.Word2Vec) with CBOW and train (model.train) it on `sentences`

## CODE HERE
model = gensim.models.Word2Vec (sentences, size=100, window=3, min_count=2, workers=1)
model.train(sentences, total_examples=len(sentences), epochs=10)

2019-04-23 19:11:41,148 : INFO : collecting all words and their counts
2019-04-23 19:11:41,150 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 19:11:41,245 : INFO : PROGRESS: at sentence #10000, processed 405638 words, keeping 13375 word types
2019-04-23 19:11:41,335 : INFO : PROGRESS: at sentence #20000, processed 784788 words, keeping 17230 word types
2019-04-23 19:11:41,390 : INFO : collected 18740 word types from a corpus of 1010744 raw words and 25690 sentences
2019-04-23 19:11:41,392 : INFO : Loading a fresh vocabulary
2019-04-23 19:11:41,433 : INFO : effective_min_count=2 retains 13673 unique words (72% of original 18740, drops 5067)
2019-04-23 19:11:41,434 : INFO : effective_min_count=2 leaves 1005677 word corpus (99% of original 1010744, drops 5067)
2019-04-23 19:11:41,478 : INFO : deleting the raw counts dictionary of 18740 items
2019-04-23 19:11:41,480 : INFO : sample=0.001 downsamples 49 most-common words
2019-04-23 19:11:41,481 : INFO 

(7664171, 10107440)

In [9]:
model = gensim.models.Word2Vec (sentences, size=100, window=3, min_count=2, workers=1)

2019-04-23 19:10:31,834 : INFO : collecting all words and their counts
2019-04-23 19:10:31,836 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 19:10:31,941 : INFO : PROGRESS: at sentence #10000, processed 405638 words, keeping 13375 word types
2019-04-23 19:10:32,032 : INFO : PROGRESS: at sentence #20000, processed 784788 words, keeping 17230 word types
2019-04-23 19:10:32,090 : INFO : collected 18740 word types from a corpus of 1010744 raw words and 25690 sentences
2019-04-23 19:10:32,091 : INFO : Loading a fresh vocabulary
2019-04-23 19:10:32,138 : INFO : effective_min_count=2 retains 13673 unique words (72% of original 18740, drops 5067)
2019-04-23 19:10:32,140 : INFO : effective_min_count=2 leaves 1005677 word corpus (99% of original 1010744, drops 5067)
2019-04-23 19:10:32,184 : INFO : deleting the raw counts dictionary of 18740 items
2019-04-23 19:10:32,188 : INFO : sample=0.001 downsamples 49 most-common words
2019-04-23 19:10:32,190 : INFO 

### Understanding the parameters:

```
model = gensim.models.Word2Vec (documents, size=100, window=10, min_count=2, workers=2, sg=0)
```

#### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. 

#### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

#### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

#### `workers`
How many threads to use

#### `sg`
Training algorithm: CBOW (0) or skip-gram (1)


### What have you trained?
Several functions allow to explore the results. The `most_similar` function returns the top 10 similar words to a given input word. `similarity` returns the similarity between two words that are present in the vocabulary. `doesnt_match` returns the most dissimilar word with a list of words. Let's play with these functions.


In [8]:
# Chose a word to see the 10 closest words with `most_similar`
w1 = "throne"
## CODE HERE
model.most_similar(positive=w1)

[(u'islands', 0.6903221607208252),
 (u'fleet', 0.6046836376190186),
 (u'holt', 0.6003888845443726),
 (u'chair', 0.5949640274047852),
 (u'council', 0.590110182762146),
 (u'realm', 0.5506998896598816),
 (u'kingslayer', 0.5349704027175903),
 (u'spikes', 0.5182492733001709),
 (u'tourney', 0.5094053149223328),
 (u'price', 0.5049958229064941)]

What happens if the word is not in the vocabulary? We are using a tiny corpus in a specific domain...

In [13]:
# Chose a word that you think does not belong to the corpus to see the 10 closest words
## CODE HERE
w1 = "omelette"
model.most_similar(positive=w1, topn=6)


KeyError: ignored

You can also specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related with `most_similar(positive=w1,negative=w2,topn=n)`. 

In [18]:
# get everything related to stuff on the bed for instance
## CODE HERE
w1 = ["bed","sheet","pillow"]
w2 = ["face"]
model.wv.most_similar(positive=w1, negative=w2, topn=10)


[(u'cart', 0.7314779758453369),
 (u'basin', 0.7270445823669434),
 (u'bunk', 0.6982508897781372),
 (u'tub', 0.6680843830108643),
 (u'bedrobe', 0.6667254567146301),
 (u'horsehide', 0.664029598236084),
 (u'pillows', 0.6630938649177551),
 (u'cushion', 0.6628068089485168),
 (u'cask', 0.6619012355804443),
 (u'cracks', 0.661659836769104)]

In [0]:
# try what happens without the negative constraint
## CODE HERE

Calculate some similarities now

In [21]:
# similarity between two different words
## CODE HERE
w1 = "bed"
w2 = "sleep"
model.wv.similarity(w1, w2)


0.6102705

In [0]:
# similarity between two identical words
## CODE HERE

In [22]:
# similarity between two unrelated words
## CODE HERE
w1 = "happy"
w2 = "sad"
model.wv.similarity(w1, w2)

0.61103094

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can use Word2Vec to find odd items given a list of items with `doesnt_match`.

In [23]:
# Define a list of words and look for the strange word
# Which one is the odd one out in this list?
## CODE HERE
model.wv.doesnt_match(["snow","winter","sword"])

'sword'

In [0]:
# Which one is the odd one out in this other list?
## CODE HERE


### Pre-trained vectors

We'll use now better vectors that have been trained using large corpora such as Wikipedia and Gigaword. `gensim` also allows to load pre-trained embeddings so that now we can do the same we have done with our word2vec embeddings but using general Glove vectors. 

In [24]:
#Download and load the model
import gensim.downloader as api
model_gigaword = api.load("glove-wiki-gigaword-100")

# If you have them already just load them
# from gensim.models import KeyedVectors
# gloveModel="PATH_TO_FILE"
# model_gigaword = KeyedVectors.load_word2vec_format(gloveModel, binary=False)

2019-04-23 19:21:45,791 : INFO : Creating /root/gensim-data




2019-04-23 19:22:07,920 : INFO : glove-wiki-gigaword-100 downloaded
2019-04-23 19:22:07,922 : INFO : loading projection weights from /root/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz
2019-04-23 19:23:05,875 : INFO : loaded (400000, 100) matrix from /root/gensim-data/glove-wiki-gigaword-100/glove-wiki-gigaword-100.gz


In [28]:
# find the similarity between two words. 
# Use the same examples as before an also some examples with out-of-domain vocabulary. I'm sure the word "phone" was
# not in the vocabulary before!

## CODE HERE
model_gigaword.wv.most_similar(positive=["phone","winter"], topn=10)

[(u'summer', 0.7735719680786133),
 (u'day', 0.7169382572174072),
 (u'days', 0.7116138339042664),
 (u'spring', 0.7099618315696716),
 (u'time', 0.7047103643417358),
 (u'next', 0.6884782910346985),
 (u'weekend', 0.6871427297592163),
 (u'week', 0.6843944787979126),
 (u'telephone', 0.6821471452713013),
 (u'even', 0.6820520162582397)]

In [0]:
# as before, get everything related to stuff on the bed

## CODE HERE

In [0]:
# The famous (king - man) + woman
## CODE HERE

In [0]:
# What happened with the small GoT model?
## CODE HERE

## Data processing


Let's see what does stemming, part-of-speech tagging and lemmatisation to our corpus.

### Stemming with Porter Stemmer

PorterStemmer uses **suffix stripping** to produce stems. The algorithm does not follow linguistics rather a set of rules for different cases that are applied in phases (step by step) to generate stems. Therefore PorterStemmer does not often generate stems that are actual English words. It uses the rules to decide whether it is wise to strip a suffix.

So why using it? Is simple, fast and reduces sparsity! 

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# initialise the stemmer
porterStemmer=PorterStemmer()

In [30]:
# Let's see the first sentence before and after stemming to understand what we are doing.
# porterStemmer.stem(word) does the stemming
## CODE HERE

stemSentences=[]
print(sentences[0])
for word in sentences[0]:
  stemSentences.append(porterStemmer.stem(word))
print(stemSentences)
  

[u'we', u'should', u'start', u'back', u'gared', u'urged', u'as', u'the', u'woods', u'began', u'to', u'grow', u'dark', u'around', u'them', u'the', u'wildlings', u'are', u'dead']
[u'we', u'should', u'start', u'back', u'gare', u'urg', u'as', u'the', u'wood', u'began', u'to', u'grow', u'dark', u'around', u'them', u'the', u'wildl', u'are', u'dead']


In [31]:
# It's OK, let's do the whole corpus
# Stem the full corpus
## CODE HERE

# Look some examples
example=16169
print(sentences[example])
print(stemmedSentences[example])

[u'she', u'will', u'without', u'highgarden', u'the', u'lannisters', u'have', u'no', u'hope', u'of', u'keeping', u'joffrey', u'on', u'his', u'throne', u'if', u'my', u'son', u'the', u'lord', u'oaf', u'asks', u'she', u'will', u'have', u'no', u'choice', u'but', u'to', u'grant', u'his', u'request']


NameError: ignored

### Word embeddings on the stemmed corpus

In [0]:
# Another training with word2vec, now with stems `modelStems`
## CODE HERE

Explore similarities in `modelStems`

In [0]:
w1 = "throne"
## CODE HERE


In [0]:
# More examples?
## CODE HERE

### Lemmatisation with a WordNet lemmatiser

Lemmatization, unlike stemming, reduces the inflected words properly ensuring that the root word, _lemma_ belongs to the language.
NLTK uses a WordNet Lemmatiser that uses the WordNet Database to lookup lemmas of words.

In [32]:
# Import the package and initialise the lemmatiser
# We need to download Wordnet too
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

# Initialise the lemmatiser
lemmatiser = WordNetLemmatizer()


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [35]:
# Let's see the first sentence before and after lemmatising
## CODE HERE
lemSentence=[]
print(sentences[0])
for word in sentences[0]:
  lemSentence.append(lemmatiser.lemmatize(word))
print(lemSentence)

[u'we', u'should', u'start', u'back', u'gared', u'urged', u'as', u'the', u'woods', u'began', u'to', u'grow', u'dark', u'around', u'them', u'the', u'wildlings', u'are', u'dead']
[u'we', u'should', u'start', u'back', u'gared', u'urged', u'a', u'the', u'wood', u'began', u'to', u'grow', u'dark', u'around', u'them', u'the', u'wildlings', u'are', u'dead']


Are you happy with that? The `lemmatize(word)` function also allow to include information about the PoS of the word `lemmatize(word, PoS)`. Let's us it!

In [0]:
# Imports needed for PoS tagging
from nltk import pos_tag
from nltk.corpus import wordnet

# You might need to download this
#nltk.download('averaged_perceptron_tagger')

# Write a function to map PoS tag in wordnet to the first letter only
def getWordnetPoS(word):
    """Map POS tag to first character lemmatize() accepts"""
    # WordNet POS tags are only: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tagDict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tagDict.get(tag, wordnet.NOUN)

# Tag with PoS the first sentence of the corpus and print it
## CODE HERE

# Lemmatise the full corpus with the information of PoS now
## CODE HERE

In [0]:
# It's OK, let's lemmatise the whole corpus
## CODE HERE

# Look again at some examples
print(sentences[example])
print(lemSentences[example])


### Word embeddings on the lemmatised corpus

In [0]:
# Another training with word2vec, now with lemmas, create the model `model Lemmas`
## CODE HERE

Explore again similarities, distances, leave one out...

In [0]:
w1 = "throne"
## CODE HERE

## ...

## Visualisation

Finally, we will visualise the _n_-dimensional word embeddings by projecting them down to 2-dimensional x,y coordinate pairs. 
Several techniques exist (PCA, t-SNE, etc). We use PCA in the following (PCA class in `sklearn.decomposition`)

In [0]:
# Imports needed for the visualisation
from sklearn.decomposition import PCA
from matplotlib import pyplot
%matplotlib inline

# fit a 2d PCA model to the vectors
## CODE HERE

# create a scatter plot of the projection
## CODE HERE

# add the labels to the plot
## CODE HERE

# look at the plot
pyplot.show()


Too much information. Let's select only a subset of words

In [0]:
# Select what we wanna see ('most_similar' words to something for instance)
setToPlot = modelLemmas.wv.most_similar(positive='throne', topn=10)

# Look for the vectors for the desired words only, and store them as vectorX and vectorY
vectorX =  []
vectorY =  []
words = []
## CODE HERE

# create the scatter plot for these words
## CODE HERE

# add the labels
## CODE HERE

# look at the plot
pyplot.show()

We can do many more things, but the best way to learn is parctice by yourself. 

## What do I do now?

All the exercise has been done with almost a _toy corpus_ (only 25k sentences, 1M words, 19k types). Everything is fast in this case but the quality is limited. I recommend that you work with a real corpus now.

* Download a corpus of tweets, of news or a Wikipedia edition.
* Clean and preprocess the data. No stemming or lemmatisation is usually applied with large corpora.
* Train word vectors with different software: word2vec, glove, fasttext
* Load them and explore the embeddings, you can still use `gensim`
* Evaluate the embeddings. The functions `evaluate_word_pairs` and `evaluate_word_analogies` can help on that, you only need to download standard test sets to use a gold standard.
