<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
Word Embeddings - A Numerric Building Block of Natural Language Processing 
</h1>



<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

# Description of this notebook

Words can be represented by Vectors, so that co-occurrence relations among words in the corpus can be approximated 
through nummeric measurements, such as the cosine similarity, of their vector representations. 

In this notebook, we use `gensim` package to describe basic operations of word-embeddings. `Gensim` is a python package for topic modeling (LDA) in Natural Language Processing, and also provides a number of useful tools, such as word-embeddings. 

We first import `Word2Vec` from `gensim` as follows.

In [1]:
from gensim.models import Word2Vec

`Word2Vec` provides an easy-to-use interface to use `NLTK` corpora (Natural Language Toolkit). We import its movie review corpora and Brown corpoa. 

In [2]:
from nltk.corpus import movie_reviews, brown

Word Vectors representing word co-occurrence relations in movie reviews are easily imported by one Python line

In [3]:
reviewVec = Word2Vec(movie_reviews.sents())

The top 10 most similar words of `way` in the movie review corpora are listed as follows.

In [None]:
reviewVec.wv.most_similar('way', topn=10)

[('situation', 0.7207054495811462),
 ('stuff', 0.6776666641235352),
 ('material', 0.6655994653701782),
 ('audience', 0.661513090133667),
 ('money', 0.6551351547241211),
 ('viewer', 0.6523748636245728),
 ('thing', 0.6413513422012329),
 ('place', 0.6360496282577515),
 ('it', 0.6343892216682434),
 ('mind', 0.6280417442321777)]

We can compare the results of using Brown corpora.

In [None]:
brownVec = Word2Vec(brown.sents())

In [None]:
brownVec.wv.most_similar('way', topn=10)

Co-occurrence relations are determined (or biased) by the corpora. We can compare the results of using Brown corpora.

In [None]:
reviewVec.wv.most_similar('brown', topn=10)

We can compare the results of using Brown corpora.

In [None]:
brownVec.wv.most_similar('brown', topn=10)

We see the different neighbors of the same word. In real applications, we need to train word embeddings of a particular corpora. It is also simple to do this. Suppose we have 5 sentences as follows. 

In [None]:
snts = [
    ['why','do','so','many','egyptian','statues','have','broken','noses'],
    ['why','are','the','statues','noses','broken'],
    ['it','might','seem','inevitable','that','after','thousands','of','years','an','ancient','artifact','would','show','wear','and','tear'],
    ['but','this','simple','observation','led','bleiberg','to','uncover','a','widespread','pattern','of','deliberate','destruction',],
    ['which','pointed','to','a','complex','set','of','reasons','why','most','works','of','egyptian','art','came','to','be','defaced','in','the','first','place']
]

The vector representation of this 5 sentences can be easily trained by the following function call.

In [None]:
w2vModel = Word2Vec(snts, min_count=1)

We can see the top-level features by just using the `print` function. `vocab=55` means that there are 55 words, `size=100` means that the vector size is 100, `alpha` is used for adjusting the training process, its default starting value is 0.025, and decreases linearly after each training epoch.

In [None]:
print(w2vModel)

The vocabulary can be accessed and printed out.

In [None]:
words = list(w2vModel.wv.vocab)

In [None]:
print(words)

We can use the following formula to see the vector representation of a word

In [None]:
print(w2vModel.wv['why'])

Use the following formula to see the top 10 words with the highest cosine similarity values.

In [None]:
w2vModel.wv.most_similar('noses', topn=10)

We can save this model into a file `myModel.bin`, and load this file

In [None]:
w2vModel.save('myModel.bin')

In [None]:
myModel = Word2Vec.load('myModel.bin')

In [None]:
print(myModel)

To have a better intuition on the co-occurrrence relations among all words, we can plot these vectors into a 2-dimensional space. We first need to reduce their dimensions from 100 to 2 by using `PCA` (Principal Component Analysis) class.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)

In [None]:
points2D = pca.fit_transform(w2vModel.wv[w2vModel.wv.vocab])

To visualize these points, we use `pyplot` from the `matplotlib` library

In [None]:
import matplotlib
from matplotlib import pyplot

`matplotlib.rcParams['figure.figsize']` controlls the figure size, the default value is `[10, 5]`

In [None]:
matplotlib.rcParams['figure.figsize'] = [20, 10]

In [None]:
pyplot.scatter(points2D[:, 0], points2D[:,1])

Which word does a point represent? Let us tag them.

In [None]:
pyplot.scatter(points2D[:, 0], points2D[:,1])
for index, word in enumerate(words):
    pyplot.annotate(word, xy=(points2D[index,0], points2D[index,1]))

Now, let us a function which accepts a http address as input, extracts all the sentences (tagged by the `contextClass` in the webpage) from the webpage pointed by the address, computes word embeddings, and save them into an output file, plot the result with tagged words.
```
def computing_word2vec_in_webpage(httpAddress   = 'http://...', 
                              contextClass  = '',
                              outputW2VFile = '/home/user/.../w2vWeb.bin':
    
```
We decompose this process into four steps:
* extract the webpage pointed by the http address
* transform these texts into a list of sentences, each sentence is a list of tokens
* use gensim to train word embeddings
* save the result into output file, and plot


We use `urllib.request` pacakge to retrieve the webpage pointed by 'https://edition.cnn.com/style/article/egyptian-statues-broken-noses-artsy/index.html', and `BeautifulSoup` package to retrieve texts.

### note: make sure that your computer is allowed to access and retrieve webpages

In [None]:
import urllib

In [None]:
from bs4 import BeautifulSoup

In [None]:
webpage = ''
httpAddress = 'https://edition.cnn.com/style/article/egyptian-statues-broken-noses-artsy/index.html'
contextClass = 'Paragraph__component BasicArticle__paragraph BasicArticle__pad'
webpage = urllib.request.urlopen(httpAddress).read().decode('utf-8')

In [None]:
print(len(webpage))

The value should not be 0. 

In [None]:
soup = BeautifulSoup(webpage, 'html.parser')
sections = [sec.text for sec in soup.find_all(class_= contextClass)]
corpora = " ".join(sections)

In [None]:
print(corpora)

Now, we will use NLTK tools to transform this text into a list of sentences, each sentence is a list of tokens.

In [None]:
import nltk
from nltk.tokenize import TweetTokenizer, sent_tokenize

In [None]:
tokenizer = TweetTokenizer()

In [None]:
inputCorpora = [tokenizer.tokenize(snt) for snt in nltk.sent_tokenize(corpora.lower())]

In [None]:
from pprint import pprint

In [None]:
pprint(inputCorpora)

In [None]:
w2vModel = Word2Vec(inputCorpora, min_count=1)

In [None]:
print(w2vModel)

In [None]:
w2vModel.wv.most_similar('noses', topn=10)

Now, we can write up the function

In [None]:
def computing_word2vec_in_webpage(httpAddress   = 'http://...', 
                              contextClass  = 'p',
                              outputW2VFile = 'w2vWeb.bin'):
    # get raw corpus
    import urllib
    from bs4 import BeautifulSoup
    webpage = urllib.request.urlopen(httpAddress).read().decode('utf-8')
    soup = BeautifulSoup(webpage, 'html.parser')
    sections = [sec.text for sec in soup.find_all(class_= contextClass)]
    corpora = " ".join(sections)
    
    # pre-process raw corpus
    import nltk
    from nltk.tokenize import TweetTokenizer, sent_tokenize
    tokenizer = TweetTokenizer()
    inputCorpora = [tokenizer.tokenize(snt) for snt in nltk.sent_tokenize(corpora.lower())]
    
    # the main machine learning process
    from gensim.models import Word2Vec
    w2vModel = Word2Vec(inputCorpora, min_count=1)
    print(w2vModel)
    
    # save result
    w2vModel.save(outputW2VFile)
    
    # visualize result
    from sklearn.decomposition import PCA
    pca = PCA(n_components=2)
    points2D = pca.fit_transform(w2vModel.wv[w2vModel.wv.vocab])
    import matplotlib
    matplotlib.rcParams['figure.figsize'] = [20, 10]
    from matplotlib import pyplot
    pyplot.scatter(points2D[:, 0], points2D[:,1])
    pyplot.scatter(points2D[:, 0], points2D[:,1])
    for index, word in enumerate(words):
        pyplot.annotate(word, xy=(points2D[index,0], points2D[index,1]))
    
    
                              

In [None]:
computing_word2vec_in_webpage(httpAddress = 'https://edition.cnn.com/style/article/egyptian-statues-broken-noses-artsy/index.html',
                              contextClass = 'Paragraph__component BasicArticle__paragraph BasicArticle__pad',
                              outputW2VFile = 'w2vEgyptianBrokenBoses.bin')                          


# References

* Tomas Mikolov, Kai Chen, Greg Corrado,Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. CoRR:abs/1301.3781. <a>http://arxiv.org/abs/1301.3781</a> 
* Jeffrey Pennington, Richard Socher,d Christopher D. Manning (2014). GloVe: Global Vectors for Word Representation. EMNLP-14. <a>https://nlp.stanford.edu/projects/glove/</a>
* Omer Levy and Yoav Goldberg (2014). Dependency-Based Word Embeddings. ACL-14. <a>https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/2</a>

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>