# Finding The Words - Word Associations, Word Embeddings and the Word2Vec model
##### David Miller - August 2018 - [Link to Github](https://github.com/millerdw/millerdw.github.io/tree/master/_notebooks/FindingTheWords_2)
---

## Introduction
In a previous post on [simple NLP using R](http://millerdw.github.io/_posts/2018-07-23-RSS and Simple Natural Language Processing.html), we developed an algorithm that used frequency analysis of the vocabulary in different texts to cluster those texts together. We looked at a couple of improvements on this theme, including using w-shingling to compare more complex word combinations rather than just vocabulary.

In this post I want to develop a couple of those ideas further, and address a few of the shortcomings of those approaches; namely that they are relatively lightweight, and attempt a rather superficial form of unsupervised learning, rather than diving deeper into the meaning of the texts.

Finally, I'm going to take a little bit of a dive into the Word2Vec model ...

- Word Associations
    + Word Roots and Synonyms
- Word Embeddings
    + Translating words into 'meaning' vectors
- Transfer Learnings
    + Using the Word2Vec model

## Word Associations
An immediate downside to comparing complete words and vocabulary within texts is that your algorithm is at the mercy of the author; idioms, favoured words, spelling *conventions* (think 'colour' vs 'color'), spelling *mistakes*... If an algorithm doesn't know to recognise the similarities between such differences (i.e. hasn't been explicitly coded, or taught, to do so), then these all add up to a lot of noise in the signal you are trying to process. This can be a serious problem in texts that are condensed, and have very few words to go by, such as a news article.

This is an important point to consider when you're working with an NLP problem. Do I try to solve it by preprocessing the data before it reaches my algorithm? Do I try an algorithm that's more robust to some of these issues, say focussing on strings of characters rather than complete words? I think both are viable options, depending on what your goals are, but for the purposes of this post, I'm going to focus on the former. In short, because I'm more interested in document-level text comparison, I think a character-level algorithm is likely to be overkill, and added to this I'm a big believer in a plug-and-play approach to programming, whereby the different components of an algorithm can be separated (see *pre*-processing), upgraded, replaced, generally-messed-with, forgotten, and even reintroduced at a later date, *without affecting any code elsewhere in the project*. 


> I'm a big believer in a plug-and-play approach to programming, whereby the different components of an algorithm can be separated (see *pre*-processing), upgraded, replaced, generally-messed-with, forgotten, and even reintroduced at a later date, *without affecting any code elsewhere in the project*.

### Variety is the Spice of Life
A lot of work has been published around the idea of cleaning or normalising individual words before they're input into an algorithm. This is often referred to as [stemming or lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) the text, and consists of either simply removing suffices until only the core of a word remains (stemming), or performing a more complex analysis over a large vocabulary in order to group the various forms of a word together (lemmatization).

For the purposes of demonstration, I'm going to use the [Porter Stemmer](http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/porter.pdf) algorithm to cleanup our vocabulary. The Porter Stemmer works by applying a set of rules or heuristics in order, to accurately reduce as many words as possible to a 'correct' word stem. However, the English language is rather varied - given its [variety of historical influences](https://www.merriam-webster.com/help/faq-history), and England's more recent history of global trade, colonialism, and various forms of cultural osmosis (see ['pukka'](https://blog.oxforddictionaries.com/2013/06/14/pukka/), ['tattoo'](https://www.tattoo.com/blog/origin-word-tattoo/), ['chit'](https://www.etymonline.com/word/chit)) - which means that such a set of rules will never be perfect. 

There are lots of different stemming algorithms, but this one is the most widely known, and has the advantage of being published in its own library; [Snowball](http://snowballstem.org/).


> *"We don't just borrow from other languages; English pursues them down alleyways, beats them unconscious, and rifles their pockets for loose grammar."*

### Getting to work
For the purposes of this blog, Python is the programming language of choice. Most of my previous NLP work has been in R, so this will be a first for me. The main reason for my doing this is to make use of the wide variety of tools in the Python community, especially libraries such as [Numpy](http://www.numpy.org/), [Plotly](https://plot.ly/), the [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html), and later, perhaps, [Keras](https://keras.io/) for deep learning. 

While the Snowball library is included in NLTK, it is also available in a lighter python wrapper called 'PyStemmer'. As with all of these 3rd party libraries, you'll have to install a copy in addition to your Python installation if you're doing this at home. You'll need to open a `cmd` terminal (in Windows) and run `pip install PyStemmer` to install the library, and then import it into the relevant Python script, as usual.

Believe it or not, this will also be my first attempt to build a functioning Python script of my own, so let me know if you find something that's bad practice or poorly executed - I've been working with C#, a very similar language, for my whole career, and I've had plenty of experience in reading, contemplating, and occasionally fixing other people's work in Python, but I've never had the joy of starting from scratch! 

Anyway, here goes... In the script below, we're going to import the Snowball/PyStemmer library containing the Porter Stemmer algorithm, and apply it to a set of similar-sounding words. Note also that we have the choose the language that we're interested in. This might seem obvious, but it highlights something worth remembering; the algorithm uses a set of fixed rules based on the cases, tenses and plural forms of various words, and these will obviously change depending on the language or dialect used.

In [1]:
import numpy as np
import Stemmer as ps

# list available stemmers
print('Stemmer Languages available:\t',ps.algorithms())


Stemmer Languages available:	 ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']


In [2]:
# create stemmer class
stemmer = ps.Stemmer('english')

# stemmer examples
rawWords = ['general','generalizing','generalization','generalise','generalising','generalisation']

print('Raw Word\t Stemmed Word')
for w in rawWords :
    print(w,'\t',stemmer.stemWord(w))

Raw Word	 Stemmed Word
general 	 general
generalizing 	 general
generalization 	 general
generalise 	 generalis
generalising 	 generalis
generalisation 	 generalis


## [Extra] JSON, NewsAPI and building a news dataset

In [27]:
from newsapi import NewsApiClient

news = NewsApiClient(api_key='2bd0b9a9d4594be6b0ceaa26d1861165')

all_news = []
for i in range(1,11):
    all_news.append(news.get_everything(sources='bbc-news,the-verge,abc-news,ary news,associated press,wired,aftenposten,bbc news,bild,blasting news,bloomberg,business insider,engadget,google news,the verge',
                                        from_param='2018-08-01',
                                        to='2018-08-14',
                                        language='en',
                                        page_size=100,
                                        page=i))

In [49]:
rawArticles=[]
for news in all_news:
    rawArticles=rawArticles+news['articles']

# dict comprehension syntax - similar to usage in f#, or the foreach library in R
rawArticles = {i : rawArticles[i] for i in range(len(rawArticles))}

print(len(rawArticles))
print(rawArticles[1])

900
{'source': {'id': 'abc-news', 'name': 'ABC News'}, 'author': 'ABC News', 'title': 'WATCH: Driver arrested for terrorism after crashing into barrier at British Parliament', 'description': 'Metropolitan Police have arrested the driver of a vehicle after he crashed into the barriers outside British Parliament in London and are holding him on suspicion of terrorist offenses.', 'url': 'https://abcnews.go.com/International/video/driver-arrested-terrorism-crashing-barrier-british-parliament-57164907', 'urlToImage': 'https://s.abcnews.com/images/International/180813_atm_parliament_crash_hpMain_16x9_992.jpg', 'publishedAt': '2018-08-14T11:04:07Z'}


In [96]:
import re

def preprocessArticles(articles) :
    reEndOfSentence = re.compile('\\. ')
    reNonAlphaNumeric = re.compile('[\W_]')
    processedArticles = {}
    for i,article in articles.items() :
        # collect text
        text = str(article['description'])+' '
        # covert to lower case
        text = text.lower()
        # replace full stops with ENDOFSEN
        text = reEndOfSentence.sub(' ENDOFSEN ',text)
        # split text into individual words
        words = text.split(' ')
        # remove remaining non-alphanumerics
        words = [reNonAlphaNumeric.sub('',word) for word in words]
        

        # combine into list of processed articles
        processedArticles[i] = words
        
    return processedArticles

def stemText(words) :
    return [stemmer.stemWord(word) for word in words]

def stemArticles(articles) :
    return {i:stemText(words) for i,words in articles.items()}


articles = preprocessArticles(rawArticles)
stemmedArticles = stemArticles(articles)

print('\nRaw Description:')
print(rawArticles[1]['description'])
print('\nPreprocessed Description:')
print(articles[1])
print('\nStemmed Description:')
print(stemmedArticles[1])



Raw Description:
Metropolitan Police have arrested the driver of a vehicle after he crashed into the barriers outside British Parliament in London and are holding him on suspicion of terrorist offenses.

Preprocessed Description:
['metropolitan', 'police', 'have', 'arrested', 'the', 'driver', 'of', 'a', 'vehicle', 'after', 'he', 'crashed', 'into', 'the', 'barriers', 'outside', 'british', 'parliament', 'in', 'london', 'and', 'are', 'holding', 'him', 'on', 'suspicion', 'of', 'terrorist', 'offenses', 'ENDOFSEN', '']

Stemmed Description:
['metropolitan', 'polic', 'have', 'arrest', 'the', 'driver', 'of', 'a', 'vehicl', 'after', 'he', 'crash', 'into', 'the', 'barrier', 'outsid', 'british', 'parliament', 'in', 'london', 'and', 'are', 'hold', 'him', 'on', 'suspicion', 'of', 'terrorist', 'offens', 'ENDOFSEN', '']


## Vectorised Implementations

very simple, vector defines whole text



In [103]:

def buildVocabulary(processedArticles):
    # generate list of all vocabulary
    vocabulary = []
    for i,article in processedArticles.items():
        vocabulary = vocabulary + article
    
    vocabulary = list(set(vocabulary))
    vocabulary.sort(key=str)

    #return as dictionary
    return {i:vocabulary[i] for i in range(len(vocabulary))}

def vectoriseText(vocabToIndexMap,article):
    indexVector=[vocabToIndexMap[word] for word in article]
    return [np.sum(indexVector==i) for i in range(len(vocabToIndexMap))]
        

def vectoriseArticles(processedArticles):
    # build vocabulary
    vocabulary = buildVocabulary(processedArticles)
    # create map of word to vocabulary index
    vocabToIndexMap={w:i for i,w in vocabulary.items()}
    # convert to dictionary of vectors
    vectorisedArticles = {i:vectoriseText(vocabToIndexMap,article) for i,article in processedArticles.items()}
    return vectorisedArticles, vocabulary
    
  
vectorisedArticles, vocabulary = vectoriseArticles(articles)
vectorisedStemmedArticles, stemmedVocabulary = vectoriseArticles(stemmedArticles)

print('Original Vocabulary:\t'+str(len(vocabulary)))
print('Stemmed Vocabulary:\t'+str(len(stemmedVocabulary)))

Original Vocabulary:	4820
Stemmed Vocabulary:	3823


... as expected, the normal vocabulary is significantly higher than the stemmed vocabulary



## Vectorised k-Means

In [104]:
print(vectorisedArticles[1])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Word Embeddings
Superficial text vs deeper meaning