# Intro to NLP

#### Areas covered

Natural language processing covers a wide range of different activities. This notebook covers the basic steps of preparing data for NLP applications:
1. Word tokenization
2. Creating word embeddings
3. POS tagging
4. Lemmatization
5. Tree parsing

#### Import libraries.

In [15]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import brown
from gensim.models import Word2Vec
from nltk.stem import WordNetLemmatizer
from nltk.chunk import RegexpParser

#### Word tokenization

Word tokenization is the task of separating text into words.

The most trivial way of tokenizing text is to have Python split the text string based on spaces. For English and other European languages, this does not always do what we want—punctuation will typically be mishandled, resulting in chunks that are not words. Moreover, many Asian languages do not place spaces between words, meaning that this strategy won't work at all for those words.

The method `word_tokenize` from `nltk` is useful in that it can accurately split text from languages that use word-based spacing. Other languages that do not do this need special modules to achieve this, or, in the case of languages like Hmong that place spaces between syllables, where morpheme boundaries coincide with syllable boundaries, `word_tokenize` can be used with IOB tagging (explained below) at the word level.

In [2]:
text = 'The method word_tokenize from nltk is useful in that it can accurately split text from languages that use word-based spacing. Other languages that do not do this need special modules to achieve this, or, in the case of languages like Hmong that place spaces between syllables, where morpheme boundaries coincide with syllable boundaries, word_tokenize can be used with IOB tagging (explained below) at the word level.'
tokens = word_tokenize(text)
print(tokens)

['The', 'method', 'word_tokenize', 'from', 'nltk', 'is', 'useful', 'in', 'that', 'it', 'can', 'accurately', 'split', 'text', 'from', 'languages', 'that', 'use', 'word-based', 'spacing', '.', 'Other', 'languages', 'that', 'do', 'not', 'do', 'this', 'need', 'special', 'modules', 'to', 'achieve', 'this', ',', 'or', ',', 'in', 'the', 'case', 'of', 'languages', 'like', 'Hmong', 'that', 'place', 'spaces', 'between', 'syllables', ',', 'where', 'morpheme', 'boundaries', 'coincide', 'with', 'syllable', 'boundaries', ',', 'word_tokenize', 'can', 'be', 'used', 'with', 'IOB', 'tagging', '(', 'explained', 'below', ')', 'at', 'the', 'word', 'level', '.']


#### Word embeddings

Word embeddings are abstract representations of words in the form of lists of numbers acting as vectors, which computers can use to train models based on the words found in the embeddings.

A simple version might have _cat_ as [ 0.36403364 -0.49103507  0.18100677  0.82723975  0.641248    0.34181604
  0.01125114 -0.7229596   0.24586748 -0.5252472 ].

We create a very simple, illustrative word embedding system for English below, using the Brown corpus found in `nltk.corpus.brown`, and the `Word2Vec` module from `gensim.models`. Here, sentences are drawn from the Brown corpus using `brown.sents()`, as Word2Vec needs sentences as input. 

The `Word2Vec` initialization method includes as parameters:
1. the data (`sentences`)
2. `iter`: the number of times to pass through the data while training
3. `workers`: the number of processes to use while training
4. `size`: the magnitude of the word embedding vector
5. `window`: the number of words on either side of the current word to consider
6. `min_count`: the number of times a word should occur before it receives its own word embedding vector 

In [9]:
sentences = brown.sents()
model = Word2Vec(sentences, iter=10, workers=10, size=10, window=5, min_count=5)

Now, we can see what the word embedding vectors look like for individual words using `model.wv['word']`.

In [11]:
print(model.wv['cat'])
print(model.wv['new'])

[ 0.36403364 -0.49103507  0.18100677  0.82723975  0.641248    0.34181604
  0.01125114 -0.7229596   0.24586748 -0.5252472 ]
[-0.51304495  0.0516611  -1.1650859   0.30826917 -0.3926486  -1.0635322
 -3.607118    0.06678296  0.6079878  -1.8219658 ]


One interesting trick is to use word embeddings to find similar words, using `model.wv.most_similar`.

In [18]:
print(model.wv.most_similar(['blue']))

[('black', 0.968340277671814), ('brown', 0.9671033620834351), ('green', 0.9663865566253662), ('gray', 0.9647021293640137), ('red', 0.9555657505989075), ('white', 0.9543337821960449), ('thin', 0.9507400989532471), ('pink', 0.9382452964782715), ('deep', 0.9381362199783325), ('pale', 0.9359599351882935)]


When we try a color word, lots of other color words also appear: _black_, _brown_, _green_, _gray_, _red_, _white_, and _pink_, along with other adjectives of physical description such as _thin_, and especially those associated with colors, such as _deep_ (as in _deep red_) or _pale_ (as in _pale red_).

However, given the small size of the vectors (only 10!) and the small number of iterations (also only 10!), a lot of noise will be seen for other words—_lawyer_ will get results like _governor_ and _critic_ along with all sorts of strange things, such as _Holy_, _old_, and _thinks_. This can be avoided with larger vectors and a greater number of iterations.

Even better, however, are more robust word embedding training methods. Recent methods include BERT, XLNet, and ALBERT. Note that these other methods require computing power far greater than what is available on a typical laptop or desktop. On one occasion, out of curiosity I attempted to run ALBERT on my laptop, and based on the rate it was processing the data, training the model for Hmong based on the SCH Corpus would have taken about three years. The good news is these more robust models have been trained for many languages by the Google team already, but only those with a Wikipedia version with high volumes of articles.

#### POS tagging

Part-of-speech (POS) tagging assigns word class labels to each word in a text. This can not only make the text more accessible for corpus-based data mining, but it also typically serves as a basis for more advanced NLP applications.
`nltk` has the method `pos_tag` that can do this automatically for English, because `nltk` already has a tagged corpus available.

In [4]:
pos_tagged_tokens = pos_tag(tokens)
print(pos_tagged_tokens)

[('The', 'DT'), ('method', 'NN'), ('word_tokenize', 'NN'), ('from', 'IN'), ('nltk', 'NN'), ('is', 'VBZ'), ('useful', 'JJ'), ('in', 'IN'), ('that', 'IN'), ('it', 'PRP'), ('can', 'MD'), ('accurately', 'RB'), ('split', 'VB'), ('text', 'NN'), ('from', 'IN'), ('languages', 'NNS'), ('that', 'WDT'), ('use', 'VBP'), ('word-based', 'JJ'), ('spacing', 'NN'), ('.', '.'), ('Other', 'JJ'), ('languages', 'NNS'), ('that', 'WDT'), ('do', 'VBP'), ('not', 'RB'), ('do', 'VB'), ('this', 'DT'), ('need', 'NN'), ('special', 'JJ'), ('modules', 'NNS'), ('to', 'TO'), ('achieve', 'VB'), ('this', 'DT'), (',', ','), ('or', 'CC'), (',', ','), ('in', 'IN'), ('the', 'DT'), ('case', 'NN'), ('of', 'IN'), ('languages', 'NNS'), ('like', 'IN'), ('Hmong', 'NNP'), ('that', 'WDT'), ('place', 'NN'), ('spaces', 'NNS'), ('between', 'IN'), ('syllables', 'NNS'), (',', ','), ('where', 'WRB'), ('morpheme', 'NN'), ('boundaries', 'NNS'), ('coincide', 'VBP'), ('with', 'IN'), ('syllable', 'JJ'), ('boundaries', 'NNS'), (',', ','), ('wor

As can be seen above, POS tags typically have 2-3 letters representing the word class. _DT_ is 'determiner', _NN_ is 'noun', _PRP_ is 'personal pronoun', and so on.

For other languages, POS-tagging models have to be trained from scratch, using the word embeddings described above.

#### Lemmatization

Lemmatization is the process of determining the lemma (~ word root, base form, and so on) of each word. In other words, _walking_ produces _walk_, _bigger_ produces _big_, and so on. This is important as it makes recognizing words with the same root simpler. For linguistics research, this can be especially useful for finding all of the forms of a given root and finding their distribution.
To do lemmatization, we can use `WordNetLemmatizer` from `nltk.stem`. `WordNetLemmatizer` is based on WordNet, a database that represents words and the relations between them for English.
For other languages, a specific lemmatizer has to be developed. Several major world languages already have one, while Hmong, which has less than 30 affixes and word roots that are readily obvious, would not benefit from lemmatization.

In [21]:
#from nltk import download
#download('wordnet')

lemmatizer = WordNetLemmatizer()
sent = 'I saw them at the shop everyday when I was younger.'
tokens = word_tokenize(sent)
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)
tag_conversions = {'VBD':'v', 'JJR':'a'}

# loop through tokens to produce result
lemmas = []
for token in tagged_tokens:
    if token[1] in tag_conversions.keys():
        lemma = lemmatizer.lemmatize(token[0], tag_conversions[token[1]])
    else:
        lemma = lemmatizer.lemmatize(token[0])
    lemmas.append(lemma)
    
print(lemmas)

[('I', 'PRP'), ('saw', 'VBD'), ('them', 'PRP'), ('at', 'IN'), ('the', 'DT'), ('shop', 'NN'), ('everyday', 'NN'), ('when', 'WRB'), ('I', 'PRP'), ('was', 'VBD'), ('younger', 'JJR'), ('.', '.')]
['I', 'saw', 'them', 'at', 'the', 'shop', 'everyday', 'when', 'I', 'be', 'young', '.']


#### Tree parsing

Tree parsing is combining words into constituent phrases, such as adjective + noun or demonstrative + noun to make a noun phrase.

This can be done with `nltk` through the use of manual rules written as strings, which are then loaded into a `RegexpParser` object.

In [20]:
phrase = 'kuv/PN yog/VV ib/QU tug/CL neeg/NN'
tokens = phrase.split(' ')
tagged_tokens = [tuple(w.split('/')) for w in tokens]

NP_rule = """NP: 
{<QU><CL><NN>}
{<PN>}"""

parser = RegexpParser(NP_rule)
result = parser.parse(tagged_tokens)
print(result)

(S (NP kuv/PN) yog/VV (NP ib/QU tug/CL neeg/NN))


Beyond these basics, many other things can be done with NLP, based on the steps discussed above:
1. Named entity recognition (recognizing names of people, companies, places, etc.) and linking entities in relationships
2. Sentiment classification (is an evaluation good, bad, etc.)
3. Text classification (spam vs. non-spam, genre, etc.)
4. Machine translation
5. Spell checker