# AIAC536 Assignment 2 - Building Your NLP Vocabulary
This is code from chapter 3 of *Hands-On Python Natural Language Processing*. We learn new vocabulary and explore various normalization techniques such as stemming, lemmatization, stopword removal, and case folding.

## Theory

### Lexicons
Lexicons are a collection of vocabulary of a person, language, or a profession. It consists of several **Lexemes**

### Phonemes
Phonemes are the speech sounds.

### Graphemes
Groups of one or more letters which represent a single phonemes. Eg. The word `spoon` consists of four phonemes: `s`, `p`, `oo` and `n`

### Morpheme
The smallest meaningful unit in a language. E.g: The word `Unbreakable` consts of `un`, `break`, and `able`

## Tokenization
Tokenization is the process of breaking a document or text into smaller chunks called tokens.

In [5]:
print("The capital of China is Beijing".split())
print("China's capital is Beijing".split())
print("Let's travel from Hong Kong to Beijing".split())
print("A friend is pursuing his M.S from Beijing".split())

['The', 'capital', 'of', 'China', 'is', 'Beijing']
["China's", 'capital', 'is', 'Beijing']
["Let's", 'travel', 'from', 'Hong', 'Kong', 'to', 'Beijing']
['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']


In the above tokens we find **unigrams** and **bigrams**. `Hong Kong` is a bigram since it contains two words, while `Beijing` is a unigram due to it containing a single word. Such naming can be generalized under **n-grams**.

In [12]:
## RexEx Tokenizer
from nltk.tokenize import RegexpTokenizer

s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA"

tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
print(tokenizer.tokenize(s))

tokenizer = RegexpTokenizer('\w+|[^\w\s]+')
print(tokenizer.tokenize(s))

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$3000.0', '-', '$8000.0', 'in', 'USA']
['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$', '3000', '.', '0', '-', '$', '8000', '.', '0', 'in', 'USA']


In [13]:
## Treebank Tokenizer
from nltk.tokenize import TreebankWordTokenizer

s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'that',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

In [17]:
## TweetTokenizer
from nltk.tokenize import TweetTokenizer

s = "@amankedia I'm going to buy a Rolexxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

## Stemming
Stemming is the process of removing inflection forms of word, and strip them to their base form called **stem**. The letters removed during stemming are called **affixes**.

In [18]:
from nltk.stem.snowball import SnowballStemmer

print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [22]:
from nltk.stem.porter import PorterStemmer

plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']
stemmer = PorterStemmer()
singles = [stemmer.stem(p) for p in plurals]
print(singles)

['caress', 'fli', 'die', 'mule', 'die', 'agre', 'own', 'humbl', 'size', 'meet', 'state', 'siez', 'item', 'tradit', 'refer', 'colon', 'plot', 'have', 'gener']


## Lemmatization
Lemmatization uses the context to convert words to their base form. The context can be determined from accross words in sentences, or even documents.

In [24]:
import nltk
#nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of Lemmatization"
token_list = s.split()
print("Tokens are: ", token_list)

lemmalized_tokens = [lemmatizer.lemmatize(t) for t in token_list]
print("Lemmalized tokens are:", lemmalized_tokens)

Tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
Lemmalized tokens are: ['We', 'are', 'putting', 'in', 'effort', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']


In [30]:
## Lets include POS-Data for better lemmatization

from nltk.corpus import wordnet

##This is a common method which is widely used across the NLP community of practitioners and readers
def get_part_of_speech_tags(token):
    """Maps POS tags to first character lemmatize() accepts.We are focusing on Verbs, Nouns, Adjectives and Adverbs here."""
    tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)

lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]

print('Lemmatized text:', ' '.join(lemmatized_output_with_POS_information))


Lemmatized text: We be put in effort to enhance our understand of Lemmatization


In [31]:
## Lets try the Snowball lemmatizer

stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


In [33]:
## Spacy lemmatizer

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('We are putting in efforts to enhance our understanding of Lemmatization')

print(' '.join([token.lemma_ for token in doc]))

we be put in effort to enhance our understanding of lemmatization


## Stopword Removal
Stopwords are words that occur frequently in a text and carry little information. Words such as `a`, `an`, and `the` are considered stopwords. They can be filtered out in most NLP tasks, as they serve little to no purpose.

In [34]:
## List all stopwords in NLTK english stopword package
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))
", ".join(stop)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"at, by, himself, weren't, i, its, that'll, own, he, further, do, yours, hasn, through, his, we, how, didn, wasn, have, just, any, don't, did, you're, haven, such, that, few, should, aren't, doing, because, she's, on, until, her, when, which, only, those, in, to, it, y, they, isn't, m, who, themselves, down, t, needn, most, their, the, couldn, ll, can, this, him, under, below, then, doesn't, shouldn, you'd, myself, or, same, hadn't, why, not, yourselves, more, herself, was, o, mustn, whom, again, ourselves, these, no, needn't, doesn, our, with, are, my, all, don, being, where, d, me, ma, wouldn, mightn, couldn't, for, re, but, of, shan't, theirs, a, now, she, against, up, having, while, is, hers, there, aren, shan, into, weren, won, other, you've, some, shouldn't, had, about, between, mustn't, each, ours, yourself, wasn't, if, out, am, very, haven't, were, nor, too, both, mightn't, your, wouldn't, so, after, what, during, as, here, and, above, be, an, off, ain, before, has, isn, over, 

In [35]:
## We want to keep a remove a couple of words from the stopword list

wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
stop = set(stopwords.words('english'))

sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"

for word in wh_words:
    stop.remove(word)

sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
' '.join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

## Case folding
Case folding is the process of normalizing all letters to a common case, preferably lower case.

In [36]:
s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s

'we are putting in efforts to enhance our understanding of lemmatization'

## N-Grams

In [39]:
from nltk.util import ngrams

s = 'Natural Language Processing is the way to go'
tokens = s.split()
bigrams = list(ngrams(tokens, 3))
[' '.join(t) for t in bigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']