# Notes on NLP techniques
The notes below were taken, mostly, from an NLP class by Keith Galli, available by [this link](https://www.youtube.com/watch?v=vyOgWhwUmec), in the PyCon US Youtube Channel and research on the documentation of some of the referred libraries.
<hr>

## Bag of Words 

The bag of words model breaks down the sentences in the training set to word vectors, and then flattens them into one vector with unique words.<br>
Then, each sentence will be mapped to a line inside a binary matrix. Every column of this matrix represents a word.<br>
If $a_{11} = 1$, for example, it means that the second sentence in the set contains the second word at least once.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]

The `fit_transform` method produces the vector of unique words mentioned above (obtainable via the `get_feature_names` method) and returns the corresponding binary matrix.

In [10]:
vectorizer = CountVectorizer(binary=True)
vectors = vectorizer.fit_transform(train_x)

In [11]:
print(vectorizer.get_feature_names())

['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']


In [12]:
print(vectors.toarray())

[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


The matrix above is the representation, in the Bag of Words model, of the sentences in _train_x_.<br>
Now, we can feed those vectors to a Support Vector Machine algorithm, so that it may receive a sentence, break it down to a binary vector for each of the words in the set and predict whether it's about books or clothing.

In [16]:
class Category:
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"

train_y=[Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [15]:
from sklearn import svm
clf_svm = svm.SVC(kernel="linear")
clf_svm.fit(vectors, train_y)

SVC(kernel='linear')

The `transform` method does not add new words into the vector of "features".<br>It only transforms the sentences passed to binary vectors corresponding to the words defined when `fit_transform` was called.

In [23]:
vectorizer.transform(["i like the book", 'i like the soup']).toarray()

array([[1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [27]:
test_x = vectorizer.transform(["i like the book", "i hate the shoes", "I love this", "this is great"])
clf_svm.predict(test_x)

array(['BOOKS', 'CLOTHING', 'BOOKS', 'CLOTHING'], dtype='<U8')

The main problem with the bag of words model is it does not properly decipher meaning from the words or _n-grams_.<br>
It only associates words to categories, so that "love" points to `BOOKS` and "great" points to `CLOTHING`, even though they are virtually unrelated.<br>
The model also has a problem with interpreting words that are not in the training set, being incapable of estimating whether they are related to some category or not.

In [36]:
test_x = vectorizer.transform(["i love the story", "this is a great author"])
clf_svm.predict(test_x)

array(['CLOTHING', 'CLOTHING'], dtype='<U8')

## Word Vectors

Another approach to converting text to numerical vectors. It attemps to assign semantic meaning to words, given a group of words.<br>
For example, given the sentences "I read the **book**", "The **book** has great **characters**", "The **story** in the **book** is great", this model could relate the word "story" to "book", "book" to "characters", and, indirectly, "story" to "characters".

In [4]:
import spacy
nlp = spacy.load("en_core_web_md")

In [38]:
docs = [nlp(text) for text in train_x]
train_x_wv = [x.vector for x in docs]

`nlp(text)` will assign a series of word embedding values to the sentence `text`, based on a relation between the words in that sentence and the sentences in the "en_core_web_md" training set.<br>
For example, it probably contains sentences with both "book" and "characters", "book" and "story", etc., so that when we try to classify a sentence containing "characters" or "story", it will be related to the word "book", which is ultimately assigned to the category `BOOKS` according to our training sets.

In [22]:
clf_svm_wv = svm.SVC(kernel="linear")
clf_svm_wv.fit(train_x_wv, train_y)

SVC(kernel='linear')

In [37]:
test_x = ["i love the book", "a pointy hat"]
test_docs = [nlp(text) for text in test_x]
test_x_wv = [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_wv)

array(['BOOKS', 'CLOTHING'], dtype='<U8')

It's worth mentioning that the Word Vectors model assigns semantic meaning to a sentence by averaging the meaning of each word in it. This means that, for longer sentences, meaning might get lost in this process.

In [44]:
test_x = ["A mind needs books as a sword needs a whetstone, if it is to keep its edge.",
          "A mind needs books."]
test_docs = [nlp(text) for text in test_x]
test_x_wv = [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_wv)

array(['CLOTHING', 'BOOKS'], dtype='<U8')

Furthermore, since the model relies on the semantic meaning of each isolated word, words whose meaning varies according to context can't be relied on to define the meaning of a whole sentence.

## Regular Expressions

Regular expressions are patterns used to match character combinations in strings. By defining a number of boundaries to our pattern, we can verify if a string matches it or not.

In [18]:
import re
regexp = re.compile(r"^ab[\S]*cd$")

The regular expression above describes any string that starts with **ab**, followed by any number of characters other than a whitespace and ends with **cd**.<br>
These expressions can be defined by combining various rules. A quick reference to those rules can be found [here](https://www.rexegg.com/regex-quickstart.html).

In [23]:
expressions = ["abcd", "abbacd", "cdabba", "ab cd", "xxabcdxx"]
matches = [bool(re.match(regexp, expression)) for expression in expressionss]
print(matches)

[True, True, False, False, False]


It's worth noting that the `re.match` method verifies if a string, _as a whole_, matches a specific pattern. `regexp`, as it is defined above, also enforces that rule.<br>
Let's define a new regular expression and verify this.

In [30]:
regexp = re.compile(r"ab[\S]*cd")

expressions = ["xxx abcd xxx", "xxabbacdxx", "xxabcdxx"]
matches = [bool(re.match(regexp, expression)) for expression in expressions]
print(matches)

[False, False, False]


However, the `re.search` method is capable of recognizing a pattern inside a string.

In [31]:
matches = [bool(re.search(regexp, expression)) for expression in expressions]
print(matches)

[True, True, True]


Regular expressions are useful in finding elements in text that always follow a certain formatting, such as phone numbers, document IDs, user tags (@username), etc.

## Stemming and Lemmatization

These are techniques used to reduce words in a text to their essential content. Removing inflections in a word, for example, could reduce informations that are not relevant to a specific context, such as number, gender, verb tenses, etc.<br>
For example, reducing the word "books" to "book" or "reading" to "read" could help a certain algorithm recognize those words.<br>
<br>
The difference between stemming and lemmatization is that, while **stemming** simply cuts out pieces of a word, **lemmatization** uses a dictionary to match inflected words to their corresponding base.<br>

In [34]:
import nltk

In [48]:
# Downloading NLTK content packages
# nltk.download("wordnet")
# nltk.download("stopwords")

In [39]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

NLTK stands for Natural Language Toolkit, and has a variety of methods useful in NLP.

In [63]:
phrase = "joining the dots"
words = word_tokenize(phrase)
print(words)

['joining', 'the', 'dots']


In [65]:
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['join', 'the', 'dot']


In this example, WordNet is the corpus used for lemmatizing words, with NLTK establishing the mapping from a complex word to it's simplest form.

In [66]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [67]:
phrase = "joining the dots"
words = word_tokenize(phrase)

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

['joining', 'the', 'dot']


Note that the lemmatizer above expects part of speech for each word in a sentence. By default, every word is expected to be a noun.<br>
For "joining the dots", "joining" is treated as noun instead of a verb. To get around this, the `lemmatize` method accepts a part of speech argument.

In [68]:
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmatized_words)

['join', 'the', 'dot']


Now, both "joining" and "dots" are treated as verbs, since the part of speech (`pos`) specified is verb (`'v'`) for every word.<br>

## Stopwords removal

This technique consists of removing common words that don't play an essential role in semantics.<br>
It can help reduce the "meaning noise" in long sentences while using the Word Vectors model.

In [69]:
from nltk.corpus import stopwords

In [76]:
stopwords = stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [84]:
phrase = "Songs that we will not get sued for but at the end of the day it all goes away anyway"
words = word_tokenize(phrase)

stripped_phrase = []
for word in words:
    if word not in stopwords:
        stripped_phrase.append(word)

print(" ".join(stripped_phrase))

Songs get sued end day goes away anyway


___

## Other techniques

###  TextBlob

TextBlob is a library that contains various utilities in processing textual data, such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation and more.

In [87]:
from textblob import TextBlob

In [90]:
phrase = TextBlob("The quick brwon fox jumps over the lazy dog")

The `correct` method, for example, corrects mispelled words in a sentence.

In [95]:
phrase = phrase.correct()
print(phrase)

The quick brown fox jumps over the lazy dog


The `tags` property assigns part of speech tags to each word.

In [94]:
phrase.tags

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brwon', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

The `sentiment` property captures sentiment in a sentence.

In [107]:
print(TextBlob("The evil brown fox jumps over the ugly dog").sentiment)
print(TextBlob("The beautiful brown fox jumps over the ellegant dog").sentiment)

Sentiment(polarity=-0.85, subjectivity=1.0)
Sentiment(polarity=0.85, subjectivity=1.0)
