In [31]:
import re

import numpy as np
from numpy import dot
from numpy.linalg import norm

import nltk

import spacy

from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer


from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

# Dúvidas  

*  ver str built in functions.

```python
def vectorize():
    vocabulary = build_vocabulary()
    vectors = []
    for doc in docs:
        words = doc.split() # What a fuck is this doing here?!
        vector = np.array([doc.count(word) for word in vocabulary])
        vectors.append(vector)

    return vectors
```

**BLU07 - Part 3 of 3**

Labeling is not necessary!!!

```python
# Encode the labels
le = preprocessing.LabelEncoder()
le.fit(train_df['sentiment'].values)

train_df['sentiment'] = le.transform(train_df['sentiment'].values)
validation_df['sentiment'] = le.transform(validation_df['sentiment'].values)
```

# References  

\[1\] - [RegExr](https://regexr.com/3lvai)

\[2\] - [NLTK Book](https://www.nltk.org/book/)

\[3\] - [Deep Learning MIT Book](http://www.deeplearningbook.org/)


### Word of Advice  

Even though we are using NLTK library during this BLU, there are some other libraries that are commonly used and probably better. Here is a list of some to consider in your future challenges in NLP:

- [Spacy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)

# BLU07 - Feature Extraction

### Regular Expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns. It goes by several rules and is one of the most fundamental and important concepts in computer science regarding working with textual data.

We use [python re library](https://docs.python.org/3/library/re.html). Using `search()` we can take a certain pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the text portion that was matched by our pattern.

#### Cheatsheet [\[1\]](https://regexr.com/3lvai)

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search. 

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

## Functions

[**`re.search(pattern, string)`**](https://docs.python.org/3/library/re.html#re.search)  

Scan through string looking for the **first location** where the regular expression pattern produces a match, and return a corresponding [match object](https://docs.python.org/3/library/re.html#match-objects).

In [2]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match) 

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


[**`re.findall(pattern, string)`**](https://docs.python.org/3/library/re.html?highlight=findall#re.findall)

If we want to **return all the matches** to our pattern in a given text we might use the funcion findall(). In this case, the matched portions of the text will be returned, instead of the Match object.

In [3]:
pattern = "Lisbon"

re.findall(pattern, text)

['Lisbon', 'Lisbon', 'Lisbon']

In [4]:
# pattern = "Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


[**`re.finditer(pattern, string)`**](https://docs.python.org/3/library/re.html?highlight=finditer#re.finditer)

If instead we really want the `Match` objects for some reason, `finditer()` should be used instead.

In [5]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


[**`re.MULTILINE`**](https://docs.python.org/3/library/re.html#re.MULTILINE)

When specified, the pattern character `^` matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character `$` matches at the end of the string and at the end of each line (immediately preceding each newline).

In [6]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

In [7]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

## Tokenizing

One important step when dealing with text data is to _tokenize_ the data. In practice what this means is **splitting the strings of a corpus into substrings**. For instance, if we are working with the sentence

> _"The car went too fast on the second lap. This damaged the tyres."_ ,

would be better approached **as a list**,

> _["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tyres", "."]_ .


First we will be using [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) implementations. But the simplest tokenizer of all is [`str.split()`](https://docs.python.org/3/library/stdtypes.html?highlight=str%20split#str.split).    

Ahead _tokenization_ will be impplemented automatically inside [`sklearn.feature_extraction.text`](https://scikit-learn.org/stable/modules/feature_extraction.html) tools.

In [8]:
text = "The car went too fast.This damaged the tyres... 456 $89.7.5345.3...3456 foram-se"

### [`str.split()`](https://docs.python.org/3/library/stdtypes.html?highlight=str%20split#str.split)

In [9]:
print(text.split())

['The', 'car', 'went', 'too', 'fast.This', 'damaged', 'the', 'tyres...', '456', '$89.7.5345.3...3456', 'foram-se']


### [NLTK Tokenizer](https://www.nltk.org/_modules/nltk/tokenize/regexp.html)

In [10]:
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
#tokenizer = RegexpTokenizer('\w+|\$\d+\.?\d+|\S+')
#tokenizer = RegexpTokenizer('\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'car', 'went', 'too', 'fast', '.This', 'damaged', 'the', 'tyres', '...', '456', '$89.7.5345.3...3456', 'foram', '-se']


**The regular expression is used to define what enters the list and what is left behind.**

Notice that there are already some pre-defined implementations by taking advantage of `RegexpTokenizer`. These are:
- `BlanklineTokenizer` - Tokenize a string using blank lines as delimiter.
- `WordPunctTokenizer` - Tokenize a string into alphabetic and non-alphabetic characters.
- `WhitespaceTokenizer`-  Tokenize a string using spaces, tabs and newlines as delimiters.

`from nltk.tokenize import BlanklineTokenizer`  
`from nltk.tokenize import WordPunctTokenizer`  
`from nltk.tokenize import WhitespaceTokenizer` 

In [11]:
print(WordPunctTokenizer().tokenize(text))

['The', 'car', 'went', 'too', 'fast', '.', 'This', 'damaged', 'the', 'tyres', '...', '456', '$', '89', '.', '7', '.', '5345', '.', '3', '...', '3456', 'foram', '-', 'se']


In [12]:
print(WhitespaceTokenizer().tokenize(text))

['The', 'car', 'went', 'too', 'fast.This', 'damaged', 'the', 'tyres...', '456', '$89.7.5345.3...3456', 'foram-se']


Notice that the **`WordPunctTokenizer()` is _"similar"_ to the first one we defined. This is what is commonly used and the default method of tokenization** that will be used when we talk about the method.

## Stemming

Stemming means to get the "root" of the words.  
We are going to use the NLTK implementation of the [snowball stemmer](https://www.nltk.org/api/nltk.stem.html#nltk.stem.snowball.SnowballStemmer).

In [13]:
text = 'We are counting the occurrences of tokens in each of the documents.'

#tokenizing
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)

#stemming
stemmer = SnowballStemmer("english")
stems = list(map(stemmer.stem, words))
print(stems)

['we', 'are', 'count', 'the', 'occurr', 'of', 'token', 'in', 'each', 'of', 'the', 'document', '.']


An alternative process to stemming is **lemmatization**.  

Both processes share the goal of getting the root of the word [\[7\]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), but they act differently. **Whereas stemming drops the suffix of words, lemmatization uses a dictionary to return the base form of words**, known as _lemma_.

Using the example in the cited reference, given the word _saw_, stemming would tend to return only *s* , while lemmatization would take into account if the word was the verb or the noun, and correspondingly, return _see_ or _saw_  as the base form of the word.

### N-Grams

_n-grams_ correspond to sequences of n consecutive elements from a given sentence. Usually we refer to unigrams, bigrams, trigrams, four-grams, etc. according to the length of the sequence of elements.

For instance, for the sentence

`"The driver made a mistake"`,

we would have:

- unigrams: `The`, `driver`, `made`, `a`, `mistake`
- bigrams: `The driver`, `driver made`, `made a`, `a mistake`
- trigrams: `The driver made`, `driver made a`, `made a mistake`
- four-grams: `The driver made a`, `driver made a mistake`

In [14]:
print(list(ngrams(words, 1)))

[('We',), ('are',), ('counting',), ('the',), ('occurrences',), ('of',), ('tokens',), ('in',), ('each',), ('of',), ('the',), ('documents',), ('.',)]


In [15]:
print(list(ngrams(words, 2)))

[('We', 'are'), ('are', 'counting'), ('counting', 'the'), ('the', 'occurrences'), ('occurrences', 'of'), ('of', 'tokens'), ('tokens', 'in'), ('in', 'each'), ('each', 'of'), ('of', 'the'), ('the', 'documents'), ('documents', '.')]


## Feature Selection through statistical analysis

## Word vectors

`import spacy`

In [16]:
nlp = spacy.load('en_core_web_md')

In [20]:
token = nlp("house")
print("token vector is of type: {}".format(type(token.vector)))
print("token vector is of length: {}".format(len(token.vector)))

token vector is of type: <class 'numpy.ndarray'>
token vector is of length: 300


In [25]:
# Two different ways of getting the same vector
assert np.all(token.vector == nlp.vocab[token.text].vector)

In [21]:
doc = nlp("Some text is written here.")
print("doc vector is of type: {}".format(type(doc.vector)))
print("doc vector is of length: {}".format(len(doc.vector)))

doc vector is of type: <class 'numpy.ndarray'>
doc vector is of length: 300


In [39]:
doc = nlp("Give it back! He pleaded.")
assert len(doc) == 7

for word in doc:
    print(word.text)

Give
it
back
!
He
pleaded
.


We can define a simple function just to make it easier and avoid rewriting the same thing over and over again.

In [27]:
def vec(s):
    return nlp(s).vector

In [28]:
# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

In [32]:
cosine(vec('house'), vec('home'))

0.7388624

In [33]:
cosine(vec('house'), vec('mouse'))

0.16095257

Once again, to simplify our next examples, let's create a function that gets us the closest words to the vector that we are interested in:

In [34]:
def spacy_closest(token_list, vec_to_check, n=10, dont_include_list=[]):
    similarity_list = [(x, cosine(vec_to_check, vec(x))) for x in token_list if x not in dont_include_list]
    
    return sorted(similarity_list, key=lambda x: x[1], reverse=True)[:n]

There are several different ways we could think of to construct a representation for complete sentences using the same vectors we have used for words. The average is a good enough approach to start with.

In [35]:
def sentvec(s):
    sent = nlp(s)
    return np.mean(np.array([w.vector for w in sent]), axis=0)

In [38]:
# Two different ways of getting the same vector
vect1 = sentvec("i am against the trump administration .")
vect2 = nlp("i am against the trump administration .").vector
assert np.all(vect1 == vect2)

In [36]:
cosine(sentvec("i am against the trump administration ."), nlp("i am against the trump administration .").vector)

1.0