In [2]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

# BLU07 - Feature Extraction

### Regular Expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns. It goes by several rules and is one of the most fundamental and important concepts in computer science regarding working with textual data.

We use [python re library](https://docs.python.org/3/library/re.html). Using `search()` we can take a certain pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the text portion that was matched by our pattern.

#### Cheatsheet [\[1\]](https://regexr.com/3lvai)

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search. 

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

## Functions

[**`re.search(pattern, string)`**](https://docs.python.org/3/library/re.html#re.search)  

Scan through string looking for the **first location** where the regular expression pattern produces a match, and return a corresponding [match object](https://docs.python.org/3/library/re.html#match-objects).

In [8]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match) 

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


[**`re.findall(pattern, string)`**](https://docs.python.org/3/library/re.html?highlight=findall#re.findall)

If we want to **return all the matches** to our pattern in a given text we might use the funcion findall(). In this case, the matched portions of the text will be returned, instead of the Match object.

In [9]:
re.findall(pattern, text)

['Lisbon', 'Lisbon', 'Lisbon']

In [4]:
pattern = "Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


[**`re.finditer(pattern, string)`**](https://docs.python.org/3/library/re.html?highlight=finditer#re.finditer)

If instead we really want the `Match` objects for some reason, `finditer()` should be used instead.

In [5]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


[**`re.MULTILINE`**](https://docs.python.org/3/library/re.html#re.MULTILINE)

When specified, the pattern character `^` matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character `$` matches at the end of the string and at the end of each line (immediately preceding each newline).

In [6]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

In [7]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

## Tokenizer

One important step when dealing with text data is to _tokenize_ the data. In practice what this means is splitting the strings of a corpus into substrings. This is important because it transforms a string into parts that are more suitable to be used by the tools that exist in natural language processing. For instance, if we are working with the sentence

_"The car went too fast on the second lap. This damaged the tyres."_ ,

would be better approached as a list,

_["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tyres", "."]_ .

We will be using [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) implementations.