# Week 9 Discussion

## Infographic

* [History of Infographics](http://infowetrust.com/scroll/)

## Links

* [Python's Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto)
* [RegExr](https://regexr.com/) -- test regex online


## Regular Expressions

A _regular expression_, or regex, is a language for expressing patterns in strings.

Regular expressions are not Python-specific! They are also supported in R, Java, Perl, etc...

In Python, regex is supported by the built-in [`re` module][re] as well as the [`regex` package][regex] and some of the Pandas [`.str` methods][pandas-str].

Regular expressions are __slow and brittle__. Use them as a last resort! Try to solve problems with [string methods][str] or [an appropriate parser][lxml] instead. There's even a [famous SO post][so-regex-html] about this.

[re]: https://docs.python.org/3/library/re.html
[regex]: https://pypi.python.org/pypi/regex/
[pandas-str]: https://pandas.pydata.org/pandas-docs/stable/text.html
[str]: https://docs.python.org/3/library/stdtypes.html#string-methods
[lxml]: http://lxml.de/
[so-regex-html]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

### Metacharacters

A regular expression can describe a complicated pattern in just a few characters because some non-alphabet characters have special meanings.

These special characters are called metacharacters.

Metacharacter | Meaning
------------- | -------
`.`           | any 1 character (a _wildcard_)
`[ ]`         | any 1 character listed (a _set_)
`[^ ]`        | any 1 character except those listed (a _set complement_)
`^`           | start of string
`$`           | end of string

The `re.findall()` function returns a list of each part of the string that matched the pattern. Experimenting with `re.findall()` is a good way to learn regular expressions.

In [2]:
import re

# re.findall(PATTERN, STRING)
re.findall(".", "xyz")

['x', 'y', 'z']

Square brackets `[ ]` mark a set. A set matches any 1 of the symbols inside.

In [7]:
re.findall("[abc]", "caxyb")

['c', 'a', 'b']

Putting `^` at the beginning of a set matches the _set complement_ (any character except the ones in the set):

In [9]:
re.findall("[^12]", "31214")

['3', '4']

Sets can contain ranges of letters or numbers:

In [13]:
re.findall("[h-z0-9]", "0abcdefghi")

['0', 'h', 'i']

To include a literal dash `-` in a set, don't put it between letters:

In [15]:
re.findall("[a-z]", "-")

[]

In [16]:
re.findall("[a-z-]", "-")

['-']

Use `^` and `$` to mark the start and end of a the string, respectively:

In [18]:
re.findall("^a", "aba")

['a']

### Escape Sequences

The backslash `\` has a special meaning in Python strings: it marks the beginning of an escape sequence.

Escape sequences are necessary to write characters that don't appear on the keyboard. For instance, a newline is `\n`:

In [3]:
print("Hello Goodbye")

Hello Goodbye


In [5]:
print("Hello\nGoodbye")

Hello
Goodbye


Because backslash marks the beginning of an escape sequence, if we want to write a literal backslash, we have to write `\\`:

In [6]:
print("\\")

\


This is inconvenient since regex uses backslash to make a metacharacter into a literal character.

So if we wanted to match a literal `.`, we'd have to write:

In [21]:
pattern = "\\."
print(pattern)
re.findall(pattern, "x.y")

\.


['.']

Even worse, to match a literal backslash, we'd have to write `"\\\\"`!

Python provides _raw strings_ to fix this problem. In a raw string, backslash has no special meaning for Python (but it still has a special meaning for regex.

To make a raw string, put the letter `r` before the quotes:

In [30]:
pattern = r"\."
print(pattern)
re.findall(pattern, "x.y")

\.


['.']

### Quantifiers

You can describe repeated characters with quantifiers:

Metacharacter | Meaning
------------- | -------
`*`           | repeat previous character 0 or more times
`+`           | repeat previous character 1 or more times
`?`           | repeat previous character 0 or 1 times


In [32]:
re.findall("a", "aaa")

['a', 'a', 'a']

In [31]:
re.findall("a+", "aaa")

['aaa']

In [33]:
re.findall("xy?z", "xz")

['xz']

In [34]:
re.findall("xy?z", "xyz")

['xyz']

### Groups

You can make a group with parentheses `( )`.

Groups can be repeated just like single characters.

The `re.findall()` function has special handling for groups, so instead we use `re.search()` here:

In [43]:
re.search("x(abc)+y", "xabcabcabcy")

<_sre.SRE_Match object; span=(0, 11), match='xabcabcabcy'>

## Natural Language Processing (NLP)

Basic NLP workflow:

1. __Tokenize__ -- split text into words
2. __Remove Noise__ (optional) -- remove stop words, convert words to lemmas, correct spelling, ...
3. __Vectorize__ -- compute term frequencies, tf-idfs, or some other statistic
4. __Analyze__

This workflow is the same regardless of what language you're using.

### In Python

Python has lots of packages for natural language processing:

* [TextBlob][] is the most Pythonic NLP package. Good for learning and prototyping.
* [NLTK][] is the most comprehensive NLP package. Good for _learning_ NLP, but runs slowly.
* [spaCy][] is the fastest NLP package.
* [Stanford's Core NLP][CoreNLP] is the cutting edge of NLP research. It's written in Java, but several Python packages provide an interface. The [pynlp][] and [stanford-corenlp][] packages look promising.
* [SyntaxNet][] is Google's NLP package. Good if you're already using TensorFlow.
* [Pattern][] combines web scraping and NLP.
* [gensim][] is good for topic modelling (a specific NLP method).

Nick recommends __TextBlob__ with __NLTK__ for beginners, and __spaCy__ or __Core NLP__ for serious projects.

[TextBlob]: http://textblob.readthedocs.io/en/dev/
[NLTK]: https://www.nltk.org/
[spaCy]: https://spacy.io/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp
[SyntaxNet]: https://github.com/tensorflow/models/tree/master/research/syntaxnet
[Pattern]: https://www.clips.uantwerpen.be/pattern
[gensim]: https://radimrehurek.com/gensim/

In [74]:
# Set up NLTK packages used by TextBlob.
import nltk

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("brown")
nltk.download("wordnet")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/nick/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nick/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /home/nick/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nick/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/nick/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [137]:
from textblob import TextBlob

text = "The quick brown foxes jumped lightly over the lazy dog."

blob = TextBlob(text)

In [106]:
blob.words

WordList(['The', 'quick', 'brown', 'foxes', 'jumped', 'lightly', 'over', 'the', 'lazy', 'dog'])

### Removing Noise

Sometimes it's useful to remove words that appear frequently, like "the". These are called _stopwords_.

Removing stopwords is something you need to think about carefully. Does it make sense for the problem you're trying to solve?

Also, the NLTK stopwords are not necessarily the right stopwords for your problem

In [139]:
from nltk.corpus import stopwords

stopwords = stopwords.words("english")

new_text = " ".join(w for w in blob.words if w.lower() not in stopwords)
blob = TextBlob(new_text)
blob

TextBlob("quick brown foxes jumped lightly lazy dog")

In addition to removing stopwords, you might want to _lemmatize_ the words.

Lemmatization converts words to a single inflection. After lemmatizing, plurals and verb tenses are eliminated.

A related procedure, _stemming_, also converts words to a single inflection, but uses programmatic rules rather than a dictionary. Stemming is simpler but less accurate.

As an example, if we lemmatize all of the words, "foxes" becomes "fox":

In [108]:
blob.words.lemmatize()

WordList(['quick', 'brown', 'fox', 'jumped', 'lightly', 'lazy', 'dog'])

Why wasn't "jumped" converted to "jump"?

By default, NLTK's lemmatizer only looks for nouns. We need to tell it which words are verbs.

In [109]:
blob.words[3]

'jumped'

In [123]:
blob.words[3].lemmatize("v")

'jump'

Fortunately, TextBlob can detect parts of speech (POS):

In [129]:
blob.pos_tags

[('quick', 'JJ'),
 ('brown', 'NN'),
 ('foxes', 'NNS'),
 ('jumped', 'VBD'),
 ('lightly', 'RB'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

These are [Brown POS tags][brown], but the lemmatizer uses WordNet POS tags. Use this function to convert:

[brown]: https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

In [142]:
from nltk.corpus import wordnet

def wordnet_pos(tag):
    """Map a Brown POS tag to a WordNet POS tag."""
    
    table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    
    # Default to a noun.
    return table.get(tag[0], wordnet.NOUN)

Now we can get the tags and use them to lemmatize:

In [133]:
tags = [wordnet_pos(x[1]) for x in blob.pos_tags]
tags

['a', 'n', 'n', 'v', 'r', 'a', 'n']

In [141]:
new_text = " ".join(x.lemmatize(t) for x, t in zip(blob.words, tags))
blob = TextBlob(new_text)
blob

TextBlob("quick brown fox jump lightly lazy dog")

### Term Frequencies

How can we quantify a text document so we can do statistics?

In a _bag of words_ model, we assume that the order of the words in a document doesn't matter.

Then we can quantify the document by counting how many times each word appears. This is called a _word vector_.

Surprisingly, the bag of words model works well in most cases.


In [66]:
blob.word_counts

defaultdict(int,
            {'brown': 1,
             'dog': 1,
             'fox': 1,
             'jumped': 1,
             'lazy': 1,
             'over': 1,
             'quick': 1,
             'the': 2})

We can check how similar two documents are by computing the dot product of their word vectors.

The more words they have in common, the larger the dot product will be.

In [None]:
docs = ["the quick brown fox", "the fast brown fox"]

def term_frequencies(doc):
    return {w: x / doc.words for w, x in doc.word_counts}
for doc in docs:
    doc = TextBlob(doc)
    {w: tf(doc) for w in doc.words}

In [220]:
import pandas as pd

docs = [
    "The quick brown foxes jumped lightly over the lazy dog.",
    "Brown ducks lay eggs once a year and are omnivorous.",
    "The fleet brown foxes jumped over the dogs."
]

# Get word counts.
docs_df = pd.DataFrame(TextBlob(d).word_counts for d in docs).T
docs_df = docs_df.fillna(0)
docs_df.head()

Unnamed: 0,0,1,2
a,0.0,1.0,0.0
and,0.0,1.0,0.0
are,0.0,1.0,0.0
brown,1.0,1.0,1.0
dog,1.0,0.0,0.0


In [204]:
# Compute frequencies.
tf = docs_df / docs_df.sum()
tf.head()

Unnamed: 0,0,1,2
a,0.0,0.1,0.0
and,0.0,0.1,0.0
are,0.0,0.1,0.0
brown,0.1,0.1,0.125
dog,0.1,0.0,0.0


In [217]:
sum(tf[0] * tf[1])

0.010000000000000002

In [218]:
sum(tf[0] * tf[2])

0.1

One problem with term frequencies is that some terms have high frequencies simply because they appear frequently in the language. These words can cause a high similarity score for documents that are otherwise different.

_Term frequency-inverse document frequency_ (tf-idf) statistics solve this problem. There are several different tf-idf statistics.

The _smoothed term frequency-inverse document frequency_ (smoothed tf-idf), for a term $t$ and document $d$, is given by
$$
\operatorname{tf-idf}(t, d) = \operatorname{tf}(t, d) \cdot \log \left( \frac{N}{1 + n_t} \right)
$$
where $N$ is the total number of documents and $n_t$ is the number of documents that contain $t$.

In practice, it's easiest to use a package to compute tf-idf:

In [219]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer().fit_transform(docs)

# Use .A to display a sparse matrix.
(tf_idf * tf_idf.T).A

array([[1.        , 0.04166004, 0.59997135],
       [0.04166004, 1.        , 0.04772968],
       [0.59997135, 0.04772968, 1.        ]])