# STA 141B Data & Web Technologies for Data Analysis

### Lecture 12, 2/20/24, Natural language processing


### Today's topics
- Natural Language Processing
     - `nltk` package
     - Tokenization
     - Regular Expressions
     - Standardizing Text

### Ressources
- [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
- [Scikit-Learn Documentation][skl], especially the section about [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)


[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US
[skl]: https://scikit-learn.org/stable/documentation.html


### Natural Language Processing

A _natural language_ is a language people use to communicate, like English, Spanish, or Mandarin. These languages evolved over thousands of years and do not have simple, explicit rules.

_Natural language processing_ (NLP) means using a computer to analyze, manipulate, or synthesize natural language. Some examples of NLP tasks are:
* Translating from one language to another
* Recognizing speech or handwriting
* Tagging sentences with metadata, such as parts of speech (verbs, nouns, etc) or sentiment
* Extracting information or computing statistics from text

Compared to artificial languages like Python and XML, it's much more difficult to extract information from natural languages. NLP is a wide field; we only have time to learn the absolute basics. If you want to learn more, consider reading the entire [Natural Language Processing with Python][nlpp] book or taking a class in computational linguistics.

[nlpp]: https://www.nltk.org/book/


#### The Python NLP Ecosystem

There are lots of Python packages for NLP (try searching online)! A few popular ones are:

* [Natural Language Tool Kit][nltk] (`nltk`) is the most popular. It's designed for learning and research, so it's well-documented and has lots of features. We will use `nltk` for this class. 
* [TextBlob][textblob] is a "simplified" package. It has a nicer interface than NLTK, but less features.
* [SpaCy][spacy] is a "production-ready" package, and the fastest of all the packages listed here. Useful for working with large natural language datasets.
* [gensim][gensim] is a package for creating topic models, which are a kind of statistical model that predict the topics of a text.

We're going to learn `nltk`, but you might want to try some of the others if your project involves NLP.

[Stanford's Core NLP][CoreNLP] library is at the cutting edge of NLP research. It's developed in Java, but several Python packages provide an interface (such as [pynlp][] and [stanford-corenlp][]).

[nltk]: https://www.nltk.org/
[spacy]: https://spacy.io/
[textblob]: https://textblob.readthedocs.io/en/dev/
[gensim]: https://radimrehurek.com/gensim/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp

#### Corpora and Documents

A _document_ is a single body text. When working with natural language data, documents are the unit of observation.

What you choose as a document depends on the purpose of your analysis. If you're studying how people react to news on Twitter, it makes sense to use individual tweets as documents. If you're studying how animals are portrayed in 19th-century literature, you could use individual novels as documents.

A _corpus_ is a collection of documents. In other words, a corpus is a dataset.

`nltk` provides some example corpora in the `nltk.corpus` submodule. The documentation gives a [complete list](http://www.nltk.org/nltk_data/). Most have to be downloaded with `nltk.download()` before use.

In [1]:
import nltk.corpus

# Download books from Project Gutenberg
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to /Users/peter/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

The `.fileids()` method lists the documents in a corpus.

In [2]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Lets talk about [whales](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). The `.raw()` method returns the raw text for a single document. Specify the document by its file ID.

In [3]:
moby = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")

In [4]:
moby[0:2000]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.  He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.  He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.\r\n\r\n"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true." --HACKLUYT\r\n\r\n"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness\r\nor rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;\r\nA.S. WALW-IAN, t

### Tokenization

A _token_ is a sequence of characters to be treated as a group. Tokens are the unit of analysis for an indvidual document.

Tokens can represent paragraphs, sentences, words, or something else. Most of the time, tokens will be words.

When you analyze a document, the first step will usually be to split the document into tokens. Functions that do this are called _tokenizers_, and this process is called _tokenization_.

The `nltk.sent_tokenize()` function splits a document into sentences, and the `nltk.word_tokenize()` function splits a document into words.

In [5]:
type(nltk.sent_tokenize(moby))

list

In [6]:
nltk.sent_tokenize(moby)[0]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.'

In [7]:
nltk.sent_tokenize(moby)[283]

'Call me Ishmael.'

In [8]:
nltk.word_tokenize(moby)[0:10]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

Corpora also have `.sents()` and `.word()` methods for tokenization. These methods are specialized to the corpus, so they sometimes use the different strategies than `sent_tokenize()` and `word_tokenize()`.

In [9]:
tmp = nltk.corpus.gutenberg.sents("melville-moby_dick.txt")

In [10]:
#tmp[3]

In [11]:
nltk.corpus.gutenberg.words("melville-moby_dick.txt")[:10]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

### Strings and String Methods

Lets continue talking about 	&#128011;. How does word tokenization actually work? The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [12]:
moby.split()[:10] # splits on whitespace

['[Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851]',
 'ETYMOLOGY.',
 '(Supplied',
 'by',
 'a']

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in `re` module provides regular expression functions [here](https://docs.python.org/3/library/re.html).

```
re.split(pattern, string, maxsplit=0, flags=0)
```

In [13]:
import re

In [14]:
moby[:100]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Gr'

In [15]:
re.split("[, :;']", 'I can\'t, let:Go')

['I', 'can', 't', '', 'let', 'Go']

What if we also want to split at newlines?

### Escape Sequences and Raw Strings

In Python strings, backslash `\` marks the beginning of an _escape sequence_. Escape sequences are special codes for writing characters that you can't otherwise type. For example, `\n` is a new line character and `\t` is a tab character.

Since `\` has a special meaning in strings, to write a literal `\` you must use the escape sequence `\\`.

You can see the actual characters in a string by printing the string:

In [16]:
print("hello\nworld.")

hello
world.


The regular expression (Regex) language is independent of Python and also uses backslash `\` to mark the beginning of an escape sequence. Regex escape sequences disable special behavior for characters. For example, `.` matches any character, but `\.` only matches a literal `.`.

As a result, writing a regular expression in an ordinary Python string is awkward. For example, to match a literal `\`, we need to write `\\` in regular expressions, which is `\\\\` in an ordinary Python string.

In [17]:
print("\\\\")

\\


Python provides _raw strings_, where `\` has no special meaning for Python, to help solve this problem. You can create a raw string by putting an `r` before the starting quote:

More about raw strings: [here](https://www.journaldev.com/23598/python-raw-string#:~:text=Python%20raw%20string%20is%20created,treated%20as%20an%20escape%20character.)

In [18]:
print(r"\")

SyntaxError: EOL while scanning string literal (835106092.py, line 1)

In [None]:
print(r"\"") 

In [None]:
print(r'\\')

In [None]:
s = 'Hi\nHello'
print(s)

In [None]:
raw_s = r'Hi\nHello'
raw_s

In [None]:
print(raw_s)

Even raw strings can't end in `\;` this is a limitation of the Python parser.

Now we can write a better regular expression to split with:

In [None]:
len(set(re.split(r"[ \[\](),.:;!?'\n\r]", moby)))

In [None]:
re.split(r"\s", moby)

### Regular Expressions

The regular expressions language includes _character classes_ that describe common sets of characters. The whitespace class `\s` and the word class `\w` are useful here (see [Reference](https://docs.python.org/3/library/re.html)). So to split on any whitespace character:

In [None]:
string = r'[ ,.:;!\n\r]'
string

moby[:10

In [None]:
moby[:10]

In a raw string, `re.split` looks for regex escapes; in a non-raw string, the function looks for the literal ASCII character. If these coincide, the string does not have to be converted to a raw string. 

In [None]:
re.split("[ \[\],.:;!'()\n\r-]", moby) # note the '

In [None]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

In [None]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

In [None]:
re.split("[\s\[\],.:;!'()-]", moby)

Capitalizing a character classes inverts the meaning, so to split on all non-word characters:

In [None]:
re.split("\W+", moby) # + matches 1 or more of the preceding characters

`\w` means _any word character_

`+` Causes the resulting RE to match 1 or more repetitions of the preceding RE. 

In [None]:
re.split(r"\W+", "the...dog")

In [None]:
re.split("\W+", "the,dog")

In [None]:
re.split(r"\W+", "the,I:! dog")

In [None]:
re.split(r"\W+", moby)

Rather than splitting the text, you can also approach the problem from the perspective of extracting tokens. The `findall()` function returns all matches for a regular expression:

In [None]:
re.findall(r"\w+", "The dog barked!")

In [None]:
print("\w") # \w is not a special python escape sequence, so it passes through

In [None]:
re.split(r"\W+", "The dog barked!")

In [None]:
re.findall(r"\w+'?\w{1}", "The dog's toy barked!")

In [None]:
re.findall(r"[\w']+", "I think the dog's toy barked!")

- `r" "`: read the string
- `()+`: the patterns inside the parathesis should appear once or more
- `\w+`: the whole word
- `|`: or

More practice? [here](https://regex101.com/?fbclid=IwAR36UyAxywvpSvTOh7F-KYI72IZAVQ0wRcBc0OEOu6h4MifEf-iLcFedfyk)

In [19]:
words = re.findall(r"\w+", moby)
words

['Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 'ETYMOLOGY',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 'The',
 'pale',
 'Usher',
 'threadbare',
 'in',
 'coat',
 'heart',
 'body',
 'and',
 'brain',
 'I',
 'see',
 'him',
 'now',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 'with',
 'a',
 'queer',
 'handkerchief',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale',
 'fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',
 'leaving',
 'out',
 'through',
 'ignorance',
 'the',
 'letter',
 'H',
 'which'

In [29]:
moby[21000:22500]

'rful unbroken colt, with the mere appliance of a rope\r\ntied to the root of his tail." --A CHAPTER ON WHALING IN RIBS AND\r\nTRUCKS.\r\n\r\n"On one occasion I saw two of these monsters (whales) probably male\r\nand female, slowly swimming, one after the other, within less than a\r\nstone\'s throw of the shore" (Terra Del Fuego), "over which the beech\r\ntree extended its branches." --DARWIN\'S VOYAGE OF A NATURALIST.\r\n\r\n"\'Stern all!\' exclaimed the mate, as upon turning his head, he saw\r\nthe distended jaws of a large Sperm Whale close to the head of the\r\nboat, threatening it with instant destruction;--\'Stern all, for your\r\nlives!\'" --WHARTON THE WHALE KILLER.\r\n\r\n"So be cheery, my lads, let your hearts never fail,\r\nWhile the bold harpooneer is striking the whale!" --NANTUCKET SONG.\r\n\r\n"Oh, the rare old Whale, mid storm and gale\r\nIn his ocean home will be\r\nA giant in might, where might is right,\r\nAnd King of the boundless sea." --WHALE SONG.\r\n\r\n\r\n\r\n

In [28]:
print(moby[21000:22500])

rful unbroken colt, with the mere appliance of a rope
tied to the root of his tail." --A CHAPTER ON WHALING IN RIBS AND
TRUCKS.

"On one occasion I saw two of these monsters (whales) probably male
and female, slowly swimming, one after the other, within less than a
stone's throw of the shore" (Terra Del Fuego), "over which the beech
tree extended its branches." --DARWIN'S VOYAGE OF A NATURALIST.

"'Stern all!' exclaimed the mate, as upon turning his head, he saw
the distended jaws of a large Sperm Whale close to the head of the
boat, threatening it with instant destruction;--'Stern all, for your
lives!'" --WHARTON THE WHALE KILLER.

"So be cheery, my lads, let your hearts never fail,
While the bold harpooneer is striking the whale!" --NANTUCKET SONG.

"Oh, the rare old Whale, mid storm and gale
In his ocean home will be
A giant in might, where might is right,
And King of the boundless sea." --WHALE SONG.



CHAPTER 1

Loomings.


Call me Ishmael.  Some year

In [21]:
moby.find('CHAPTER 1')

21945

Lets try to match all chapters in the book. First, lets match the chapter sequence, they are similar to "\nCHAPTER 1\r\n\r\nLoomings.\r\n". Check the novel [here](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). Note that the full stop after the chapter is not in the string. 

In [27]:
re.findall(r"CHAPTER\s{1}\d+\s*\w+\.{1}", moby)

['CHAPTER 1\r\n\r\nLoomings.',
 'CHAPTER 5\r\n\r\nBreakfast.',
 'CHAPTER 11\r\n\r\nNightgown.',
 'CHAPTER 12\r\n\r\nBiographical.',
 'CHAPTER 13\r\n\r\nWheelbarrow.',
 'CHAPTER 14\r\n\r\nNantucket.',
 'CHAPTER 15\r\n\r\nChowder.',
 'CHAPTER 25\r\n\r\nPostscript.',
 'CHAPTER 28\r\n\r\nAhab.',
 'CHAPTER 32\r\n\r\nCetology.',
 'CHAPTER 37\r\n\r\nSunset.',
 'CHAPTER 38\r\n\r\nDusk.',
 'CHAPTER 46\r\n\r\nSurmises.',
 'CHAPTER 58\r\n\r\nBrit.',
 'CHAPTER 59\r\n\r\nSquid.',
 'CHAPTER 84\r\n\r\nPitchpoling.',
 'CHAPTER 92\r\n\r\nAmbergris.',
 'CHAPTER 121\r\n\r\nMidnight.']

In [24]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(\w+\.{1})", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 32', 'Cetology.'),
 ('CHAPTER 37', 'Sunset.'),
 ('CHAPTER 38', 'Dusk.'),
 ('CHAPTER 46', 'Surmises.'),
 ('CHAPTER 58', 'Brit.'),
 ('CHAPTER 59', 'Squid.'),
 ('CHAPTER 84', 'Pitchpoling.'),
 ('CHAPTER 92', 'Ambergris.'),
 ('CHAPTER 121', 'Midnight.')]

In [25]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(.+\.{1})", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

See chapter 43. 

In [26]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}|\?{1}])", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

Chapter 1 reappeared! 

In [None]:
re.findall(r"(?<!,\s)(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}|\?{1}])", moby) # do not capture

Lets use a negatve lookbehind! 

In [None]:
re.findall(r"(?<!,\s*)(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

Lets find the unmatched chapters. 

In [None]:
all_chapters = [i for i in range(1,135)]
matched_chapters = [int(i) for i in 
                    re.findall(r"(?<!,\s{1})(?:CHAPTER\s{1})(\d+)(?:\s*.+[\.{1}|!{1}])", moby)]

In [None]:
[i for i in all_chapters if not i in matched_chapters ]

There is another new line! 

In [None]:
re.findall(r"(?<!,\s)(CHAPTER\s{1}\d+)\s*(.+\s*.*[\.{1}|!{1}])", moby)

Lets be lazy! 

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+?\s*.*[\.{1}|!{1}|\?{1}])", moby)

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS)\s*(.+?\s*.*[\.{1}|!{1}|\?{1}])", moby)

Lets use a positive lookahead `(?=...)`.

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\s*(.+?\s*.*[\.{1}|!{1}])", moby)

To match `"ETYMOLOGY."`, we have to account for parenthesis. (Note the extra `\.*`!)

In [None]:
re.findall(r"(?<!,\s{1})(ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\.*\s*(.+?\s*.*[\.{1}|!{1}|\){1}|\?{1}])", moby)

Perfect! But what if we want to match the chapters that follow after his matched string? 

In [None]:
re.findall(r"((?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}|\?{1}])", moby)

Check the [docs](https://docs.python.org/3/library/re.html#re.split). Remove the capturing group when splitting! 

In [None]:
chapters = re.split(r"(?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}|\?{1}]", moby)

In [None]:
chapters[2]

In [None]:
chapters = [re.sub(r"\s+", " ", chapter) for chapter in chapters]

In [None]:
chapter = chapters[3]
chapter

Back to tokenizing! Tokenizing natural languages is a difficult problem. Some tokenizers work better for certain kinds of documents than others.

Before building your own tokenizer, try the tokenizers included with __nltk__, in the `nltk.tokenize` submodule.

### Standardizing Text

We standardize numerical data in order to make fair comparisons, comparisons that are not influenced by the location and scale of the data. Similarly, you can standardize text (tokens) to make sure comparisons are fair and accurate.

For example, `"Cat"` and `"cat"` are the same word even though they're different tokens. Converting all characters to lowercase is one way to standardize a document.

Some common standardization techniques for text are:

* Lowercasing
* Stemming: Use patterns to remove prefixes and suffixes from words.
* Lemmatiziation: Look up each token in a dictionary and replace it with a root word. Similar to stemming, but more accurate.
* Stopword Removal: Remove tokens that don't contribute meaning. For example, "the" is meaningless on its own.
* Identifying Outliers: Identify and possibly remove non-standard "words" like numbers, mispellings, code, etc...

How and whether you should standardize a document or corpus depends on what kind of analysis you want to do. There is no formula; you must think carefully and experiment to determine which standardization techniques work best for your problem.

#### Lowercasing

You can use Python's string methods for simple text transformations.

In [None]:
chapter[:100]

In [None]:
chapter.lower()[:100]

In [None]:
chapter.upper()[:100]

In [None]:
words = re.findall(r"\w+", chapter)

In [None]:
words

In [None]:
lower = [w.lower() for w in words] # lower and upper
lower[:20]

#### Stemming

_Stemming_ runs an algorithm on each token to remove affixes (prefixes and suffixes). The result is called a _stem_.

Stemming is useful if you want to ignore affixes.

For example, most English verbs use suffixes to mark the tense. We write "They fish" (present) and "They fished" (past). Without any standardization, the tokens "fish" and "fished" would be treated as separate words. Stemming converts both tokens to the common stem "fish":

In [None]:
[nltk.PorterStemmer().stem(w) for w in words] 

In [None]:
print(nltk.PorterStemmer().stem("whales"))
print(nltk.PorterStemmer().stem("whaling"))
print(nltk.PorterStemmer().stem("whalebone"))
print(nltk.PorterStemmer().stem("narwhales"))

Stemmers use a sequence of rules to determine the stem for each token, but natural languages are full of special cases and exceptions. So as you can see in the example above, some stems are not words , and sometimes tokens that seem like they should have the same stem don't.

Several different stemmers are provided in the `nltk.stem` submodule.

#### Lemmatization

_Lemmatization_ looks up each token in a dictionary to find a root word, or _lemma_.

Lemmatization serves the same purpose as stemming. Lemmatization is more accurate, but requires a dictionary and usually takes longer.

In [None]:
nltk.download('wordnet')

In [None]:
nltk.WordNetLemmatizer().lemmatize("whales")

In [None]:
nltk.WordNetLemmatizer().lemmatize("whaling")

In [None]:
nltk.WordNetLemmatizer().lemmatize("whaling", "v") #this is a verb - it should be lemmatized to 'whale'

In [None]:
nltk.WordNetLemmatizer().lemmatize("whalebone")

The WordNet lemmatizer requires part of speech information in order to lemmatize words. You can get approximate part of speech information with __nltk__'s `pos_tag()` function.

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
nltk.pos_tag(["whaling"])

In [None]:
nltk.pos_tag(["whale"])

NLTK POS Tags are [Brown POS tags][brown]

[brown]: https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

#### Foreign language

In [None]:
from nltk.stem.snowball import SnowballStemmer

In [None]:
fr = SnowballStemmer('french')

sent = "En mathématiques, une fonction càdlàg (continue à droite, limite à gauche) est ..."
nltk.word_tokenize(sent)

nltk.pos_tag([fr.stem(word) for word in nltk.word_tokenize(sent)])

In [None]:
moby_tags = nltk.pos_tag(words)
moby_tags

The `nltk.stem` submodule also provides several different lemmatizers.

### Stopword Removal

_Stopwords_ are words that appear frequently but don't add meaning.

In English, "the", "a", and "at" are examples. However, exactly which words are stopwords depends on your analysis. Words that are meaningless in one analysis might be very important in others.

You can filter out stopwords with a list comprehension:

In [None]:
stopwords = ["the", "a", "and", "or", "in", "by"]
[w for w in words if w not in stopwords]

__nltk__ also provides a stopwords corpus that contains common stopwords for several languages.

In [None]:
nltk.download("stopwords")

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
[w for w in words if w not in stopwords]

### Summary 

- Learn Regular Expressions to rule natural languages 
- Processing depends on use