# STA 141B Data & Web Technologies for Data Analysis

### Lecture 11, 11/07/23, Natural language processing


### Announcements

 - Proposal is due this week. 
 - Exam is graded. 

### Last Week's topics
- Scraping with `pandas` 
- Parsing HTML and XML
- Web Scraping: 
    - Foodwise
    - Tornado Watch

### Today's topics
- Natural Language Processing
     - `nltk` package
     - Tokenization
     - Regular Expressions
     - Standardizing Text

### Ressources
- [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
- [Scikit-Learn Documentation][skl], especially the section about [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)


[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US
[skl]: https://scikit-learn.org/stable/documentation.html


### Natural Language Processing

A _natural language_ is a language people use to communicate, like English, Spanish, or Mandarin. These languages evolved over thousands of years and do not have simple, explicit rules.

_Natural language processing_ (NLP) means using a computer to analyze, manipulate, or synthesize natural language. Some examples of NLP tasks are:
* Translating from one language to another
* Recognizing speech or handwriting
* Tagging sentences with metadata, such as parts of speech (verbs, nouns, etc) or sentiment
* Extracting information or computing statistics from text

Compared to artificial languages like Python and XML, it's much more difficult to extract information from natural languages. NLP is a wide field; we only have time to learn the absolute basics. If you want to learn more, consider reading the entire [Natural Language Processing with Python][nlpp] book or taking a class in computational linguistics.

[nlpp]: https://www.nltk.org/book/


#### The Python NLP Ecosystem

There are lots of Python packages for NLP (try searching online)! A few popular ones are:

* [Natural Language Tool Kit][nltk] (`nltk`) is the most popular. It's designed for learning and research, so it's well-documented and has lots of features. We will use `nltk` for this class. 
* [TextBlob][textblob] is a "simplified" package. It has a nicer interface than NLTK, but less features.
* [SpaCy][spacy] is a "production-ready" package, and the fastest of all the packages listed here. Useful for working with large natural language datasets.
* [gensim][gensim] is a package for creating topic models, which are a kind of statistical model that predict the topics of a text.

We're going to learn `nltk`, but you might want to try some of the others if your project involves NLP.

[Stanford's Core NLP][CoreNLP] library is at the cutting edge of NLP research. It's developed in Java, but several Python packages provide an interface (such as [pynlp][] and [stanford-corenlp][]).

[nltk]: https://www.nltk.org/
[spacy]: https://spacy.io/
[textblob]: https://textblob.readthedocs.io/en/dev/
[gensim]: https://radimrehurek.com/gensim/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp

#### Corpora and Documents

A _document_ is a single body text. When working with natural language data, documents are the unit of observation.

What you choose as a document depends on the purpose of your analysis. If you're studying how people react to news on Twitter, it makes sense to use individual tweets as documents. If you're studying how animals are portrayed in 19th-century literature, you could use individual novels as documents.

A _corpus_ is a collection of documents. In other words, a corpus is a dataset.

`nltk` provides some example corpora in the `nltk.corpus` submodule. The documentation gives a [complete list](http://www.nltk.org/nltk_data/). Most have to be downloaded with `nltk.download()` before use.

In [1]:
import nltk.corpus

# Download books from Project Gutenberg
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to /Users/peter/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

The `.fileids()` method lists the documents in a corpus.

In [2]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Lets talk about [whales](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). The `.raw()` method returns the raw text for a single document. Specify the document by its file ID.

In [3]:
moby = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")

In [4]:
moby[0:200000]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.  He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.  He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.\r\n\r\n"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true." --HACKLUYT\r\n\r\n"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness\r\nor rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;\r\nA.S. WALW-IAN, t

### Tokenization

A _token_ is a sequence of characters to be treated as a group. Tokens are the unit of analysis for an indvidual document.

Tokens can represent paragraphs, sentences, words, or something else. Most of the time, tokens will be words.

When you analyze a document, the first step will usually be to split the document into tokens. Functions that do this are called _tokenizers_, and this process is called _tokenization_.

The `nltk.sent_tokenize()` function splits a document into sentences, and the `nltk.word_tokenize()` function splits a document into words.

In [5]:
nltk.sent_tokenize(moby)

['[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.',
 '(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.',
 'He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.',
 'He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.',
 '"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true."',
 '--HACKLUYT\r\n\r\n"WHALE.',
 '... Sw. and Dan.',
 'HVAL.',
 'This animal is named from roundness\r\nor rolling; for in Dan.',
 'HVALT is arched or vaulted."',
 '--WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE.',
 '...',
 'It is more immediately from the Dut.',
 '

In [6]:
nltk.sent_tokenize(moby)[0]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.'

In [7]:
nltk.sent_tokenize(moby)[283]

'Call me Ishmael.'

In [8]:
nltk.word_tokenize(moby)

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now',
 '.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '.',
 '``',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',

Corpora also have `.sents()` and `.word()` methods for tokenization. These methods are specialized to the corpus, so they sometimes use the different strategies than `sent_tokenize()` and `word_tokenize()`.

In [9]:
nltk.corpus.gutenberg.sents("melville-moby_dick.txt")

[['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']'], ['ETYMOLOGY', '.'], ...]

In [10]:
nltk.corpus.gutenberg.words("melville-moby_dick.txt")[:10]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

### Strings and String Methods

Lets continue talking about 	&#128011;. How does word tokenization actually work? The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [11]:
moby.split() # splits on whitespace

['[Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851]',
 'ETYMOLOGY.',
 '(Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School)',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat,',
 'heart,',
 'body,',
 'and',
 'brain;',
 'I',
 'see',
 'him',
 'now.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars,',
 'with',
 'a',
 'queer',
 'handkerchief,',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars;',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality.',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others,',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',
 'leaving',
 'out,',
 'through',
 'ignorance,',
 'the',
 'letter',
 '

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in `re` module provides regular expression functions [here](https://docs.python.org/3/library/re.html).

```
re.split(pattern, string, maxsplit=0, flags=0)
```

In [12]:
import re


In [13]:
re.split("[ ,.:;!()']", moby)

['[Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851]\r\n\r\n\r\nETYMOLOGY',
 '\r\n\r\n',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 '\r\n\r\nThe',
 'pale',
 'Usher--threadbare',
 'in',
 'coat',
 '',
 'heart',
 '',
 'body',
 '',
 'and',
 'brain',
 '',
 'I',
 'see',
 'him\r\nnow',
 '',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 '',
 'with',
 'a',
 'queer\r\nhandkerchief',
 '',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the\r\nknown',
 'nations',
 'of',
 'the',
 'world',
 '',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 '',
 'it\r\nsomehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '\r\n\r\n"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 '',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what\r\nname',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 't

What if we also want to split at newlines?

### Escape Sequences and Raw Strings

In Python strings, backslash `\` marks the beginning of an _escape sequence_. Escape sequences are special codes for writing characters that you can't otherwise type. For example, `\n` is a new line character and `\t` is a tab character.

Since `\` has a special meaning in strings, to write a literal `\` you must use the escape sequence `\\`.

You can see the actual characters in a string by printing the string:

In [14]:
print("hello\nworld\.")

hello
world\.


The regular expression (Regex) language is independent of Python and also uses backslash `\` to mark the beginning of an escape sequence. Regex escape sequences disable special behavior for characters. For example, `.` matches any character, but `\.` only matches a literal `.`.

As a result, writing a regular expression in an ordinary Python string is awkward. For example, to match a literal `\`, we need to write `\\` in regular expressions, which is `\\\\` in an ordinary Python string.

In [15]:
print("\\\\")

\\


Python provides _raw strings_, where `\` has no special meaning for Python, to help solve this problem. You can create a raw string by putting an `r` before the starting quote:

More about raw strings: [here](https://www.journaldev.com/23598/python-raw-string#:~:text=Python%20raw%20string%20is%20created,treated%20as%20an%20escape%20character.)

In [16]:
print(r"\ ") # print(r"\") returns an error

\ 


In [17]:
print(r"\"") 

\"


In [18]:
print(r'\\')

\\


In [19]:
s = 'Hi\nHello'
print(s)

Hi
Hello


In [20]:
raw_s = r'Hi\nHello'
raw_s

'Hi\\nHello'

In [21]:
print(raw_s)

Hi\nHello


Even raw strings can't end in `\;` this is a limitation of the Python parser.

Now we can write a better regular expression to split with:

In [22]:
re.split(r"[ \[\](),.:;!?'\n\r]", moby)

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 '',
 '',
 '',
 '',
 '',
 '',
 'ETYMOLOGY',
 '',
 '',
 '',
 '',
 '',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 '',
 '',
 '',
 '',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat',
 '',
 'heart',
 '',
 'body',
 '',
 'and',
 'brain',
 '',
 'I',
 'see',
 'him',
 '',
 'now',
 '',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 '',
 'with',
 'a',
 'queer',
 '',
 'handkerchief',
 '',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 '',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 '',
 'it',
 '',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '',
 '',
 '',
 '',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 '',
 'and',
 'to',
 'teach',
 'them',
 'by',
 '

### Regular Expressions

The regular expressions language includes _character classes_ that describe common sets of characters. The whitespace class `\s` and the word class `\w` are useful here (see [Reference](https://docs.python.org/3/library/re.html)). So to split on any whitespace character:

In [23]:
string = r'[ ,.:;!\n\r]'
string

'[ ,.:;!\\n\\r]'

In [24]:
print(string)

[ ,.:;!\n\r]


In a raw string, `re.split` looks for regex escapes; in a non-raw string, the function looks for the literal ASCII character. If these coincide, the string does not have to be converted to a raw string. 

In [25]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 '\r',
 '\r',
 '\r',
 'ETYMOLOGY',
 '\r',
 '\r',
 '',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 '\r',
 '\r',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat',
 '',
 'heart',
 '',
 'body',
 '',
 'and',
 'brain',
 '',
 'I',
 'see',
 'him\r',
 'now',
 '',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 '',
 'with',
 'a',
 'queer\r',
 'handkerchief',
 '',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the\r',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 '',
 'it\r',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '\r',
 '\r',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 '',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what\r',
 'name',
 'a',
 'whale-fish',


In [26]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 '\r',
 '\r',
 '\r',
 'ETYMOLOGY',
 '\r',
 '\r',
 '',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 '\r',
 '\r',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat',
 '',
 'heart',
 '',
 'body',
 '',
 'and',
 'brain',
 '',
 'I',
 'see',
 'him\r',
 'now',
 '',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 '',
 'with',
 'a',
 'queer\r',
 'handkerchief',
 '',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the\r',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 '',
 'it\r',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '\r',
 '\r',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 '',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what\r',
 'name',
 'a',
 'whale-fish',


In [27]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 '\r',
 '\r',
 '\r',
 'ETYMOLOGY',
 '\r',
 '\r',
 '',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 '\r',
 '\r',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat',
 '',
 'heart',
 '',
 'body',
 '',
 'and',
 'brain',
 '',
 'I',
 'see',
 'him\r',
 'now',
 '',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 '',
 'with',
 'a',
 'queer\r',
 'handkerchief',
 '',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the\r',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 '',
 'it\r',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '\r',
 '\r',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 '',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what\r',
 'name',
 'a',
 'whale-fish',


In [28]:
re.split("\s", moby)

['[Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851]',
 '',
 '',
 '',
 '',
 '',
 'ETYMOLOGY.',
 '',
 '',
 '',
 '(Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School)',
 '',
 '',
 '',
 'The',
 'pale',
 'Usher--threadbare',
 'in',
 'coat,',
 'heart,',
 'body,',
 'and',
 'brain;',
 'I',
 'see',
 'him',
 '',
 'now.',
 '',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars,',
 'with',
 'a',
 'queer',
 '',
 'handkerchief,',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 '',
 'known',
 'nations',
 'of',
 'the',
 'world.',
 '',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars;',
 'it',
 '',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality.',
 '',
 '',
 '',
 '"While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others,',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 '',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 

Capitalizing a character classes inverts the meaning, so to split on all non-word characters:

In [29]:
re.split("\W+", moby) # + matches 1 or more of the preceding characters

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 'ETYMOLOGY',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 'The',
 'pale',
 'Usher',
 'threadbare',
 'in',
 'coat',
 'heart',
 'body',
 'and',
 'brain',
 'I',
 'see',
 'him',
 'now',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 'with',
 'a',
 'queer',
 'handkerchief',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale',
 'fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',
 'leaving',
 'out',
 'through',
 'ignorance',
 'the',
 'letter',
 'H',
 'w

`\w` means _any word character_

`+` Causes the resulting RE to match 1 or more repetitions of the preceding RE. 

In [30]:
re.split(r"\W+", "the...dog")

['the', 'dog']

In [31]:
re.split("\W+", "the,dog")

['the', 'dog']

In [32]:
re.split(r"\W+", "the,I:dog")

['the', 'I', 'dog']

In [33]:
re.split(r"\W+", moby)

['',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 'ETYMOLOGY',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 'The',
 'pale',
 'Usher',
 'threadbare',
 'in',
 'coat',
 'heart',
 'body',
 'and',
 'brain',
 'I',
 'see',
 'him',
 'now',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 'with',
 'a',
 'queer',
 'handkerchief',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale',
 'fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',
 'leaving',
 'out',
 'through',
 'ignorance',
 'the',
 'letter',
 'H',
 'w

Rather than splitting the text, you can also approach the problem from the perspective of extracting tokens. The `findall()` function returns all matches for a regular expression:

In [34]:
re.findall(r"\w+", "The dog barked!")

['The', 'dog', 'barked']

In [35]:
print("\w") # \w is not a special python escape sequence, so it passes through

\w


In [36]:
re.split(r"\W+", "The dog barked!")

['The', 'dog', 'barked', '']

In [37]:
re.findall(r"\w+'?\w{1}", "The dog's toy barked!")

['The', "dog's", 'toy', 'barked']

In [38]:
re.findall(r"\w+'?\w{1}!?", "The dog's toy barked!")

['The', "dog's", 'toy', 'barked!']

- `r" "`: read the string
- `()+`: the patterns inside the parathesis should appear once or more
- `\w+`: the whole word
- `|`: or

More practice? [here](https://regex101.com/?fbclid=IwAR36UyAxywvpSvTOh7F-KYI72IZAVQ0wRcBc0OEOu6h4MifEf-iLcFedfyk)

In [39]:
words = re.findall(r"\w+", moby)
words

['Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 'ETYMOLOGY',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 'The',
 'pale',
 'Usher',
 'threadbare',
 'in',
 'coat',
 'heart',
 'body',
 'and',
 'brain',
 'I',
 'see',
 'him',
 'now',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 'with',
 'a',
 'queer',
 'handkerchief',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale',
 'fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',
 'leaving',
 'out',
 'through',
 'ignorance',
 'the',
 'letter',
 'H',
 'which'

Lets try to match all chapters in the book. First, lets match the chapter sequence, they are similar to "\nCHAPTER 1\r\n\r\nLoomings.\r\n". Check the novel [here](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). Note that the full stop after the chapter is not in the string. 

In [40]:
re.findall(r"CHAPTER\s{1}\d+\s*\w+\.{1}", moby)

['CHAPTER 1\r\n\r\nLoomings.',
 'CHAPTER 5\r\n\r\nBreakfast.',
 'CHAPTER 11\r\n\r\nNightgown.',
 'CHAPTER 12\r\n\r\nBiographical.',
 'CHAPTER 13\r\n\r\nWheelbarrow.',
 'CHAPTER 14\r\n\r\nNantucket.',
 'CHAPTER 15\r\n\r\nChowder.',
 'CHAPTER 25\r\n\r\nPostscript.',
 'CHAPTER 28\r\n\r\nAhab.',
 'CHAPTER 32\r\n\r\nCetology.',
 'CHAPTER 37\r\n\r\nSunset.',
 'CHAPTER 38\r\n\r\nDusk.',
 'CHAPTER 46\r\n\r\nSurmises.',
 'CHAPTER 58\r\n\r\nBrit.',
 'CHAPTER 59\r\n\r\nSquid.',
 'CHAPTER 84\r\n\r\nPitchpoling.',
 'CHAPTER 92\r\n\r\nAmbergris.',
 'CHAPTER 121\r\n\r\nMidnight.']

In [41]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(\w+\.{1})", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 32', 'Cetology.'),
 ('CHAPTER 37', 'Sunset.'),
 ('CHAPTER 38', 'Dusk.'),
 ('CHAPTER 46', 'Surmises.'),
 ('CHAPTER 58', 'Brit.'),
 ('CHAPTER 59', 'Squid.'),
 ('CHAPTER 84', 'Pitchpoling.'),
 ('CHAPTER 92', 'Ambergris.'),
 ('CHAPTER 121', 'Midnight.')]

In [42]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(.+\.{1})", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

See chapter 43. 

In [43]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

Chapter 1 reappeared! 

In [44]:
re.findall(r"(?:,)\s*(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby) # do not capture

[('CHAPTER 1', '. (HUZZA PORPOISE).')]

Lets use a negatve lookbehind! 

In [45]:
re.findall(r"(?<!,\s*)(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

error: look-behind requires fixed-width pattern

In [46]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

Lets find the unmatched chapters. 

In [47]:
all_chapters = [i for i in range(1,135)]
matched_chapters = [int(i) for i in 
                    re.findall(r"(?<!,\s{1})(?:CHAPTER\s{1})(\d+)(?:\s*.+[\.{1}|!{1}])", moby)]

In [48]:
[i for i in all_chapters if not i in matched_chapters ]

[56, 57, 105]

There is another new line! 

In [49]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+\s*.*[\.{1}|!{1}])", moby)

[('CHAPTER 1', 'Loomings.\r\n\r\n\r\nCall me Ishmael.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16',
  'The Ship.\r\n\r\n\r\nIn bed we concocted our plans for the morrow.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20',
  'All Astir.\r\n\r\n\r\nA day or two passed, and there was great activity aboard the Pequod.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'K

Lets be lazy! 

In [50]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+?\s*.*[\.{1}|!{1}])", moby)

[('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Ahab; to Him, Stubb.'),
 ('CHAPTER 30', 'The Pipe.'),
 (

In [51]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS)\s*(.+?\s*.*[\.{1}|!{1}])", moby)

[('EXTRACTS', '(Supplied by a Sub-Sub-Librarian).'),
 ('EXTRACTS', '.\r\n\r\n"And God created great whales." --GENESIS.'),
 ('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knight

Lets use a positive lookahead `(?=...)`.

In [52]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\s*(.+?\s*.*[\.{1}|!{1}])", moby)

[('EXTRACTS', '(Supplied by a Sub-Sub-Librarian).'),
 ('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27', 'Knights and Squires.'),
 ('CHAPTER 28', 'Ahab.'),
 ('CHAPTER 29', 'Enter Aha

To match `"ETYMOLOGY."`, we have to account for parenthesis. (Note the extra `\.*`!)

In [53]:
re.findall(r"(?<!,\s{1})(ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\.*\s*(.+?\s*.*[\.{1}|!{1}|\){1}])", moby)

[('ETYMOLOGY', '(Supplied by a Late Consumptive Usher to a Grammar School)'),
 ('EXTRACTS', '(Supplied by a Sub-Sub-Librarian).'),
 ('CHAPTER 1', 'Loomings.'),
 ('CHAPTER 2', 'The Carpet-Bag.'),
 ('CHAPTER 3', 'The Spouter-Inn.'),
 ('CHAPTER 4', 'The Counterpane.'),
 ('CHAPTER 5', 'Breakfast.'),
 ('CHAPTER 6', 'The Street.'),
 ('CHAPTER 7', 'The Chapel.'),
 ('CHAPTER 8', 'The Pulpit.'),
 ('CHAPTER 9', 'The Sermon.'),
 ('CHAPTER 10', 'A Bosom Friend.'),
 ('CHAPTER 11', 'Nightgown.'),
 ('CHAPTER 12', 'Biographical.'),
 ('CHAPTER 13', 'Wheelbarrow.'),
 ('CHAPTER 14', 'Nantucket.'),
 ('CHAPTER 15', 'Chowder.'),
 ('CHAPTER 16', 'The Ship.'),
 ('CHAPTER 17', 'The Ramadan.'),
 ('CHAPTER 18', 'His Mark.'),
 ('CHAPTER 19', 'The Prophet.'),
 ('CHAPTER 20', 'All Astir.'),
 ('CHAPTER 21', 'Going Aboard.'),
 ('CHAPTER 22', 'Merry Christmas.'),
 ('CHAPTER 23', 'The Lee Shore.'),
 ('CHAPTER 24', 'The Advocate.'),
 ('CHAPTER 25', 'Postscript.'),
 ('CHAPTER 26', 'Knights and Squires.'),
 ('CHAPTER 27',

Perfect! But what if we want to match the chapters that follow after his matched string? 

In [54]:
re.findall(r"((?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}])", moby)

['ETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Grammar School)',
 'EXTRACTS (Supplied by a Sub-Sub-Librarian).',
 'CHAPTER 1\r\n\r\nLoomings.',
 'CHAPTER 2\r\n\r\nThe Carpet-Bag.',
 'CHAPTER 3\r\n\r\nThe Spouter-Inn.',
 'CHAPTER 4\r\n\r\nThe Counterpane.',
 'CHAPTER 5\r\n\r\nBreakfast.',
 'CHAPTER 6\r\n\r\nThe Street.',
 'CHAPTER 7\r\n\r\nThe Chapel.',
 'CHAPTER 8\r\n\r\nThe Pulpit.',
 'CHAPTER 9\r\n\r\nThe Sermon.',
 'CHAPTER 10\r\n\r\nA Bosom Friend.',
 'CHAPTER 11\r\n\r\nNightgown.',
 'CHAPTER 12\r\n\r\nBiographical.',
 'CHAPTER 13\r\n\r\nWheelbarrow.',
 'CHAPTER 14\r\n\r\nNantucket.',
 'CHAPTER 15\r\n\r\nChowder.',
 'CHAPTER 16\r\n\r\nThe Ship.',
 'CHAPTER 17\r\n\r\nThe Ramadan.',
 'CHAPTER 18\r\n\r\nHis Mark.',
 'CHAPTER 19\r\n\r\nThe Prophet.',
 'CHAPTER 20\r\n\r\nAll Astir.',
 'CHAPTER 21\r\n\r\nGoing Aboard.',
 'CHAPTER 22\r\n\r\nMerry Christmas.',
 'CHAPTER 23\r\n\r\nThe Lee Shore.',
 'CHAPTER 24\r\n\r\nThe Advocate.',
 'CHAPTER 25\r\n\r\nPostscript.',
 'CHAPTE

Check the [docs](https://docs.python.org/3/library/re.html#re.split). Remove the capturing group when splitting! 

In [55]:
chapters = re.split(r"(?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}]", moby)

In [56]:
chapters = [re.sub(r"\s+", " ", chapter) for chapter in chapters]

In [57]:
chapter = chapters[3]
chapter

' Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people\'s hats off--then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, alm

Back to tokenizing! Tokenizing natural languages is a difficult problem. Some tokenizers work better for certain kinds of documents than others.

Before building your own tokenizer, try the tokenizers included with __nltk__, in the `nltk.tokenize` submodule.

### Standardizing Text

We standardize numerical data in order to make fair comparisons, comparisons that are not influenced by the location and scale of the data. Similarly, you can standardize text (tokens) to make sure comparisons are fair and accurate.

For example, `"Cat"` and `"cat"` are the same word even though they're different tokens. Converting all characters to lowercase is one way to standardize a document.

Some common standardization techniques for text are:

* Lowercasing
* Stemming: Use patterns to remove prefixes and suffixes from words.
* Lemmatiziation: Look up each token in a dictionary and replace it with a root word. Similar to stemming, but more accurate.
* Stopword Removal: Remove tokens that don't contribute meaning. For example, "the" is meaningless on its own.
* Identifying Outliers: Identify and possibly remove non-standard "words" like numbers, mispellings, code, etc...

How and whether you should standardize a document or corpus depends on what kind of analysis you want to do. There is no formula; you must think carefully and experiment to determine which standardization techniques work best for your problem.

#### Lowercasing

You can use Python's string methods for simple text transformations.

In [58]:
chapter[:100]

' Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my pur'

In [59]:
chapter.lower()[:100]

' call me ishmael. some years ago--never mind how long precisely--having little or no money in my pur'

In [60]:
chapter.upper()[:100]

' CALL ME ISHMAEL. SOME YEARS AGO--NEVER MIND HOW LONG PRECISELY--HAVING LITTLE OR NO MONEY IN MY PUR'

In [61]:
words = re.findall(r"\w+", chapter)

In [62]:
words

['Call',
 'me',
 'Ishmael',
 'Some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'I',
 'thought',
 'I',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'It',
 'is',
 'a',
 'way',
 'I',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'Whenever',
 'I',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever',
 'it',
 'is',
 'a',
 'damp',
 'drizzly',
 'November',
 'in',
 'my',
 'soul',
 'whenever',
 'I',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'and',
 'bringing',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funeral',
 'I',
 'meet',
 'and',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 '

In [63]:
lower = [w.lower() for w in words] # lower and upper
lower[:20]

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and']

#### Stemming

_Stemming_ runs an algorithm on each token to remove affixes (prefixes and suffixes). The result is called a _stem_.

Stemming is useful if you want to ignore affixes.

For example, most English verbs use suffixes to mark the tense. We write "They fish" (present) and "They fished" (past). Without any standardization, the tokens "fish" and "fished" would be treated as separate words. Stemming converts both tokens to the common stem "fish":

In [64]:
[nltk.PorterStemmer().stem(w) for w in words] 

['call',
 'me',
 'ishmael',
 'some',
 'year',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precis',
 'have',
 'littl',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purs',
 'and',
 'noth',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'a',
 'littl',
 'and',
 'see',
 'the',
 'wateri',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'a',
 'way',
 'i',
 'have',
 'of',
 'drive',
 'off',
 'the',
 'spleen',
 'and',
 'regul',
 'the',
 'circul',
 'whenev',
 'i',
 'find',
 'myself',
 'grow',
 'grim',
 'about',
 'the',
 'mouth',
 'whenev',
 'it',
 'is',
 'a',
 'damp',
 'drizzli',
 'novemb',
 'in',
 'my',
 'soul',
 'whenev',
 'i',
 'find',
 'myself',
 'involuntarili',
 'paus',
 'befor',
 'coffin',
 'warehous',
 'and',
 'bring',
 'up',
 'the',
 'rear',
 'of',
 'everi',
 'funer',
 'i',
 'meet',
 'and',
 'especi',
 'whenev',
 'my',
 'hypo',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 'of',
 'me',
 'that',
 'it',
 'requir',
 'a',
 'strong

In [65]:
print(nltk.PorterStemmer().stem("whales"))
print(nltk.PorterStemmer().stem("whaling"))
print(nltk.PorterStemmer().stem("whalebone"))
print(nltk.PorterStemmer().stem("narwhales"))

whale
whale
whalebon
narwhal


Stemmers use a sequence of rules to determine the stem for each token, but natural languages are full of special cases and exceptions. So as you can see in the example above, some stems are not words , and sometimes tokens that seem like they should have the same stem don't.

Several different stemmers are provided in the `nltk.stem` submodule.

#### Lemmatization

_Lemmatization_ looks up each token in a dictionary to find a root word, or _lemma_.

Lemmatization serves the same purpose as stemming. Lemmatization is more accurate, but requires a dictionary and usually takes longer.

In [66]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/peter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [67]:
nltk.WordNetLemmatizer().lemmatize("whales")

'whale'

In [68]:
nltk.WordNetLemmatizer().lemmatize("whaling")

'whaling'

In [69]:
nltk.WordNetLemmatizer().lemmatize("whaling", "v") #this is a verb - it should be lemmatized to 'whale'

'whale'

In [70]:
nltk.WordNetLemmatizer().lemmatize("whalebone")

'whalebone'

The WordNet lemmatizer requires part of speech information in order to lemmatize words. You can get approximate part of speech information with __nltk__'s `pos_tag()` function.

In [71]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/peter/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [72]:
nltk.pos_tag(["whaling"])

[('whaling', 'VBG')]

In [73]:
nltk.pos_tag(["whale"])

[('whale', 'NN')]

NLTK POS Tags are [Brown POS tags][brown]

[brown]: https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

#### Foreign language

In [74]:
from nltk.stem.snowball import SnowballStemmer

In [75]:
fr = SnowballStemmer('french')

sent = "En mathématiques, une fonction càdlàg (continue à droite, limite à gauche) est ..."
nltk.word_tokenize(sent)

nltk.pos_tag([fr.stem(word) for word in nltk.word_tokenize(sent)])

[('en', 'JJ'),
 ('mathémat', 'NN'),
 (',', ','),
 ('une', 'JJ'),
 ('fonction', 'NN'),
 ('càdlàg', 'NN'),
 ('(', '('),
 ('continu', 'JJ'),
 ('à', 'NNP'),
 ('droit', 'NN'),
 (',', ','),
 ('limit', 'NN'),
 ('à', 'NNP'),
 ('gauch', 'NN'),
 (')', ')'),
 ('est', 'NN'),
 ('...', ':')]

In [77]:
moby_tags = nltk.pos_tag(words)
moby_tags

[('Call', 'VB'),
 ('me', 'PRP'),
 ('Ishmael', 'NNP'),
 ('Some', 'DT'),
 ('years', 'NNS'),
 ('ago', 'RB'),
 ('never', 'RB'),
 ('mind', 'VB'),
 ('how', 'WRB'),
 ('long', 'JJ'),
 ('precisely', 'RB'),
 ('having', 'VBG'),
 ('little', 'JJ'),
 ('or', 'CC'),
 ('no', 'DT'),
 ('money', 'NN'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('purse', 'NN'),
 ('and', 'CC'),
 ('nothing', 'NN'),
 ('particular', 'JJ'),
 ('to', 'TO'),
 ('interest', 'NN'),
 ('me', 'PRP'),
 ('on', 'IN'),
 ('shore', 'NN'),
 ('I', 'PRP'),
 ('thought', 'VBD'),
 ('I', 'PRP'),
 ('would', 'MD'),
 ('sail', 'VB'),
 ('about', 'IN'),
 ('a', 'DT'),
 ('little', 'JJ'),
 ('and', 'CC'),
 ('see', 'VB'),
 ('the', 'DT'),
 ('watery', 'JJ'),
 ('part', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('way', 'NN'),
 ('I', 'PRP'),
 ('have', 'VBP'),
 ('of', 'IN'),
 ('driving', 'VBG'),
 ('off', 'RP'),
 ('the', 'DT'),
 ('spleen', 'NN'),
 ('and', 'CC'),
 ('regulating', 'VBG'),
 ('the', 'DT'),
 ('circulation

The `nltk.stem` submodule also provides several different lemmatizers.

### Stopword Removal

_Stopwords_ are words that appear frequently but don't add meaning.

In English, "the", "a", and "at" are examples. However, exactly which words are stopwords depends on your analysis. Words that are meaningless in one analysis might be very important in others.

You can filter out stopwords with a list comprehension:

In [78]:
stopwords = ["the", "a", "and", "or", "in", "by"]
[w for w in words if w not in stopwords]

['Call',
 'me',
 'Ishmael',
 'Some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'no',
 'money',
 'my',
 'purse',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'I',
 'thought',
 'I',
 'would',
 'sail',
 'about',
 'little',
 'see',
 'watery',
 'part',
 'of',
 'world',
 'It',
 'is',
 'way',
 'I',
 'have',
 'of',
 'driving',
 'off',
 'spleen',
 'regulating',
 'circulation',
 'Whenever',
 'I',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'mouth',
 'whenever',
 'it',
 'is',
 'damp',
 'drizzly',
 'November',
 'my',
 'soul',
 'whenever',
 'I',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'bringing',
 'up',
 'rear',
 'of',
 'every',
 'funeral',
 'I',
 'meet',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 'of',
 'me',
 'that',
 'it',
 'requires',
 'strong',
 'moral',
 'principle',
 'to',
 'prevent',
 'me',
 'from',
 'deliberately',

__nltk__ also provides a stopwords corpus that contains common stopwords for several languages.

In [79]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /Users/peter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [80]:
stopwords = nltk.corpus.stopwords.words("english")
[w for w in words if w not in stopwords]

['Call',
 'Ishmael',
 'Some',
 'years',
 'ago',
 'never',
 'mind',
 'long',
 'precisely',
 'little',
 'money',
 'purse',
 'nothing',
 'particular',
 'interest',
 'shore',
 'I',
 'thought',
 'I',
 'would',
 'sail',
 'little',
 'see',
 'watery',
 'part',
 'world',
 'It',
 'way',
 'I',
 'driving',
 'spleen',
 'regulating',
 'circulation',
 'Whenever',
 'I',
 'find',
 'growing',
 'grim',
 'mouth',
 'whenever',
 'damp',
 'drizzly',
 'November',
 'soul',
 'whenever',
 'I',
 'find',
 'involuntarily',
 'pausing',
 'coffin',
 'warehouses',
 'bringing',
 'rear',
 'every',
 'funeral',
 'I',
 'meet',
 'especially',
 'whenever',
 'hypos',
 'get',
 'upper',
 'hand',
 'requires',
 'strong',
 'moral',
 'principle',
 'prevent',
 'deliberately',
 'stepping',
 'street',
 'methodically',
 'knocking',
 'people',
 'hats',
 'I',
 'account',
 'high',
 'time',
 'get',
 'sea',
 'soon',
 'I',
 'This',
 'substitute',
 'pistol',
 'ball',
 'With',
 'philosophical',
 'flourish',
 'Cato',
 'throws',
 'upon',
 'sword'

### Summary 

- Learn Regular Expressions to rule natural languages 
- Processing depends on use