# STA 141B Data & Web Technologies for Data Analysis

### Lecture 12, 2/18/25, Natural language processing


### Announcements 

- HW 2 is graded
- HW 3 is due this Sunday
- Form groups for the final project

### Last week's topics

- Scraping


### Today's topics
- Natural Language Processing
     - `nltk` 
     package
     - Tokenization
     - Regular Expressions
     - Standardizing Text

### Ressources
- [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
- [Scikit-Learn Documentation][skl], especially the section about [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)


[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US
[skl]: https://scikit-learn.org/stable/documentation.html


### Natural Language Processing

A _natural language_ is a language people use to communicate, like English, Spanish, or Mandarin. These languages evolved over thousands of years and do not have simple, explicit rules.

_Natural language processing_ (NLP) means using a computer to analyze, manipulate, or synthesize natural language. Some examples of NLP tasks are:
* Translating from one language to another
* Recognizing speech or handwriting
* Tagging sentences with metadata, such as parts of speech (verbs, nouns, etc) or sentiment
* Extracting information or computing statistics from text

Compared to artificial languages like Python and XML, it's much more difficult to extract information from natural languages. NLP is a wide field; we only have time to learn the absolute basics. If you want to learn more, consider reading the entire [Natural Language Processing with Python][nlpp] book or taking a class in computational linguistics.

[nlpp]: https://www.nltk.org/book/


#### The Python NLP Ecosystem

There are lots of Python packages for NLP (try searching online)! A few popular ones are:

* [Natural Language Tool Kit][nltk] (`nltk`) is the most popular. It's designed for learning and research, so it's well-documented and has lots of features. We will use `nltk` for this class. 
* [TextBlob][textblob] is a "simplified" package. It has a nicer interface than NLTK, but less features.
* [SpaCy][spacy] is a "production-ready" package, and the fastest of all the packages listed here. Useful for working with large natural language datasets.
* [gensim][gensim] is a package for creating topic models, which are a kind of statistical model that predict the topics of a text.

We're going to learn `nltk`, but you might want to try some of the others if your project involves NLP.

[Stanford's Core NLP][CoreNLP] library is at the cutting edge of NLP research. It's developed in Java, but several Python packages provide an interface (such as [pynlp][] and [stanford-corenlp][]).

[nltk]: https://www.nltk.org/
[spacy]: https://spacy.io/
[textblob]: https://textblob.readthedocs.io/en/dev/
[gensim]: https://radimrehurek.com/gensim/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp

#### Corpora and Documents

A _document_ is a single body text. When working with natural language data, documents are the unit of observation.

What you choose as a document depends on the purpose of your analysis. If you're studying how people react to news on Twitter, it makes sense to use individual tweets as documents. If you're studying how animals are portrayed in 19th-century literature, you could use individual novels as documents.

A _corpus_ is a collection of documents. In other words, a corpus is a dataset.

`nltk` provides some example corpora in the `nltk.corpus` submodule. The documentation gives a [complete list](http://www.nltk.org/nltk_data/). Most have to be downloaded with `nltk.download()` before use.

In [None]:
import nltk.corpus

In [None]:
# Download books from Project Gutenberg
nltk.download("gutenberg")

The `.fileids()` method lists the documents in a corpus.

In [None]:
nltk.corpus.gutenberg.fileids()

Lets talk about [whales](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). The `.raw()` method returns the raw text for a single document. Specify the document by its file ID.

In [None]:
moby = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")

In [None]:
moby[0:2000]

### Tokenization

A _token_ is a sequence of characters to be treated as a group. Tokens are the unit of analysis for an indvidual document.

Tokens can represent paragraphs, sentences, words, or something else. Most of the time, tokens will be words.

When you analyze a document, the first step will usually be to split the document into tokens. Functions that do this are called _tokenizers_, and this process is called _tokenization_.

The `nltk.sent_tokenize()` function splits a document into sentences, and the `nltk.word_tokenize()` function splits a document into words.

In [None]:
type(nltk.sent_tokenize(moby))

In [None]:
nltk.sent_tokenize(moby)[1]

In [None]:
nltk.sent_tokenize(moby)[283]

In [None]:
nltk.word_tokenize(moby)[0:20]

Corpora also have `.sents()` and `.word()` methods for tokenization. These methods are specialized to the corpus, so they sometimes use the different strategies than `sent_tokenize()` and `word_tokenize()`.

In [None]:
tmp = nltk.corpus.gutenberg.sents("melville-moby_dick.txt")

In [None]:
tmp[2]

In [None]:
nltk.corpus.gutenberg.words("melville-moby_dick.txt")[:10]

### Strings and String Methods

Lets continue talking about 	&#128011;. How does word tokenization actually work? The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [None]:
moby.split()[:10] # splits on whitespace

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in `re` module provides regular expression functions [here](https://docs.python.org/3/library/re.html).

```
re.split(pattern, string, maxsplit=0, flags=0)
```

In [None]:
import re

In [None]:
moby[:100]

In [None]:
re.split("[ ',:]", '[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Gr')

What if we also want to split at newlines?

### Escape Sequences and Raw Strings

In Python strings, backslash `\` marks the beginning of an _escape sequence_. Escape sequences are special codes for writing characters that you can't otherwise type. For example, `\n` is a new line character and `\t` is a tab character.

Since `\` has a special meaning in strings, to write a literal `\` you must use the escape sequence `\\`.

You can see the actual characters in a string by printing the string:

In [None]:
print("hello\n\t\\world.")

The regular expression (Regex) language is independent of Python and also uses backslash `\` to mark the beginning of an escape sequence. Regex escape sequences disable special behavior for characters. For example, `.` matches any character, but `\.` only matches a literal `.`.

As a result, writing a regular expression in an ordinary Python string is awkward. For example, to match a literal `\`, we need to write `\\` in regular expressions, which is `\\\\` in an ordinary Python string.

In [None]:
print(r"\\")

Python provides _raw strings_, where `\` has no special meaning for Python, to help solve this problem. You can create a raw string by putting an `r` before the starting quote:

More about raw strings: [here](https://www.journaldev.com/23598/python-raw-string#:~:text=Python%20raw%20string%20is%20created,treated%20as%20an%20escape%20character.)

In [None]:
print(r"\")

In [None]:
print(r"\"") 

In [None]:
print(r'\\')

In [None]:
s = 'Hi\nHello'
print(s)

In [None]:
raw_s = r'Hi\nHello'
raw_s

In [None]:
print(raw_s)

Even raw strings can't end in `\;` this is a limitation of the Python parser.

Now we can write a better regular expression to split with:

In [None]:
 moby[:100]

In [None]:
re.split("[\.,:;\r\n\t ()\[\]]", moby[:100])

In [None]:
len(re.split(r"[ \[\](),.:;!?'\n\r]", moby))

In [None]:
len(re.split(r" ", moby))

### Regular Expressions

The regular expressions language includes _character classes_ that describe common sets of characters. The whitespace class `\s` and the word class `\w` are useful here (see [Reference](https://docs.python.org/3/library/re.html)). So to split on any whitespace character:

In [None]:
string = moby[:200]
string

In [None]:
print(string)

In [None]:
re.findall(r"\w+", string) #? 

In [None]:
re.split(r"\W*", string) #? 

In a raw string, `re.split` looks for regex escapes; in a non-raw string, the function looks for the literal ASCII character. If these coincide, the string does not have to be converted to a raw string. 

In [None]:
re.split(r"[ \[\],.:;!'()\n\r-]", moby) # note the '

In [None]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

In [None]:
re.split("[ \[\],.:;!'()\n]", moby) # note the '

In [None]:
re.split("[\s\[\],.:;!'()-]", moby)

Capitalizing a character classes inverts the meaning, so to split on all non-word characters:

In [None]:
re.split("\W+", moby[:100]) # + matches 1 or more of the preceding characters

`\w` means _any word character_

`+` Causes the resulting RE to match 1 or more repetitions of the preceding RE. 

In [None]:
re.split(r"\W+", "the...dog")

In [None]:
re.split("\W+", "the,dog")

In [None]:
re.split(r"\W+", "the,I:! dog")

In [None]:
re.split(r"\W+", moby)

Rather than splitting the text, you can also approach the problem from the perspective of extracting tokens. The `findall()` function returns all matches for a regular expression:

In [None]:
re.findall(r"\w+", "The dog barked!")

In [None]:
print("\w") # \w is not a special python escape sequence, so it passes through

In [None]:
re.split(r"\W+", "The dog barked!")

In [None]:
re.findall(r"\w+'?\w*", "The' dog'ss toy barked!")

In [None]:
re.findall(r"[\w']+", "I think the dog's toy barked!")

- `r" "`: read the string
- `()+`: the patterns inside the parathesis should appear once or more
- `\w+`: the whole word
- `|`: or

More practice? [here](https://regex101.com/?fbclid=IwAR36UyAxywvpSvTOh7F-KYI72IZAVQ0wRcBc0OEOu6h4MifEf-iLcFedfyk)

In [None]:
words = re.findall(r"\w+", moby)
words

In [None]:
moby[34400:34530]

In [None]:
print(moby[34480:22500])

In [None]:
moby.find('CHAPTER 2')

Lets try to match all chapters in the book. First, lets match the chapter sequence, they are similar to "\nCHAPTER 1\r\n\r\nLoomings.\r\n". Check the novel [here](https://www.gutenberg.org/files/2701/2701-h/2701-h.htm#link2H_4_0002). Note that the full stop after the chapter is not in the string. 

In [None]:
re.findall(r"CHAPTER \d+\s+.*[\.\?\!]", moby)

In [None]:
re.findall(r"(CHAPTER\s{1}\d+)\s*(\w+\.{1})", moby)

In [None]:
print(moby[322415:322615])

See chapter 43. 

In [None]:
re.findall(r"CHAPTER \d+\s*.+[\.!\?]", moby)

Chapter 1 reappeared! 

In [None]:
re.findall(r"(?<!,\s)(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}|\?{1}])", moby) # do not capture

Lets use a negatve lookbehind! 

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+[\.{1}|!{1}])", moby)

In [None]:
re.findall(r"CHAPTER (\d+)\s+.+[\.!\?]", moby)

Lets find the unmatched chapters. 

In [None]:
all_chapters = [i for i in range(1,136)]
matched_chapters = [int(i) for i in 
                    re.findall(r"CHAPTER (\d+)\s+.+[\.!\?]", moby)]

In [None]:
#re.findall(r"CHAPTER (\d+)\s+.+[\.!\?]", moby)

In [None]:
[i for i in all_chapters if not i in matched_chapters ]

There is another new line! 

In [None]:
re.findall(r"(CHAPTER \d+\s+.+\n.+[\.!\?])", moby)

Lets be lazy! 

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+)\s*(.+?\s*.*[\.{1}|!{1}|\?{1}])", moby)

In [None]:
re.findall(r'''((CHAPTER \d+|Epilogue)\s+.+\n.+[\.!\?])''', moby)

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS)\s*(.+?\s*.*[\.{1}|!{1}|\?{1}])", moby)

Lets use a positive lookahead `(?=...)`.

In [None]:
re.findall(r"(?<!,\s{1})(CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\s*(.+?\s*.*[\.{1}|!{1}])", moby)

To match `"ETYMOLOGY."`, we have to account for parenthesis. (Note the extra `\.*`!)

In [None]:
re.findall(r"(?<!,\s{1})(ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())\.*\s*(.+?\s*.*[\.{1}|!{1}|\){1}|\?{1}])", moby)

Perfect! But what if we want to match the chapters that follow after his matched string? 

In [None]:
re.findall(r"((?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}|\?{1}])", moby)

Check the [docs](https://docs.python.org/3/library/re.html#re.split). Remove the capturing group when splitting! 

In [None]:
chapters = re.split(r"(?<!,\s{1})(?:ETYMOLOGY|CHAPTER\s{1}\d+|Epilogue|EXTRACTS(?=\s*\())(?:\.*\s*).+?\s*.*[\.{1}|!{1}|\){1}|\?{1}]", moby)

In [None]:
chapters[2]

In [None]:
chapters = [re.sub(r"\s+", " ", chapter) for chapter in chapters]

In [None]:
chapter = chapters[3]
chapter

Back to tokenizing! Tokenizing natural languages is a difficult problem. Some tokenizers work better for certain kinds of documents than others.

Before building your own tokenizer, try the tokenizers included with __nltk__, in the `nltk.tokenize` submodule.