# Introduction to `nltk` in Python

## Lecture goals

- Introducing the `nltk` library in Python.  
- Basic text processing with `nltk`.  
   - Tokenization. 
   - Stemming and lemmatizing.  
   - Part-of-speech tagging. 
   - Processing web text.

## The `nltk` library

> The **`nltk`** ("Natural Language Toolkit") library contains a set of functions, classes, and *corpora* (databases) to help process and analyze language data.

In [1]:
### usually we'll import more specific functions or clasess
import nltk

### `nltk`: a brief history

- [`nltk`](https://en.wikipedia.org/wiki/Natural_Language_Toolkit) was created by Steven Bird and Edward Loper. 
- Has a *ton* of useful functions.  
  - Built-in **corpora** of multiple genres and languages. 
  - Can **pre-process** (e.g., tokenize) text.
  - Can **extract** information from text.  
  - Includes basic **machine learning** applications. 
- Used by *linguists*, *NLP practitioners*, and more.

## Tokenization

> `nltk` has a number of **tokenization** functions built in, including for tokenizing *sentences* (`sent_tokenize`) and indiviudal words (`word_tokenize`).

`nltk` has some [additional documentation](https://www.nltk.org/api/nltk.tokenize.html?highlight=tokenize) here.

### Refresher: what is tokenization?

> **Tokenization** is the process of splitting a chunk of text into individuals elements or "tokens" (often *words*). 

- Tokenization is critical for *pre-processing* text.  
- Before counting, calculating sentiment, etc., must identify the tokens!
- It's also surprisingly challenging to do.

### Tokenizing *words*

- Before, we used `split` to tokenize a string into words.  
- `nltk` has a function called `word_tokenize` that does this for us.

In [2]:
# Pre-processing step
import nltk
nltk.download('punkt') ### May need to install/download this

### import word_tokenize
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /Users/seantrott/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### `word_tokenize` in action (1)

Here, we see that it works well on our example sentence from last time.

In [3]:
example_str = "The quick brown fox jumped over the lazy dog"
word_tokenize(example_str)

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

#### `word_tokenize` in action (2)

It also does a better job with the Declaration of Independence, which `split` didn't work so well with.

In [4]:
### Declaration of Independence
with open("data/text/declaration.txt", "r") as f:
    contents = f.read()

tokens = word_tokenize(contents)

In [5]:
tokens[0:6]

['In', 'Congress', ',', 'July', '4', ',']

#### `word_tokenize` in action (3)

It also deals pretty well with newline (`\n`) characters and other punctuation.

In [6]:
s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''

In [7]:
word_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

#### How does it work?

Under the hood, `word_tokenize` uses the [`TreebankWordTokenizer`](https://www.nltk.org/api/nltk.tokenize.TreebankWordTokenizer.html#nltk.tokenize.TreebankWordTokenizer), which implements the following steps:

- Split standard contractions (`"don't"`). 
- Treat most punctuation characters as separate tokens
- Split off commas and single quotes, when followed by whitespace.
- Separate periods that appear at the end of line.

#### Tokenization in other languages

- `nltk.word_tokenize` supports some other languages besides English, which can be entered using the `language` parameter.

In [8]:
spanish_example = "Después del almuerzo, Diego y Carlos fueron a la casa. Diego le dio las llaves a Carlos."

In [9]:
nltk.word_tokenize(spanish_example, language = 'spanish')[0:5]

['Después', 'del', 'almuerzo', ',', 'Diego']

### Tokenizing *sentences*

- Often, we are processing more than one sentence at a time.  
- Before we tokenize words, it is useful to tokenize *sentences*.
- `nltk` has a function called `sent_tokenize` that does this for us.

In [10]:
from nltk import sent_tokenize

### Tokenizing *sentences*

- Often, we are processing more than one sentence at a time.  
- Before we tokenize words, it is useful to tokenize *sentences*.
- `nltk` has a function called `sent_tokenize` that does this for us.

In [11]:
sentences = "I visited my friend in Seattle. We went to the park and then downtown. It was a fun time."
sent_tokenize(sentences)

['I visited my friend in Seattle.',
 'We went to the park and then downtown.',
 'It was a fun time.']

## Stemming and lemmatizing

> **Stemming** or **lemmatizing** a word means removing any *inflectional morphology* (e.g., plural markers) to reveal only the "root".

- Very useful if you only have *root words* in a vocabulary set.  
   - "loved" --> "love"
   - "running" --> "run"
- Stemming ≠ lemmatizing (the latter is a bit harder).

### Stemming

> **Stemming** is a process that removes common *affixes* from a word.

- English examples: plural ("-s"), past tense ("-ed"), etc.  
- Simple approach:
   - Build list of common suffixes.  
   - Identify these substrings in words.  
   - Remove those substrings.

#### Stemming with `nltk`

- The `PorterStemmer` implements the [**Porter Stemming Algorithm**](https://tartarus.org/martin/PorterStemmer/#:~:text=The%20Porter%20stemming%20algorithm%20(or,History).  
- Imperfect process!

In [12]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [13]:
ps.stem("dogs") ### nice!

'dog'

In [14]:
ps.stem("houses") ### hmm

'hous'

In [15]:
ps.stem("flies") ### not so good

'fli'

#### Stemming in action

- Typically, you would first use `word_tokenize`, then **stem** those words. 
- E.g., useful for sentiment analysis.

In [16]:
sentence = "I loved that film"
words = word_tokenize(sentence)
words

['I', 'loved', 'that', 'film']

In [17]:
stemmed_words = [ps.stem(w) for w in words]
stemmed_words

['i', 'love', 'that', 'film']

#### Check-in

Try to come up with some examples where `ps.stem` works and where it doesn't work. What do you think about its performance?

In [18]:
### Your code here

#### A qualitative evaluation

`ps.stem` works for prototypical examples, but also fails often.

In [19]:
ps.stem("movies")

'movi'

In [20]:
ps.stem("bottles")

'bottl'

### Lemmatizing

> **Lemmatizing** is a process for identifying the *root* or *lemma* of a word, which ensures that the output is indeed a root form.  

- Stemming can be implemented with some simple rules (e.g., remove "-s").  
  - Fast, but error-prone.
- Lemmatizing requires a *list* of all the valid lemmas of a language. 
  - Slower and harder to build, but more accurate.

#### Using `WordNetLemmatizer`

- The `WordNetLemmatizer` uses the [**WordNet database**](https://www.nltk.org/howto/wordnet.html) as its dictionary of *lemmas*.  
- Unlike stemming, lemmatizing should return a valid word.

In [21]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [22]:
lemmatizer.lemmatize("movies")

'movie'

In [23]:
lemmatizer.lemmatize("bottles")

'bottle'

#### Check-in

Try to come up with some more examples that `WordNetLemmatizer` does well on, but `PorterStemmer` does not.

In [24]:
### Your code here

#### More qualitative evaluation

In [25]:
print(ps.stem("battles"))
print(lemmatizer.lemmatize("battles"))

battl
battle


#### Specifying the `pos`

- By specifying the `pos` tag, you can improve the accuracy of the lemmatization.
- Noun ("n"), Verb ("v"), and adjective ("a").

In [26]:
### Without POS
print(lemmatizer.lemmatize("running"))  # Expected verb lemma "run", but defaults to noun
print(lemmatizer.lemmatize("better"))   # Expected adjective lemma "good"

running
better


In [27]:
### Without POS
print(lemmatizer.lemmatize("running", pos = "v"))  # Expected verb lemma "run", but defaults to noun
print(lemmatizer.lemmatize("better", pos = "a"))   # Expected adjective lemma "good"

run
good


#### But how do you figure out part of speech?

## Part-of-speech tagging

> **Part-of-speech** tagging is the process of identifying the part of speech of different words in a sentence (E.g., "verb", etc.).

- The same *wordform* has different meanings depending on its **usage**.  
- Can also be lemmatized differently ("in the running" vs. "she is running").

In [28]:
from nltk import pos_tag

### `nltk.pos_tag`

- Given a list of tokens...
- ...`pos_tag` returns *tuples* of each word and its POS.

In [29]:
s1 = "She is in the running for office."
s2 = "She is running for office."

In [30]:
### First sentence
pos_tag(word_tokenize(s1))

[('She', 'PRP'),
 ('is', 'VBZ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('running', 'NN'),
 ('for', 'IN'),
 ('office', 'NN'),
 ('.', '.')]

#### Check-in

Use `pos_tag` on the second sentence. What do you notice about the tag for "running"?

In [31]:
### Your code here

#### A different POS

Now, "running" is tagged as "VBG" (a "gerund verb").

In [32]:
pos_tag(word_tokenize(s2))

[('She', 'PRP'),
 ('is', 'VBZ'),
 ('running', 'VBG'),
 ('for', 'IN'),
 ('office', 'NN'),
 ('.', '.')]

## Processing web text

> **Web scraping** is a common use case for CSS.  

- We won't discuss the details too much here, but enough to get you started.  
- You'll use `requests` to access text of web pages.
- And then use `BeautifulSoup` and `nltk` to process that text.

### Necessary installations

- If you don't have it already, you may need to install some of these libraries.

In [33]:
pip install nltk requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


### Fetching web content with `requests`

> The [`requests` library](https://pypi.org/project/requests/ is a library for retrieving the HTML from a web addres.

In [34]:
import requests ### requests libriary

In [35]:
### Let's look at my CSS 2 syllabus page
url = "https://ucsd-css2.github.io/ucsd-css2-website/course/syllabus.html"

In [36]:
# Fetch the content from the URL
response = requests.get(url)
type(response)

requests.models.Response

#### What's in the `Response`?

- The `Response` object has a number of features, like `.json`, `.links`, and `.text`.  
- We'll focus on `.text`, which is...all HTML!

In [37]:
response.text[0:100]

'\n\n<!DOCTYPE html>\n\n\n<html lang="en" data-content_root="" >\n\n  <head>\n    <meta charset="utf-8" />\n  '

### Simplifying with `BeautifulSoup`

- `BeautifulSoup` can be used to **extract structure** from that HTML text.

In [38]:
from bs4 import BeautifulSoup ### import

In [39]:
# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
type(soup)

bs4.BeautifulSoup

#### Getting the *text* from the page

In [40]:
# Extract the text from the web page
text = soup.get_text(strip=True)
text[0:100]

'CSS 2 Syllabus: Winter 2024 — CSS 2Skip to main contentBack to topCtrl+KWelcome to CSS 2!Course Logi'

In [41]:
tokens = word_tokenize(text)
tokens[0:10] ### not perfect, but not bad

['CSS', '2', 'Syllabus', ':', 'Winter', '2024', '—', 'CSS', '2Skip', 'to']

### Note on web APIs

- Many websites have an **API**, or *Application Programming Interface*.  
   - E.g., [Reddit API](https://praw.readthedocs.io/en/stable/.
- This means you don't have to process all the web text yourself.  
- Instead, you can work with more customized objects specific to the web page.  
- But also often requires creating an **account**.


### Web scraping and privacy use

- Just because the text is *available*, doesn't mean it's *legal* or *ethical* to scrape it.  
- Before scraping, think about our [privacy lectures](https://ucsd-css2.github.io/ucsd-css2-website/intro.html from CSS 2:
   - Is scraping this content against the company's terms of service?  
   - Would the people who produced this content be comfortable with me scraping it?  
   - How do I plan to *use* that scraped content?

## Lecture wrap-up

- A big part of CSS is dealing with *unstructured* text data.  
- `nltk` has methods to help us **extract structure**.  
   - Tokenizing to find the words.  
   - Stemming and lemmatizing to find the *roots* of those words.  
- Other packages, like `BeautifulSoup`, can help simplify web text.