### Week 7 - Regex

```markdown
# NLP Tutorial with Definitions

Below is **one** raw Markdown chunk containing explanatory text, **key definitions**, and multiple Python code snippets.  
All code fences use triple backticks with language spec, and this entire content is wrapped in an extra set of backticks to remain raw.  
No code execution output is included.

---

## 1. Overview

Natural Language Processing (NLP) is the field of enabling computers to understand, interpret, and generate human language. Python libraries such as **NLTK**, **TextBlob**, **spaCy**, and **gensim** help accomplish various NLP tasks including tokenization, parsing, and semantic analysis.

This tutorial shows how to use **NLTK** and **regular expressions** (`re`) for the basic task of text **tokenization**. We also define key terms for clarity.

---

## 2. Key Definitions

1. **Natural Language Processing (NLP)**  
   Using computers to analyze, understand, or generate human language (e.g., English, Spanish). NLP often deals with tasks like sentiment analysis, language translation, topic modeling, etc.

2. **Corpus**  
   A **corpus** is a large and structured set of texts (or documents). Examples include a database of tweets, a collection of news articles, or the entire text of several books.

3. **Document**  
   A **document** is usually the smallest standalone text unit within a corpus. For example, one news article, one tweet, or one book in a set of books.

4. **Token**  
   A **token** is a sequence of characters treated as a single meaningful entity. Typically, tokens are words, but they can also be punctuation marks, hashtags, or other atomic units.

5. **Tokenization**  
   The process of splitting text into smaller pieces called *tokens*. It is often the first step in text preprocessing.

6. **Regular Expression (Regex)**  
   A **regular expression** is a special text string used for describing search patterns. In Python, the `re` module provides functions to match or split strings using regexes.

7. **Stopwords**  
   **Stopwords** are common words (e.g., “and,” “the,” “to”) that usually carry little meaning in text analysis. Removing them can often improve analysis results.

8. **Stemming**  
   **Stemming** crudely chops off word endings to reduce words to their base forms (e.g., “running” → “run”).

9. **Lemmatization**  
   **Lemmatization** also reduces words to a base form (lemma), but it uses vocabulary and morphological analysis (e.g., “am,” “are,” “is” → “be”).

10. **Frequency Distribution**  
    A summary of how often different tokens (words) appear in a text. Helps find the most frequent words.

---

## 3. Installing and Loading NLTK

1. **Install NLTK** (if needed):
   ```bash
   pip install nltk
   ```
2. **Import NLTK and Download Resources**:
   ```python
   import nltk
   nltk.download('punkt')      # for tokenizers
   nltk.download('gutenberg')  # sample Project Gutenberg texts
   nltk.download('stopwords')  # for stopword lists
   ```

---

## 4. Using NLTK for Tokenization

### 4.1 Loading a Corpus (Gutenberg Example)

```python
import nltk
from nltk.corpus import gutenberg

# List file IDs in the Gutenberg corpus
print(gutenberg.fileids())

# Load the raw text of "Moby Dick"
moby_raw = gutenberg.raw("melville-moby_dick.txt")
print(moby_raw[:500])  # print the first 500 characters
```

### 4.2 Sentence Tokenization

```python
# Split text into sentences
sentences = nltk.sent_tokenize(moby_raw)
print(f"Number of sentences: {len(sentences)}")
print(sentences[0])  # view the first sentence
```

### 4.3 Word Tokenization

```python
# Split the raw text into individual word tokens
words = nltk.word_tokenize(moby_raw)
print(f"Number of word tokens: {len(words)}")
print(words[:20])  # see the first 20 tokens
```

### 4.4 Tokenizing a Custom Sentence

```python
custom_sentence = "Hello, world! This is a test-sentence, with punctuation."
custom_tokens = nltk.word_tokenize(custom_sentence)
print(custom_tokens)
# Example output (not shown here to keep it raw):
# ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test-sentence', ',', 'with', 'punctuation', '.']
```

---

## 5. Splitting Text with Regular Expressions

### 5.1 Splitting on Non-Word Characters

```python
import re

sample_text = "Hello, world!\nThis is regex 101.\tLet's split by non-word chars."
split_tokens = re.split(r"\W+", sample_text)
print(split_tokens)
# Might include empty strings where the pattern matches at start/end
```

### 5.2 Finding Words with `re.findall`

```python
# Grab all word-like sequences (\w+)
words_regex = re.findall(r"\w+", sample_text)
print(words_regex)
# Potentially: ['Hello', 'world', 'This', 'is', 'regex', '101', 'Let', 's', 'split', 'by', 'non', 'word', 'chars']
```

### 5.3 Extracting Chapter Headings from Moby Dick

```python
moby_chapters = re.findall(r"(CHAPTER\s+\d+.*?)\r?\n\r?\n", moby_raw)
print(f"Found {len(moby_chapters)} potential chapters.")
# Each element in moby_chapters should start with “CHAPTER <number>” up to a blank line.
```

### 5.4 Extracting Specific Chapter Text

```python
# Grab "CHAPTER 1" text up until the next "CHAPTER <number>"
chapter_1_pattern = r"(CHAPTER 1.*?)(?=CHAPTER \d+|$)"
match = re.search(chapter_1_pattern, moby_raw, re.DOTALL)

if match:
    chapter_1_text = match.group(1)
    chapter_1_tokens = nltk.word_tokenize(chapter_1_text)
    print(f"Number of tokens in Chapter 1: {len(chapter_1_tokens)}")
else:
    print("Chapter 1 not found.")
```

---

## 6. Further NLP Examples

### 6.1 Frequency Distribution

```python
from nltk.probability import FreqDist

# Convert to lowercase and create a frequency distribution
words_lower = [w.lower() for w in words]
fdist = FreqDist(words_lower)

# Show the top 10 most common tokens
common_tokens = fdist.most_common(10)
print(common_tokens)
```

### 6.2 Removing Stopwords

```python
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in words_lower if w.isalpha() and w not in stop_words]

print(f"Original token count: {len(words_lower)}")
print(f"Filtered token count: {len(filtered_tokens)}")
```



---

## 7. Conclusion

- **Definition Recap**:
  - **NLP**: Processing human language with computers.
  - **Corpus/Document**: Structured text collections vs. individual text units.
  - **Token/Tokenization**: Splitting text into meaningful chunks.
  - **Regex**: Pattern-based string manipulation.
  - **Stopwords**, **Stemming**, **Lemmatization**: Techniques to clean and normalize textual data.
- **NLTK** provides:
  - Ready-to-use corpora (e.g., Gutenberg).
  - Easy tokenization methods (`sent_tokenize`, `word_tokenize`).
  - Tools for frequency analysis, stopword removal, etc.
- **Regular Expressions** (`re`) let you handle custom or advanced pattern matching.

From here, you can move into more advanced tasks like **Part-of-Speech Tagging**, **Named Entity Recognition**, **Text Classification**, or explore other libraries like **spaCy** for fast production-level NLP.

``` 
````````markdown


```markdown
# Regular Expressions: Basics, Grammar, and Homework-Specific Usage

This tutorial explains **regular expressions (regex)** from the ground up for students with little or no prior experience, highlighting patterns, methods, and examples needed for the Wiki Game and Movie Scraping homework in STA 141B.

---

## 1. Regex Basics

Regular expressions are a mini-language for **pattern matching** and **searching** within text. In Python, they are provided by the built-in [`re` module](https://docs.python.org/3/library/re.html).

### 1.1. Metacharacters and Special Symbols

- **`.` (dot)**: Matches **any single character** (except possibly newlines, unless you set certain flags like `re.DOTALL`).
- **`^`**: Asserts the **start of a string** (or start of a line with `re.MULTILINE`).
- **`$`**: Asserts the **end of a string** (or end of a line with `re.MULTILINE`).
- **`*`**: Matches **0 or more** of the preceding element.
- **`+`**: Matches **1 or more** of the preceding element.
- **`?`**: 
  1. Matches **0 or 1** of the preceding element (makes something optional), or 
  2. Makes a quantifier (like `*` or `+`) **non-greedy** when combined (`*?`, `+?`).
- **`{m,n}`**: Matches **m to n** of the preceding element, e.g., `\d{2,4}` matches 2–4 digits in a row.
- **`(...)`**: **Parentheses** for **grouping** or **capturing**. 
- **`|`**: **Alternation**, acts like a logical OR. For example, `(cat|dog)` matches `"cat"` or `"dog"`.
- **`\`**: Escape character. For instance, `\.` matches a literal dot (`.`), `\(` matches a literal `(`, etc.

Below are short code examples demonstrating these metacharacters:

```python
import re

text_sample = "cat cot cut c#t c t"

# 1) Dot (.)
#    Pattern: "c.t" matches any single character between 'c' and 't'
match_dot = re.findall(r"c.t", text_sample)
print("Matches for c.t:", match_dot)
# Example output might be ['cat', 'cot', 'cut']

# 2) Start (^) and End ($)
lines = """dog
cat
car
cart
"""
# Matches lines that start with 'c' and end with 't'
match_start_end = re.findall(r"^c.*t$", lines, flags=re.MULTILINE)
print("Lines starting with c and ending with t:", match_start_end)
# Example output: ['cat', 'cart']

# 3) Star (*) vs Plus (+) vs Question Mark (?)
text_nums = "ab, abb, abbb, abbbb"
# 'ab*' => "a" followed by zero or more "b"
match_star = re.findall(r"ab*", text_nums)
print("Matches for ab*:", match_star)
# Might match 'a', 'ab', 'abb', 'abbb', etc. (depending on context)

# 'ab+' => "a" followed by one or more "b"
match_plus = re.findall(r"ab+", text_nums)
print("Matches for ab+:", match_plus)

# 'ab?' => "a" followed by zero or one "b"
match_question = re.findall(r"ab?", text_nums)
print("Matches for ab?:", match_question)

# 4) Curly Braces {m,n}
text_digits = "Phone: 1234567, Code: 999, Number: 1234"
# Pattern for 3 to 4 digits in a row
match_digits = re.findall(r"\b\d{3,4}\b", text_digits)
print("Matches for 3-4 digits:", match_digits)

# 5) Grouping and Alternation
animals = "cat dog mouse rat catdog"
# Pattern: (cat|dog)
group_alt = re.findall(r"(cat|dog)", animals)
print("Matches for (cat|dog):", group_alt)
# Will match 'cat', 'dog', but not 'catdog' in a single match
```

---

### 1.2. Character Classes

- **`[ABC]`**: Matches **one** character that is `A`, `B`, or `C`.
- **`[A-Za-z0-9]`**: Matches any alphanumeric character.
- **`[^...]`**: **Negated** character class. For instance, `[^0-9]` matches any character *not* a digit.

```python
import re

text_chars = "ABC 123 #$% xyz"

# [ABC]
match_abc = re.findall(r"[ABC]", text_chars)
print("Matches for [ABC]:", match_abc)

# Negated class [^0-9] => non-digit
match_non_digit = re.findall(r"[^0-9]+", text_chars)
print("Matches for [^0-9]+:", match_non_digit)
```

---

### 1.3. Common Shorthand Escapes

- **`\d`**: Digit (0–9).
- **`\s`**: Whitespace (spaces, tabs, newlines, etc.).
- **`\w`**: Word characters (letters, digits, underscores).
- **`\b`**: Word boundary.

```python
import re

sample_text = "ABC123 foo_bar   7"

# 1) \d => digit
digits = re.findall(r"\d", sample_text)
print("Digits:", digits)

# 2) \s => whitespace
spaces = re.findall(r"\s+", sample_text)
print("Whitespace chunks:", spaces)

# 3) \w => word characters (letters, digits, underscore)
words = re.findall(r"\w+", sample_text)
print("Word chunks:", words)

# 4) \b => word boundary
boundary_test = re.findall(r"\bfoo\b", "foo_bar foo bar foobar")
print("Exact 'foo' word:", boundary_test)
```


---

## 2. The `re` Module: Common Methods

1. **`re.search(pattern, string)`**: Finds the *first* occurrence of `pattern`.  
2. **`re.match(pattern, string)`**: Like `search` but only at the beginning of the string.  
3. **`re.findall(pattern, string)`**: Returns a list of all **non-overlapping** matches.  
4. **`re.finditer(pattern, string)`**: Returns an iterator of match objects.  
5. **`re.sub(pattern, repl, string)`**: **Substitutes** matches of `pattern` with `repl`.  
6. **`re.split(pattern, string)`**: Splits `string` using `pattern` as a delimiter.  
7. **`re.compile(pattern)`**: Compiles a pattern for repeated use.

**Example**:

```python
import re

text = "Hello world! 2023 is a year, 1999 was another year."

# 1) re.search
match_search = re.search(r"\d{4}", text)
print(match_search.group())  # '2023' (the first 4-digit number found)

# 2) re.findall
all_numbers = re.findall(r"\d{4}", text)
print(all_numbers)  # ['2023', '1999']

# 3) re.sub
text_no_digits = re.sub(r"\d+", "[NUM]", text)
print(text_no_digits)
# "Hello world! [NUM] is a year, [NUM] was another year."
```

---

## 3. Why Parentheses in Regex?

1. **Grouping**: `(19|20)\d{2}` means “19 or 20, followed by 2 digits.” Without parentheses, `19|20\d{2}` might be interpreted as “19 **or** `20\d{2}`,” which could break your intended logic.  
2. **Capturing**: If you do `match = re.search(r"(19|20)\d{2}", text)`, you can use `match.group(1)` to see which part of the group (`19` or `20`) was matched.

---

## 4. Homework-Specific Patterns & Logic

### 4.1. Removing Parenthetical Text

**Goal**: If a string contains `( ... )`, you often want to remove them entirely (e.g., in Wiki articles).

```python
import re

text_example = "Some text (hidden info) more text (another) done."
removed = re.sub(r"\(.*?\)", "", text_example)
print(removed)
# Output: "Some text  more text  done."
```

### 4.2. Ignoring External or Special Links (e.g., `/wiki/File:`)

**Goal**: In the Wiki Game, skip `/wiki/File:`, `/wiki/Category:`, etc.

```python
import re

def is_valid_link(url):
    # Return False if it matches these special forms
    if re.match(r"^/wiki/(File|Category|Special):", url):
        return False
    # Skip external 'http' or disambiguation pages
    if re.match(r"^http", url) or "disambiguation" in url.lower():
        return False
    return True

print(is_valid_link("/wiki/File:Example.jpg"))  # False
print(is_valid_link("/wiki/Regular_Article"))   # True
```

### 4.3. Removing Italicized Sections

To ignore italicized links (`<i><a href=...>`), remove all `<i>...</i>` blocks:

```python
html = "<p>Text <i><a href='/wiki/Italics'>Ignore me</a></i> and <a href='/wiki/Valid'>Use me</a>.</p>"
cleaned = re.sub(r"<i>.*?</i>", "", html, flags=re.DOTALL)
print(cleaned)
# "<p>Text  and <a href='/wiki/Valid'>Use me</a>.</p>"
```

### 4.4. Matching Years (e.g., 1900–2099)

```python
import re

text_movie = "Script date: 2004. Release date: 2006. Old year: 1879."
years = re.findall(r"\b\d{4}\b", text_movie)
print(years)  # ['2004', '2006', '1879']

modern_years = re.findall(r"(?:19|20)\d{2}", text_movie)
print(modern_years)
# ['2004', '2006'] (1879 won't match)
```

---

## 5. Putting It All Together: Wiki Game Example

```python
import re

page_html = """
<p>This is an example (ignore me) <i><a href="/wiki/Italics">Italics link</a></i>
<a href="/wiki/File:Example.jpg">File link</a>
<a href="/wiki/Valid_Link">Go here</a></p>
"""

# Step 1: Remove parentheses
step1 = re.sub(r"\(.*?\)", "", page_html)

# Step 2: Remove italic tags
step2 = re.sub(r"<i>.*?</i>", "", step1, flags=re.DOTALL)

# Step 3: Extract links with a regex for href="..."
links = re.findall(r'href="([^"]+)"', step2)

# Step 4: Filter out invalid links (File, Category, Special, http)
valid_links = []
for lk in links:
    if re.match(r"^/wiki/(File|Category|Special):", lk):
        continue
    if lk.startswith("http"):
        continue
    valid_links.append(lk)

print("Valid links found:", valid_links)
# Next: take the first valid link, follow it, repeat the process in your Wiki Game.
```

---

## 6. Conclusion

With these **regex basics** and **examples** in hand, you can:
1. **Skip** or **remove** unwanted text (parentheses, italic tags, special links).
2. **Extract** relevant data (years, normal wiki links).
3. **Build** the Wiki Game logic (follow valid links, handle loops).
4. **Scrape** or parse text from HTML by focusing on stable patterns (like `<a href="...">`) and ignoring sections you don't need.

**Key Takeaways**:
- **Master** the core symbols (`.`, `^`, `$`, `*`, `+`, `?`, `{m,n}`, `(...)`, `|`) and how to **escape** them (`\(`, `\.`).
- **Use** `re.findall`, `re.search`, `re.sub` wisely.
- **Test** your regex thoroughly on example strings to ensure correctness for edge cases.

Happy coding and good luck with your homework!
```


```markdown
# NLTK Grammar and Methods for STA 141B Homework

In some STA 141B tasks, you may want to go beyond simple **regex** or **HTML parsing** and leverage more robust **Natural Language Processing (NLP)** approaches. The **Natural Language Toolkit (NLTK)** is a popular Python library that provides tools for tokenizing text, tagging parts of speech, parsing syntax, and more. Below is a brief overview of NLTK basics that might be helpful if your homework requires deeper text analysis.

---

## 1. Installation & Basic Setup

If you are working in a standard environment (e.g., Anaconda, pip), install NLTK with:

```bash
pip install nltk
```

Then, within Python, you can download certain corpora or tokenizers:

```python
import nltk
nltk.download('punkt')  # For tokenizers
nltk.download('averaged_perceptron_tagger')  # For part-of-speech tagging
```

---

## 2. Tokenizing Text

- **Sentence Tokenization**: Splits a large text into individual sentences.  
- **Word Tokenization**: Splits a sentence into words (tokens).

```python
import nltk

text = "Hello world! This is a test. Let's see how NLTK tokenizes sentences."

# Sentence Tokenize
sentences = nltk.sent_tokenize(text)
print("Sentences:", sentences)

# Word Tokenize each sentence
for i, sent in enumerate(sentences, start=1):
    tokens = nltk.word_tokenize(sent)
    print(f"Sentence {i} tokens:", tokens)
```

You could use this approach, for example, if you want to detect and skip parentheses or italic sections at a token level rather than relying solely on regex.

---

## 3. Part-of-Speech (POS) Tagging

If you need to determine whether a token is a noun, verb, adjective, etc., you can use **POS tagging**. This might be helpful if you need to filter certain words or analyze link anchor text based on grammar.

```python
import nltk

sentence = "Wiki articles often contain many references."
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
# Example output might be: [('Wiki', 'NNP'), ('articles', 'NNS'), ('often', 'RB'), ('contain', 'VBP'), ...]
```

Each tuple has the form `(word, POS_tag)`. For instance, `NN` = noun (singular), `NNS` = noun (plural), `VBP` = verb present tense, etc.

---

## 4. Chunking & Parsing

For more advanced tasks, you can define **chunk grammars** to group tokens into phrases:

```python
import nltk

grammar = "NP: {<DT>?<JJ>*<NN>}"  # A simple grammar for a Noun Phrase: optional Det, any # of Adjs, then a Noun
chunk_parser = nltk.RegexpParser(grammar)

text_sample = "The big dog jumped over the lazy fox."
tokens = nltk.word_tokenize(text_sample)
pos_tags = nltk.pos_tag(tokens)
tree = chunk_parser.parse(pos_tags)

print(tree)
tree.draw()  # If you have a GUI environment, this displays a tree diagram
```

**Relevance to Homework**: Usually, chunking might not be strictly required unless you are analyzing the grammatical structure of anchor texts or script lines. But it’s good to know that NLTK can handle these tasks if your assignment asks for more nuanced textual patterns beyond regex.

---

## 5. Potential Use-Cases in Homework

1. **Filtering Certain Terms**: After tokenizing, you could remove stop words or punctuation.  
2. **Detecting Key Phrases**: If the assignment requires analyzing script lines or complex textual fields, NLTK can parse or chunk them.  
3. **Advanced Searching**: Instead of relying purely on `re.search`, you can tokenize, then do logical checks (e.g., skip tokens in parentheses or ignore italicized segments).

> **Note**: In many STA 141B use-cases, **regex + HTML parsing** will suffice. NLTK is an extra layer for more complex text manipulation (e.g., ignoring certain words by part of speech, or extracting lines that contain specific grammatical structures).

---

## 6. Practical Example

Below is a simplified example showing how you might combine **regex** to remove HTML tags, then **NLTK** to tokenize or further analyze the clean text:

```python
import re
import nltk

# Suppose you scraped some raw HTML:
raw_html = "<p>This is <i>italic text</i> in a sample script (1985 draft version).</p>"

# 1) Remove HTML tags with regex
text_no_html = re.sub(r"<.*?>", "", raw_html)

# 2) Remove parentheses text with regex
text_clean = re.sub(r"\(.*?\)", "", text_no_html)

# 3) Tokenize with NLTK
tokens = nltk.word_tokenize(text_clean)
print("Tokens:", tokens)

# 4) Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)
```

This approach is helpful if your assignment requires **linguistic** or **grammatical** analysis after stripping out extraneous HTML elements.

---

## 7. Summary

- **NLTK** provides a higher-level way to work with textual data, beyond simple regex.  
- **Tokenizing** text into sentences or words is crucial for more advanced tasks like part-of-speech tagging or chunking.  
- While not always necessary for the Wiki or IMSDb scraping, it can be useful if your approach requires deeper text processing (e.g., ignoring certain words by part-of-speech or analyzing script lines).

Feel free to experiment with NLTK’s capabilities if you find that the basic HTML parsing and regex approach **isn’t** enough for your homework.
```


Let consider a related example to the lecture. [This](https://en.wikipedia.org/wiki/Around_the_World_in_Eighty_Days) novel contains geographical information, so lets identify all names cities in it and put these on a map. 

The file is not available in `nltk`, so we need to scrape it. 

```markdown
# Annotated NLP Example: Identifying City Names from a Novel, with Regex Explanation

Below is **one** raw Markdown chunk containing a detailed tutorial on scraping text from Project Gutenberg, identifying city names with `nltk.ne_chunk`, verifying them on Wikipedia, and plotting them. All code fences use triple backticks with language spec; the entire content is wrapped in extra backticks so it remains raw Markdown. **Additional emphasis** is placed on explaining the usage of `re.findall(r"\w+", document)`.

---

## 1. Overview

In this example, we:

1. **Scrape** Jules Verne’s _Around the World in 80 Days_ (Project Gutenberg ID #103).  
2. **Extract** text from `<p>` paragraphs.  
3. **Tokenize** using **regular expressions** and **NLTK**.  
4. **Perform** Named Entity Recognition to find GPEs (“Geo-Political Entities”), which potentially indicates cities.  
5. **Verify** each potential city with Wikipedia to confirm it’s actually a city.  
6. **Fetch** latitude & longitude for each verified city.  
7. **Plot** the results on a map with Plotly.

---

## 2. Key Regex Explanation

- **`re.findall(r"\w+", document)`**:  
  - `\w` is a shorthand character class in regex that matches **“word characters”** (letters, digits, and underscores by default).  
  - The `+` quantifier means **“one or more”** of those word characters.  
  - Thus, `r"\w+"` captures **consecutive alphanumeric** “word-like” sequences while effectively discarding punctuation and whitespace.  
  - In this example, we use `re.findall(...)` to obtain a **list of word tokens** from each paragraph before feeding them into NLTK for POS tagging and NER.

This approach is a **simpler form** of tokenization: punctuation will be excluded, and words like `O'Neill` would be split at the apostrophe. If you need a more nuanced approach (keeping some punctuation, handling contractions, etc.), other **tokenizers** (e.g., `nltk.word_tokenize`) might be preferable. But `\w+` is typically enough for quick “word-only” tasks.

---

## 3. Full Code with Annotations

```python
###########################################################
# 1. IMPORTS
###########################################################
import requests_cache
from lxml.html import fromstring
import re
import nltk
import time
import pandas as pd
import plotly.express as px

# Explanation:
# - requests_cache: to cache repeated HTTP requests (avoid re-downloading pages).
# - lxml.html: parse HTML from strings so we can run XPath queries.
# - re: Python's built-in regex module for text matching/splitting.
# - nltk: Natural Language Toolkit, used here for POS tagging and Named Entity Recognition (NER).
# - time: to add small sleep intervals (prevent hitting server rate limits).
# - pandas, plotly.express: for data manipulation and interactive mapping.

###########################################################
# 2. SCRAPE DATA FROM PROJECT GUTENBERG
###########################################################
# We fetch Jules Verne’s "Around the World in 80 Days" from Project Gutenberg.
s = requests_cache.CachedSession()
r = s.get("https://www.gutenberg.org/files/103/103-h/103-h.htm")
html = fromstring(r.text)

# Extract paragraphs under <div class="chapter"> into a list.
corpus = html.xpath('//div[@class="chapter"]/p/text()')

# Display the first paragraph for illustration.
document = corpus[0]
print("EXAMPLE PARAGRAPH:\n", document)

###########################################################
# 3. TOKENIZING & NAMED ENTITY RECOGNITION
###########################################################
# Let's see how we tokenize a single paragraph and identify GPEs (cities).
# Step 1: Basic regex-based tokenization with \w+ to capture word-like tokens.

words = re.findall(r"\w+", document)
# Explanation:
# "\w+" = sequences of word chars (letters, digits, underscores).
# This effectively breaks text into alphanumeric "words", ignoring punctuation.

print("\nTOKENS:\n", words)

# Step 2: Part-of-Speech tagging
token = nltk.pos_tag(words)
# Example: [("Mr", "NNP"), ("Phileas", "NNP"), ("Fogg", "NNP"), ... ]

# Step 3: Named Entity Recognition using nltk.ne_chunk
chunk = nltk.ne_chunk(token)
# chunk is an nltk.Tree with labeled subtrees like PERSON, ORGANIZATION, GPE, etc.

print("\nCHUNK (NER TREE):\n", chunk)

###########################################################
# 4. EXTRACTING CITY CANDIDATES
###########################################################
# Let's store GPE-labeled entities in a global list (not best practice, but simple here).
city_candidates = []

# A function to preprocess each paragraph:
def preprocess_paragraph(doc_text):
    global city_candidates
    
    # a) Tokenize with regex
    w = re.findall(r"\w+", doc_text)
    # b) POS tag
    t = nltk.pos_tag(w)
    # c) Named Entity Recognition
    c = nltk.ne_chunk(t)
    
    # d) Extract GPE-labeled subtrees
    for subtree in c:
        if isinstance(subtree, nltk.Tree):
            if subtree.label() == "GPE":
                # Gather all tokens from that subtree
                city_candidates.append([leaf[0] for leaf in subtree.leaves()])

# Apply to every paragraph in the corpus
for doc in corpus:
    preprocess_paragraph(doc)

# Convert the collected GPE-lists to a unique set of strings
city_candidates = set(" ".join(c) for c in city_candidates)
print("\nPOTENTIAL CITY CANDIDATES:\n", city_candidates)

###########################################################
# 5. WIKIPEDIA LOOKUP
###########################################################
# Check if the candidate is actually a city using Wikipedia categories.
def check_for_city(city):
    """
    Return True if 'city' has 'cities' in its Wikipedia categories.
    If not found, tries appending ' city' (e.g., 'London city').
    """
    is_city = False
    time.sleep(0.05)
    r = s.get("https://en.wikipedia.org/wiki/" + city)
    html = fromstring(r.text)

    try:
        categories = html.xpath('//div[@id="mw-normal-catlinks"]')[0].text_content()
    except:
        categories = ""

    if 'cities' in categories:
        is_city = True
    elif 'city' not in city:
        # Attempt with " city" appended
        is_city = check_for_city(city + " city")

    return is_city

def get_coord(city):
    """
    If 'city' is recognized as a city on Wikipedia,
    return (latitude, longitude). Otherwise return False.
    """
    print(city)
    is_city = False
    time.sleep(0.05)
    r = s.get("https://en.wikipedia.org/wiki/" + city)
    html = fromstring(r.text)

    try:
        categories = html.xpath('//div[@id="mw-normal-catlinks"]')[0].text_content()
    except:
        categories = ""

    if 'cities' in categories:
        is_city = True
    elif 'city' not in city:
        # Attempt city + " city"
        appended_check = get_coord(city + " city")
        if not isinstance(appended_check, bool):
            return appended_check

    if is_city:
        try:
            lat_string = html.xpath('//span[@class="latitude"]/text()')[0]
            # e.g. "22°18′N" -> parse out numbers
            lat_nums = re.findall(r'\d+', lat_string)
            lat = float(lat_nums[0] + '.' + lat_nums[1])
            if 'S' in lat_string:
                lat = -lat

            long_string = html.xpath('//span[@class="longitude"]/text()')[0]
            long_nums = re.findall(r'\d+', long_string)
            long = float(long_nums[0] + '.' + long_nums[1])
            if 'W' in long_string:
                long = -long

            return (lat, long)
        except:
            return False

    return False

# Apply get_coord to every candidate, storing in a dict
coord = {city: get_coord(city) for city in city_candidates}

# Filter out those that returned False
valid_cities = {k: v for k, v in coord.items() if v is not False}

###########################################################
# 6. BUILD A DATAFRAME AND PLOT
###########################################################
df = pd.DataFrame(valid_cities).T.reset_index()
df = df.rename(columns={'index': 'Name', 0: 'latitude', 1: 'longitude'})

# Set your Mapbox token (read from file or environment variable)
px.set_mapbox_access_token(open("./../keys/mapbox.txt").read())

# Plot with Plotly
fig = px.scatter_mapbox(df, 
                        lat='latitude', 
                        lon='longitude', 
                        hover_name="Name", 
                        zoom=4)
fig.show()
```

---

## 4. Key Takeaways

1. **Regex Tokenization** via `\w+`:  
   - Finds sequences of letters/digits, effectively ignoring punctuation and whitespace.  
   - Straightforward but may split words at contractions or miss hyphenated terms.

2. **Named Entity Recognition with `nltk.ne_chunk`**:  
   - After POS-tagging, `ne_chunk` identifies labeled subtrees (e.g., `PERSON`, `ORGANIZATION`, `GPE`).  
   - We specifically look for `GPE` as a clue to potential city or country names.

3. **Wikipedia Verification**:  
   - Avoid false positives by checking whether a candidate has `'cities'` in its category links.

4. **Coordinate Extraction**:  
   - Parse `<span class="latitude">` and `<span class="longitude">` from Wikipedia’s infobox.  
   - Convert them into floating-point coordinates.

5. **Interactive Map**:  
   - With **Plotly**, we can visualize these recognized cities around the world.

``` 
````````markdown


```markdown
# **STA 141B WQ 25 Homework 3 (HINTS / PSEUDO-CODE VERSION)**

Below are **hints** and **partial/pseudo-code** for the given exercises. **No full solutions** are provided—only pointers on **logic** and **key methods/libraries**. You’re expected to fill in the details yourself.

---

## **Exercise 2: UC Irvine Compensation Data**

**Goal**: Retrieve UC Irvine (UCI) *professor* compensation data from UC’s open compensation site for 2023, then parse department info from the UCI directory.

### (a) Getting Compensation Data

1. **Send POST Request**  
   - Use `requests.post(...)` to query the UC compensation API (e.g., `https://ucannualwage.ucop.edu/wage/search.do`).
   - Provide parameters: e.g. `year=2023`, `title='PROF'`, `location='Irvine'`, etc.  
2. **Check/Parse Response**  
   - Use `.json()` to convert the response to Python structures.
   - Inspect `data['rows']` for the returned entries.  
   - Count with `len(...)`.

#### **Pseudo-Code Sketch**

```python
import requests

# Example pattern (fill in correct params/URL)
result = requests.post("API_URL", params={
    'year': 2023, 
    'title': 'PROF',
    'location': 'Irvine',
    # other params...
})
data = result.json()
all_entries = data['rows']  # might hold your records

print(len(all_entries))  # total returned
```

**Hint**: Some fields may appear as `"*****"`. You can filter those if needed.

---

### (b) Department Data via UCI Directory

1. **Extract Name Info**  
   - For each record, pick out first name/last name.  
2. **Call the UCI Directory**  
   - Possibly `requests.post("https://directory.uci.edu/render-list", data={...})` or similar.  
   - Provide a string like `"FirstName LastName"` in the form data.  
3. **Parse HTML**  
   - Use `lxml.html` or similar to parse the returned page.  
   - Extract the department text if available (e.g., an XPath like `//td[strong[text()="Department"]]/...`).  
4. **Store in a DataFrame**  
   - Convert your data into a `pandas.DataFrame`.  
   - Convert numeric pay columns with `pandas.to_numeric(...)`.  
5. **Group & Aggregate**  
   - `df.groupby("department").mean()` or `.agg(...)`.  
   - Sort by `gross` or `base` to find top departments.

#### **Pseudo-Code Sketch**

```python
import lxml.html as lx
import pandas as pd
from time import sleep

departments = []
for record in all_entries:
    # 1) Extract first/last name
    # 2) Directory request
    # 3) Parse HTML, find "Department" text
    # 4) Append to departments list
    sleep(0.01)  # minimal delay to be polite to server

df = pd.DataFrame({
    'gross': [...], 
    'base': [...],
    'department': departments
})

# Convert to numeric
df['gross'] = pd.to_numeric(df['gross'])
df['base']  = pd.to_numeric(df['base'])

grouped = df.groupby('department').mean()
top_by_gross = grouped.sort_values('gross', ascending=False).head(4)
top_by_base  = grouped.sort_values('base',  ascending=False).head(4)
```

**Hint**: You might find 500–1000 records with valid department data (some names may not match).

---

## **Exercise 2 (Wiki Game Variation)**

**Goal**: Start from a random (or specified) Wikipedia article, follow the **first non-italicized link** outside parentheses/infobox. Stop under any of these conditions:

- Reaches `/wiki/Philosophy`
- Finds a dead end (no valid links)
- Loops (revisiting a page already in the path)

### 1. Removing Parentheses

- Create a function using `re.sub(...)` to remove `( ... )` sections from text repeatedly.
- Ensure you only remove plain parentheses text, not parts of link markup.

### 2. Skipping Special / External Links

- Validate link with a regex like `^/wiki/` but exclude `/wiki/File:`, `/wiki/Category:`, `/wiki/Special:`.  
- Also skip external links (`http://`, `https://`).

### 3. Gathering Articles in a Chain

- Start from either `Special:Random` or a given `article` link.
- **Fetch** the page, **parse** main content, **remove** parentheses if needed.
- **Pick** the first valid link **not** in italic or an excluded region (infobox, note boxes, etc.).
- If no valid link, return `None`.

#### **Pseudo-Code Sketch**

```python
import requests
import lxml.html as lx
import re
import time

def remove_parenthesis(text):
    # repeatedly use re.sub to remove ( ... )
    # return cleaned text

def check_link(link):
    # verify link starts with /wiki/
    # exclude /wiki/File: etc.

def get_article(article_link):
    # 1) request: "https://en.wikipedia.org" + article_link
    # 2) remove parentheses from text
    # 3) parse valid <a> from the main content
    # 4) pick the first passing check_link
    # 5) return that link or None

def play(start_article=None):
    visited = []
    # if no start_article => use Special:Random
    current_article = ...
    while True:
        visited.append(current_article)
        next_link = get_article(current_article)
        if (no more links) or (loop) or (Philosophy):
            break
        current_article = next_link
    
    # add last link
    visited.append(next_link)
    return visited
```

**Hint**: Use an **XPath** to exclude `<i>`, `<table>`, `<figure>`, or note-like `<div>`s from your search for the first link.

---

### 4. Running 200 Times + Collecting Stats

- Call `play()` in a loop (200 runs).  
- **Count** how many end with `/wiki/Philosophy`.  
- **Compute** average and max path lengths.  
- **Collect** all visited articles to find the top 10 most visited (`collections.Counter` or `pandas.value_counts`).  
- **Unique** article count is `len(set(...))`.

#### **Pseudo-Code Sketch**

```python
from collections import Counter

results = []
for i in range(200):
    chain = play()
    results.append(chain)

# (i) how many end in Philosophy
count_philos = sum(1 for c in results if c[-1] == '/wiki/Philosophy')

# (ii) average length
avg_len = sum(len(c) for c in results) / 200

# (iii) maximum length
max_len = max(len(c) for c in results)

# (iv) ten most often visited
all_articles = [art for chain in results for art in chain]
counts = Counter(all_articles)
most_common_10 = counts.most_common(10)

# (v) number of distinct visited articles
unique_count = len(set(all_articles))
```

### 5. Starting from Philosophy

- If you want the chain starting at `/wiki/Philosophy`, do `play("/wiki/Philosophy")`.
- Print/inspect the list.

---

### **Final Notes**

These hints cover **key steps** (HTTP requests, regex filtering, HTML parsing, data aggregation) but **do not provide** the full working solution. You must fill in details and handle edge cases (like missing fields, unexpected HTML structures, etc.). Good luck!
```
