# BYOC: Build Your Own Corpus

There's a lot in this notebook. One reason for that is that scraping websites is more art than science: you have to be patient and be prepared either to tune your current method or to change to a different method. After the recap below, I discuss four methods:

- CLI tool `wget`
- Python `urllib` in conjunction with `BeautifulSoup`
- Python `selenium` in conjunction with `BeautifulSoup` *and* `wget`
- Python `praw` library to work with Reddit

My best advice is to read quickly through this notebook just to get a sense of all the possibilities, then to turn to your own corpus and discover for yourself what works. Be prepared for bumps and bruises along the way!

## Recap of Reasons to Create Your Own Corpus

**The Big Picture (for our small data)**
- Any machine learning application is primarily focused on discovering **signal**(s) within noise. 
- Extraction is done via **feature analysis**, determining which features, properties, dimensions best encode its meaning and underlying structure. 
- Determining a **representation** requires us to define the units (of language)—the things we count, measure, analyze, and learn from.

**What is a corpus?**
- A collection of related documents (that contain natural language)
- Size matters less but good design matters more. (ChatGPT is proof that large size can overcome some bad design decisions. A variety of focused BERTs reveal that design can beat size.)
- Typically, a corpus can be broken down into categories (or genres) of documents, which can emerge as principal components, centroids, neighborhoods, etc.
- Corpora can be text (data) only or they can contain metadata: the titles of novels, the date a tweet was published, the name of the user who posted the TikTok, the subreddit in which a post appeared, the number of comments or like any of these have received.

**What are the components?**
- Documents can be broken into paragraphs (units of discourse which conventionally express one idea or action), into sentences (units of syntax), into words and punctuation. (Words can be further broken down into syllables and characters, if interested.)
- The unit of analysis is the decision of the analyst.

**Advantages of Domain-Specific Corpora**
- A corpus that is relatively focused on a specific domain is easier to analyze and model than one made up of mixed domains.
- By fitting models in a narrower context, the prediction space is smaller and more specific, and therefore better able to handle the flexible aspects of language. (Again, ChatGPT versus all the BERTs.)
- How do we do it: scraping, RSS ingestion, API, finding a raw text corpus already out there. (See: Kaggle, Harvard Dataverse, etc.)
- Acquisition is the first step. Determining how to structure and manage the data, which includes “cleaning”,  is the next step that is often overlooked. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
import pandas as pd

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 300

## Getting Data

Places to get stuff:

**Books**
* Project Gutenberg: (http://gutenberg.org/)
* Google Books (http://books.google.com)
* EEBO (http://quod.lib.umich.edu/e/eebogroup/)
* ECCO (http://quod.lib.umich.edu/e/ecco/)
* Evans (http://quod.lib.umich.edu/e/evans/)

**Scholarship & Science**
* JSTOR DFR (http://dfr.jstor.org/)
* Open Access: PMC Open Access Set, PLoS, BioMed Central

**Social Media**
* The author of _Mining the Social Web_ (O'Reilly) has made the code available (on GitHub of course). Here's his website: [Mining the Social Web](https://miningthesocialweb.com).
* [Twitter/X APIs](https://dev.x.com)
* [Facebook API](https://developers.facebook.com)
* [Reddit API](https://developers.reddit.com/docs/api)

### Troubles with Access and Quality

The elephant in the room is **copyright**. The reason so much work is done with older books is that they are out of copyright. Getting access to contemporary books is difficult and/or expensive and is largely the purview of well-funded institutions. 

The next issue is quality. Most older materials have been digitized, which means a picture of the page was taken and then run through optical character recognition software (OCR). Some texts are more thoroughly checked than others. (Crowd-sourcing can vary: it all depends on the community.) 

There was a real rush to digitize things in the 2000s and 2010s, but those efforts were often underfunded and underconsidered. E.g., Louisiana Digital Library is kind of a mess. This doesn't mean you shouldn't use resources from under-resourced agencies and organizations, but it does mean that you should do hand-checking. (Depending on size, you may want to find a way to give your material either a thorough check or a considered random check.) OCR scanning erros can lead ro signals that are weak or just plain wrong.

re: Social media, see: Boyd and Crawford 2011 (SSRN: 1926431).

Not only is OCR problematic, but automated tasks, like named entity extraction, are also questionable. 

### CLI Tool: `wget`

Sometimes CLI tools, like `wget`, are more powerful than GUI tools. The key difference is that GUI tools are easier to use at first, but repetitive tasks are difficult or expensive (in terms of time). CLI tools are a little more difficult at first, but once you have an established collection of them, they are not only easier to use but just plain easier. 

**`wget`** is one of those tools. E.g.:

    % wget -r -l 1 -w 2 --limit-rate=20k https://www.cs.cmu.edu/~spok/grimmtmp/

Since CLI commands can often appear like magical incantations, let's break down what's happening in the line above:

* `wget` is the program we are going to use. It wants to know where to go and how to proceed.
* `-r` (or `--recursive`) turns on recursive retrieving (up to 5 directories deep). 
* `-l 1` (or`--level=1`) keeps the depth to 1.
* `-w 2` gives the amount of time to wait between retrievals. (Two seconds lessens the server load.)
* `--limit-rate=20k` sets the retrieval rate to 20kB/s. (This is being polite in a shared connection setting.)

K.M. Kinnaird and I used `wget` to "crawl" the TED website and download transcripts, talk and speaker descriptions, and comments pages. I have also used it, as the URL in the example above suggests, to grab the texts of Grimms' fairy tales. It's a truly useful tool, and, luckily for us, available through the **conda** package management system. *Yay!*

### Python Libraries: `urllib`

#### Case 1: Downloading Files from a Directory

Some sites do not like being crawled by `wget` and they will return **ERROR 403: Forbidden**. In those cases where you can see the directory contents but it is probably the case that the site has been configured so that directores cannot be browsed directly, you will have to plot an alternative path.

Sometimes you can use `wget` to retrieve a list of files and then feed that list to a Python script, or sometimes you can do everything you need from within Python. When I was helping someone working with treaties between Plains Indians and various government agencies, we developed the following:

In [None]:
import urllib.request
from bs4 import BeautifulSoup

# To use this script, the user needs to provide the three values below: 
# myurl, myfilter, mydirectory
# Please make sure `mydirectory` is already created before running

myurl = "http://digital.library.okstate.edu/kappler/Vol2/Toc.htm"
myfilter = "http://digital.library.okstate.edu/kappler/Vol2/treaties/"
mydirectory = "/Users/me/Desktop/downloadedfiles/"

myconnection = urllib.request.urlopen(myurl)
myhtml = myconnection.read()
mysoup = BeautifulSoup(myhtml, "lxml")
mylinks = mysoup.find_all('a')

all_links = []
for tag in mylinks:
    link = tag.get('href',None)
    if link is not None:
        all_links.append(link)

myresults = [k for k in all_links if myfilter in k]

for result in myresults:
    remotefile = urllib.request.urlopen(result)
    localfile = open(mydirectory+result.replace(myfilter, ''),'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()

#### Case 2: Working with an Content Management System

What happens when the texts with which you want to work are not sitting in a directory, but are in a content management system (CMS)? Helping somone interested in Paul Laurence Dunbar's poetry and fiction, we started with the previous script to map how the CMS generated URLs. Once we had a map, we were pretty sure we had a way to get what we wanted.

Here is the link for the digital archive of Dunbar’s work at Wright State: http://www.libraries.wright.edu/special/dunbar/

![Screenshot of Dunbar Archive Web Page](../assets/2-1-ScreenShot_Dunbar.png)

If we click on the "poetry" link in the lefthand navigation pane, and then hover over one of the books (see image above), we see the following URL: 

    http://www.libraries.wright.edu/special/dunbar/explore?book=8

Clicking on a book, takes us to a table of contents, with a series of links like this:

    http://www.libraries.wright.edu/special/dunbar/explore?book=9&id=236

The `id`s are not sequential within a book; however, by playing with the URLs in a browser, it looks like you can insert an asterisk into portion of the URL that identifies the book, `book=*`, and still get back results on simply the `id=`:

    http://www.libraries.wright.edu/special/dunbar/explore?book=*&id=99

In fact, after a little experimentation of just typing in numbers and changing the `id` number and getting back results, it looks like we just need to iterate through all the `id`s. If we start with `1`, how far up do we need to go? Since I saw numbers in the 300s earlier, I am going to start with 400 and go up by 100 until I get no results and then narrow by 10s and then 1s until I know where to stop ... and it appears we stop at 433.

Now let's go build, er, revise us some code...

In [None]:
#! /usr/bin/env python

import urllib.request
from bs4 import BeautifulSoup
import re

baseurl = "http://www.libraries.wright.edu/special/dunbar/explore?book=*&id="
mydirectory = "/Users/jjl/Desktop/downloadedfiles/"

mylist = []
for i in range (1, 434):
    link = baseurl+str(i)
    mylist.append(link)

for link in mylist:
    remotefile = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(remotefile, "lxml")
    div = soup.find('div', 'bookContain-right')
    localfile = open(mydirectory+link.replace(baseurl, '')+".html",'wt')
    localfile.write(str(div.encode('utf-8')))
    localfile.close()

The code works, and it returns only the contents of the desired `div`:

    <div class="bookContain-right">

But the contents remain ugly. At the very least, some regex is needed to clean up some of the escaped characters: those that begin with a backslash. Perhaps better would be to use `html2text` to convert the documents to plain text. 

In [None]:
#! /usr/bin/env python

import urllib.request
from bs4 import BeautifulSoup
import html2text

baseurl = "http://www.libraries.wright.edu/special/dunbar/explore?book=*&id="
mydirectory = "/Users/jjl/Desktop/downloadedfiles/"

mylist = []
for i in range (1, 2):
    link = baseurl+str(i)
    mylist.append(link)

for link in mylist:
    remotefile = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(remotefile, "lxml")
    div = soup.find('div', 'bookContain-right')
    text = html2text.html2text(str(div))
    localfile = open(mydirectory+link.replace(baseurl, '')+".txt",'wt')
    localfile.write(str(text))
    localfile.close()

### Python Library: `selenium`

While I worked for the Army, I was interested in possibly writing a paper about how the Army's thinking about China had changed over time. My goal was to build a corpus of all the materials available from the Army War College, which has both a journal as well as theses written by students and papers by faculty, West Point, and TRADOC. 

Trying to keep things as simple as possible, I started by trying to scrape the AWC Press website, which uses BePress, focused on using the `requests` library. (See [DataQuest's tutorial](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/) as a reminder/guide.) Using a URL I that worked in my browser, I met with failure:

```python
import requests
page = requests.get("https://press.armywarcollege.edu/do/search/?q=china&start=25&context=18225338&facet=#query-results")
page
```

```
ConnectionError: HTTPSConnectionPool(host='press.armywarcollege.edu', port=443): Max retries exceeded with url: /do/search/?q=china&start=25&context=18225338&facet= (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb9318dab90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
```

After trying a few other approaches, including my old friend `wget`, I turned to  Selenium, with some help from [ScrapFly's "Web Scraping with Selenium and Python"](https://scrapfly.io/blog/web-scraping-with-selenium-and-python/). It requires the `chromedriver`--both `selenium` and `chromedriver` are available through **conda**.

<div class="alert alert-block alert-danger">
In my experience, the chromedriver wants something fairly specific: <br/><code>export PATH=$PATH:/Users/jl/miniconda3/bin</code>. <br/>
Make sure to add it to your PATH — the `.zshrc` file on macOS.</div>

Once you connect, you can turn to the trusty parser of web pages, `BeautifulSoup`, and drill down into the search page HTML to get the URLs you need to create a list of URLs. In my case I decided to feed the list to `wget` and let it fetch the PDFs from the AWC Press website.

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# ==== To run headless:
# from selenium.webdriver.chrome.options import Options

# configure webdriver
# options = Options()
# options.headless = True  # hide GUI
# options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
# options.add_argument("start-maximized")  # ensure window is full-screen

# driver = webdriver.Chrome(options=options)
# ====

for i in range(0, 475, 25):
    driver = webdriver.Chrome()
    driver.get(f"https://press.armywarcollege.edu/do/search/?q=china&start={i}&context=18225338&facet=#query-results")
    time.sleep(15) # Selenium is a bit slow, so patience.
    soup = BeautifulSoup(driver.page_source)
    soupy_pdfs = soup.find_all('a', class_="pdf")
    pdfs = [str(item) for item in soupy_pdfs]
    with open("pdfurls4.txt", "a") as file:
        file.writelines(s + '\n' for s in pdfs)
    driver.close() # I thought the driver remaining open was the pbm

### AltPath: Using the Search Results File

In [None]:
# IMPORTS
import requests
from bs4 import BeautifulSoup

# Use our search results to create URLs
with open('../notes/AWC-search_results.txt', 'r') as the_file:
    urls = the_file.read().splitlines()

# Create an empty list
pdflinks = []

# And then fill it
for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    link = soup.find('a', attrs = {'id':'alpha-pdf'})['href']
    pdflinks.append(link)

In [None]:
len(pdflinks)

In [None]:
with open('../notes/pdfurls.txt', 'w') as f:
    f.writelines(i + '\n' for i in pdflinks)

## Working with APIs

### Reddit's `praw`

Our final example uses the `praw` library to download the 100 hot posts in `r/conspiracy` and save them to a dataframe (and later to a CSV file). 

*For those interested, it is also possible to use the ID for each post to grab all the comments. You can then add the comments to the dataframe, all of which can then be saved as a CSV.*

<div class="alert alert-block alert-danger">
<b>Warning!</b> Please note that most APIs will require you to set up credentials. You may be tempted to include those credentials in your script. <b>NEVER INCLUDE YOUR CREDENTIALS IN YOUR CODE.</b> It is too easy to commit and push and before you know it, your credentials are out on the big, bad web. Instead, save your credentials in a separate file which sits outside the repository. 
</div>

Loading credentials is fairly easy: in the code below I have saved them to my `.zshrc` file and I grab them using the `os.getenv()` function. Easy peasy.

In [None]:
import os
import praw
import pandas as pd
import numpy as np

# SECRET DECODER RING
webber_id = os.getenv("REDDIT_API_ID")
webber_secret = os.getenv("REDDIT_API_SECRET")

In [None]:
reddit = praw.Reddit(
    client_id="",
    client_secret="",
    user_agent="webber"
    )
print(reddit.read_only)
# try:
#     print("Authenticated as {}".format(reddit.user.me()))
# except ResponseException:
#     print("Something went wrong during authentication")

# print(reddit.auth.url(scopes=["identity"], state="...", duration="permanent"))

In [None]:
# hot_posts = reddit.subreddit('conspiracy').hot(limit=5)
# for post in hot_posts:
#     print(post.title)

In [None]:
conspiracy = reddit.subreddit('conspiracy')

posts = []
for post in conspiracy.hot(limit=100):
    posts.append([post.title, 
                  post.score, 
                  post.id, 
                  post.url, 
                  post.num_comments, 
                  post.selftext, 
                  post.created])

posts = pd.DataFrame(posts,
                     columns=['title', 'score', 'id', 'url', 'num_comments', 'body', 'created'])

posts.shape

In [None]:
posts.head()

In [None]:
# posts.to_csv('c2.csv')

### Braxton's Lyrics

In [None]:
import urllib.request
from bs4 import BeautifulSoup

# To use this script, the user needs to provide the three values below: 
# myurl, myfilter, mydirectory
# Please make sure `mydirectory` is already created before running

myurl = "https://www.azlyrics.com/m/mfdoom.html"


myconnection = urllib.request.urlopen(myurl)
myhtml = myconnection.read()
mysoup = BeautifulSoup(myhtml, "lxml")
mylinks = mysoup.find_all(class_="listalbum-item")

In [None]:
type(mylinks)

In [None]:
for item in mylinks[0:4]:
    print(item)

In [None]:
mydivs = mysoup.find_all(class_="listalbum-item")

links = []
for div in mydivs[0:2]:
    the_link = div.find('a')['href']
    links.append(the_link)


print(links)

Okay, so now we have some piece of the URL we want to use in a list.

Let's take a look at the URL we are trying to replicate:
```
https://www.azlyrics.com/lyrics/mfdoom/meddlewithmetal.html
```

It looks like all we need to do is prepend `https://www.azlyrics.com` to build the full URL. (Note that we have a slash already, so we don't need the trailing slash in our base URL.)

In [None]:
%pwd

In [None]:
%cd ~/Desktop/downloaded

In [None]:
baseURL = "https://www.azlyrics.com"
myfilter = "https://www.azlyrics.com/lyrics/mfdoom/"

# mydirectory = "/Users/me/Desktop/downloaded/"

for link in links:
    remotefile = urllib.request.urlopen(baseURL+link)
    filename = link.replace(myfilter, '')
    print(filename)
    # localfile = open(link.replace(myfilter, ''),'wb')
    # localfile.write(remotefile.read())
    # localfile.close()
    # remotefile.close()

In [None]:
baseURL = "https://www.azlyrics.com"
myfilter = "/lyrics/mfdoom/"

# mydirectory = "/Users/me/Desktop/downloaded/"

for link in links:
    remotefile = urllib.request.urlopen(baseURL+link)
    filename = link.replace(myfilter, '')
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()