## Processing PDFs

This lesson will go over how to scrape the UC Berkeley Police website for latest crime statistics, then extract the data from a PDF. 

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import pdfplumber
import requests
import time
import os

Whenever you access a website, your computer sends a bit of data about itself called headers to that website, including which web browser you're using, and where you're coming from. We will spoof the headers we're using to the actual organization we represent, since we're doing it from a server. 

In [None]:
headers = {
    'referer': 'https://journalism.berkeley.edu/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

If you visit [https://ucpd.berkeley.edu/alerts-data/daily-crime-log](https://ucpd.berkeley.edu/alerts-data/daily-crime-log) website, you'll see that all of the data is in a hard-to-access format of PDFs. This is common with many government agencies. Instead of manualy downloading all of these PDFs one-by-one, we can programmatically scrape the website and download all of the PDFs with code. 

We wil use the **requests** Python library, which is specifically built for requesting information from other web properties.

In [None]:
webpage = requests.get("https://ucpd.berkeley.edu/alerts-data/daily-crime-log", headers=headers, timeout=4)
webpage.encoding = 'utf-8'
webpage

Next, we will use a common Python web scraping parser called Beautiful Soup to make sense of the webpage text string. 

In [None]:
soup = BeautifulSoup(webpage.text, 'html.parser')
soup

Much cleaner. Now we just need to find all of the `<a>` tags that link to a PDF file. First, let's find all of the "a" tags that have an href attribute. BeautifulSoup will store them in a Python list for us.

In [None]:
urls = soup.find_all('a', href=True)
urls

Great, now we just need to create a For Loop for going through each URL and create a new python list of only the URL portions.

In [None]:
pdfs = []

for url in urls:
    if(url['href'].endswith(".pdf")):
        pdfs.append(url['href'])

pdfs

That first item in the pdfs list is an errant document. We can use the `pop` method to get rid of it. 

In [None]:
pdfs.pop(0)
pdfs

Before we download all of these PDFs, we need to be aware of what rate limits the server might have. Servers will block scrapers who request files too frequently in a row. Many websites list their policies on a special files called robots.txt that exist on the root server of their website. Let's review UCPDs policy:

In [None]:
print(requests.get("https://ucpd.berkeley.edu/robots.txt").text)

Now that we know it's 10 seconds, we will wait 10 seconds between each request. This for loop will download each file and save it to our server. 

In [None]:
for pdf in pdfs:
    try:
        response = requests.get(pdf, headers=headers)
        response.raise_for_status() 
        
        filename = os.path.basename(pdf)
        with open(filename, 'wb') as f:
            f.write(response.content)
        
        print(f'Downloaded: {filename}')
        time.sleep(10)  # wait 10 seconds between each request

    except requests.exceptions.RequestException as e:
        print(f"Failed to download {pdf}: {e}")

Let's take the first file, and try to scrape it using a utility called PDF Plumber. Replace the pdf_name for one of your files.

In [None]:
pdf_name = "dcl20250401.pdf"

pdf = pdfplumber.open(pdf_name)
page = pdf.pages[0]
image = page.to_image(resolution=150)
image.reset().debug_tablefinder()

This shows us the auto detect feature of the table structure. Unfortunately, the lines on the table are partially broken and the system isn't able to find the table very well. But we can help it along by cropping the image to just the table portion, and explicitly stating where the vertical lines are in this table. PDF Plumber does the rest. 

In [None]:
table_settings = {"explicit_vertical_lines":[10, 73, 387, 470, 695]}
crop_settings  = (0, 140,792,580)

image = page.crop(crop_settings).to_image(resolution=150)
image.reset().debug_tablefinder(table_settings)

Much better. Now we can extract the text of the table.

In [None]:
table_text = page.crop(crop_settings).extract_table(table_settings)
table_text

Our data is in a 2-dimensional Python list, which is perfect for importing to a Pandas DataFrame. We just need to list the column headers.

In [None]:
columns=["Case", "Crimes", "Reported", "Occurred Range", "Location"]

df = pd.DataFrame(table_text, columns=columns)


df

Lastly, we can modify out code to create a loop to go through every page of the pdf, and even every pdf file, combining all of the data together into a single spreadsheet. This is beyond the scope of this lesson, so we'll just save this page we extracted.

In [None]:
df.to_csv("first_page_exported.csv", encoding="utf-8", index=False)