# Downloading Notre Dame Capstan PDFs

From the [University Archives](http://archives.nd.edu/digital/):
- "During World War II, the United States Navy trained many officers at Notre Dame. The naval program published its own yearbook, called Capstan."
- [Capstan, 1943-1945, Digital Collection](http://archives.nd.edu/Capstan/)

This Jupyter Notebook inclues codes + comments that downloads all PDFs of *Capstan*, and also matches yearbook titles to file names.

# Import Libraries, Load URL, and Create Beautiful Soup Object

In [None]:
# import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv

In [None]:
# load url, create beautifulsoup object
page = requests.get('http://archives.nd.edu/Capstan/Capstan.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# isolate HTML with 'ol' tag
url_names = soup.find('ol')

# find all instances of 'a' tag
items = url_names.find_all('a')

# show items
items

# Get List of Yearbook Links and Titles

In [None]:
# create empty list for urls
url_list = []

# create empty list for yearbook titles
title_list = []

# for loop that extracts href contents, concatenates full url, appends to url_list; extracts tag contents (yearbook title) and appends to title_list
for item in items:
    url_list.append("http://archives.nd.edu/Capstan/" + item.get('href'))
    title_list.append(item.contents[0])

In [None]:
# show list of links
url_list

In [None]:
# show list of yearbook titles
title_list

# Download PDFs from List of URLs

In [None]:
# import libraries
import urllib3
import os

# configure urllib
http = urllib3.PoolManager()
print("downloading with urllib")

# for loop that downloads PDFs
for url in url_list:
    r = http.request('GET', url)
    filename = os.path.basename(url)
    with open (filename, 'wb') as fcont:
        fcont.write(r.data)

# Matching File Names and Yearbook Titles

In [None]:
# create empty list for file names
file_names = []

# for loop that extracts href contents, appends to file_name list
for item in items:
    file_names.append(item.get('href'))

# show file_names list
file_names

In [None]:
# import pandas
import pandas as pd

# create empty dataframe with two columns
df = pd.DataFrame(columns=['file_name', 'title'])

# append url_list to file_names column
df['file_name'] = file_names

# append file_name_list to title column
df['title'] = title_list

# show dataframe
df

In [None]:
# write dataframe to csv file
df.to_csv('scholastic_file_name_master.csv', index=False)