# Downloading Notre Dame Directories

From the [University Archives](http://archives.nd.edu/digital/):
- "Lists of Notre Dame officers, administrators, rectors, prefects, faculty, post-doctoral research fellows, and students. The alphabetical list of faculty generally indicates academic department, campus address and home address. The alphabetical list of students gives major subject or academic program, dorm or local address, and home address."
- [Notre Dame Directories, 1922 - 1974](http://archives.nd.edu/dir/)

This Jupyter Notebook inclues codes + comments that downloads all directory PDFs, and also matches directory titles to file names.

# Import Libraries, Load URL, and Create Beautiful Soup Object

In [None]:
# import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv

In [None]:
# load url, create beautifulsoup object
page = requests.get('http://archives.nd.edu/dir/dir.htm')
soup = BeautifulSoup(page.text, 'html.parser')

# isolate HTML with 'ul' tag
url_names = soup.find('ul')

# find all instances of 'a' tag
items = url_names.find_all('a')

items

# Get List of Directory Links and Titles

In [None]:
# create empty list for urls
url_list = []

# create empty list for yearbook titles
title_list = []

# for loop that extracts href contents, concatenates full url, appends to url_list; extracts tag contents (yearbook title) and appends to title_list
for item in items:
    url_list.append("http://archives.nd.edu/dir/" + item.get('href'))
    title_list.append(item.contents[0])

In [None]:
# show list of links
url_list

In [None]:
# show list of directory titles
title_list

# Download PDFs from List of URLs

In [None]:
# import libraries
import urllib3
import os

# configure urllib
http = urllib3.PoolManager()
print("downloading with urllib")

# for loop that downloads PDFs
for url in url_list:
    r = http.request('GET', url)
    filename = os.path.basename(url)
    with open (filename, 'wb') as fcont:
        fcont.write(r.data)

# Matching File Names and Yearbook Titles

In [None]:
# create empty list for file names
file_names = []

# for loop that extracts href contents, appends to file_name list
for item in items:
    file_names.append(item.get('href'))

# show file_names list
file_names

In [None]:
# import pandas
import pandas as pd

# create empty dataframe with two columns
df = pd.DataFrame(columns=['file_name', 'title'])

# append url_list to file_names column
df['file_name'] = file_names

# append file_name_list to title column
df['title'] = title_list

# show dataframe
df

In [None]:
# write dataframe to csv file
df.to_csv('directories_file_name_master.csv', index=False)