# Downloading Notre Dame Annual Catalogues and Bulletins

From the [University Archives](http://archives.nd.edu/digital/):
- "Notre Dame's catalogues or bulletins included descriptions of courses, programs, curricula, facilities, and faculty. They generally [listed students](http://archives.nd.edu/bulletin/stdnts.htm) and provided information on [graduation ceremonies](http://archives.nd.edu/bulletin/cmmncmts.htm), degree recipients, and academic prizes won by students."
- [Notre Dame Annual Catalogues or Bulletins, 1850 - 1914](http://archives.nd.edu/bulletin/)

This Jupyter Notebook inclues codes + comments that downloads all PDFs ofr Notre Dame annual catalogues and bulletins, and also matches document titles to file names.

# Import Libraries, Load URL, and Create Beautiful Soup Object

In [None]:
# import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv

In [None]:
# load url, create beautifulsoup object
page = requests.get('http://archives.nd.edu/bulletin/catalogs.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# isolate HTML with 'ul' tag
file_names = soup.find('ul')

# find all instances of 'a' tag
items = file_names.find_all('a')

# show items
items

# Get List of Publication Links and Titles

In [None]:
# empty list for urls
url_list = []

# empty list for publication titles
title_list = []

# for loop that extracts href contents, concatenates with url root, appends to url_list; extracts publication title and appends to title_list 
for item in items:
    url_list.append("http://archives.nd.edu/bulletin/" + item.get('href'))
    title_list.append(item.contents[0])

In [None]:
# show list of urls
url_list

In [None]:
# show list of titles
title_list

# Download PDFs from List of URLs

In [None]:
# import libraries
import urllib3
import os

# configure urllib
http = urllib3.PoolManager()
print("downloading with urllib")

# for loop that downloads PDF for each url in url_list
for url in url_list:
    r = http.request('GET', url)
    filename = os.path.basename(url)
    with open (filename, 'wb') as fcont:
        fcont.write(r.data)

# Matching File Names and Publication Titles

In [None]:
# create empty list for file names
file_names = []

# for loop that extracts href contents, appends to file_name list
for item in items:
    file_names.append(item.get('href'))

# show file_names list
file_names

In [None]:
# import pandas
import pandas as pd

# create dataframe
df = pd.DataFrame(columns=['file_name', 'title'])

# write file names to column
df['file_name'] = file_names

# write publication titles to column
df['title'] = title_list

# output dataframe
df

## Create Cleaned Version of Publication Titles

In [None]:
# import libraries
import re
import string

  
# create variable with punctuation/special characters
rem = string.punctuation
pattern = r"[{}]".format(rem)

# create title_clean column in dataframe and use regular expressions to replace special characters
df['title_clean'] = df['title'].str.replace(pattern, "-")

# remove any remaining whitespace
df['title_clean'] = df['title_clean'].str.replace(" ", "-")

# remove double dashes
df['title_clean'] = df['title_clean'].str.replace("--", "-")

# remove double dashes again
df['title_clean'] = df['title_clean'].str.replace("--", "-")

# remove trailing dash
df['title_clean'] = df['title_clean'].str.replace("-$", "")

# show updated dataframe
df

## Write DataFrame to CSV File

In [None]:
# write dataframe to csv file
df.to_csv('bulletins_catalogs_name_master.csv', index=False)

# Rename PDFs

Code that renames downloaded files with respective publication title.

In [None]:
# import os
import os

# create dictionary file file_name and clean_title columns in dataframe
references = dict(df.set_index("file_name")["title_clean"])

# show dictionary
references

In [None]:
# for loop that isolates dictionary elements and renames PDFs
for item in references.items():
    try:
        old_name = item[0]
        new_name = (item[1] + ".pdf")
        os.rename(old_name, new_name)
    except:
        pass