# Downloading Notre Dame Scholastic Football Review

From the [University Archives](http://archives.nd.edu/digital/):
- "The Notre Dame Scholastic published reviews of the football season starting in 1901. In 1910 a separate publication covered the "Gridiron Season" and from 1919 to 1921 a Football Review provided competition for the Scholastic. From 1924 to 1932 the Football Review prevailed as the Scholastic provided little or no commentary. In later years the Scholastic generally published its own special issue on the football season, though Irish Eye took over for a time in the 1980s."
- [Notre Dame Football Review, 1901 - 2010](http://archives.nd.edu/Football/)

This Jupyter Notebook inclues codes + comments that downloads all football review PDFs, and also matches publication titles to file names.

# Import Libraries, Load URL, and Create Beautiful Soup Object

In [None]:
# import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv

In [None]:
# load url, create beautifulsoup object
page = requests.get('http://archives.nd.edu/Football/Football.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# isolate HTML with 'ol' tag
url_names = soup.find('ol')

# find all instances of 'a' tag
items = url_names.find_all('a')

items

# Get List of Publication Links and Titles

In [None]:
# create empty list for urls
url_list = []

# create empty list for publication titles
title_list = []

# for loop that extracts href contents, concatenates full url, appends to url_list; extracts tag contents (publication title) and appends to title_list
for item in items:
    url_list.append("http://archives.nd.edu/Football/" + item.get('href'))
    title_list.append(item.contents[0])

In [None]:
# show list of urls
url_list

In [None]:
# show list of publication titles
title_list

# Download PDFs from List of URLs

In [None]:
# import libraries
import urllib3
import os

# configure urllib
http = urllib3.PoolManager()
print("downloading with urllib")

# for loop that downloads PDFs
for url in url_list:
    r = http.request('GET', url)
    filename = os.path.basename(url)
    with open (filename, 'wb') as fcont:
        fcont.write(r.data)

# Matching File Names and Publication Titles

In [None]:
# create empty list for file names
file_names = []

# for loop that extracts href contents, appends to file_name list
for item in items:
    file_names.append(item.get('href'))

# show file_names list
file_names

In [None]:
# import pandas
import pandas as pd

# create empty dataframe with two columns
df = pd.DataFrame(columns=['file_name', 'title'])

# append url_list to file_names column
df['file_name'] = file_names

# append file_name_list to title column
df['title'] = title_list

# show dataframe
df

In [None]:
# write dataframe to csv file
df.to_csv('scholastic_file_name_master.csv', index=False)