# Downloading Notre Dame Bagby Negative Images

From the [University Archives](http://archives.nd.edu/digital/):
- "The Bagby company, a South Bend photographic studio, took pictures of athletes for Notre Dame. The digitized Glass Plate Negative Collection is part of a [larger Bagby collection](http://archives.nd.edu/findaids/ead/xml/bby.xml)."
- [Bagby Glass Plate Negative Collection (Notre Dame Sports), 1920s-1930s](http://archives.nd.edu/Bagby/index.htm)

This Jupyter Notebook inclues codes + comments that downloads all images in the Bagby Glass Plate Negative Collection (Notre Dame Sports), and also matches image metadata to file names.

# Import Libraries, Load URL, and Create Beautiful Soup Object

In [1]:
# import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv
import pandas as pd

In [2]:
# load url, create beautifulsoup object
page = requests.get('http://archives.nd.edu/Bagby/index.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# isolate html with 'table' tag
url_names = soup.find('table')

# find all instances of 'img' tag
img_list = url_names.find_all('img')

# Get List of Image File Names

In [None]:
# create empty list for image file names
image_file_names = []

# for loop that isolates src contents, removes 'tn\\' string, and appends to empty list
for img in img_list:
    image_file_names.append(img.get('src').replace("tn\\tn-", ""))
    
# list of image file names
image_file_names

# Get List of Image URLs

In [4]:
# create empty list for image urls
image_url_list = []

# for loop that concatenates URL root with image file name (end of link)
for name in image_file_names:
    image_url_list.append("http://archives.nd.edu/Bagby/" + name)
    
# list of urls
image_url_list

['http://archives.nd.edu/Bagby/GBBY-45g001.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g002.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g003.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g004.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g005.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g006.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g007.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g008.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g009.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g010.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g011.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g012.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g013.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g014.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g015.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g016.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g017.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g018.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g019.jpg',
 'http://archives.nd.edu/Bagby/GBBY-45g020.jpg',
 'http://archives.nd

# Download Image Files from List of Full URLS

In [None]:
# import libraries
import urllib3
import os

# configure urllib
http = urllib3.PoolManager()
print("downloading with urllib")

# for loop that downloads image for each url in image_url_list
for url in image_url_list:
    r = http.request('GET', url)
    filename = os.path.basename(url)
    with open (filename, 'wb') as fcont:
        fcont.write(r.data)

# Matching File Names and Image Info

In [52]:
# get image descriptions

# isolate html with 'table' tag
url_names = soup.find('table')

# isolate HTML with 'td' tags
images = soup.find_all('tr')

# create empty list for image titles
image_titles = []

# for loop to extract image titles and append to list
for title in images:
  test = title.contents[1]
  test = test.contents[0]
  image_titles.append(test)

In [None]:
# show sample image title
image_titles[0]

In [None]:
# import pandas
import pandas as pd

# create dataframe from table object using pd.read_html
df = pd.DataFrame(columns=["Image_URL", "Image_File_Name", "Image_Description"])

# map lists to data frame columns
df['Image_URL'] = image_url_list

df['Image_File_Name'] = image_file_names

df['Image_Description'] = image_titles

# show newly-created dataframe
df

In [None]:
# write dataframe to csv
df.to_csv('bagby_images_file_name_master.csv', index=False)