# workflow

1. Grab LCCNs for different newspaper titles at this link: [https://chroniclingamerica.loc.gov/newspapers.txt](https://chroniclingamerica.loc.gov/newspapers.txt). This LCCN is essentially a unique identifier for each newspaper title. For your project, I'd recommend sticking to titles published after 1900, as headlines didn't really become prominent on the front pages of newspaper pages until the turn of the century.

2. Once you have the LCCN, you can access all of the different issues for that specific newspaper title in JSON format by going to: [https://chroniclingamerica.loc.gov/lccn/sn86069873.json](https://chroniclingamerica.loc.gov/lccn/sn86069873.json) (here, I'm using The Bourbon News from Kentucky with the LCCN "sn86069873" as an example).

3. In the JSON, you'll find a URL to a JSON file containing the page-level data for each issue. For example, see: [https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1.json](https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1.json). If you then adjust this URL by appending "/seq-1.jp2", you'll then be able to download the front page (for example, [https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1/seq-1.jp2](https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1/seq-1.jp2)). The number of front pages is then just the number of newspaper issues listed.

# getting LCCNs

In [1]:
import pandas as pd
import time
import re
import pickle

Grab LCCNs for different newspaper titles at this link: https://chroniclingamerica.loc.gov/newspapers.txt. This LCCN is essentially a unique identifier for each newspaper title.

## preprocessing / data formatting

In [3]:
newspapers_df = pd.read_csv('https://chroniclingamerica.loc.gov/newspapers.txt', sep='|')

In [11]:
newspapers_df.columns = [col_name.strip() for col_name in newspapers_df.columns]

In [42]:
def get_month(date_string):
    """
    given date string, returns month and index where numerical date starts (`date_index`)
    """
    date_index = re.search(r"\d", date_string).start()
    month = date_string[:date_index].strip('. ')
    return month, date_index

In [50]:
def convert_datetime(date_string):
    """
    converts date_string (from either 'First Issue Date' or 'Last Issue Date' in LOC newspaper dataset) to time.struct_time object
    """
    date_string = date_string.strip()
    month, date_index = get_month(date_string)
    if len(month) == 3 and month != 'May':
        return time.strptime(date_string, '%b. %d, %Y')
    if month == 'Sept':
        date_string = 'September ' + date_string[date_index:]
    return time.strptime(date_string, '%B %d, %Y')

In [53]:
newspapers_df['First Issue Date'] = newspapers_df['First Issue Date'].map(convert_datetime)
newspapers_df['Last Issue Date'] = newspapers_df['Last Issue Date'].map(convert_datetime)
newspapers_df['LCCN'] = newspapers_df['LCCN'].map(lambda lccn : lccn.strip())

In [2]:
# newspapers_df.to_pickle("newspapers_df.pkl")
newspapers_df = pd.read_pickle("newspapers_df.pkl") # UNCOMMENT TO UNPICKLE

In [3]:
newspapers_df[:3]

Unnamed: 0,Persistent Link,State,Title,LCCN,OCLC,ISSN,No. of Issues,First Issue Date,Last Issue Date,More Info
0,https://chroniclingamerica.loc.gov/lccn/sn8607...,Alabama,"The age-herald. [volume] (Birmingham, Ala.) 1...",sn86072192,14948274,2692-4099,1630,"(1897, 8, 1, 0, 0, 0, 6, 213, -1)","(1902, 5, 20, 0, 0, 0, 1, 140, -1)",
1,https://chroniclingamerica.loc.gov/lccn/sn8402...,Alabama,Alabama state intelligencer. [volume] (Tuscal...,sn84021903,2683862,2574-4089,50,"(1831, 1, 1, 0, 0, 0, 5, 1, -1)","(1831, 12, 24, 0, 0, 0, 5, 358, -1)",
2,https://chroniclingamerica.loc.gov/lccn/sn8402...,Alabama,"Birmingham age-herald. [volume] (Birmingham, ...",sn84020639,4066065,2692-4226,423,"(1894, 7, 1, 0, 0, 0, 6, 182, -1)","(1895, 10, 3, 0, 0, 0, 3, 276, -1)",


# filtering
For your project, I'd recommend sticking to titles published after 1900, as headlines didn't really become prominent on the front pages of newspaper pages until the turn of the century.

In [189]:
print(f"There are {sum(newspapers_df['No. of Issues'])} newspaper issues and {len(newspapers_df['LCCN'])} LCCNs total")

There are 2443029 newspaper issues and 3327 LCCNs total


In [174]:
# start_before = newspapers_df['First Issue Date'] < START_DATE
# end_after = newspapers_df['Last Issue Date'] > START_DATE
# turn_of_century_papers = newspapers_df[start_before & end_after]
# print(f"There are {sum(turn_of_century_papers['No. of Issues'])} issues from publications started before but active during {time.asctime(START_DATE)}")

There are 1114757 issues from publications started before but active during Mon Jan  1 00:00:00 1900


In [4]:
def get_lccns(start_date_str, newspapers_df):
    """
    start_date_str should be a string of the format 'Month Date Year' e.g. 'January 1 1950', representing the publication start date
    prints the number of LCCNs and the number of issues total
    returns a list of LCCNs
    """
    start_date = time.strptime(start_date_str, '%B %d %Y')
    filtered_newspapers = newspapers_df[newspapers_df['First Issue Date'] >= start_date]
    filtered_LCCNs = filtered_newspapers['LCCN']
    
    print(f"There are {sum(filtered_newspapers['No. of Issues'])} issues and {len(filtered_LCCNs)} LCCNs from publications started after {start_date_str}")
    
    return list(filtered_LCCNs)

In [5]:
lccn_1900_683352 = get_lccns('January 1 1900', newspapers_df)

There are 683352 issues and 1123 LCCNs from publications started after January 1 1900


In [6]:
lccn_1959_1003 = get_lccns('January 1 1959', newspapers_df)

There are 1003 issues and 8 LCCNs from publications started after January 1 1959


# writing list of front page image links

Once you have the LCCN, you can access all of the different issues for that specific newspaper title in JSON format by going to: https://chroniclingamerica.loc.gov/lccn/sn86069873.json (here, I'm using The Bourbon News from Kentucky with the LCCN "sn86069873" as an example).

In the JSON, you'll find a URL to a JSON file containing the page-level data for each issue. For example, see: https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1.json. If you then adjust this URL by appending "/seq-1.jp2", you'll then be able to download the front page (for example, https://chroniclingamerica.loc.gov/lccn/sn86069873/1897-01-08/ed-1/seq-1.jp2). The number of front pages is then just the number of newspaper issues listed.

In [7]:
import json
import requests
import sys
from progressbar import ProgressBar

In [220]:
def write_image_links(lccns, out_filename, error_filename, one_page_per_lccn=False):
    """
    writes links to jp2 images for all front pages published by outlets with lccns in `lccns` to out_filename; writes errors to error_filename.
    if one_page_per_lccn == True, only gets one front page, that of MOST RECENT issue, per publisher (default is to get all front pages from all issues)
    """
    front_pages = []
    failed_lccns = []
    pbar = ProgressBar()

    with open(out_filename, 'a') as out_file, open(error_filename, 'a') as error_file:
        for lccn in pbar(lccns):
            try:
                paper = requests.get('https://chroniclingamerica.loc.gov/lccn/' + lccn + '.json').json()
                if one_page_per_lccn:
                    url = paper['issues'][-1]['url'] # get url for most recent issue
                    out_file.write(url[:-5] + '/seq-1.jp2\n') # append link to first page to out_file
                else:
                    urls = [issue['url'] for issue in paper['issues']] # get urls for all issues
                    out_file.writelines([issue_url[:-5] + '/seq-1.jp2\n' for issue_url in urls]) # append all first page links to out_file
            except:
                e = sys.exc_info()
                error_file.write(f"LCCN {lccn} failed with error {str(e)}\n")


In [221]:
write_image_links(lccn_1959_1003, 'links_1959_unique_8.txt', 'errors_1959_unique_8.txt', one_page_per_lccn=True)

100% |########################################################################|


In [223]:
write_image_links(lccn_1959_1003, 'links_1959_1003.txt', 'errors_1959_1003.txt')

100% |########################################################################|


In [224]:
write_image_links(lccn_1900_683352, 'links_1900_unique_1123.txt', 'errors_1900_unique_1123.txt', one_page_per_lccn=True)

100% |########################################################################|


# getting images from links into specified directory

**DEPRECATED // OUT OF DATE**
see `get_images.py` instead :^)

In [56]:
import os
import sys
import requests
import PIL
from PIL import Image
import io

In [None]:
# $ brew tap hhatto/pgmagick
# $ brew install pgmagick

In [37]:
def get_image_name(image_link):
    """
    given link to jp2 image (e.g. 'https://chroniclingamerica.loc.gov/lccn/sn83045298/1963-12-20/ed-1/seq-1.jp2'),
    returns name of image by joining all parts of the url following and including the lccn value (e.g. sn83045298) with underscores
    """
    return ('_'.join(image_link.split('/')[4:])).strip()[:-1] + 'g'

In [38]:
get_image_name('https://chroniclingamerica.loc.gov/lccn/sn83045298/1963-12-20/ed-1/seq-1.jp2')

'sn83045298_1963-12-20_ed-1_seq-1.jpg'

In [59]:
def write_jp2_images(links_filename, error_filename, out_dir):
    """
    writes all .jp2 images with links in links_filename as into out_dir; writes errors into error_filename
    """
    with open(links_filename, 'r') as links, open(error_filename, 'a') as error_file:
        for image_link in links:
            try:
                image_name = get_image_name(image_link)
                if not os.path.isfile(out_dir + '/' + image_name):
                    # pulls image
                    r = requests.get(image_link, stream=True)
                    # makes sure the request passed:
                    if r.status_code == 200:
                        with open(out_dir + '/' + image_name, 'wb') as f:
                            f.write(r.content)
            except:
                e = sys.exc_info()
                error_file.write(f"Could not get image {image_link}; failed with error {str(e)}\n")

In [39]:
write_jp2_images('links_1959_unique_8.txt', 'error_1959_unique_8.txt', 'testjpg')

In [53]:
im = Image.open('testjpg/sn83045298_1963-12-20_ed-1_seq-1.jpg')

In [None]:
def jp2s_to_pngs(jp2s_dir):
    command = f"opj_decompress -ImgDir {jp2s_dir} -OutFor png" #TODO: TEST THIS ON AWS
    os.system(command)

In [49]:
def crop_resize(png_file, dim=1024):
    # resize to maintain aspect ratio
    img = Image.open(png_file)
    wpercent = (dim / float(img.size[0]))
    hsize = int((float(img.size[1]) * float(wpercent)))
    img = img.resize((dim, hsize), PIL.Image.ANTIALIAS)
    
    # crop image
    (left, upper, right, lower) = (0, 0, dim, dim)
    img = img.crop((left, upper, right, lower))

    img.save('resized_image.jpg')

In [60]:
png_file = 'blabla.png'

In [62]:
png_file[:-4]

'blabla'