The Internet Archive offers an API (Application Programming Interface) that makes it easier to access their archives programmatically. For a given item, say [the first issue of Scientific American](https://archive.org/details/scientific-american-1845-08-28), we can get the metadata for the item by changing the URL.

| Data View | URL |
| --- | --- |
| Pretty Web Page for Humans | [https://archive.org/**details**/scientific-american-1845-08-28](https://archive.org/details/scientific-american-1845-08-2)|
| Easy Metadata for Machines | [https://archive.org/**metadata**/scientific-american-1845-08-28](https://archive.org/metadata/scientific-american-1845-08-28) |

The metadata is offered in a standard format called [JSON](https://en.wikipedia.org/wiki/JSON). JSON is a standard format for communicating chunks of simple data in a way that is human-readable and machine-parseable.

Let's take a look at what the JSON for the above item looks like:

    {"created":1448448826,"d1":"ia600806.us.archive.org","d2":"ia700806.us.archive.org","dir":"/34/items/scientific-american-1845-08-28","files":[{"name":"scientific-american-v01-n01-1845-08-28.djvu","source":"derivative","format":"DjVu","original":"scientific-american-v01-n01-1845-08-28_djvu.xml","mtime":"1329085972","size":"1054500","md5":"e24b0db3861efd985ed11172ec0f5677","crc32":"a156ba3a","sha1":"7ca3d6ce76717b19ba91a3baaecbe6bb7b897d10"},{"name":"scientific-american-v01-n01-1845-08-28.epub","source":"derivative","format":"EPUB","original":"scientific-american-v01-n01-1845-08-28_abbyy.gz","mtime":"1329085978","size":"97103","md5":"b27bd43cf6af61d6e70a6d135e2178e9","crc32":"3c9a3d5f","sha1":"474c04d1000809886afe0c52712735be890c2057"},...

Now at first glance, this is...not human-readable. But let's consider the data in a different way: as a table.

| Key | Value |
| --- | --- |
| "created" | 1448448826 |
| "d1" | "ia600806.us.archive.org" |
| "d2" | "ia700806.us.archive.org" |
| "dir" | "/34/items/scientific-american-1845-08-28"|
| "files" | [{"name":...}] |

Still esoteric, but more organized. A piece of JSON data relates keys to values, much like how a phone book relates names to phone numbers. But what do the values mean? 

Whelp, one of the unfortunate things about JSON is that, while the _format_ is universally understood, the _meanings_ for keys and values are not standardized across the web.

Here, for example, the Internet Archive somewhat curiously decided that "created" should refer to when the request was first made, as seconds since midnight, January 1st, 1970 (or what is called the [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time)). So that ```1448448826``` refers to Wed, 25 Nov 2015 10:53:46. If Bard's library were to implement a JSON API for their collections, they might decide to use the key "created" to refer to when the item was first uploaded, or first published, or first acquired. Facebook might use "created" to refer to when a particular user first joined Facebook, or when a user was born. And so on.

The upshot is that, while JSON is great for getting information from Web Point A to Web Point B, there's still the need to process and interpret that information according to how the publisher meant for it to be interpreted.

In [1]:
import internetarchive as ia

This line loads the "internetarchive" library for python, and gives it a shorter name ("ia") for brevity's sake. This library handles communicating with the Internet Archive, and deals with interpreting the JSON for us. Now we run a search for a particular collection, and print out how many documents we found:

In [2]:
search_results = ia.search_items('collection:scientific-american-1845-1909')
print(search_results.num_found)

2870


(Once you've read through this notebook and get what's going on, try replacing ```scientific-american-1845-1909``` with the name of a collection you're interested in. You can get the name by going to the collection in your browser and looking at the last part of the URL.)

Right now, the ```search_results``` variable is just a pointer to the search. But in order to get the actual IDs for all the items in the collection, we need to perform that search and store what it returns in something we can process further, like a list:

In [3]:
search_items = list(search_results)

Now let's see what the list contains. This should print out the IDs for first ten items on the list:

In [4]:
for item in search_items[0:10]:
    print("ID: {}".format(item['identifier']))

ID: scientific-american-1845-08-28
ID: scientific-american-1846-03-19
ID: scientific-american-1846-03-26
ID: scientific-american-1846-04-02
ID: scientific-american-1846-04-09
ID: scientific-american-1846-04-16
ID: scientific-american-1846-04-23
ID: scientific-american-1846-04-30
ID: scientific-american-1846-05-06
ID: scientific-american-1846-05-14


Now that we know the IDs of what we're interested in, we can work with those items. Let's get a list of the file types associated with that first item returned:

In [5]:
first_item = ia.get_item(search_items[0]['identifier'])

That line is a little dense, so let's break it down, starting from the inside and working our way out.
1. ```search_items``` is that same list of items and IDs from above.
2. ```search_items[0]``` gets the first item of that list (remember: computers start counting from zero!)
3. ```search_items[0]['identifier']``` gets the value for the ```identifier``` key of that first item.
4. ```ia.get_item(search_items[0]['identifier'])``` takes that identifer value for the first item and asks the Internet Archive for all the information it has.
5. Finally, we keep that information in the ```first_item``` variable for later processing.

Let's get a listing of all the files associated with this item. There are quite a lot, so let's just look at the first two.

In [6]:
first_item.files[0:2]

[{'crc32': 'a156ba3a',
  'format': 'DjVu',
  'md5': 'e24b0db3861efd985ed11172ec0f5677',
  'mtime': '1329085972',
  'name': 'scientific-american-v01-n01-1845-08-28.djvu',
  'original': 'scientific-american-v01-n01-1845-08-28_djvu.xml',
  'sha1': '7ca3d6ce76717b19ba91a3baaecbe6bb7b897d10',
  'size': '1054500',
  'source': 'derivative'},
 {'crc32': '3c9a3d5f',
  'format': 'EPUB',
  'md5': 'b27bd43cf6af61d6e70a6d135e2178e9',
  'mtime': '1329085978',
  'name': 'scientific-american-v01-n01-1845-08-28.epub',
  'original': 'scientific-american-v01-n01-1845-08-28_abbyy.gz',
  'sha1': '474c04d1000809886afe0c52712735be890c2057',
  'size': '97103',
  'source': 'derivative'}]

We want PDFs, right? But take a look at the PDF files associated with the item with id 'scientific-american-1898-11-12':

In [7]:
uh_oh = ia.get_item('scientific-american-1898-11-12')
[file for file in uh_oh.files if "PDF" in file['format']]

[{'crc32': '482d2aed',
  'format': 'Image Container PDF',
  'md5': 'eeb493b3fb722f839bc9a92e5427be62',
  'mtime': '1329172430',
  'name': 'scientific-american-v79-n20-1898-11-12.pdf',
  'sha1': 'dcb1e1443e4d049195f2e5ea75ad94ac65d2fed2',
  'size': '10096731',
  'source': 'original'},
 {'crc32': '301df72b',
  'format': 'Additional Text PDF',
  'md5': '3ae6328f04fb3847da845e5b79bb2bd6',
  'mtime': '1329178757',
  'name': 'scientific-american-v79-n20-1898-11-12_text.pdf',
  'original': 'scientific-american-v79-n20-1898-11-12_djvu.xml',
  'sha1': '9a35631132cf87c58801543753fe2569ffc8f6e7',
  'size': '3437337',
  'source': 'derivative'}]

There are two types of PDF: Full image containers, and those processed by OCR software to compress and represent text. For our purposes, we don't really care which we get -- we just want one.

That second line is an example of a thing in Python called a *list comprehension*. It's basically shorthand for the specific case of constructing a list using a ```for``` loop. That line is functionally the same as doing this:

```
pdf_files = []
for file in uh_oh.files:
    if "PDF" in file['format']:
        pdf_files.append(file)
print(pdf_files)
```

...but it's clearly a lot shorter.


So let's download PDFs for the first 10 documents.
First we make a directory to store the documents in, if it doesn't already exist.
Then we download and save the needed files to that directory.
We also record the filenames of each pdf for later use.

In [8]:
import os
from requests import ConnectionError

try:
    os.mkdir("DownloadedPDFs")
except FileExistsError:
    # if it already exists, then don't worry about making it.
    pass

files = []

for search_item in search_items[0:10]:
    item_id = search_item['identifier']
    item = ia.get_item(item_id)
    filenames = [f['name'] for f in item.files if 'PDF' in f['format']]
    if len(filenames) == 0:
        print("Ooops, looks like the item with id {} has no PDFs!".format(item_id))
    else:
        fn  = filenames[0]
        print("Getting file {}...".format(fn))
        file = item.get_file(fn) 
        try:
            filename = os.path.join("DownloadedPDFs", fn)
            file.download(filename, silent=True)
            files.append(filename)                   
            print("Gotten!")
        except ConnectionError:
            print("Oops, there's a connection error. Try to get {} again later".format(fn))      

Getting file scientific-american-v01-n01-1845-08-28.pdf...
Gotten!
Getting file scientific-american-v01-n27-1846-03-19.pdf...
Gotten!
Getting file scientific-american-v01-n28-1846-03-26.pdf...
Gotten!
Getting file scientific-american-v01-n29-1846-04-02.pdf...
Gotten!
Getting file scientific-american-v01-n30-1846-04-09.pdf...
Gotten!
Getting file scientific-american-v01-n31-1846-04-16.pdf...
Gotten!
Getting file scientific-american-v01-n32-1846-04-23.pdf...
Gotten!
Getting file scientific-american-v01-n33-1846-04-30.pdf...
Gotten!
Getting file scientific-american-v01-n34-1846-05-06.pdf...
Gotten!
Getting file scientific-american-v01-n35-1846-05-14.pdf...
Gotten!


Right, now we've got our PDFs! Here's a function that converts a pdf to images.

In [9]:
from wand.image import Image

def convert_pdf(pdf_filename, converted_dir = "converted", to_directory=True):
    magazine_name = os.path.basename(pdf_filename)
    magazine_name, _ = os.path.splitext(magazine_name)
    try:
        os.mkdir(converted_dir)
    except FileExistsError:
        pass
    
    if to_directory:
        converted_dir = os.path.join(converted_dir, magazine_name)
        try:
            os.mkdir(converted_dir)
        except FileExistsError:
            pass
        
    with Image(filename=pdf_filename, resolution=200) as magazine:
        for page in magazine.sequence:
            i = magazine.sequence.index(page) + 1
            print("Converting page {}".format(i))
            converted = Image(page).convert('jpg')
            
            # Make filename for converted files from pdf filename
            converted_filename, _ = os.path.splitext(magazine_name)
            converted_filename += "-pg{}.jpg".format(i)
            converted_filename = os.path.join(converted_dir, converted_filename)
            
            converted.save(filename=converted_filename)

In [10]:
for file in files:
    convert_pdf(file, to_directory=False)
print('Complete!')

Converting page 1
Converting page 2
Converting page 3
Converting page 4
