## Using the Internet Archive API to access the Medical Heritage Library 

The Internet Archive offers an API (Application Programming Interface) that makes it easier to access their archives programmatically. Even the [Medical Heritage Library (https://archive.org/details/medicalheritagelibrary)! Using just a few keystrokes we can download thousands of PDFs and analyze them in different ways. This Jupyter Notebook - which you should be running in the MedicalHeritageLibraryVM is set up to help you download PDFs from the Medical Heritage Library. 

First, load the "internetarchive" library for python. This library handles communicating with the Internet Archive.

In [None]:
import internetarchive as ia

Now we run a search for the Medical Heritage Library, and print out how many documents we found:

In [None]:
search_results = ia.search_items('collection:medicalheritagelibrary')
print(search_results.num_found)

You will see that there are over 100,000 collections in the Medical Hertiage Library. Wow! That's a lot of medical history. Internet Archive's API, right now, can only handle 10,000 items at a time so try limiting your results by year. 

In [None]:
search_results = ia.search_items('collection:medicalheritagelibrary AND date:[1900 TO 1905]')
print(search_results.num_found)

Great! Now we have a little over 6,000 items. Let's check out what is in that list. 

Right now, the ```search_results``` variable is just a pointer to the search. But in order to get the actual IDs for all the items in the collection, we need to perform that search and store what it returns in a list:

In [None]:
search_items = list(search_results)

Now let's see what the list contains. This should print out the IDs for first ten items on the list:

In [None]:
for item in search_items[0:10]:
    print("ID: {}".format(item['identifier']))

Now that we know the IDs of what we're interested in, we can work with those items. Let's get a list of the file types associated with that first item returned:

In [None]:
import os
from requests import ConnectionError

try:
    os.mkdir("MedicalHeritagePDFS")
except FileExistsError:
    pass

files = []

for search_item in search_items[0:10]:
    item_id = search_item['identifier']
    item = ia.get_item(item_id)
    filenames = [f['name'] for f in item.files if 'PDF' in f['format']]
    if len(filenames) == 0:
        print("The item with id {} has no PDFs!".format(item_id))
    else:
        fn  = filenames[0]
        print("Getting file {}...".format(fn))
        file = item.get_file(fn) 
        try:
            filename = os.path.join("MedicalHeritagePDFS", fn)
            file.download(filename, silent=True)
            files.append(filename)                   
            print("Check!")
        except ConnectionError:
            print("There's a connection error. Try to get {} again later".format(fn))      

Now, let's say you want to do other types of searches rather than searching by date. Don't worry! You can! 

You can search for things like: 

title (the title of the work) 
ex: title:[civil war]

date (the date of the work formatted like so: [YEAR-MONTH-DAY TO YEAR-MONTH-DAY]) 
ex: date:[1820-01-01 TO 1830-12-31]

description (the description of the work)
ex: description:photograph

addeddate (the date the work was added)
ex: addeddate:2016

In [None]:
search_results = ia.search_items('collection:medicalheritagelibrary AND title:medical')
print(search_results.num_found)