In [2]:
import requests, pprint, json, os
from datetime import datetime
from pathlib import PurePath, Path
import xml.etree.ElementTree as ET
from lxml import etree
from io import StringIO, BytesIO

pp = pprint.PrettyPrinter(indent=4)

# NYPL API
# main parameters
output_dir = PurePath(os.getcwd(), 'output')
n_pagin = 500
pagin = 'per_page=' + str(n_pagin)
baseurl = 'http://api.repo.nypl.org/api/v1/'
trail_url = '/search?&publicDomainOnly=true&q='
token = '1bt13bkug32reiu4'
auth = 'Token token=' + token

# custom parameters
# s_terms = 'Farm Security Administration Photographs'
# s_terms = 'still image manhattan street portrait 1960 photograph '
s_terms = "Photographs"
adv_s_terms = '&field=genre'
# coll_id = 'e5462600-c5d9-012f-a6a3-58d385a7bc34'  # Farm Security Administration Photographs
# coll_id = 'a301da20-c52e-012f-cc55-58d385a7bc34'  # Photographic views of New York City, 1870's-1970's
coll_id = '439afdd0-c62b-012f-66d1-58d385a7bc34'  # Detroit Publishing Company postcards
# Cf. http://api.repo.nypl.org/api/v1/collections?per_page=200

coll_url = baseurl + 'collections/' + coll_id + '?' + pagin
full_url = baseurl + 'items' + trail_url + s_terms + '&' + pagin + adv_s_terms
item_url = baseurl + 'items/mods_captures/'   # item_details


# data = requests.get(coll_url, headers={'Authorization': auth}).json()['nyplAPI']
# data = requests.get(full_url, headers={'Authorization': auth}).json()['nyplAPI']


# print(rq.keys())
# print(meta.keys())
# check_n_pages = total_results % n_pagin


def timestamp():
    return datetime.utcnow().isoformat()

def create_logfile(log_path):
    """
    Creates a (txt) log file
    """
    if os.path.isfile(log_path):
        os.remove(log_path)
    log_stream = open(log_path, 'w+')
    return log_stream

def write_to_log(log_path, message):
    """
    Writes to (txt) log file
    """
    with open(log_path, 'a') as log:

        if isinstance(message, str):
            mes_str = '\n' + '/!\\ ' + message
            log.write(mes_str)

        elif isinstance(message, list):
            mes_main = '\n' + '/!\\ ' + message[0]
            log.write(mes_main)
            for p in message[1:]:
                mes_p = '\n' + '\t' + ' - ' + p
                log.write(mes_p)


# loop over API
page = 1
n_pages = 1
run_dir = PurePath(output_dir, timestamp())
xml_dir = PurePath(run_dir, 'data', 'xml')
json_dir = PurePath(run_dir, 'data', 'json')
log_dir = PurePath(run_dir, 'log')
log_path = PurePath(log_dir, 'log.txt')

for f in [run_dir, xml_dir, json_dir, log_dir]:
    # os.mkdir(f)
    Path(f).mkdir(parents=True, exist_ok=True)

log_stream = create_logfile(log_path)
while page <= n_pages:
    
    # page
    if page > 1:
        page_url = full_url + '&page=' + str(page)
    else:
        page_url = full_url
    init_mes = '\n--> Page ' + str(page)
    print(init_mes)
    write_to_log(log_path, init_mes)
    
    # data, params
    print(page_url)
    write_to_log(log_path, page_url)
    data = requests.get(page_url, headers={'Authorization': auth}).json()['nyplAPI']
    metata = data['response']['result']
    # metata = data['response']
    
    # prepare page numbers
    if page == 1:
        total_results = int(data['response']['numResults'])
        n_pages = int(total_results / n_pagin)
        if total_results % n_pagin > 0:
            n_pages += 1 
        
        # logs
        mes_res = '{:,}'.format(total_results) + " items retrieved from the search '" + s_terms + "'"
        mes_pages = '\n{:,}'.format(n_pages) + " pages to process"
        print(mes_res, mes_pages)
        write_to_log(log_path, [mes_res, mes_pages])
    
    # loop over items in this page
    total = len(metata)
    for p in range(total):
        
        # preps
        pos = p + (page - 1) * n_pagin
        p_offset = (page-1) * n_pagin + 1
        el = metata[p]
        item_mes = '\n--> Item #' + '{:,}'.format(pos+1) + '/' + '{:,}'.format(total_results)
        print(item_mes)
        write_to_log(log_path, item_mes)
        url = item_url + el['uuid']
        url_xml = url + '.xml'
        filename = 'item_' + str(pos) + '_page_' + str(page)
        print(url)
        write_to_log(log_path, url)
        
        # json
        try:
            data = requests.get(url, headers={'Authorization': auth}).json()['nyplAPI']
            json_f = PurePath(json_dir, filename + '.json')
            with open(json_f, 'w') as f:
                json.dump(data['response'], f)
                log = data['response']['headers']['message']['$']
                # pp.pprint(data['response']['headers']['message'])
                json_mes = '- json file created --> ' + str(log)
                print(json_mes)
                write_to_log(log_path, json_mes)
        except Exception:
            print(Exception)
            print('--> Issue wiht file/url ...')
        
        # xml
        try:
            data = requests.get(url_xml, headers={'Authorization': auth}).text
            xml_f = PurePath(xml_dir, filename + '.xml')
            tree = etree.fromstring(bytes(data, encoding='utf-8'))
            meta_xml = tree.findall('response')[0]
            xml_data = etree.tostring(meta_xml).decode('UTF-8')
            with open(xml_f, "w") as f:
                f.write(xml_data)
                log = tree.findall('response/headers/message')[0].text
                xml_mes = '- xml file created --> ' + log
                print(xml_mes)
                write_to_log(log_path, xml_mes)
        except Exception:
            print(Exception)
            print('--> Issue wiht file/url ...')

    # iterate to next page
    page += 1

# end loop
log_stream.close()



--> Page 1
http://api.repo.nypl.org/api/v1/items/search?&publicDomainOnly=true&q=Photographs&per_page=500&field=genre
138,604 items retrieved from the search 'Photographs' 
278 pages to process

--> Item #1/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/7d71ff40-b3cb-0132-6b3b-58d385a7bbd0
- json file created --> ok
- xml file created --> ok

--> Item #2/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/d62998c0-b45c-0132-b812-58d385a7b928
- json file created --> ok
- xml file created --> ok

--> Item #3/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/c91a6ca0-b462-0132-b9ba-58d385a7bbd0
- json file created --> ok
- xml file created --> ok

--> Item #4/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/6cf6f150-b53e-0132-e134-58d385a7bbd0
- json file created --> ok
- xml file created --> ok

--> Item #5/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/92bb1610-b54e-0132-aa1c-58d385a7b928
- json file created --> ok
- xml file creat

- xml file created --> ok

--> Item #50/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47e3-b54d-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #51/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c491-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #52/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c49d-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #53/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c49a-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #54/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c495-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #55/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c45c-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file creat

- xml file created --> ok

--> Item #100/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b0e4-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #101/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b0e3-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #102/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-470b-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #103/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b0c4-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #104/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b0e0-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #105/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-d2aa-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file

- json file created --> ok
- xml file created --> ok

--> Item #150/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a057-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #151/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a054-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #152/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a059-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #153/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a05a-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #154/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a04e-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #155/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dc-a04f-a3d9-e040-e00a18064a99
- json fil

- json file created --> ok
- xml file created --> ok

--> Item #200/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c12f-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #201/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c128-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #202/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b1d9-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #203/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c11a-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #204/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-b192-a3d9-e040-e00a18064a99
- json file created --> ok
- xml file created --> ok

--> Item #205/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47dd-c13f-a3d9-e040-e00a18064a99
- json fil

- json file created --> ok
- xml file created --> ok

--> Item #250/138,604
http://api.repo.nypl.org/api/v1/items/mods_captures/510d47d9-c461-a3d9-e040-e00a18064a99


KeyboardInterrupt: 