# Accessing LMEC Collections via JSON API

This notebook provides some tips for using Digital Commonwealth's JSON API to query the LMEC collections portal and programmatically retrieve metadata about collections items.

### URL syntax

To retrieve any page as JSON, simply append `.json` to the page URL. On the collections portal, this should be placed directly after `search`:

    # normal, return HTML
    https://collections.leventhalmap.org/search?utf8=%E2%9C%93&q=lowell&search_field=all_fields

    # return JSON
    https://collections.leventhalmap.org/search.json?utf8=%E2%9C%93&q=Lowell&search_field=all_fields

### Increasing max items returned from search query

By default this query will return a max of 20 items (it's reading from the page). You can increase this to 100 by replacing `utf8=%E2%9C%93&` with `per_page=100&`:

    # normal, return HTML with up to 100 items per page
    https://collections.leventhalmap.org/search?per_page=100&q=lowell&search_field=all_fields

    # return JSON
    https://collections.leventhalmap.org/search.json?per_page=100&q=lowell&search_field=all_fields

### Tweaking the query with other filters

You can also tweak your search by adjusting things like "Place," "Topic," and "Date" on the collections portal itself before grabbing the URL. The following query searches against 2 parameters: 1) maps that match the keyword "Lowell" 2) with a date of 1850 or later. It also lists 100 items per page, although only 19 maps are returned:

    # normal, return HTML with up to 100 items and a date constraint
    https://collections.leventhalmap.org/search?per_page=100&q=lowel&range%5Bdate_facet_yearly_itim%5D%5Bbegin%5D=1850&range%5Bdate_facet_yearly_itim%5D%5Bend%5D=1951&search_field=dummy_range

    # return JSON

    https://collections.leventhalmap.org/search.json?per_page=100&q=lowel&range%5Bdate_facet_yearly_itim%5D%5Bbegin%5D=1900&range%5Bdate_facet_yearly_itim%5D%5Bend%5D=1950&search_field=dummy_range

### Collections item-level syntax

At the item level, `.json` should be appended to the end of the collections item, directly after the commonwealth ID:

    # normal, return HTML
    https://collections.leventhalmap.org/search/commonwealth:3f463717c

    # return JSON
    https://collections.leventhalmap.org/search/commonwealth:3f463717c.json

### Parsing a single item

We can parse a single item by first reading JSON data from a given URL into a Python dictionary, and then printing it as a string:

In [None]:
# import the relevant python libraries:
# `json` for parsing json formatted data,
# `requests` for easily accessing json data, and
# `pandas` for viewing data in tables/frames

import json
import requests
import pandas as pd

data = requests.get("https://collections.leventhalmap.org/search/commonwealth:3f463717c.json")

print(json.dumps(data.json(), indent=2))

### Retrieving a larger query

That was just JSON from one item. We can also retrieve and parse multiple items at once by redefining the `data` variable with a **search URL** instead of a single item.

For example, the URL

`https://collections.leventhalmap.org/search.json?per_page=100&q=lowell&search_field=all_fields`

will return a larger response. This search for "lowell" returned 35 items total:

In [None]:

data = requests.get("https://collections.leventhalmap.org/search.json?per_page=100&q=lowell&search_field=all_fields")

len(data.json()["response"]["docs"])

We printed the number of items, as opposed the full JSON for the query, because printing the full JSON would take up way too much space.

### Architecture of the API response

If you loop through the query's `response`, you can parse each section of the collections portal's web page:

In [None]:
for a in data.json()["response"]:
    print(a)

where `docs` contains collection items, `facets` contains filters (e.g., the "Date" or "Subject" filter), and `pages` contains actions for moving through pages. (Since this query contains 35 items and we've set our page view to 100, there's only 1 page here—but this is useful to know for something like building a Python scraper.)

We mostly will want to interact with `docs`. Let's start by figuring out what kind of metadata each item contains.

### Accessing metadata fields

We can easily list metadata fields by:

1. putting our data into a data frame and
2. listing the data frame's columns

In [None]:
df = pd.DataFrame(data.json()['response']['docs'])

print(list(df.columns.values))

There are a *lot* of metadata fields here (77!). You don't need to see all of them, since many contain irrelevant detail or null values, so our next step is to filter some fields out.

You could start by examining what each field contains by visiting the [BPL's field name reference guide](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API).

### Filtering a data frame by columns

Still, this task can be pretty arduous. Below, we selected a few particularly useful fields and stored them as a list so that we only see selected columns in the resultant data frame. We also renamed the fields for readability.

In [None]:
fields = ['title_info_primary_tsi', 'name_tsim', 'id', 'date_end_dtsi', 'georeferenced_bsi']
newFieldNames = {'title_info_primary_tsi':'title', 'name_tsim':'creator', 'id':'commonwealth_id', 'date_end_dtsi':'date', 'georeferenced_bsi':'georef'}

df_fltr = pd.DataFrame(df[fields])
df_fltr.rename(columns = newFieldNames, inplace = True)
df_fltr

### Geospatial data fields

In addition to the boolean georeferencing field, there are three other geospatial data fields that may be of interest:

1. `subject_bbox_geospatial` is a string array of four coordinates that create a bounding box on georeferenced items
2.  `subject_coordinates_geospatial` is a string array of lat and long for locations depicted in an item
3. `subject_point_geospatial` is a lat-long string depicting the center point of an item

Let's create a data frame that filters against these fields:

In [None]:
fields = ['title_info_primary_tsi', 'subject_bbox_geospatial', 'subject_coordinates_geospatial', 'subject_point_geospatial', 'georeferenced_bsi']
newFieldNames = {'title_info_primary_tsi':'title', 'subject_bbox_geospatial':'bbox', 'subject_coordinates_geospatial':'coords', 'subject_point_geospatial':'centerpoint', 'georeferenced_bsi':'georef'}

df_fltr = pd.DataFrame(df[fields])
df_fltr.rename(columns = newFieldNames, inplace = True)
df_fltr

### Commonwealth ID's

One thing to highlight here are the **commonwealth ID's**. In our collections, commonwealth ID's are a stable item identifier. Prefixing any of these ID's with `https://collections.leventhalmap.org/search/` will take you directly to the item's web page.

### Filtering data frame by column values

Now, let's say we want to filter our response according to certain metadata fields, for example only retrieving ID's for maps that have been georeferenced.

The [`.loc` property](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) of pandas makes it easy to access rows and columns by a label or array. 

In [None]:
df_fltr.loc[df['georeferenced_bsi'] == True]

### Calculate a new column based on an existing one

You could also add more parameters, such as filtering for maps that are georeferenced only after 1850. This requires creating a new field, since our `date` field is stored as an [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) string.

Here, we just extracted the first 4 characters from the `date` field and turned them into an integer:

In [None]:
fields = ['title_info_primary_tsi', 'name_tsim', 'id', 'date_end_dtsi', 'georeferenced_bsi']
newFieldNames = {'title_info_primary_tsi':'title', 'name_tsim':'creator', 'id':'commonwealth_id', 'date_end_dtsi':'date', 'georeferenced_bsi':'georef'}

df_fltr = pd.DataFrame(df[fields])
df_fltr.rename(columns = newFieldNames, inplace = True)

df_fltr['year'] = df_fltr['date'].str[:4].astype(int)
df_fltr

### Filtering against multiple parameters

Now we can use the `.loc` property again, but this time filtering by two parameters: 1) georeferenced maps that were 2) created beginning in 1850.

In [None]:
df_fltr.loc[(df_fltr['georef'] == True) & (df_fltr['year'] >= 1850)]

### Filtering by date by API request

A less programmatic way to filter by date is to just manually set filters to your desired search on the [LMEC collections portal](https://collections.leventhalmap.org), and then grab the resulting URL.

To do it this way, we'll first redefine our original request, and then we'll recreate the necessary data frames:

In [None]:
# request a search query that is pre-filtered by a date range

data_Date = requests.get("https://collections.leventhalmap.org/search.json?utf8=%E2%9C%93&q=lowell&search_field=dummy_range&range%5Bdate_facet_yearly_itim%5D%5Bbegin%5D=1850&range%5Bdate_facet_yearly_itim%5D%5Bend%5D=1951&commit=Apply")

# define the results of that query as a data frame

df_Date = pd.DataFrame(data_Date.json()['response']['docs'])

# filter the data frame so that it only shows relevant columns
# and rename the columns so they're human readable

df_Date_fltr = pd.DataFrame(df_Date[fields])
df_Date_fltr.rename(columns = newFieldNames, inplace = True)

# print data frame

df_Date_fltr

### Filtering by column again

Filtering by the `georef` column shows us the same 5 maps of Lowell which meet these two parameters: 1) georeferenced and 2) created after 1850.

In [None]:
df_Date_fltr.loc[df_Date_fltr['georef'] == True]

# That's it!

Next, check out the notebook on using the IIIF API.