**October 13, 2020**

**1:00-1:40**
## Optimized Python for Working with Data and API's
_Dolsy Smith_

_George Washington University_

dsmith@gwu.edu

Thanks to Alma’s API’s, we can create custom applications, automate workflows, and perform batch operations not possible with Jobs and Sets. However, using the API’s on large amounts of data can be slow. In this session, we will explore tools and strategies available in Python that can make these tasks and applications more efficient. 

### Introduction

#### Why I Use Python

- It's a scripting language with user-friendly syntax, making it relatively easy to learn.
- It's a high-level language with some great features (like list comprehensions) that make it useful for rapid prototyping. 
- It has a huge open-source ecosystem, with third-party Python libraries available for almost every task you might imagine.
- For an interpreted language, it's fairly performant.

#### Why "Optimized" Python?

Being interpreted, native Python cannot achieve the efficiency of compiled languages like Java and C. Several robust approaches exist to mitigate this limitation:

  - Use high-performance libraries that delegate repetitive operations to a lower-level language under the hood.
  - Invoke multiple threads and/or processors.
  - Take advantage of Python's support for asynchronous/concurrent I/O.
 
We will look at the first and third of these approaches today.


#### Use Cases

The following are a handful of ways I've used these approaches in my work with Alma and its API's.

##### Discharging over 60,000 items using Alma's scan-In API. 

With concurrent requests, this took a [Python script](https://github.com/gwu-libraries/bulk-loans-discharge-alma)  roughly 45 minutes to complete on my laptop.


##### Looking up Alma users in real-time for a LibCal integration. 

Needing to adhere to strict constraints on usage of our library's physical spaces during the pandemic, we require patrons to book an appointment in LibCal. [Our application](https://github.com/gwu-libraries/libcal_pp_integration) retrieves appointments from the LibCal API, enriches them with user data from Alma, and loads them into our visitor-management software. Concurrent requests allow us to retrieve data for multiple users from Alma at the same time -- an important piece of efficiency, since our app is fetching appointments at 5-minute intervals.


##### Testing legacy portfolio URL's

After migration, we found ourselves with several thousand portfolios that had been created from Voyager catalog records (as opposed to the activations from our ERM). Being unattached to collections, these would have to be analyzed one by one. But with concurrent requests, I was able to test the URLs quickly and identify those that did not return a valid HTTP response so that they could be deactivated.


##### Migration cleanup: merging and munging

After our migration, we had (and still have) lots of cleanup to do. Frequently, this involves comparing Alma Analytics reports with data from other sources, including our legacy Voyager database. With Python's `pandas` module, I can  filter, clean, re-shape, and merge large datasets far more efficiently than in Excel. Plus, `pandas` supports a variety of input and output formats, including `.csv` and `.xlsx` files. Doing this work in Jupyter notebooks has the further advantage of allowing me to document my process in Markdown, which makes the work more shareable and reproducible.

This workshop presents an optimized workflow for batch operations using Python and the Alma API's. 

#### Outline
1. API housekeeping (YAML)
2. Reading an Analytics report in Python (`pandas`)
3. A brief primer on asynchronous programming
4. Making API requests asynchronously (`aiohttp`)
5. Processing the results (`pandas`)

### Setup with `pandas` and YAML

#### YAML files for configuration

YAML is "human-readable data-serialization language" [Wikipedia](https://en.wikipedia.org/wiki/YAML) that works much like JSON but without all the extra punctuation. Like Python, it uses whitespace/indentation to create nested blocks (instead of JSON's curly braces), and it doesn't require quotation marks around strings except in certain situations.

I've stored my API key, the Alma API endpoint I'll be using, and the path to a CSV file containing identifiers in a file called `workshop-config.yml`. The following allows me to read these config values into a Python dictionary.

In [74]:
# If you haven't already, run !pip install pyaml
import yaml

In [75]:
with open('./workshop-config.yml', 'r') as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

In [76]:
config

{'test_url': 'http://slowwly.robertomurray.co.uk/delay/1000/url/http://www.google.com',
 'api_key': 'l8xx77562cda96264fa1afb585e50d992fad',
 'get_item_url': 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/{mms_id}/holdings/{holding_id}/items/{item_id}',
 'csv_path': './items_technical_migration.csv'}

For now, I'm leaving this as a global variable. We'll use it throughout the workflow that follows.

#### Reading Analytics data with `pandas`

The report I'm using contains items that received an error status during our migration to Alma. The report has about 20,000 rows. We'll work with a subset of them for this example. 

In what follows, I'll walk through some of the features of the `pandas` library that make it useful for cleaning and re-shaping data.

In [77]:
import pandas as pd

First, we read our report into `pandas`. The report has been saved locally as a CSV.

The `.read_csv` method returns a `DataFrame` object.

In [78]:
migration_errors = pd.read_csv(config['csv_path'])

The column names are the standard Analytics column names, in title case with whitespace. These will be easier to work with if we make them valid Python identifiers. We can do this by way of a list comprehension applied to the `.columns` attribute on our `DataFrame`. 

In [79]:
migration_errors.columns = [column.lower().replace(' ', '_') for column in migration_errors.columns]

We have two columns named `suppressed_from_discovery`. The first is for the holdings record, the second for the bib. We should probably rename them to avoid confusion.

The following `for` loop uses built-in Python list functions to create new headers for the `suppressed_from_discovery` columns. 

In [80]:
new_names = ['suppressed_holdings', 'suppressed_bibs']
new_columns = []
for column in migration_errors.columns:
    if column == 'suppressed_from_discovery':
        new_columns.append(new_names.pop(0))
    else:
        new_columns.append(column)

In [81]:
migration_errors.columns = new_columns

Now let's exclude those suppressed items (holdings or bibs) from our set. We can do this using the `.loc` functionality of a `DataFrame`, which can accept a Boolean expression, returning only those rows for which the Boolean expression evaluates to `True`.

If you evaluate on its own either one of the expressions in parentheses below, you'll notice that it returns a data structure called a `Series`, which in this case has the same number of rows as the original `migration_errors` `DataFrame`. But for values each `Series` has only `True` and `False`. Passing those expressions to `migration_expressions.loc` (which is not a function, but an indexer, hence the square brackets) gives us the result we want. 

The pipe symbol `|` is used here in place of the keyword `or` because we are comparing two Boolean values for _every row of the `DataFrame`_, rather than comparing two objects (the Boolean `Series` themselves).

In [82]:
migration_errors = migration_errors.loc[(migration_errors.suppressed_holdings == 0) | (migration_errors.suppressed_bibs == 'No')]

We can use the same process to exclude items in temporary locations.

(Here, the value `None` coming from Analytics has been read by Python as a string, not as the Python null type `None`.)

In [83]:
migration_errors = migration_errors.loc[migration_errors.temporary_location_name == 'None']

Finally, what if we want to limit our set by call number, only to items in the N's? 

`pandas` support efficient indexing by string conditions, using the `.str` attribute of a column that consists of Python strings.

In [84]:
art_books = migration_errors.loc[migration_errors.permanent_call_number.str.startswith('N')]

We could, of course, achieve these same results in Analytics by using filters. But the ability to do so in Python often gives me greater flexibility, since I can quickly try different approaches without re-running the query. In addition, `pandas` has far more tools for data cleanup and analysis than the restricted SQL set of Analytics. Finally, I can use `pandas` to merge Analytics data sets with data from other sources (including other Analytics subject areas). 

We could easily devote a whole workshop to exploring `pandas`. But for now, we'll see how to use our filtered report to create concurrent requests to the Alma API.

### Asynchronous programming: a brief primer

The following section attempts to illustrate some of the principles of asynchronous programming via more atomic Python constructs, namely, iterators, generators, and coroutines. At the end of this notebook, I've provided additional resources on this topic. Depending on your background in writing code, it may not be an easy topic to grasp; I certainly struggled with it for several years before it began to crystallize (before I could really understand why the code I wrote sometimes behaved the way it did). What helped was seeing the connection between the higher-level asynchronous syntax I'll introduce below, and the more foundational parts of the language we'll turn to now.

Most Python code we write executes in a **synchronous** fashion. But what does _that_ mean? 
  - It's not necessarily about things happening _at the same time_. 
  - Rather, it's about things happening _in sync_ with one another. As in synchronized swimming.

In [85]:
for i in range(5):
    print(i)

0
1
2
3
4


By definition, a `for` loop in Python executes the statements in the block sequentially. Another way to put it is that the outcome of the loop is deterministic: the variable `i` will take on the values from `range(5)` always in this order. If it somehow printed the `3` before the `2`, we would conclude that something had gone terribly wrong.

In fact, strictly speaking, the Python interpreter _never_ does more than one thing at the same time. For complicated reasons, and as conventional wisdom has it, Python isn't very good at parallel processing. There is the  `multiprocessing` library, but it effectively spawns multiple copies of the Python interpreter, which comes with a certain amount of overhead. 

Libraries like `pandas`, which support fast computation, tend to delegate CPU-intensive operations to lower-level code (usually written in C). But there are other things we use Python for that don't consume a lot of CPU cycles, but which can still slow down our code. Chief among these are operations involving what's called _blocking I/O_. 

#### Do Python scripts dream of electric sheep?

Let's say we need to request data from the same webserver five times in a row. If our requests **block**, then between sending the request and receiving the response, nothing else can happen on our end. It's as though whenever you sent an email, you had to wait until the recipient responded before sending another. That would certainly make your inbox easier to organize, but on the other hand, you might not be able to accomplish much.

Using Python's `requests` library, we can see this behavior in action. (Do `!pip install requests` first if you get a `ModuleNotFound` error when running the `import` statement.)

The URL we're using for this test causes a delay of at least one second before the server returns a response.

In [86]:
import requests
from datetime import datetime

In [87]:
def sync_fn(i):
    resp = requests.get(config['test_url'])
    print(f'Loop {i}; URL status: {resp.status_code}; Timestamp: {datetime.now().time()}')

In [88]:
for i in range(5):
    sync_fn(i)

Loop 0; URL status: 200; Timestamp: 09:15:29.592248
Loop 1; URL status: 200; Timestamp: 09:15:30.918724
Loop 2; URL status: 200; Timestamp: 09:15:32.451844
Loop 3; URL status: 200; Timestamp: 09:15:33.870340
Loop 4; URL status: 200; Timestamp: 09:15:35.171670


The timestamps should be roughly one second apart. Most of that time was spent by Python idling for a response. Which is not such a big deal if we're making 5 requests, or even 500, but what about 5,000 or 50,000?

The alternative is to write code that doesn't block on certain kinds of I/O. How do we accomplish, if the Python interpreter only does one thing at a time?

It might help to think about multitasking. Human beings do many things at the same time, like walking and breathing and digesting food. But when we talk about multitasking, we're usually talking about something short of true simultaneity. Handling email correspondence is a good example: you might be engaged in multiple threads of conversation throughout the day. You might be writing a couple of emails while watching a webinar. But it would be quite the neurological feat if three separate parts of your brain were each working on a different task, completely in parallel. More likely, the same parts of your brain are quickly switching back and forth between the tasks. Multitasking is an exercise in effective sequencing. Consider the case where you compose an email, send it, compose another, send it, send a third, receive a response to the first, reply, compose a fourth email while waiting for replies to the other two, and so on. 

Such commonplace multitasking is analogous to **non-blocking** I/O. Different languages support non-blocking I/O (if they support it at all) in different ways. In Python 3, the `asyncio` library provides this support. 

Before we turn to `asyncio`, let's take a closer look at the building blocks of Python's support for asynchronous programming.

#### Iterators, generators, and coroutines, oh my!

An **iterator** is a special Python object that permits iteration. It does so by producing a sequence of values _on demand_. Some built-in Python functions are iterators. `enumerate`, for example, which accepts a sequence of values and returns, for each element in the sequence, a pair of values: the original element and its index. Normally, we use `enumerate` in a `for` loop, but we can expose its iterator nature by using the `next` function.

In [89]:
enum = enumerate(['a', 'b', 'c'])

Used outside of a loop context, `enumerate` doesn't actually enumerate anything. But running `next(enum)` repeatedly will produce values from the sequence until it is exhausted.

In [90]:
next(enum)

(0, 'a')

The `StopIteration` exception will be caught by a Python `for` loop, so normally we don't see it. But it represents the iterator's signal that it has no more values to emit.

An easy way to create your own iterator in Python is to write a function that uses the `yield` keyword. Such functions are called **generators**. When the Python interpreter encounters a generator, it turns that function into an iterator object. 

We could write our own version of `enumerate` as follows:

In [91]:
def my_enumerate(seq):
    i = 0
    while seq:
        yield i, seq.pop(0)
        i += 1

In [92]:
enum2 = my_enumerate(['a', 'b', 'c'])

In [93]:
next(enum2)

(0, 'a')

`my_enumerate` isn't as useful as the built-in `enumerate`; for one, our version works only on Python lists. But the relevant point for our discussion of asynchronous programming is this. A regular Python function (one without the `yield` keyword) is "one-and-done," so speak: it is called from a particular context, it executes its code, and then it returns the control flow to the calling context. A regular Python function demands the interpreter's undivided attention. Or from the user's perspective, calling a regular Python function is a bit like ordering food for curbside pickup: you submit your order, you wait, and then you receive your food. 

A generator, on the other hand, when called with `next` or used in a `for` loop, behaves more like a server in a sit-down restaurant, bringing your meal one dish at a time. But unlike dine-in service, generators can actually be more efficient than regular functions in many contexts. One reason for this is that generators -- or more precisely, the iterators that they become -- do not need to allocate a set amount of memory in advance. For instance, we can use them to parse a file line by line without reading the whole file into memory first. Or we can create generators capable of producing infinite sequences:

In [94]:
def to_infinity():
    i = 0
    while True:
        yield i
        i += 1

In [95]:
gen = to_infinity()
for _ in range(10):
    print(next(gen))

0
1
2
3
4
5
6
7
8
9


Our `to_infinity` function will never raise a `StopIteration` exception. If run for enough iterations, it will consume all the available memory. 

Thus far we haven't seen any asynchronous behavior, but in the guise of generators, we have met Python functions that can start and stop on demand. We can also write generators that can communicate with their calling context. If you're dining in a restaurant, you generally don't have to order all your food at once. You can order an appetizer while you decide on your entree. The following generator, which is technically called a **coroutine**, we can use like a calculator to perform running sums:

In [96]:
def co_sum():
    total = 0
    while True:
        n = yield total
        if not n:
            return total
        total += n
        

Notice that the `yield` keyword appears on the right-hand side of an equals sign. The line `n = yield total` instructs the function to provide the value of `total` to the calling context, and then to "wait" until it receives a new value from that context. The calling context can provide such a value with the `.send` method (built into every generator by default). I put _wait_ in quotation marks because the key thing about a coroutine is that **while it waits for new input, it doesn't actually block the Python interpreter from doing other tasks.** It just sits there until its `.send` method is called, at which point it resumes execution where it left off, pausing again at **the next `yield` statement**. 

The only quirk is that we have to "prime" the coroutine by calling `.send(None)` before we can use it. 

In [97]:
calc = co_sum()
calc.send(None)

0

In [98]:
calc.send(1)

1

In [99]:
calc.send(10)

11

In [100]:
calc.send(14)

25

Our `co_sum` coroutine will keep adding until it encounters an error (_e.g._, it runs out of memory, or we send it a non-numeric value). We can make it quit gracefully by sending `None`, which returns the current value of `total` inside a `StopIteration` exception. 

In [101]:
calc.send(None)

StopIteration: 25

#### Coroutines in action: `async`, `await`, and `asyncio`

Since version 3.5, Python as provided higher-level abstractions for using coroutines with non-blocking I/O. Again, the use cases here are those where a Python program is waiting on input from an external process, such as the response from a webserver. These abstractions consist of two new keywords -- `async` and `await` -- and a module in the standard library, `asyncio`. 

`asyncio` provides an implementation of an **event loop**. There are a variety of patterns for using the event loop, but in one of the most straighforward patterns, which we'll employ below, the event loop manages a collection of coroutines, each of which has one task that involves non-blocking I/O. To modify our previous analogy, imagine a server in a busy restaurant. The server is the event loop; their tables are the coroutines. Because diners spend far more time (as a general rule) eating than they do ordering food, the server can manage many tables at once. All they need to do is keep checking with each table to see if they want to order something else (if the coroutine has a value to `yield`). 

Using `asyncio`, we don't have to write the event loop ourselves. All we have to do is supply it with a collection of coroutines to manage.

Our coroutines we define by using the `async` and `await` keywords. 

Unfortunately, we can just stick `await` in front of every Python function to make it asynchronous. Even Python functions that work with I/O -- like the `requests` library we used above -- cannot be uses asynchronously if they were not designed to be. But there are a growing number of Python libraries for asynchronous I/O. Here we'll use the library called `aiohttp` for making asynchronous HTTP requests.

You may need to install `aiohttp` first: `!pip install aiohttp`. 

Then we import both `aiohttp` and `asyncio`.

In [102]:
import aiohttp
import asyncio

To define an asynchronous coroutine, use `async def` in place of `def`. And such a coroutine must include at least one `await` statement.

Here our coroutine makes a simple HTTP request. The code is a bit more complex because `aiohttp` uses _context managers_ to handle opening and closing HTTP connections. The `async with` statement is an asynchronous version of the regular Python `with` statement that creates an instance of a context manager. 

Our coroutine also accepts a `client` argument, which will be an object created by the `aiohttp.ClientSession` context manager. This allows us to re-use the same connection for multiple requests.

Note that the `yield` keyword does not appear inside our async coroutine. (In Python 3.6+, it's possible to `yield` from an async coroutine, but it's not required.) Here `await` does the work of pausing the coroutine until it receives a value from "outside," as it were. An important difference between `await` and `yield` is that **we** are **not** sending the value (as we did above with `calc.send()`). Rather, the value is coming from the special asynchronous `text()` method on our `aiohttp` object. 

You can only `await` other async coroutines (and some other specialized Python objects called Futures and Tasks). 

The `async` keyword in front of the `def` and the `with` keywords is important. Without them, the code will either throw an exception or not work as intended.

In [103]:
async def async_fn(i, client):
    async with client.get(config['test_url']) as session:
        resp = await session.text()
        print(f'Loop {i}; URL status: {session.status}; Timestamp: {datetime.now().time()}')

Typically, we initialize our collection of coroutines inside another asynchronous function (coroutine). We'll call this one `main`.

1. `main` invokes the `aiohttp.ClientSession` context manager, creating an instance of a client that we can re-use across all our requests. 
2. Then it creates a collection of `async_fn` coroutines, initializing each with a new value between 0 and 4 and with the client created above.
3. Next we pass this collection to `asyncio.gather` and `await` it. `gather` will execute the coroutines concurrently, ensuring that their results (if any) are accumulated and arranged according to the order in which they were submitted.

The `await` keyword before `asnycio.gather` is important. Our `main` coroutine, like our `async_fn` coroutines, will run _inside_ the event loop. This seems counterintuitive, since `main` manages other coroutines. But it's actually more of a helper; there is a different function that kicks off the event loop itself, which we'll see below.

In [104]:
async def main():
    async with aiohttp.ClientSession() as client:
        awaitables = [async_fn(i, client) for i in range(5)] # Here async_fn(i, client) doesn't execute -- it only initializes the corouting
        await asyncio.gather(*awaitables)  # The asterisk before the variable unpacks the list -- gather() expects one or more coroutines but not a list of them

In Python 3.7+, we would write -- from some **non** async function or from the main part of our script -- the following to call the `main` coroutine:

`asyncio.run(main)`

- This **blocking** command populates the event loop with the coroutine `main`. 
- The `main` coroutine then adds each initialized instance of our `async_fn` coroutine (via the `awaitables` variable) to the event loop. 
- **Each instance of `async_fn` makes an HTTP request then cedes control back to the event loop.** 
- **The event loop checks each coroutine to see if it has received a response.**
- Once **all** of the `async_fn` coroutines have finished -- either by returning or raising an exception -- the `main` coroutine will return.
- At this point, execution will resume after the call to `asyncio.run`. 

In a Jupyter Notebook, however, we are already inside an event loop. So we can't call `asyncio.run` without getting an error. Fortunately, we can just write `await main()` to get the same behavior.

In [105]:
await main()

Loop 0; URL status: 200; Timestamp: 09:16:08.227113
Loop 2; URL status: 200; Timestamp: 09:16:08.228655
Loop 1; URL status: 200; Timestamp: 09:16:08.229905
Loop 4; URL status: 200; Timestamp: 09:16:09.132163
Loop 3; URL status: 200; Timestamp: 09:16:09.133798


Comparing this output with our synchronous loop above, notice the following:
- The timestamps should be within a few milliseconds of each other, even though each HTTP server still took at least 1 second to respond. The requests were made concurrently, so the **total** time to send the requests and receive the responses should be approximately 1 second.
- `asyncio.gather` guarantees that it will **return results** from coroutines in the order that they were passed to it. In this case, however, the output on screen is from a `print` statement inside each coroutine, showing that the coroutines do not necessarily **complete** in the same order. 
- That's the key to asynchronous programming: it eases the requirement that every operation occur in the sequence stipulated by the programmer. 
- And in exchange for some loss of synchronicity, we get significant gains in performance.

### Using the Alma API's asynchronously

The preceding tour of iterators, generators, and coroutines was intended to provide some conceptual grounding for a grasp of how Python's `async` coroutines work. To summarize, the key points are as follows:

1. **Coroutines** are Python functions that can suspend their operation while waiting for more input. While they are paused, the Python interpreter can do other work.
2. `async` coroutines, which typically handle input from I/O processes, are managed by the `asyncio` **event loop**. The event loop is responsible for resuming the coroutines based on the availability of their input from external processes (like an HTTP response). 
3. The event loop allows us to achieve **concurrency** in our I/O requests. This is not the same as true parallelism, but more like highly efficient multitasking. Since the processes involved typically don't require much, if any, of the Python interpreter's resources -- and may not even involve many system resources -- asynchronous programming in Python is essentially a way to **occupy the idle time** that the interpreter would otherwise spend waiting for responses from elsewhere.

The **main challenge** in writing asynchronous Python is adapting to a different way of programming, one in which the path from input to output is less straightforward (and in some sense, less predictable).

The Alma API's impose a **rate limit** of 25 requests per second. With synchronous approaches, that's typically not a problem, because the API generally doesn't respond quickly enough for us to be able to make 25 requests _in sequence_ in under a second. But _concurrent_ requests are different. We can pass 100, 1,000, or 100,000 `async` coroutines to the event loop, and if each coroutine makes one request, the event loop will issue those requests immediately, the only constraint being how fast the hardware on our end can handle it. As a result, we can easily exceed the rate limit if we don't **throttle** the requests somehow.

We'll use a tiny Python library call `asyncio-throttle` to do that, which you can install by running
```
!pip install asyncio-throttle
```
in your notebook.

In [106]:
from asyncio_throttle import Throttler

#### Writing an `async` coroutine to make API requests

The following function will make a single GET request. 

1. The function accepts the following arguments:
  - An instance of the `aiohttp.ClientSession` class. This allows us to reuse the same connection across requests.
  - An instance of the `Throttler` class from `asyncio_throttle`. 
  - A url string. The URL in this case will be formatted for retrieving a specific item from the Bibs API.
  - A Python dictionary called `headers`. This will contain the header information required by the API.


2. The function will return either
  - a valid response from the API, which we expect to be in JSON format,
  - or an error.
  - Both types of return values will be Python dictionaries.


3. Error handling with asynchronous programming can be challenging. 
  - If a coroutine raises an exception, the `asyncio` event loop will allow the rest to continue execution. That's helpful, because in a case like ours, we wouldn't want one API error -- which might be causes by a bad identifier -- to cause the whole batch to fail. 
  - However, it's important to keep track of which API calls **did** fail and why. In our function, we use `try...except` blocks to catch exceptions, and we package errors in Python dictionaries, including, where possible, the API response, which may include a useful message. 


4. Note the nested `async with` statements:
  - `async with throttler` applies the throttler to our request. This essentially keeps our coroutine in a queue until it's time to be executed (at a rate of no more than 25 per second).
  - `async with client.get(...) as session`: This context manager creates a single request session, closing it out when we exit the block. This ensures that we can reuse the same client between requests effectively.


5. There are two `await` statements here.
  - The first, which is executed in the case of an HTTP error, gets the HHTP response body as a string and assigns it to the `resp` variable.
  - The second, executed in the case of a successful HTTP requests, parses the HTTP response body as JSON.
  
  
6. Our function uses the `aiohttp.ClientSession.get` method, but with a couple of tweaks we could make this function handle POST requests instead. 
  - Replace the above method call with one to `aiohttp.ClientSession.post`.
  - Accept as an argument some data to be POST-ed, and include this in the method call as a `data` keyword argument.

In [108]:
async def get_item(client, throttler, url, headers):
    async with throttler:
        try:
            resp = None
            async with client.get(url, headers=headers) as session:
                if session.status != 200:
                    resp = await session.text()
                    session.raise_for_status()
                resp = await session.json()
                return {'url': url, 'response': resp}
        except Exception as e:
            return {'error': e, 'message': resp}

#### Creating the API URL's

We used `pandas` to load the identifiers for our API requests from a CSV file. Let's look at how we can extract the data we need from our `DataFrame`.

Our API endpoint looks like this:
```
almaws/v1/bibs/{mms_id}/holdings/{holding_id}/items/{item_id}'
```

So we will need the MMS, Holdings, and Item ID numbers. 

- A `DataFrame` has an `.itertuples()` method that is an iterator; it produces a Python [named tuple](https://docs.python.org/3/library/collections.html#collections.namedtuple) from each row of the DataFrame.
- Provided the DataFrame's column labels are valid Python identifiers (no spaces or special characters, must start with a letter of the alphabet), we can convert the tuple to a Python dictionary by calling its `_asdict()` method.
- Finally, we can use Python's `str.format()` [method](https://docs.python.org/3/library/stdtypes.html#str.format) to substitute the placeholders in the URL with the appropriate values from each row. `str.format` accepts optional keyword arguments and uses them to fill any matching placeholder keys (the parts of the string between curly braces). - By passing to `str.format` our row-dictionary with the double-asterisk prefix -- `url.format(**row_dict)` -- we can unpack it into keyword arguments. **Provided our column names for MMS, Holdings, and Item ID match the keys in the URL string**, `str.format` will substitute those keys for the values we need.
- `str.format` will ignore any keyword arguments that don't match the string, so it doesn't matter that our row-dictionary contains more columns than there are keys in the string.

In [117]:
def format_urls(url_str, df):
    for row in df.itertuples(index=False):
        yield url_str.format(**row._asdict())

In [118]:
[url for url in format_urls(config['get_item_url'], art_books.iloc[:50])]

['https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9922645213604107/holdings/22401585210004107/items/23401585190004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9925377663604107/holdings/22401724220004107/items/23401724200004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9923212873604107/holdings/22401782970004107/items/23401782910004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9923212873604107/holdings/22401782970004107/items/23401782920004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9923212873604107/holdings/22401782970004107/items/23401782930004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9923212853604107/holdings/22401801900004107/items/23401801880004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9923212833604107/holdings/22401816810004107/items/23401816790004107',
 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/bibs/9931940353604107/holdings/22401998400004107/items/2340

We also need a header that contains our API key and one that instructs the API to return JSON. 

In [122]:
headers = {'Authorization': f'apikey {config["api_key"]}',
          'Accept': 'application/json'}

#### Putting it all together

Finally, we make an `async` coroutine to create our concurrent requests. 

This coroutine accepts the following arguments:
- A `DataFrame` (called `data`) where each row should contain a set of identifiers we want to pass to the Bibs API.
- A Python dictionary containing the API headers (`headers`).
- A complete URL for an API endpoint, suitable for formatting with the identifiers in our dataset.


It does the following:
- Creates an instance of the `asyncio_throttle.Throttler` class with the specified rate limit.
- Creates an `aiohttp.ClientSession` instance via context manager.
- Initializes a list of `get_item` coroutines with the formatted URLs.
- Accumulates the results from those coroutines via `asyncio.gather`.
- Returns the results.


This is the coroutine we will pass to `asyncio.run` in order to kick off the event loop.

In [127]:
async def make_requests(data, headers, base_url):
    throttler = Throttler(rate_limit=25)
    async with aiohttp.ClientSession() as client:
        awaitables = [get_item(client=client,
                               throttler=throttler, 
                               headers=headers,
                                url=url) for url in format_urls(base_url, data)]
        results = await asyncio.gather(*awaitables)
    return results

And we can launch our concurrent requests as follows. (If running this outside of a Jupyter notebook, you will need to write (assuming you're using Python 3.7+):
```
results = asyncio.run(make_requests(art_books, headers, config['get_item_url'])
```
The syntax for Python 3.5 and 3.6 is a little different. It should be
```
loop = asyncio.get_event_loop()
results = loop.run_until_complete(make_requests(art_books, headers, config['get_item_url'])
```

In [128]:
results = await make_requests(art_books, headers, config['get_item_url'])

`results` should be a list of objects returned by the API, along with any errors. It should equal the length of our original dataset.

In [131]:
assert len(results) == len(art_books)

We can identify errors by looking for any objects within results that have the `error` key.

In [134]:
[r['message'] for r in results if 'error' in r]

['{"errorsExist":true,"errorList":{"error":[{"errorCode":"401683","errorMessage":"No Item found for mmsId 9927473753604107, holdings id 22406647770004107 and item id 23406647760004107.","trackingId":"E01-2309145237-4ZFZJ-AWAE392162647"}]},"result":null}',
 '{"errorsExist":true,"errorList":{"error":[{"errorCode":"401683","errorMessage":"No Item found for mmsId 9924212553604107, holdings id 22408788330004107 and item id 23408788310004107.","trackingId":"E01-2309145313-KWZBF-AWAE582686539"}]},"result":null}',
 '{"errorsExist":true,"errorList":{"error":[{"errorCode":"401683","errorMessage":"No Item found for mmsId 99139479343604107, holdings id 22425722010004107 and item id 23425721990004107.","trackingId":"E01-2309145244-XEQYM-AWAE392162647"}]},"result":null}',
 '{"errorsExist":true,"errorList":{"error":[{"errorCode":"401683","errorMessage":"No Item found for mmsId 99139476363604107, holdings id 22425753010004107 and item id 23425752980004107.","trackingId":"E01-2309145244-TNBR3-AWAE39216

Alternately, we may want to mark the rows in our original list that we have successfully completed.

First, we create a list of unique Item identifiers in our results set (filtering out any errors).

In [140]:
items = [r['response']['item_data']['pid'] for r in results if 'error' not in r]

Then we can use `pandas` functionality to add a column with values based on a Boolean condition: whether the value in the `item_id` column is in our `items` list.

**Note** that our `item_id` column was imported as an integer value, but `items` is a list a strings. So we need to do an explicit type cast in order for the test to work. `DataFrame['item_id'].astype(str)` converts the values in that column to strings.

Then we can use the `.asin` method to check for membership in a list. This will return a list of `True`/`False` values aligned with the original column.

In [150]:
art_books['item_id'] = art_books['item_id'].astype(str)
art_books['completed'] = art_books['item_id'].isin(items)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  art_books['item_id'] = art_books['item_id'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  art_books['completed'] = art_books['item_id'].isin(items)


If that code produces a `SettingWithCopyWarning`, it's safe to ignore it in this case. We can check to make sure that our flag works by comparing the subset of values in the `completed` column that are `False` with the list of error messages we received.

In [151]:
len(art_books.loc[art_books.completed == False]) == len([r['message'] for r in results if 'error' in r])

True

And then we can save our flagged dataset back to the disk as CSV.

In [152]:
art_books.to_csv('art_books_completed.csv', index=False)