# Accessing data on the web through APIs

## Gender of names, country information, hate speech categorization

by Koenraad De Smedt at UiB

---
An *Application Program Interface* (API) is a website that accepts HTTP/HTTPS requests and sends a response. If the request is valid, the response has a successful status code and the program can extract data from the response. Many websites provide data in the JSON format, which can easily be converted to a Python *dict*.

This notebook shows how to:

1.  Access a remote API with *post, get* and parameters
2.  Get a dict from the JSON data in an API response
3.  Select information from parts of the dict
4.  Transform a dict to another shape
5.  Convert a dict to a *pandas* series
6.  Make a barplot of a series
7.  Use a personal token in Colab Secrets

**Warning**: The use of external websites in these examples is for illustration only. These websites are regularly updated, so that the responses may be different from earlier. Also, it is possible that the APIs themselves will change.

---

In [None]:
import requests
import pandas as pd

---
## Gender of first names

The following example accesses an API on a website that provides the most likely gender of a name. This could be useful, for instance, in social media analysis, choice of pronouns, etc. The following example asks for the gender of *Alexa*.


In [None]:
requests.get('https://api.genderize.io', params={'name':'Alexa'}).json()['gender']

Let us break this example down in steps. First, send a *get* request and provide the parameters as a *dict*.

If the `status_code` of the response is 200, the request was successful and a valid response is obtained.

In [None]:
response = requests.get('https://api.genderize.io', params={'name':'Alexa'})
print(response)
print(response.status_code)

The response is a JSON object containing data. By means of `.json` we decode this object into a dict. It contains various pieces of information: the count, the name itself, the most likely gender, and its probability.

In [None]:
data = response.json()
print(data)

Some or all of the information in this dict can be used for further processing. Here we use only `gender`.

In [None]:
print(data['gender'])
print(data['name'], 'is', data['gender'])

### Exercise 1

Define a function `print_gender` with one argument, a name. The function should print the gender and its probability, as in the following example.

```
>>> (print_gender 'Alexa')
Alexa is female with probability 0.99
```

Then change your function definition so that the probability is printed as a percentage.

```
>>> (print_gender 'Alexa')
Alexa is female (99%)
```

### Exercise 2

According to the [documentation](https://genderize.io/), this API accepts an optional extra parameter `country_id` which should be a [two-letter country code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2). Extend your function `print_gender` with an optional parameter for the country. If the country is given, it is given as a parameter to the API and is put in the output as well. Example:

```
>>> print_gender('Kim', 'KR')
Kim is male in KR (80%)
>>> print_gender('Kim')
Kim is female (70%)
```



---
## Countries

Another API is that of https://restcountries.com/ which returns information about countries. This JSON response has a more complicated structure than the one above.

Check out this example in a browser window: https://restcountries.com/v3.1/alpha?codes=be which returns a lot of information about Belgium. Observe that the result appears as a list containing a dict.

In [None]:
countries_url = 'https://restcountries.com/v3.1/alpha'


Let's say that we are only interested in the country name and the population. These pieces of information can be 'mined' from the API result. So we make a function that takes a country code as its argument and returns two values: the common country name (in English) and the population.

In [None]:
def get_country_info(country_code):
  info = requests.get(countries_url, params={'codes':country_code}).json()
  #print(info)
  return info[0]['name']['common'], info[0]['population']

Test.

In [None]:
get_country_info('be')

Print example.

In [None]:
country_name, population = get_country_info('be')
print(f'{country_name} (pop. {population})')

### Exercise 3

Extend the function `print_gender` from the previous exercise by printing the
common name and population of the country, instead of the percentage. Example:

```
>>> print_gender('Kim', 'BE')
Kim is female in Belgium (pop. 11555997)
```

As a slightly more complex variant, also print the number of people recorded with that name in the given country.

```
>>> print_gender('Kim', 'BE')
Kim is female in Belgium (3373 recorded out of pop. 11555997)
```

## Hate speech (optional)

[Hugging Face has a classifier for hate speech](https://huggingface.co/IMSyPP/hate_speech_en). The URL for the API is as follows.

In [None]:
hs_url = 'https://api-inference.huggingface.co/models/IMSyPP/hate_speech_en'

Text sent to the API will get scores relative to four classes. Make a dict to translate the class labels to something understandable according to the documentation.

In [None]:
hs_labels = {'LABEL_0':'acceptable', 'LABEL_1':'inappropriate',
             'LABEL_2':'offensive', 'LABEL_3':'violent'}

Hugging Face is unfortunately not fully open, but requires a user account. Register for an account, then [apply for an access token](https://huggingface.co/settings/tokens) that you can use to identify yourself in API calls. Save your personal token in your Colab Secrets with name *HuggingFace* (🗝️ in the Colab left margin).

Now make headers containing the token.

In [None]:
from google.colab import userdata
headers = {'Authorization': 'Bearer ' + userdata.get('HuggingFace')}

Define a function that sends an input to the API and returns the response. This time we use `requests.post` because we will send data to the server.

In [None]:
def hs_query(text):
  response = requests.post(hs_url, headers=headers, json={'inputs': text})
  return response.json()

Test. Be aware that the system may not always be active. If you get an error saying that the system is loading, try 20 seconds later.

In [None]:
response = hs_query(input('Type a line of text: '))
response

The response is a list containing a list with four dicts. Too complicated. Let's simplify this to a simple dict.

In [None]:
hs_dict = {hs_labels[d['label']]:d['score'] for d in response[0]}
hs_dict

Transform the dict to a series.

In [None]:
hs_series = pd.Series(hs_dict, name='Hate speech classes')
hs_series

Plot the series. Optionally specify rotation of labels.

In [None]:
hs_series.plot.bar(rot=0)

 ### Exercise 4

Get your own access token at Hugging Face and try to classify a brief text for hate speech.


---

### Exercise 5

(optional) This is a slightly more complicated project for those who want to try some more APIs. There is a Digital Humanities Course Registry (a cooperation between CLARINO and DARIAH) which has an [API](https://dhcr.clarin-dariah.eu/api/v2/). Make a program to get information from the API. For instance, define a function which retrieves all courses given in certain languages, together with their institutions, such as the following:

```
>>> find_courses_lang(['Norwegian','Swedish','Danish'])
Masterprogram i digital kultur - Universitetet i Bergen
Digitala Humaniora - Åbo Akademi University
IT and Cognition - Copenhagen University
```

Other possible exercises are plotting the number of courses by language, country, institution, discipline, etc.

### Exercise 6

Try the [API for digital text analysis at the National Library of Norway](https://www.nb.no/dh-lab/digital-tekstanalyse/), – Norwegian only, currently being revised, the tutorial currently does not work in Colab but can be run in Binder after first installing the necessary modules with `!pip install dhlab`
