# Tutorial: Get all 2024 AMJ articles using API calls

## Step 1. Familiarize yourself with the API

We are going to use the CrossRef API to get all articles published in 
2024 by the *Academy of Management Journal*. 

The first step is to familiarize yourself with the API. So take a look 
at the following two pages first:

- [Page 1: A non-technical introduction to the API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/a-non-technical-introduction-to-our-api/)
- [Page 2: The API documentation](https://api.crossref.org/swagger-ui/index.html) - This page is a bit technical, but it is the official documentation of the API.

## Step 2. Our first API calls

On the non-technical introduction page, you were provided with 
a few example API calls to execute in your browser. Doing it
manually in the browser isn't very helpful for us if we're trying 
to automate the process.

Let's use Python to make some of these API calls instead.

*Make sure to replace your_email@ucf.edu with your actual email address
in the below cell.*

In [None]:
# Politeness matters - We want to be polite, so we send them
# our email with the requests so they know who to contact with
# concerns if our code goes wild. See their API documentation
# for more information.

your_email = "your_email@ucf.edu"
if "your_email" in your_email:
    print(
        "Nope, can't continue until you replace 'your_email@ucf.edu' with your email address in the code cell"
    )
else:
    print("Good to go!")


Let's try that first API call from the non-technical introduction page:

> How many accounts do we have? (This includes members and others, both inactive and active) `https://api.crossref.org/members?rows=0`

In [None]:
from curl_cffi import requests

api_url = "https://api.crossref.org/members?rows=0"
api_response = requests.get(api_url, headers={"mailto": your_email})
print("The Crossref server responded with status code: ", api_response.status_code)

- What was the status code?
- What does this mean? If you don't remember, go back to the video on web scraping.

In [None]:
# Print the contents of the response

print("Here is the content of what the server responded with:\n")
print(api_response.content)

Probably pretty difficult to read, right? That's because it's in JSON format.
the `.content` attribute of the response object contains the raw data that was 
returned from the server. But JSON data is common enough that there is a 
built-in `.json()` method that will parse the data for us.

Let's try it!

In [None]:
# Print the JSON content of the response
from pprint import pprint

print("Here is the JSON content of what the server responded with:\n")
pprint(api_response.json())

Looks about the same, but prettier, right? Well, the technical difference is 
that the data is now a Python dictionary, which we can interact with much more
easily. Previously it was one long string of characters that Python didn't
know how to interpret.

Let's see if we can get at that 'total-results' key in the dictionary.

In [None]:
# Get the total-results from the JSON response
total_results = api_response.json()["message"]["total-results"]
print("The total number of members in the Crossref database is: ", total_results)

You access information from a dictionary by using square brackets and the key name.
Here, the 'total-results' key is not at the top level, it was underneath the
'message' key, so we have two steps to access it:

```python
response.json()['message']['total-results']
```
1. `response.json()` returns the top-level dictionary from the JSON data
2. `['message']` accesses the data stored under the 'message' key. This included:
   1. `items`
   2. `items-per-page`
   3. `query`
   4. `total-results`
3. `['total-results']` accesses the data stored under the 'total-results' key
within the `message` data

Let's try something that has more than one piece of information in it that
we want to access. Let's try another API call from the non-technical
introduction page: 

> Let’s look at some of the results. `https://api.crossref.org/works?query=%22blood%22&`

In [None]:
api_url = "https://api.crossref.org/works?query=%22blood%22&"
api_response = requests.get(api_url, headers={"mailto": your_email})
print("The Crossref server responded with status code: ", api_response.status_code)
blood_json = api_response.json()
blood_total_results = blood_json["message"]["total-results"]
print(
    f"The total number of works in the Crossref database with the word "
    f"'blood' in the title is: {blood_total_results}"
)
print("Here is the JSON the server responded with:\n")
pprint(blood_json)

That is a lot of results! Let's see how many they actually sent us!

In [None]:
n_items = len(blood_json["message"]["items"])
print("The number of items in the 'items' list is: ", n_items)

OK, so even though there were nearly a million results, they only sent us 20.
That's because the API has a default limit of 20 results per page. We can
change that later if we want to, but for right now, let's just explore what we 
can do with these 20 results.

Let's see just what the data they sent us in the first result contains.

When you have a dictionary, you can use the `.keys()` method to see what keys
are available to access. Let's try that with the first result.

In [None]:
data_keys = sorted(list(blood_json["message"]["items"][0].keys()))
print(f"There are {len(data_keys)} keys in the first item in the list.\n")
print("The keys are: ")
pprint(data_keys)

So we can see that there are a lot of keys available to us. Let's try to access
the 'title' key to see what the titles of the 20 returned items are.

In [None]:
for item in blood_json["message"]["items"]:
    print(item["title"])

Notice how they're all in square brackets? That's because the 'title' key
in crossref data actually returns a list of titles. This is because an article
can have more than one title, such as in multiple languages. For now, let's 
just look at the first title of the first result.

In [None]:
for item in blood_json["message"]["items"]:
    print(item["title"][0])

Better, now what about authors? Let's try to access the 'author' key.

In [None]:
for item in blood_json["message"]["items"]:
    pprint(item.get("author", []))

What gives? Why did I change the way I retrieved the data from the dictionary?

Well, I knew that 'title' was always going to be in the data, so using the 
`['title']` method was fine. But 'author' is not always going to be in the data.
because not all articles have authors. So I used the `.get()` method instead.
This method allows you to specify a default value if the key you're looking for
is not in the dictionary. In this case, I specified an empty list `[]` as the 
default because I'm expecting to have to work with the authors data as a list 
because there can be multiple authors.

So all of those empty lists mean that there were no authors for that particular
article. 

Now let's use that author data to calculate the number of authors for each 
article - something they don't provide us with directly.

In [None]:
for item in blood_json["message"]["items"]:
    author_list = item.get("author", [])
    print(len(author_list))

Cool, now let's make the query customizable, so we can search for things
other than 'blood'.

In [None]:
import urllib.parse

# Replace this with the search term/phrase you want to use
search_phrase = "entrepreneurial orientation"
n_rows = 10

search_term = urllib.parse.quote(search_phrase)
api_url = f"https://api.crossref.org/works?query=%22{search_term}%22&rows={n_rows}"
api_response = requests.get(api_url, headers={"mailto": your_email})
print("The Crossref server responded with status code: ", api_response.status_code)
custom_json = api_response.json()
custom_total_results = custom_json["message"]["total-results"]
print(
    f"The total number of works in the Crossref database with the search term "
    f"'{search_phrase}' in the title is: {custom_total_results}"
)
print(f"The server replied with {len(custom_json['message']['items'])} items.")
print("\nHere is the JSON the server responded with:\n")
pprint(custom_json)

Notice we did a couple of things here:

```python
search_term = urllib.parse.quote(search_phrase)
```
We did this to make sure that the search term was properly formatted for a URL.
This is important because URLs can't have spaces or certain other characters in
them. This function replaces those characters with the appropriate URL encoding.

```python
api_url = f"https://api.crossref.org/works?query=%22{search_term}%22&rows={n_rows}"
```
We used an f-string to insert the search term and the number of rows we want
directly into the URL. This is a nice way to make sure that the URL is always
formatted correctly while enabling us to update the URL dynamically. For instance
if we wanted to loop through a variety of search terms, we could just update the
`search_term` variable and the URL would be updated automatically.

OK, now let's try to get all the articles from the *Academy of Management Journal*

## Step 3. Get all articles from the *Academy of Management Journal* in 2024

The examples are only going to get us so far. None of these examples tell us 
how to get all the articles from a specific journal in a specific year. We're
going to have to figure that out on our own.

Let's start by looking at the [API documentation](https://api.crossref.org/swagger-ui/index.html)

Do you see anything that looks like it would return a list of journals?

### Step 3.1. Get the ISSN for the *Academy of Management Journal*

Hopefully you found the section that says "Journals Endpoints that expose 
journal related data". Within that section, you should see information about
three different endpoints:
- `/journals` - Return a list of journals in the Crossref database.
- `/journals/{issn}` - Returns information about a journal with the given ISSN
- `/journals/{issn}/works` - Returns a list of works in the journal identified 
by the given ISSN.

We could look up the ISSN for the *Academy of Management Journal* on the
journal's website, but we're going to pretend that we don't know it and
are stuck using the `/journals` endpoint to find it.

In [None]:
endpoint = "/journals"
journal_title = "Academy of Management Journal"
query = f"query={urllib.parse.quote(journal_title)}"
api_url = f"https://api.crossref.org{endpoint}?{query}"
print("The URL we are going to use is: ", api_url)

Try clicking on the link for the `/journals` endpoint. What do you see?
This is hard to manage manually, so let's use Python to get the data for us.

In [None]:
api_response = requests.get(api_url, headers={"mailto": your_email})
print("The Crossref server responded with status code: ", api_response.status_code)
journal_json = api_response.json()
journal_total_results = journal_json["message"]["total-results"]
print(
    f"The total number of journals in the Crossref database with the title "
    f"containing the words {journal_title} is: {journal_total_results}"
)
print(f"The server replied with {len(journal_json['message']['items'])} items.")

Eight items? We were expecting to see only one! What's going on?

Let's diagnose the problem by looking at the items in the list.

In [None]:
# What data do we have for each journal?
data_keys = sorted(list(journal_json["message"]["items"][0].keys()))
print(f"There are {len(data_keys)} keys in the first item in the list.\n")
print("The keys are: ")
pprint(data_keys)

"Title" looks promising...

In [None]:
for item in journal_json["message"]["items"]:
    print(item["title"])

AHA! We found the *Academy of Management Journal*! The others just use the same
words. Let's find the ISSN for the *Academy of Management Journal*.

In [None]:
issn_list = []
for item in journal_json["message"]["items"]:
    if item["title"] == "Academy of Management Journal":
        issn_list.extend(item["ISSN"])
print(issn_list)

Verify that these are the correct ISSNs:

https://en.wikipedia.org/wiki/Academy_of_Management_Journal

Let's get the information about the journal from the second API endpoint:
`/journals/{issn}`

In [None]:
for issn in issn_list:
    endpoint = f"/journals/{issn}"
    api_url = f"https://api.crossref.org{endpoint}"
    print("The URL we are about to use for ISSN {issn} is: ", api_url)
    api_response = requests.get(api_url, headers={"mailto": your_email})
    print("The Crossref server responded with status code: ", api_response.status_code)
    journal_details_json = api_response.json()
    print("Here is the JSON the server responded with:\n")
    pprint(journal_details_json)

OK. So we have the ISSNs for the *Academy of Management Journal*. Let's use
the `/journals/{issn}/works` endpoint to get all the articles from the journal
in 2024.

**We're just going to use the first ISSN for now because they will provide duplicate results for the most part.**

In [None]:
issn = issn_list[0]
endpoint = f"/journals/{issn}/works"

from_date = "2024-01-01"
until_date = "2024-12-31"
n_rows = 100
query = f"filter=from-pub-date:{from_date},until-pub-date:{until_date}&rows={n_rows}"

api_url = f"https://api.crossref.org{endpoint}?{query}"
print("The URL we are going to use is: ", api_url)
api_response = requests.get(api_url, headers={"mailto": your_email})
print("The Crossref server responded with status code: ", api_response.status_code)
journal_works_json = api_response.json()
journal_works_total_results = journal_works_json["message"]["total-results"]
print(
    f"The total number of works in the Crossref database with the ISSN {issn} "
    f"published between {from_date} and {until_date} is: {journal_works_total_results}"
)
print(f"The server replied with {len(journal_works_json['message']['items'])} items.")
print("\nHere is the JSON the server responded with:\n")
pprint(journal_works_json)

We got the data we're looking for! Now we just need to extract the information
we want from it and save it to a datafile.

## Step 4. Extract and save the data to a file

Let's say that we want the following information from each article:
- DOI
- Title
- Number of references
- Number of authors
- Number of citations
- Publication date
- URL

Let's see if we can get this just for the first article.

In [None]:
first_article = journal_works_json["message"]["items"][0]
print("The data we have for the first article in the list is:")
pprint(sorted(list(first_article.keys())))

From this we can get the variables we want to extract:

In [None]:
doi = first_article["DOI"]
title = first_article["title"][0]
n_refs = first_article["references-count"]
n_authors = len(first_article.get("author", []))
n_cites = first_article["is-referenced-by-count"]
pub_date = first_article["published"]
url = first_article["URL"]

print(f"DOI: {doi}")
print(f"Title: {title}")
print(f"Number of references: {n_refs}")
print(f"Number of authors: {n_authors}")
print(f"Number of citations: {n_cites}")
print(f"Publication date: {pub_date}")
print(f"URL: {url}")

This looks good, but what's with the date? It's not in a very useful format.
We can use the `datetime` module to convert it to a more useful format.

In [None]:
import datetime

if len(pub_date["date-parts"][0]) == 3:
    year, month, day = pub_date["date-parts"][0]
else:
    year, month = pub_date["date-parts"][0]
    day = 1

pub_date = datetime.datetime(year, month, day)
print(f"The publication date is: {pub_date.strftime('%B %d, %Y')}")

Hrmm... that seems to have worked, but we don't want to have to do that every
time we want to restructure the date. Let's make a function that will do that 
for us.

In [None]:
def build_date(date_parts):
    if len(date_parts[0]) == 3:
        year, month, day = date_parts[0]
    else:
        year, month = date_parts[0]
        day = 1
    return datetime.datetime(year, month, day)


print(build_date([[2024, 1, 1]]).strftime("%B %d, %Y"))
print(build_date([[2024, 12, 31]]).strftime("%B %d, %Y"))
print(build_date([[2024, 7]]).strftime("%B %d, %Y"))

Well, that's useful, but we also want to be able to parse an entire article at a time,
not just the date. Let's make a function that will do that for us.

In [None]:
def parse_article(article):
    doi = article["DOI"]
    title = article["title"][0]
    n_refs = article["references-count"]
    n_authors = len(article.get("author", []))
    n_cites = article["is-referenced-by-count"]
    pub_date = build_date(article["published"]["date-parts"])
    url = article["URL"]
    return {
        "doi": doi,
        "title": title,
        "n_refs": n_refs,
        "n_authors": n_authors,
        "n_cites": n_cites,
        "pub_date": pub_date,
        "url": url,
    }


print("Here is the first article's information using the function: ")
pprint(parse_article(first_article))

print("\nHere is the second article's information using the function: ")
pprint(parse_article(journal_works_json["message"]["items"][1]))

OK, so now we can systematically apply this function to all the articles in the
data we got from the API call.

Now let's do it.

In [None]:
import pandas as pd

article_data = []
for article in journal_works_json["message"]["items"]:
    article_data.append(parse_article(article))

article_df = pd.DataFrame(article_data).sort_values("pub_date", ascending=True)
print(article_df.head())

Let's see what all we got in the dataset

In [None]:
n_rows = article_df.shape[0]
n_columns = article_df.shape[1]
print(f"The dataframe has {n_rows} articles with {n_columns} columns of data.")

avg_refs = article_df["n_refs"].mean()
avg_authors = article_df["n_authors"].mean()
avg_cites = article_df["n_cites"].mean()
print(f"The average number of references is: {avg_refs}")
print(f"The average number of authors is: {avg_authors}")
print(f"The average number of citations is: {avg_cites}")


Chances are, you aren't going to want to work with the data in this notebook.

Let's save the dataframe to a CSV file so that you can work with it in
another notebook, Excel, or any other program that can read CSV files.

In [None]:
from pathlib import Path

filename = Path.cwd() / "journal_articles.csv"
article_df.to_csv(filename, index=False, encoding="utf-8-sig", header=True)
print(f"Data saved to {filename}")

Now you have the 2024 AMJ articles in a CSV file!

Try opening it - does it look like what you expected?

Done...