# Working With APIs

Corresponds to DataQuest course. Further down, we dive deeper into the GitHub REST API, using `ghapi` for easy access.

In [None]:
import requests
import re
import os

In [None]:
response = requests.get("http://api.open-notify.org/astros.json")
print(response.status_code)

In [None]:
def process_astros_response(response):
    if response.status_code == 200:
        res_dict = response.json()
        print(f"There are currently {res_dict['number']} people in space:")
        astros = [f"{e['name']} on the {e['craft']}" for e in res_dict['people']]
        print("\n".join(astros))
    else:
        print(f"Call to astros.json resulted in status_code = {response.status_code}")

In [None]:
process_astros_response(response)

In [None]:
response.content

This is how parameters are passed.

In [None]:
# This is the latitude and longitude of New York City.
parameters = {"lat": 40.71, "lon": -74}

# Make a get request with the parameters.
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

print(response.url)

print(response.status_code)

## GitHub API

Next, we look at the GitHub API.

First, we consider GET requests. If successful, these return the status code 200. Codes from 400 or 500 denote errors.

In [None]:
# Create a dictionary of headers containing our Authorization header.
# Note: This is a temporary token, with restricted access only

token = os.environ["GITHUB_TOKEN"]
headers = {"Authorization": f"token {token}"}

# Make a GET request to the GitHub API with our headers.
response = requests.get("https://api.github.com/users/mseeger/repos", headers=headers)

for e in response.json():
    print(f"{e['html_url']}: {e['description']}")


In [None]:
response.status_code

Let us list all repositories in `awslabs`. This requires pagination. We use the best practice, scanning the 'Link' string in the header for the `rel="next"` entry.

In [None]:
def collect_and_extract(response, pages, st_entry=None):
    json_data = response.json()
    names = [e['name'] for e in json_data]
    if st_entry is None:
        try:
            pos = names.index("syne-tune")
            st_entry = json_data[pos]
        except ValueError:
            pass
    new_pages = pages + [names]
    return new_pages, st_entry

endpoint = "https://api.github.com/orgs/awslabs/repos"
regex = r"(?<=<)([\S]*)(?=>; rel=\"next\")"
pages = []
st_entry = None
st_not_yet_found = True
num_pages = 0

while endpoint is not None:
    response = requests.get(endpoint, headers=headers)
    if response.status_code != 200:
        print(f"Request for page {num_pages + 1} failed! status_code = {response.status_code}")
        break
    pages, st_entry = collect_and_extract(response, pages, st_entry)
    num_pages += 1
    print(f"Obtained page {num_pages}")
    if st_not_yet_found and st_entry is not None:
        print("Found syne-tune")
        st_not_yet_found = False
    endpoint = None
    links = response.headers.get('Link')
    if links:
        m = re.search(regex, links)
        if m:
            endpoint = m.group(1)

num_repos = sum(len(names) for names in pages)
print(f"\nFound {num_repos} repositories:")
for names in pages:
    print("\n".join(names))

if st_entry:
    print("\nFound syne-tune:\n" + str(st_entry))
else:
    print("\nDid not find syne-tune!")

A POST request sends data to an API, or is used to create a new object on the server. If successful, a POST request returns status code 201.

In [None]:
# Here, we create a new (empty) repository

# Note: If we run this again, we obtain error code 422 for
# "Validation failed, or the endpoint has been spammed". In our case,
# this means that the repo already exists.

payload = {"name": "small-test-repo"}
response = requests.post(
    "https://api.github.com/user/repos", headers=headers, json=payload
)
print(response.status_code)

In [None]:
response.json()

In [None]:
# Let us delete the repo again

response = requests.delete(
    "https://api.github.com/repos/mseeger/small-test-repo", headers=headers
)
print(response.status_code)

We are not authorized to delete the repository under the token.

A PUT or PATCH request is used to modify an object on the server. Typically, PATCH is just changing some attributes, while PUT needs the full object as input, which replaces the one on the server. If successful, these requests return status code 200.

In [None]:
# Let us modify the new repo by assigning a description

payload = {"name": "small-test-repo", "description": "This repo is just for testing API calls"}
response = requests.patch(
    "https://api.github.com/repos/mseeger/small-test-repo", headers=headers, json=payload
)
print(response.status_code)

In [None]:
response.json()

## Search and Download Files from Repository Using ghapi

Let us do something more complicated. First, we use the GitHub search API to find files which contain a certain string. Next, we download these files.

We make our life easier by using https://ghapi.fast.ai/, which provides a Pythonic API to the GitHub REST API, dealing with lots of details in the background, such as composing headers or pagination. It offers full tab completion in the notebook, as well as links to the GitHub API documentation.

**Note**: As seen on https://github.com/fastai/ghapi, `ghapi` does not seem to be actively maintained anymore. The last release was in 2022, and there are more than 40 open issues. Maybe it is better to look for alternatives?

Alternatives are listed here: https://docs.github.com/en/rest/using-the-rest-api/libraries-for-the-rest-api?apiVersion=2022-11-28#python. PyGithub has 6500 stars, but comes with GPL 3 licence (ghapi is Apache 2). All in all, it was probably heavily used at `fastai`.

In [None]:
from ghapi.all import GhApi

token = os.environ["GITHUB_TOKEN"]

api = GhApi(owner="awslabs", repo="syne-tune", token=token)

In [None]:
api.search

The search retrieves all files containing the search term. We can also obtain the fragments where the term is found, along with the spans. Below, we print the fragments, highlighting the spans in red.

Note: If the search terms appears more than twice in a fragment, only the first two appearances are returned.

In [None]:
# We are using ANSI escape sequences in order to highlight the spans in red:
# https://www.geeksforgeeks.org/print-colors-python-terminal/
def highlight_spans_in_red(fragment, spans):
    start_seq = "\033[91m"
    end_seq = "\033[00m"
    spans = [[0, 0]] + [s['indices'] for s in spans]
    parts = []
    for prev, curr in zip(spans[:-1], spans[1:]):
        parts.extend([fragment[prev[1]:curr[0]], start_seq, fragment[curr[0]:curr[1]], end_seq])
    parts.append(fragment[spans[-1][1]:])
    return "".join(parts)

In [None]:
# We search for all Python files which contain the `search_term`
from ghapi.page import paged

# TODO: Can we enforce case-sensitivity?
search_term = "SimulatorBackend"
query = f"{search_term} in:file repo:awslabs/syne-tune extension:py language:python"

# We use "text-match+json" in the Accept header, in order to obtain details on where the
# search term is matched in each file:
# https://docs.github.com/en/rest/search/search?apiVersion=2022-11-28#text-match-metadata
results = paged(
    api.search.code,
    q=query,
    headers={'Accept': 'application/vnd.github.text-match+json'},
)

all_paths = []
total_count = None
count_so_far = 0

for page in results:
    if total_count is None:
        total_count = int(page['total_count'])
        print(f"Found {total_count} matching files in total")
    for result in page['items']:
        path = result['path']
        print(f"\n[File: {path}]")
        num_found = 0
        text_matches = result['text_matches']
        num_frags = len(text_matches)
        print(f"{num_frags} fragment{'s' if num_frags > 1 else ''}:")
        for match in text_matches:
            fragment = match['fragment']
            spans = match['matches']
            num_spans = len(spans)
            num_found += num_spans
            print(f"--- [{num_spans}] ---")
            print(highlight_spans_in_red(fragment, spans))
        print("-----------")
        all_paths.append((path, num_found))
    num_results = len(page['items'])
    count_so_far += num_results
    # The iterator does not properly stop on its own
    if count_so_far >= total_count or num_results == 0:
        if count_so_far < total_count:
            print(f"Retrieved {count_so_far} of {total_count} results only, but page is empty")
        break

Note that the search does not seem to be case-sensitive. Let us now download all files containing the search term.

In [None]:
from base64 import b64decode

for path, num_matches in all_paths:
    result = api.repos.get_content(path)
    print(f"\n******** [{path}]: Found {num_matches} match{'es' if num_matches > 1 else ''} {'*' * 40}\n")
    print(b64decode(result['content']).decode("utf-8"))

Another cool project would be to search for and access pull requests:

* Which files have been modified?
* What are the diffs?

This could be useful in order to obtain data about how certain issues have been fixed, which in turn could be used as "instruction tuning" data for an AI assistant.

In [None]:
api.pulls

Let us fetch information for pull request `717` in `syne-tune`. First, `pulls.get` returns the basic information.

In [None]:
api.pulls.get

In [None]:
pr_kwargs = dict(
    owner="awslabs",
    repo="syne-tune",
    pull_number=717,
)

response = api.pulls.get(**pr_kwargs)

In [None]:
list(sorted(response.keys()))

In [None]:
response

In [None]:
basic_info = """
-----------------------------------------------------------
{title} #{number}

{body}
-----------------------------------------------------------
User posting PR:           {user}
State:                     {state}
Number of files changed:   {changed_files}
Was merged:                {merged}
Created at:                {created_at}
Closed at:                 {closed_at}
Merged at:                 {merged_at}
Merged by:                 {merged_by}
ID:                        {id}
Base SHA:                  {base_sha}
Merge Commit SHA:          {merge_commit_sha}
Labels:                    {labels}
Requested reviewers:       {requested_reviewers}
Number of comments:        {comments}
Number of review comments: {review_comments}
"""

direct_names = [
    "title", "number", "body", "state", "changed_files", "created_at", "closed_at",
    "issue_url", "id", "merged", "merged_at", "requested_reviewers", "comments",
    "review_comments", "merge_commit_sha",
]
pr_info = {name: response[name] for name in direct_names}
pr_info["labels"] = [e["name"] for e in response["labels"]]
pr_info["merged_by"] = response["merged_by"]["login"]
pr_info["user"] = response["user"]["login"]
pr_info["base_sha"] = response["base"]["sha"]

print(basic_info.format(**pr_info))

The following information can have multiple parts, and needs to be fetched by separate API calls:

* Files changed (and change patches)
* Reviews, along with their comments linked to code parts

Let us first look at reviews and their comments: `pulls.list_reviews`, `pulls.list_comments_for_review`. Note that `pulls.list_review_comments` provides a flat list of all review comments linked to code parts, but does not contain the summary comment for each review. Also, `pulls.get_review` returns the same information as `pulls.list_reviews`, but for one specific review only.

In [None]:
info_per_review = """
[ REVIEW {id} ]
{body}

User:      {user}
State:     {state}
"""

info_per_review_comment = """
[ COMMENT {id} ]
{body}

Path:       {path}
Position:   {position}
Diff hunk:
{diff_hunk}
"""

# Note: Both API calls are paginated, but we are lazy here (there are only 3 reviews, and they have 0 or 1 comment each)
response = api.pulls.list_reviews(**pr_kwargs)
print(f"Found {len(response)} reviews:")
for review in response:
    review_id = review['id']
    kwargs = {name: review[name] for name in ["body", "id", "state"]}
    kwargs["user"] = review["user"]["login"]
    print(info_per_review.format(**kwargs))
    response2 = api.pulls.list_comments_for_review(**pr_kwargs, review_id=review_id)
    for comment in response2:
        kwargs = {name: comment[name] for name in ["body", "id", "path", "position", "diff_hunk"]}
        print(info_per_review_comment.format(**kwargs))

Finally, we look at files changed: `pulls.list_files`.

In [None]:
info_per_file = """
[ PATCH {filename} ]
{patch}

Additions: {additions}
Deletions: {deletions}
Changes:   {changes}
"""

# Note: The API call is paginated, but we are lazy here (there are only 9 files changed)
response = api.pulls.list_files(**pr_kwargs)
print(f"Found {len(response)} files changed:")
for review in response:
    kwargs = {name: review[name] for name in ["filename", "patch", "additions", "deletions", "changes"]}
    print(info_per_file.format(**kwargs))

What does `list_commits` do?

For this PR, there is one commit. It lists 'author', 'committer' (both 'mseeger'). Its SHA is 'afde70bfec901d64f9d2603893dacacf2b5945f2', which is not the same as the SHA for the merge commit. This is most likely the commit for which the PR was issued. There may be further commits for changes based on review comments, all of which are different from the final merge commit.

We would like to iterate over all PRs of a repository and filter them according to some criterion. We can use `pulls.list`.

In [None]:
response = api.pulls.list(**pr_kwargs, state="closed")

In [None]:
response[0]

In [None]:
print("\n".join(f"{e['number']}: {e['title']} ({e['merged_at']})" for e in response))

Fields in entries returned by `pulls.list` useful for filtering:

* `number`: Number of PR
* `merged_at`: TS when PR was merged, or `None` if PR was not merged
* `labels`: Labels assigned to the PR
* `created_at`, `closed_at`: More TS
* `user`: Who submitted the PR?
* `title`: Title of PR

Annoyingly, there is no information about the number of files changed, or the total number or size of changes. This needs `pulls.get` for every PR. All in all, we can use `pulls.list` for some first round of filtering (e.g., by timestamps; only PRs which were merged; by user; by labels assigned), and iterate over the remaining ones with `pulls.get` for a second round of filtering.

These fields are returned by `pulls.get`, but not by `pulls.list`:

* 'additions', 'deletions', 'changed_files': Total number of additions, deletions (lines), and changed files
* 'comments', 'commits', 'review_comments': Number of comments, review comments, and commits
* 'mergeable', 'mergeable_state', 'merged', 'merged_by': Note that `merged_at` is returned by `pulls.list`
* 'maintainer_can_modify', 'rebaseable'

In [None]:
keys_for_list = set(response[0].keys())

In [None]:
response = api.pulls.get(**pr_kwargs)

In [None]:
keys_for_get = set(response.keys())

In [None]:
keys_for_get.difference(keys_for_list)

In [None]:
keys_for_list.difference(keys_for_get)

Finally, we'd like to explore searching for particular repositories or pull requests.

In [None]:
api.search

We can use `search.search_repos`. With `sort="stars"` (and `order="desc"`, the default), repos with the highest number of stars are returned first; with `sort="forks"`, repos with the largest number of forks are returned first. The search query in `q` can have fields detailed here:

https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories.

* `jquery in:name` matches repositories with "jquery" in the repository name. `jquery in:description`, `jquery in:readme`
* `user: xyz`, `org: xyz`: Repos of this user or org
* `size:<n`, `size:>n`, `size:<=n`, `size:>=n`, `size:n..m`: repos of size less than, more than, between. Numbers are in KB
* `followers:>=n`, ...: Number of followers
* `forks:>=n`, ...: Number of forks
* `stars:>=n`, ...: Number of stars
* `fork:true`: Include forks in search results
* `language:python`: Repos with code in Python
* `created:<YYYY-MM-DD`: Creation date
* `pushed:>YYYY-MM-DD`: Pushed to date (can be used to filter out inactive repos)
* `license:LICENSE_KEYWORD`: Repos with certain license
* `is:public`: Only public repos (not private ones you have access to)

This should be very useful to narrow down a search!

In [None]:
query = "size:<50000 forks:>100 stars:>2000 language:python pushed:>2023-11-01 is:public"

respose = api.search.repos(q=query, sort="stars")

In [None]:
len(respose)

In [None]:
api.recv_hdrs

In [None]:
response

In [None]:
len(respose['items'])

In [None]:
respose['items'][10]

## Reddit API

Calls to the Reddit API need to be authenticated using `OAuth`. This seems challenging to setup. Would have to be done in order to be able to use the API.