# Denison CS181/DA210 APIs Case Study

---

In [1]:
import os
import os.path
import sys
import importlib
import io
import pandas as pd
from lxml import etree

if os.path.isdir(os.path.join("../../..", "modules")):
    module_dir = os.path.join("../../..", "modules")
else:
    module_dir = os.path.join("../..", "modules")

module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

import requests

---

## Part A: High-Level Planning

We can use the GitHub API to retrieve information about organizations, repositories, and users.  Even without authenticating, we can find the commits and users that have made changes to specific files in a repository.

With this goal in mind, we will divide our work into two phases:
1. Build a table of commits to a specific file.
2. Build a table of users who have modified that file.

To do this, we'll make use of two GitHub API endpoints:
* [/repos/{owner}/{repo}/commits](https://docs.github.com/en/rest/commits/commits#list-commits)
* [/users/{username}](https://docs.github.com/en/rest/users/users#get-a-user)

---

## Part B: Building a table of commits to a specific file

We'll try to gather information about the `pandas` repository on GitHub, specifically looking at changes to the `groupby.py` file.  Here is its [file docstring](https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/groupby.py):

```
"""
Provide the groupby split-apply-combine paradigm. Define the GroupBy
class providing the base-class of operations.

The SeriesGroupBy and DataFrameGroupBy sub-class
(defined in pandas.core.groupby.generic)
expose these user-facing objects to provide specific functionality.
"""
```

#### Design a function to issue a request

First, we'll write a function `getRepositoryCommits(owner, repo, path, page=1)`, which will retrieve one _page_ of results from the list-commits endpoint.  Using the `page` parameter will allow us to easily programmatically request different pages of results.

In [2]:
def getRepositoryCommits(owner, repo, path, num_per_page=10, page=1):
    # Build the URL
    host = "api.github.com"
    resource_path = f"/repos/{owner}/{repo}/commits"
    url = util.buildURL(resource_path, host, protocol="https")

    # Make the request
    query_params = {"path": path,
                    "per_page": num_per_page,
                    "page": page}
    try:
        response = requests.get(url, params=query_params)
        assert response.status_code == 200
    except AssertionError:
        print(f"Failed: {resource_path} with status code {response.status_code}")

    # Return the parsed JSON object
    return response.json()

In [3]:
# Example using list-commits endpoint
owner = "pandas-dev"
repo = "pandas"
query_path = "pandas/core/groupby/groupby.py"
data = getRepositoryCommits(owner, repo, query_path)

util.print_json(data, level=4)

         [
           {
            "sha": "1213a173335527aa445a5cd90ea0ef457e09e24d"
            "node_id": "C_kwDOAA0YD9oAKDEyMTNhMTczMzM1NTI3YWE0NDVhNWNkOTBlYTBlZjQ1N2UwOWUyNGQ"
            "commit": 
             {
               ...
             }
            "url": "https://api.github.com/repos/pandas-dev/pandas/commits/1213a173335527aa445a5cd90ea0ef457e09e24d"
            "html_url": "https://github.com/pandas-dev/pandas/commit/1213a173335527aa445a5cd90ea0ef457e09e24d"
             ...
           }
           {
            "sha": "79fb2debb14e77d6d4af9c4db058e6f994507e29"
            "node_id": "C_kwDOAA0YD9oAKDc5ZmIyZGViYjE0ZTc3ZDZkNGFmOWM0ZGIwNThlNmY5OTQ1MDdlMjk"
            "commit": 
             {
               ...
             }
            "url": "https://api.github.com/repos/pandas-dev/pandas/commits/79fb2debb14e77d6d4af9c4db058e6f994507e29"
            "html_url": "https://github.com/pandas-dev/pandas/commit/79fb2debb14e77d6d4af9c4db058e6f994507e29"
             ...
  

#### Understand results

Let's explore the results we get from this endpoint.

We can view the commits for this file on GitHub: https://github.com/pandas-dev/pandas/commits/main/pandas/core/groupby/groupby.py.

Also, as this is a GET request, we can view the general version (without the query parameters) in a web browser to get a feel for the results: https://api.github.com/repos/pandas-dev/pandas/commits.  (Firefox in particular has a very nice view of the headers, raw JSON data, and parsed JSON result.)

The GitHub documentation shows that this result should be a JSON array (corresponding to a Python list) of JSON objects (Python dictionaries).  There are at most 30 (by default) objects per "page", and each object should represent a single commit.

In [4]:
# Look at the most recent commit's info
commit_obj = data[0]
util.print_json(commit_obj, level=2)

     {
      "sha": "1213a173335527aa445a5cd90ea0ef457e09e24d"
      "node_id": "C_kwDOAA0YD9oAKDEyMTNhMTczMzM1NTI3YWE0NDVhNWNkOTBlYTBlZjQ1N2UwOWUyNGQ"
      "commit": 
       {
        "author": 
         {
          "name": "Thierry Moisan"
          "email": "thierry.moisan@gmail.com"
          "date": "2022-04-26T00:25:32Z"
         }
        "committer": 
         {
          "name": "GitHub"
          "email": "noreply@github.com"
          "date": "2022-04-26T00:25:32Z"
         }
        "message": "DOC: add examples to groupby.first (#46766)"
        "tree": 
         {
          "sha": "15147591e35755006082e466b4d2914c444cb350"
          "url": "https://api.github.com/repos/pandas-dev/pandas/git/trees/15147591e35755006082e466b4d2914c444cb350"
         }
        "url": "https://api.github.com/repos/pandas-dev/pandas/git/commits/1213a173335527aa445a5cd90ea0ef457e09e24d"
         ...
       }
      "url": "https://api.github.com/repos/pandas-dev/pandas/commits/1213a173335527aa44

In [5]:
# Look at the most recent commit's message
commit_obj["commit"]["message"]

'DOC: add examples to groupby.first (#46766)'

In [6]:
# Look at the most recent commit's timestamp
commit_obj["commit"]["author"]["date"]

'2022-04-26T00:25:32Z'

#### Design commit table

We'll collect the following information about each commit:
- commit ID
- message
- commiter username
- commit timestamp

We can write a function that produces a list of row dictionaries (LoD) representation from the JSON-parsed data structure of a request.

In [7]:
def commitResult2LoD(result, maxelements=None):
    assert isinstance(result, list)

    LoD = []
    count = 0
    for commit_obj in result:
        if maxelements != None and count >= maxelements:
            break

        D = {}
        D["id"] = commit_obj["sha"]
        D["message"] = commit_obj["commit"]["message"]
        D["author"] = commit_obj["author"]["login"]
        D["timestamp"] = commit_obj["commit"]["author"]["date"]
        LoD.append(D)

        count += 1

    return LoD

In [8]:
# Try parsing the commit results from our previous request
LoD = commitResult2LoD(data)
for row in LoD[:3]:
    util.print_data(row)

{
  "id": "1213a173335527aa445a5cd90ea0ef457e09e24d",
  "message": "DOC: add examples to groupby.first (#46766)",
  "author": "Moisan",
  "timestamp": "2022-04-26T00:25:32Z"
}
{
  "id": "79fb2debb14e77d6d4af9c4db058e6f994507e29",
  "message": "TYP: rename (#46428)",
  "author": "twoertwein",
  "timestamp": "2022-04-03T03:20:40Z"
}
{
  "id": "382aefc6b746b20d047313c15a591f99b210fbf4",
  "message": "REGR: groupby.transform producing segfault ...
  "author": "rhshadrach",
  "timestamp": "2022-03-31T17:59:23Z"
}


#### Handle multiple pages

API service providers often throttle results to avoid sending too much data at once.  The results are typically divided into _chunks_, or _pages_, and the request must specify the desired page and/or the desired number of results per page.  Then, it is up to the client to navigate this, and issue additional requests if necessary, until the desired amount of data is acquired.

If we want more than 30 results (the default page size for the list-commit endpoint), we need to make more than one request.  We'll do that in a function `getCommits(owner, repo, query_path, num_commits)`.

In [9]:
def getCommits(owner, repo, query_path, num_commits = 15, num_per_page = 10):
    fullLoD = []

    page = 1
    commits_left = num_commits
    more_pages = True

    while more_pages and commits_left > 0:
        commit_page = getRepositoryCommits(owner, repo, query_path, num_per_page, page)

        if len(commit_page) < num_per_page:
            more_pages = False

        pageLoD = commitResult2LoD(commit_page)
        fullLoD.extend(pageLoD)

        commits_left -= len(pageLoD)
        page += 1

    df = pd.DataFrame(fullLoD)
    return df

In [10]:
# Build a table of commits
num_commits = 12
num_per_page = 8
commits_df = getCommits(owner, repo, query_path, num_commits, num_per_page)

print("Number of commits in DataFrame:", len(commits_df))
commits_df.iloc[:5, :]

Number of commits in DataFrame: 16


Unnamed: 0,id,message,author,timestamp
0,1213a173335527aa445a5cd90ea0ef457e09e24d,DOC: add examples to groupby.first (#46766),Moisan,2022-04-26T00:25:32Z
1,79fb2debb14e77d6d4af9c4db058e6f994507e29,TYP: rename (#46428),twoertwein,2022-04-03T03:20:40Z
2,382aefc6b746b20d047313c15a591f99b210fbf4,REGR: groupby.transform producing segfault (#4...,rhshadrach,2022-03-31T17:59:23Z
3,efb262ff526b6f8810b3ad1f675ffdb919af343f,API: User-control of result keys in GroupBy.ap...,TomAugspurger,2022-03-30T15:43:37Z
4,258cfccf8106814504ca07a118037d3f32073732,BUG: Allow passing of args/kwargs to groupby.a...,rhshadrach,2022-03-25T23:00:33Z


---

## Part C: Building a table of users who modified the file

Given the set of author usernames from the previous part, we can build a table of user information for those users.  This will involve multiple requests, one per user, to obtain information about each user.  From this, we can build a table and remove any duplicates.

#### Understand `users` API endpoint

First, we'll need to understand the `users` API endpoint.  Here is the documentation: https://docs.github.com/en/rest/users/users#get-a-user.

Let's look at the documentation:
- The root of the returned value (JSON object).
- The root has lots of children, including: `"login"`, `"type"`, `"company"`, `"name"`, and `"email"`.
- Each of the values of these children is a string.

In [11]:
def getUser(username):
    # Build the URL
    host = "api.github.com"
    resource_path = f"/users/{username}"
    url = util.buildURL(resource_path, host, protocol="https")

    # Make the request
    try:
        response = requests.get(url)
        assert response.status_code == 200
    except AssertionError:
        print(f"Failed: {resource_path} with status code {response.status_code}")
        return None

    # Return the parsed JSON object
    return response.json()

In [12]:
# Look up one of the people in the pandas-dev org
# (https://github.com/orgs/pandas-dev/people)
user1 = getUser("cpcloud")
util.print_json(user1, level=1, maxchildren=30)

   {
    "login": "cpcloud"
    "id": 417981
    "node_id": "MDQ6VXNlcjQxNzk4MQ=="
    "avatar_url": "https://avatars.githubusercontent.com/u/417981?v=4"
    "gravatar_id": ""
    "url": "https://api.github.com/users/cpcloud"
    "html_url": "https://github.com/cpcloud"
    "followers_url": "https://api.github.com/users/cpcloud/followers"
    "following_url": "https://api.github.com/users/cpcloud/following{/other_user}"
    "gists_url": "https://api.github.com/users/cpcloud/gists{/gist_id}"
    "starred_url": "https://api.github.com/users/cpcloud/starred{/owner}{/repo}"
    "subscriptions_url": "https://api.github.com/users/cpcloud/subscriptions"
    "organizations_url": "https://api.github.com/users/cpcloud/orgs"
    "repos_url": "https://api.github.com/users/cpcloud/repos"
    "events_url": "https://api.github.com/users/cpcloud/events{/privacy}"
    "received_events_url": "https://api.github.com/users/cpcloud/received_events"
    "type": "User"
    "site_admin": False
    "name": "Ph

#### Design users table

To build a tabular representation of GitHub users, we need to decide on the fields.  For simplicity, we'll specify just four fields:

| Field     | Python type | Notes |
| --------- | ----------- | ----- |
| `username`   | `str`       | The GitHub username of the user |
| `name`    | `str`       | The name of the user |
| `location`   | `str`       | The location of the user |
| `company` | `str`       | The company of the user |

We can get this information from a user object returned from the `users` API endpoint, which we'll do in a function `getUserRow(user)`.

In [13]:
def getUserRow(user):
    if user is None:
        return {}

    D = {}
    D["username"] = user["login"]
    D["name"] = user["name"]
    D["location"] = user["location"]
    D["company"] = user["company"]
    return D

In [14]:
# Try this out
getUserRow(user1)

{'username': 'cpcloud',
 'name': 'Phillip Cloud',
 'location': 'New York, NY',
 'company': '@voltrondata'}

Using these functions, we can build the full list of users given a list of usernames.

In [15]:
def getUsers(usernames):
    LoD = []

    for username in usernames:
        user = getUser(username)
        row = getUserRow(user)
        LoD.append(row)

    df = pd.DataFrame(LoD)
    df.drop_duplicates("username", inplace=True) # remove dups
    return df

In [16]:
# Build the users DataFrame
usernames = list(commits_df["author"])
users_df = getUsers(usernames)

print("Number of users in DataFrame:", len(users_df))
users_df.head()

Number of users in DataFrame: 11


Unnamed: 0,username,name,location,company
0,Moisan,Thierry Moisan,"Montréal, QC, Canada",
1,twoertwein,Torsten Wörtwein,,
2,rhshadrach,Richard Shadrach,"Cincinnati, OH",84.51
3,TomAugspurger,Tom Augspurger,,@microsoft
6,nafarya,Danil Iashchenko,London,


---

## Summary

We created two DataFrames, one for commit info for a given file, and another for the users involved in those commits.  The two DataFrames are shown again below.

In [17]:
commits_df.head()

Unnamed: 0,id,message,author,timestamp
0,1213a173335527aa445a5cd90ea0ef457e09e24d,DOC: add examples to groupby.first (#46766),Moisan,2022-04-26T00:25:32Z
1,79fb2debb14e77d6d4af9c4db058e6f994507e29,TYP: rename (#46428),twoertwein,2022-04-03T03:20:40Z
2,382aefc6b746b20d047313c15a591f99b210fbf4,REGR: groupby.transform producing segfault (#4...,rhshadrach,2022-03-31T17:59:23Z
3,efb262ff526b6f8810b3ad1f675ffdb919af343f,API: User-control of result keys in GroupBy.ap...,TomAugspurger,2022-03-30T15:43:37Z
4,258cfccf8106814504ca07a118037d3f32073732,BUG: Allow passing of args/kwargs to groupby.a...,rhshadrach,2022-03-25T23:00:33Z


In [18]:
users_df.head()

Unnamed: 0,username,name,location,company
0,Moisan,Thierry Moisan,"Montréal, QC, Canada",
1,twoertwein,Torsten Wörtwein,,
2,rhshadrach,Richard Shadrach,"Cincinnati, OH",84.51
3,TomAugspurger,Tom Augspurger,,@microsoft
6,nafarya,Danil Iashchenko,London,
