# Django 4.1 Contributors

Skip to: [Scope](#scope), [Assumptions](#assumptions), [Data Gathering](#data-gathering), [Analysis](#analysis).

---

Who contributed to Django 4.1?

This answer should be easy to get, surely, because Django is hosted on GitHub. 

Well, yes, but more goes into a release of Django than just GitHub. There's also the separate ticketing system, Trac, and translations from Transifex, just to name two of the major systems involved. 

`django/django` also uses an interesting way of cutting releases, as described in [Carlton's notes](https://noumenal.es/posts/what-is-django-4/zj2/) from when he tried to answer this question. 

To get a more complete view of the information, we need to get data from a few systems. 

To start: which version do we want to get the information for? In this case, 4.1. 

Noting of course that this data may still be in flux as this is still the current minor version, and will still have data changes until it is EOL'd some time in 2023 or there abouts. 

There's also the information just in the comments themselves: 

* The author and the committer (not always the same person)
* The Trac issue (often in the comment message itself)
* Any manual thanks (often in the case of security fixes) as part of the message
* Any discussions in attached pull requests. 

And from there, the Trac data, particularly the discussions in the ticket itself. But here, a ticket may have been opened longer than the release window, so why don't we initially limit this to comments on the ticket since the release was branched off. 

And also adding the same sort of data about commits in the branch from `django/django-docs-translations`, where the contributors are in the commit itself. 

---


## Defining Scope  <a class="anchor" id="scope"></a>

Before we can start analysing, we need to gather the data. 

Based on the original blog post, we first need to determine the start and stop commit range for the branch in question. 

To do that, let's get setup: 

 * Clone a local copy of the django repo,

In [16]:
!git clone https://github.com/django/django django-codebase

Cloning into 'django-codebase'...
remote: Enumerating objects: 502374, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 502374 (delta 77), reused 138 (delta 65), pack-reused 502201[K
Receiving objects: 100% (502374/502374), 224.50 MiB | 12.54 MiB/s, done.
Resolving deltas: 100% (368533/368533), done.
Updating files: 100% (6649/6649), done.


 * Install the `gitPython` package,

In [18]:
import sys
!{sys.executable} -m pip install GitPython

[0m

 * and set up for analysis

In [34]:
from git import Repo
repo = Repo("django-codebase")

We want to target the 4.1 release: 

In [35]:
target_release = "4.1"
previous_release = "4.0"  # semver -1

So using Carlton's method, let's get the start and end commits, and the merge commit: 

In [23]:
start_commit = repo.commit(previous_release)
end_commit = repo.commit(target_release)
merge_base = repo.merge_base(start_commit, end_commit)[0]

commits = list(repo.iter_commits(str(merge_base) + ".." + str(end_commit)))

We can check the start and end commits to confirm we're in the right range, and about the right commit number: 

In [33]:
from datetime import datetime

for commit in [start_commit, end_commit]: 
    print(datetime.fromtimestamp(commit.authored_date), commit.author, commit.message)
    
print("Commit count:", len(commits))

2021-12-07 20:07:32 Mariusz Felisiak [4.0.x] Bumped version for 4.0 release.

2022-08-03 18:33:01 Carlton Gibson [4.1.x] Bumped version for 4.0 release.

Commit count: 829


This looks about right. 

---


# Assumptions <a class="anchor" id="assumptions"></a>



One of the assumptions we're making here is that we don't just want a list of the authors of the commits. 

We could already get this data. 

But more work goes into releasing open source software then just those who author code. 

Django uses a separate issue tracking system, [Trac](https://code.djangoproject.com/), hosted on https://code.djangoproject.com/. 

Django also uses a separate system for handling translations, [Transifex](https://www.transifex.com/django/). 

For our purposes, here are our assumptions: 


Commits: 

* The commits in the release branch are the scope of the changes that we will focus on for a release. 
* The authors of those commits **are contributors. 

Associating Tickets: 

* For any commit in scope, the linked tickets are associated work. 
* The state of those tickets is not in question, because if work happened to make a commit, it counts. 

Tickets: 

* Any non-trivial interaction on any linked ticket is a candidate for inclusion, where activity on a ticket will be further analysed. 
  
Associating Pull Requests: 

* Where tickets have linked GitHub Pull Requests, the linked pull requests are associated work. 
* Those pull requests may have additional interactions (e.g. code review, comments). 

Pull Requests: 

* Any non-trivial interaction on any linked pull request is a candidate for inclusion, where activity on a pull request will be further analysed. 


**Note**: In this case "non-trivial" interactions are those interactions that are work done. Arguments can be made about how much interactions count. The quantitative number in this case doesn't count (one interaction of high quality could count, many spam interactions could not). These will be analysed later. 

--

## Gathering Data

### From `git` 

With our local git clone, we can go through all the commits in our scope and gather the information about the associated tickets. 

We're using the assumption that ticket numbers appear in git messages. 

In [173]:
def get_git_commits(commit):
    git_commits.append(
        {
            "django_version": target_release,
            "commit_sha": commit.hexsha,
            "datetime": commit.authored_date,
            "author": commit.author.name,
            "author_email": commit.author.email,
            "committer": commit.committer.name,
            "committer_email": commit.committer.email,
            "message": commit.message,
        }
    )

    # Get all ticket references in message
    tickets = [x.replace("#", "") for x in re.findall("\#[0-9]*", commit.message)]

    for ticket in tickets:
        if ticket:
            git_trac_links.append(
                {"commit_sha": commit.hexsha, "trac_ticket_id": ticket}
            )

            
git_commits = []
git_trac_links = []
tickets = [] 
            
for commit in commits:
    get_git_commits(commit)
    
# Get unique list
tickets = list(set([k["trac_ticket_id"] for k in git_trac_links]))

print("Git Commits:", len(git_commits))
print("Git Trac Links:", len(git_trac_links))
print("Tickets:", len(tickets))

Git Commits: 829
Git Trac Links: 563
Tickets: 402


So we now have our associated tickets. 

---

### From Trac
From here, we need to get the information for each of these tickets out of Trac. Since this operation is expensive, we'll [cache the results](https://realpython.com/caching-external-api-requests/). 

In [118]:
%%capture
import sys
!{sys.executable} -m pip install requests "requests-cache[all]" tqdm

In [107]:
import requests
from requests_cache import CachedSession

# Note: POST may be ignored by default! So ensure we cache that. 
session = CachedSession('api_cache', backend='sqlite', allowable_methods=('GET', 'POST'))

With that setup, we'll now go through all the trac tickets we found: 

In [123]:
# Shout out to John Sandall https://twitter.com/John_Sandall/status/1573711570894462977 
from tqdm.notebook import tqdm, trange

def get_trac_details(ticket_no):
    
    ticket_comments = []
    
    # Shout out to rixx https://gist.github.com/rixx/422392d2aa580b5d286e585418bf6915 
    resp = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.get", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )

    data = resp.json()["result"][3]
    
    ticket = {
        "ticket_id": ticket_no,
        "status": data["status"],
        "reporter": data["reporter"],
        "resolution": data["resolution"],
        "description": data["description"],
    }

    # struct ticket.changeLog(int id, int when=0)
    # Return the changelog as a list of tuples of the form
    # (time, author, field, oldvalue, newvalue, permanent).
    response = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.changeLog", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )
    
    changes = response.json()["result"]

    for change in changes:
        ticket_comments.append(
            {
                "ticket_id": ticket_no,
                "datetime": change[0]["__jsonclass__"][1],
                "name": change[1],
                "change_type": change[2],
                "old_value": change[3],
                "new_value": change[4],
            }
        )
    return ticket, ticket_comments



trac_tickets = []
trac_ticket_comments = {}

DJANGO_TRAC = "https://code.djangoproject.com/jsonrpc"

for ticket_no in tqdm(tickets): 
    ticket, ticket_comments = get_trac_details(ticket_no)
    trac_tickets.append(ticket)
    trac_ticket_comments[ticket_no] = ticket_comments


  0%|          | 0/402 [00:00<?, ?it/s]

We now have a list of all the Trac tickets and comments associated to commits in our target release. 

----

### From GitHub

From here, we can check for additional interactions on GitHub. 

Django's trac uses a [custom patch](https://github.com/django/code.djangoproject.com/blob/main/trac-env/htdocs/tickethacks.js#L38) to use the GitHub API to search for Pull Requests with the linked tickets, much like we did for associating commits to track tickets. 

We can use this same GitHub Search API to get the list of associated pull requests, then all the interactions on those pull requests. 

In [139]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [178]:
# Load a GitHub token to get better rate limits. 
import os
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

In [181]:
# Check your rate limits. 

# https://docs.github.com/en/rest/rate-limit 
# curl \
#  -H "Accept: application/vnd.github+json" \
#  -H "Authorization: Bearer <YOUR-TOKEN>" \
#  https://api.github.com/rate_limit

from datetime import datetime

with session.cache_disabled():
    resp = session.get("https://api.github.com/rate_limit",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3.raw",
            }
        )
data = resp.json()["resources"]
now = datetime.now()

for limit_type in ["core", "search"]: 
    d = data[limit_type]
    print(f"GitHub {limit_type} API limit:", d["used"], "/", d["limit"], ", resets", datetime.fromtimestamp(d["reset"]) )


GitHub core API limit: 80 / 5000 , resets 2022-10-10 17:04:41
GitHub search API limit: 0 / 30 , resets 2022-10-10 16:09:59


In [182]:
import time 

def github_api(uri):
    resp = session.get(
        "https://api.github.com" + uri,
        headers={
            "Authorization": f"Bearer {GITHUB_TOKEN}",
            "Accept": "application/vnd.github+json",
        }
    )

    if resp.status_code != 200:
        if "x-ratelimit-reset" in resp.headers:
            wait_seconds = int(resp.headers.get("x-ratelimit-reset")) - int(time.time())
            print(f"Rate limit expired. Wait {wait_seconds} seconds.")
        raise ValueError(resp.json()["message"])

    return resp.json()

def get_pull_requests(ticket_id):
    data = github_api(
        "/search/issues?q=repo:django/django+in:title+type:pr+"
        + "%23" + ticket_id + "%20" 
        + "+%23"+ ticket_id + "%2C" 
        + "+%23"+ ticket_id + "%3A" 
        + "+%23"+ ticket_id + "%29"
    )["items"]

    return [x["number"] for x in data]

pull_requests = []

for ticket_no in tqdm(tickets): 
    pull_requests += get_pull_requests(ticket_no)

pull_requests = list(set(pull_requests))
print("Pull Requests:", len(pull_requests))

  0%|          | 0/402 [00:00<?, ?it/s]

Pull Requests: 756


For all these pull requests, we need to get the comments associated.  

In [183]:
def get_comments_from_pull_request(pull_request_id):
    comments = []

    # Comments
    data = github_api(f"/repos/django/django/pulls/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": record["commit_id"],
                "message": record["body"],
                "pull_request": pull_request_id,
            }
        )

    # Review Comments
    data = github_api(f"/repos/django/django/issues/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": None,
                "message": record["body"],
                "pull_request": pull_request_id,
            }
        )

    return comments

pr_comments = []

for request in tqdm(pull_requests):
    pr_comments += get_comments_from_pull_request(request)

print("Pull Request Comments:", len(pr_comments))


  0%|          | 0/756 [00:00<?, ?it/s]

Pull Request Comments: 6263


Now, we have all the data we need to start analysing. 

---

# Analysis <a class="anchor" id="analysis"></a>
 
 
TODO