# Django 4.1 Contributors

Skip to: [Scope](#scope), [Assumptions](#assumptions), [Data Gathering](#data-gathering), [Analysis](#analysis).

---

Who contributed to Django 4.1?

This answer should be easy to get, surely, because Django is hosted on GitHub. 

Well, yes, but more goes into a release of Django than just GitHub. There's also the separate ticketing system, Trac, and translations from Transifex, just to name two of the major systems involved. 

`django/django` also uses an interesting way of cutting releases, as described in [Carlton's notes](https://noumenal.es/posts/what-is-django-4/zj2/) from when he tried to answer this question. 

To get a more complete view of the information, we need to get data from a few systems. 

To start: which version do we want to get the information for? In this case, 4.1. 

Noting of course that this data may still be in flux as this is still the current minor version, and will still have data changes until it is EOL'd some time in 2023 or there abouts. 

There's also the information just in the comments themselves: 

* The author and the committer (not always the same person)
* The Trac issue (often in the comment message itself)
* Any manual thanks (often in the case of security fixes) as part of the message
* Any discussions in attached pull requests. 

And from there, the Trac data, particularly the discussions in the ticket itself. But here, a ticket may have been opened longer than the release window, so why don't we initially limit this to comments on the ticket since the release was branched off. 

And also adding the same sort of data about commits in the branch from `django/django-docs-translations`, where the contributors are in the commit itself. 

---


## Defining Scope  <a class="anchor" id="scope"></a>

Before we can start analysing, we need to gather the data. 

Based on the original blog post, we first need to determine the start and stop commit range for the branch in question. 

To do that, let's get setup: 

 * Clone a local copy of the django repo,

In [1]:
!git clone https://github.com/django/django django-codebase

fatal: destination path 'django-codebase' already exists and is not an empty directory.


 * Install the `gitPython` package,

In [2]:
import sys
!{sys.executable} -m pip install GitPython


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


 * and set up for analysis

In [3]:
from git import Repo
repo = Repo("django-codebase")

We want to target the 4.1 release: 

In [4]:
target_release = "4.1"
previous_release = "4.0"  # semver -1

So using Carlton's method, let's get the start and end commits, and the merge commit: 

In [5]:
start_commit = repo.commit(previous_release)
end_commit = repo.commit(target_release)
merge_base = repo.merge_base(start_commit, end_commit)[0]

commits = list(repo.iter_commits(str(merge_base) + ".." + str(end_commit)))

We can check the start and end commits to confirm we're in the right range, and about the right commit number: 

In [6]:
from datetime import datetime

for commit in [start_commit, end_commit]: 
    print(datetime.fromtimestamp(commit.authored_date), commit.author, commit.message)
    
print("Commit count:", len(commits))

2021-12-07 20:07:32 Mariusz Felisiak [4.0.x] Bumped version for 4.0 release.

2022-08-03 18:33:01 Carlton Gibson [4.1.x] Bumped version for 4.0 release.

Commit count: 829


This looks about right. 

---


# Assumptions <a class="anchor" id="assumptions"></a>



One of the assumptions we're making here is that we don't just want a list of the authors of the commits. 

We could already get this data, very easily. 

But more work goes into releasing open source software then just those who author code. 

Django uses a separate issue tracking system, [Trac](https://code.djangoproject.com/), hosted on https://code.djangoproject.com/. 

Django also uses a separate system for handling translations, [Transifex](https://www.transifex.com/django/). 

For our purposes, here are our assumptions: 


Commits: 

* The commits in the release branch are the scope of the changes that we will focus on for a release. 
* The authors of those commits are contributors. 

Associating Tickets: 

* For any commit in scope, the linked tickets are associated work. 
* The state of those tickets is not in question, because if work happened to make a commit, it counts. 

Tickets: 

* Any non-trivial interaction on any linked ticket is a candidate for inclusion, where activity on a ticket will be further analysed. 
  
Associating Pull Requests: 

* Where tickets have linked GitHub Pull Requests, the linked pull requests are associated work. 
* Those pull requests may have additional interactions (e.g. code review, comments). 

Pull Requests: 

* Any non-trivial interaction on any linked pull request is a candidate for inclusion, where activity on a pull request will be further analysed. 


**Note**: In this case "non-trivial" interactions are those interactions that are work done. Arguments can be made about how much interactions count. The quantitative number in this case doesn't count (one interaction of high quality could count, many spam interactions could not). These will be analysed later. 

--

## Gathering Data

### From `git` 

With our local git clone, we can go through all the commits in our scope and gather the information about the associated tickets. 

We're using the assumption that ticket numbers appear in git messages. 

In [7]:
import re 
def get_git_commits(commit):
    git_commits.append(
        {
            "django_version": target_release,
            "commit_sha": commit.hexsha,
            "datetime": commit.authored_date,
            "author": commit.author.name,
            "author_email": commit.author.email,
            "committer": commit.committer.name,
            "committer_email": commit.committer.email,
            "message": commit.message,
        }
    )

    # Get all ticket references in message
    tickets = [x.replace("#", "") for x in re.findall("\#[0-9]*", commit.message)]

    for ticket in tickets:
        if ticket:
            git_trac_links.append(
                {"commit_sha": commit.hexsha, "trac_ticket_id": ticket}
            )

            
git_commits = []
git_trac_links = []
tickets = [] 
            
for commit in commits:
    get_git_commits(commit)
    
# Get unique list
tickets = list(set([k["trac_ticket_id"] for k in git_trac_links]))

print("Git Commits:", len(git_commits))
print("Git Trac Links:", len(git_trac_links))
print("Tickets:", len(tickets))

Git Commits: 829
Git Trac Links: 563
Tickets: 402


So we now have our associated tickets. 

---

### From Trac
From here, we need to get the information for each of these tickets out of Trac. Since this operation is expensive, we'll [cache the results](https://realpython.com/caching-external-api-requests/). 

In [8]:
%%capture
import sys
!{sys.executable} -m pip install requests "requests-cache[all]" tqdm ipywidgets

In [9]:
import requests
from requests_cache import CachedSession

# Note: POST may be ignored by default! So ensure we cache that. 
session = CachedSession('api_cache', backend='sqlite', allowable_methods=('GET', 'POST'))

With that setup, we'll now go through all the trac tickets we found: 

In [10]:
import json

# Shout out to John Sandall https://twitter.com/John_Sandall/status/1573711570894462977 
from tqdm.notebook import tqdm, trange

def get_trac_details(ticket_no):
    
    ticket_comments = []
    
    # Shout out to rixx https://gist.github.com/rixx/422392d2aa580b5d286e585418bf6915 
    resp = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.get", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )

    data = resp.json()["result"][3]
    
    ticket = {
        "ticket_id": ticket_no,
        "status": data["status"],
        "reporter": data["reporter"],
        "resolution": data["resolution"],
        "description": data["description"],
    }

    # struct ticket.changeLog(int id, int when=0)
    # Return the changelog as a list of tuples of the form
    # (time, author, field, oldvalue, newvalue, permanent).
    response = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.changeLog", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )
    
    changes = response.json()["result"]

    for change in changes:
        ticket_comments.append(
            {
                "ticket_id": ticket_no,
                "datetime": change[0]["__jsonclass__"][1],
                "name": change[1],
                "change_type": change[2],
                "old_value": change[3],
                "new_value": change[4],
            }
        )
    return ticket, ticket_comments



trac_tickets = []
trac_ticket_comments = []

DJANGO_TRAC = "https://code.djangoproject.com/jsonrpc"

for ticket_no in tqdm(tickets): 
    ticket, ticket_comments = get_trac_details(ticket_no)
    trac_tickets.append(ticket)
    trac_ticket_comments += ticket_comments

print("Track Tickets:", len(trac_tickets))
print("Track Ticket Comments:", len(trac_ticket_comments))

  0%|          | 0/402 [00:00<?, ?it/s]

Track Tickets: 402
Track Ticket Comments: 9799


We now have a list of all the Trac tickets and comments associated to commits in our target release. 

----

### From GitHub

From here, we can check for additional interactions on GitHub. 

Django's trac uses a [custom patch](https://github.com/django/code.djangoproject.com/blob/main/trac-env/htdocs/tickethacks.js#L38) to use the GitHub API to search for Pull Requests with the linked tickets, much like we did for associating commits to track tickets. 

We can use this same GitHub Search API to get the list of associated pull requests, then all the interactions on those pull requests. 

In [11]:
%load_ext dotenv
%dotenv

In [12]:
# Load a GitHub token to get better rate limits. 
import os
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

In [13]:
# Check your rate limits. 

# https://docs.github.com/en/rest/rate-limit 
# curl \
#  -H "Accept: application/vnd.github+json" \
#  -H "Authorization: Bearer <YOUR-TOKEN>" \
#  https://api.github.com/rate_limit

from datetime import datetime
from pprint import pprint

with session.cache_disabled():
    resp = session.get("https://api.github.com/rate_limit",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3.raw",
            }
        )
if "message" in resp.json().keys():
    print(resp.json())
else: 

    pprint(resp.json())
    data = resp.json()["resources"]
    now = datetime.now()

    for limit_type in ["core", "search"]: 
        d = data[limit_type]
        print(f"GitHub {limit_type} API limit:", d["used"], "/", d["limit"], ", resets", datetime.fromtimestamp(d["reset"]) )


{'rate': {'limit': 5000, 'remaining': 4996, 'reset': 1668902241, 'used': 4},
 'resources': {'actions_runner_registration': {'limit': 10000,
                                               'remaining': 10000,
                                               'reset': 1668902532,
                                               'used': 0},
               'code_scanning_upload': {'limit': 1000,
                                        'remaining': 1000,
                                        'reset': 1668902532,
                                        'used': 0},
               'core': {'limit': 5000,
                        'remaining': 4996,
                        'reset': 1668902241,
                        'used': 4},
               'dependency_snapshots': {'limit': 100,
                                        'remaining': 100,
                                        'reset': 1668898992,
                                        'used': 0},
               'graphql': {'limit': 5000,
         

In [14]:
import time 

def github_api(uri):
    resp = session.get(
        "https://api.github.com" + uri,
        headers={
            "Authorization": f"Bearer {GITHUB_TOKEN}",
            "Accept": "application/vnd.github+json",
        }
    )

    if resp.status_code != 200:
        if "x-ratelimit-reset" in resp.headers:
            wait_seconds = int(resp.headers.get("x-ratelimit-reset")) - int(time.time())
            wait_minutes = int(wait_seconds / 60)
            print(f"Rate limit expired. Wait {wait_minutes} minutes (or {wait_seconds} seconds).")
        raise ValueError(resp.json()["message"])

    return resp.json()

def search_for_pull_requests(ticket_id):
    data = github_api(
        "/search/issues?q=repo:django/django+in:title+type:pr+"
        + "%23" + ticket_id + "%20" 
        + "+%23"+ ticket_id + "%2C" 
        + "+%23"+ ticket_id + "%3A" 
        + "+%23"+ ticket_id + "%29"
    )["items"]

    return [x["number"] for x in data]


def get_pull_request(pr_id): 
    data = github_api(f"/repos/django/django/pulls/{pr_id}")
    
    pr = {"id": data["number"], 
          "state": data["state"], 
          "user": data["user"]["login"], 
         }
    
    return pr

pull_request_ids = []
pull_requests = []

for ticket_no in tqdm(tickets): 
    pull_request_ids += search_for_pull_requests(ticket_no)

pull_request_ids = list(set(pull_request_ids))

for pr_id in tqdm(pull_request_ids):
    pull_requests.append(get_pull_request(pr_id))

print("Pull requests:", len(pull_requests))

  0%|          | 0/402 [00:00<?, ?it/s]

  0%|          | 0/756 [00:00<?, ?it/s]

Pull requests: 756


For all these pull requests, we need to get the comments associated.  

In [15]:
def get_comments_from_pull_request(pull_request_id):
    comments = []

    # Comments
    data = github_api(f"/repos/django/django/pulls/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": record["commit_id"],
                "message": record["body"],
                "pull_request": pull_request_id
            }
        )

    # Review Comments
    data = github_api(f"/repos/django/django/issues/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": None,
                "message": record["body"],
                "pull_request": pull_request_id,
            }
        )

    return comments

pr_comments = []

for request in tqdm(pull_requests):
    pr_comments += get_comments_from_pull_request(request["id"])

print("Pull Request Comments:", len(pr_comments))


  0%|          | 0/756 [00:00<?, ?it/s]

Pull Request Comments: 6263


Now, we have all the data we need to start analysing. 

---

# Analysis <a class="anchor" id="analysis"></a>
 
 
Our analysis is going to be based on our [assumptions](#assumptions). 

TODO

https://github.com/cncf/devstats/blob/master/README_K8s.md "Contribution: a review, comment, commit, issue or PR"

To get started, we'll setup pandas

In [16]:
%%capture
import sys
!{sys.executable} -m pip install pandas

and convert our lists to DataFrames

In [17]:
import pandas as pd

git_commits_df = pd.DataFrame(git_commits)
print("Git Commit keys:", ", ".join(list(git_commits_df.keys())))

git_trac_links_df = pd.DataFrame(git_trac_links)
pull_requests_df = pd.DataFrame(pull_requests)
print("Pull Request keys:", ", ".join(list(pull_requests_df.keys())))
pull_request_comments_df = pd.DataFrame(pr_comments)
print("Pull Request Comments keys:", ", ".join(list(pull_request_comments_df.keys())))
trac_tickets_df = pd.DataFrame(trac_tickets)
print("Trac Ticket keys:", ", ".join(list(trac_tickets_df.keys())))
trac_ticket_comments_df = pd.DataFrame(trac_ticket_comments)
print("Trac Ticket Comments keys:", ", ".join(list(trac_ticket_comments_df.keys())))

Git Commit keys: django_version, commit_sha, datetime, author, author_email, committer, committer_email, message
Pull Request keys: id, state, user
Pull Request Comments keys: user, commit_id, message, pull_request
Trac Ticket keys: ticket_id, status, reporter, resolution, description
Trac Ticket Comments keys: ticket_id, datetime, name, change_type, old_value, new_value


For starters, who are our Git Commit Authors:

In [18]:
git_commits_df.author.value_counts()

Mariusz Felisiak    205
Carlton Gibson       60
Adam Johnson         40
David Smith          25
Jacob Walls          24
                   ... 
Sage Abdullah         1
Biel Frontera         1
David Sanders         1
Adrian Torres         1
Cleiton Lima          1
Name: author, Length: 213, dtype: int64

And who are our Pull Request Authors:

In [19]:
pull_requests_df.user.value_counts()

felixxm            145
smithdc1            37
carltongibson       36
claudep             27
jacobtylerwalls     20
                  ... 
ChihSeanHsu          1
vishalpandeyvip      1
alvaromlg            1
omerfarukabaci       1
mzjp2                1
Name: user, Length: 245, dtype: int64

This is where things start to break, because we have Names and Aliases. 

In [23]:
github_users = []

def get_github_user(user):
    data = github_api(f"/users/{user}")
    return {"login": data["login"], 
           "name": data["name"]}

    
for user in tqdm(list(pull_requests_df.user.unique())):
    github_users.append(get_github_user(user))
    
# TODO get full github user list (RATELIMITED)
for user in tqdm(list(pull_request_comments_df.user.unique())):
    github_users.append(get_github_user(user))

print("GitHub users:", len(github_users))
github_users_df = pd.DataFrame(github_users)
print("GitHub keys:", github_users_df.keys())
github_users_df

  0%|          | 0/245 [00:00<?, ?it/s]

  0%|          | 0/321 [00:00<?, ?it/s]

Rate limit expired. Wait 51 minutes (or 3079 seconds).


ValueError: Not Found

Then, we can start mapping things together.

In [21]:
pull_request_comments_df.user.value_counts()

felixxm          2163
carltongibson     493
ngnpope           291
smithdc1          239
timgraham         214
                 ... 
n2ygk               1
flaeppe             1
luzfcb              1
pifantastic         1
mzjp2               1
Name: user, Length: 321, dtype: int64

In [22]:
def get_name(username):
    name =  github_users_df[github_users_df.login.str.contains(username)]
    if len(name) == 1:
        return name["name"].item()
    else:
        # name doesn't exist in known github users, so return self.
        return username

def get_username(name):
    user =  github_users_df[github_users_df.name.str.contains(name, na=False)].dropna()
    if len(user) == 1:
        return user["login"].item()
    else:
        return name

for username in pull_request_comments_df.user.unique(): 
    print(username, "(", get_name(username), ")")
    #get_name(username)


NameError: name 'github_users_df' is not defined

## Sample
Let's take an example user, and check all their contributions: 

In [None]:
def user_contributions(user): 
    def dooutput(msg, df): 
        print("\n" + msg + ":", len(df))
        if len(df) < 10 and len(df) > 0: 
            print(df)
    
    
    name = user 
    mapping = github_users_df[github_users_df["login"]==user]
    if len(mapping) > 0: #.any(): 
        name = mapping["name"].item()
    
    print(f"## Data for {user} ({name})")
    
    gcs = git_commits_df[git_commits_df["author"].isin([name])][["author", "message"]]
    dooutput("Git Commits", gcs)
          
    prs = pull_requests_df[pull_requests_df["user"].isin([user])]
    dooutput("Pull Requests Authored", prs)
    
    prcs = pull_request_comments_df[pull_request_comments_df["user"].isin([user])][["pull_request", "message"]]
    dooutput("Pull Request comments", prcs)
    
    ttcs = trac_ticket_comments_df[trac_ticket_comments_df["name"].isin([user])][["ticket_id", "change_type"]]
    dooutput("Trac Ticket comments", ttcs)
    
user_contributions("glasnt")

So let's expand on this: getting a list of all users and their total number of interactions. 

In [None]:
pull_requests_df.user.value_counts()
pull_request_comments_df.user.value_counts()
git_commits_df.author.value_counts()
trac_ticket_comments_df.name.value_counts()

## Security Thanks

Another dataset we need to add in is security acknowledgements. 

Security reports are [separate to tickets](https://docs.djangoproject.com/en/dev/internals/security/), and thus don't follow the normal git flow. They are handled by the security team, and [credit is given separately](https://twitter.com/carltongibson/status/1588455611049676800).

So, we need to manually search for that credit. 

In [None]:
pd.set_option('display.max_colwidth', None)
git_commits_df[git_commits_df.message.str.contains("Thanks")][["author", "message"]]

From here, let's isolate those thanks mentions (shout out to [https://pythex.org/](https://pythex.org/)!

In [None]:
git_commits_df.message.str.extract(r'Thanks ([a-zA-Z ]+) for').dropna()

The problem you're seeing here is that we're dropping records: 

In [None]:
contains_records = len(git_commits_df[git_commits_df.message.str.contains("Thanks")][["author", "message"]])
extracts_records = len(git_commits_df.message.str.extract(r'Thanks ([a-zA-Z ]+) for').dropna())

print("Compare: contains has", contains_records, "and extracts has", extracts_records)
print("Do they match?", contains_records == extracts_records)


The issue here is that there are some users with non-alpha characters in their names. 

In [None]:
example_user = "Michal"
example_commits = git_commits_df[git_commits_df.message.str.contains(example_user).dropna()]["message"]
print(example_commits)

So we need to make sure we're using better regex.

In [None]:
w_filter = git_commits_df.message.str.extract(r'Thanks ([\w ]+) for').dropna()
print(w_filter)
print("Do they match?", contains_records == len(w_filter))

But even then we're missing some! Such as: 



In [None]:
git_commits_df[git_commits_df.message.str.contains("Splunk")][["author", "message"]]

So let's just open it up to any character. 

In [None]:
dot_filter = git_commits_df.message.str.extract(r'Thanks (.*) for').dropna()

print(dot_filter)
print("Do they match?", contains_records == len(dot_filter))

But then we're still missing some!

😨

So let's just do it manually. 

In [None]:
def flatten(l): 
    return [item for sublist in l for item in sublist]

thanks_messages = git_commits_df[git_commits_df.message.str.contains("Thanks")]["message"].values.tolist()

print("Messages to parse:", len(thanks_messages))
thanks = []
complex_thanks = []

for msg in thanks_messages: 
    found = False
    thanks = re.findall("Thanks (.*) for", msg)
    if not thanks:
        complex_thanks.append(msg)
        continue
    for thank in thanks:
        thank = thank.replace("to ", "")
        if "," in thank:
            thank = thank.split(",")
        print(thank)

# TODO BETTER PARSING
        
print(complex_thanks)
    
thanks.append(complex_thanks)
    
print(len(thanks))
print("Do they match?", contains_records == len(thanks))


# Results <a class="anchor" id="results"></a>

Now, with the data we have, we can now generate a unique list of all the humans who contributed. 

In [None]:

print("Pull Request keys:", ", ".join(list(pull_requests_df.keys())))
print("Pull Request Comments keys:", ", ".join(list(pull_request_comments_df.keys())))
print("Trac Ticket keys:", ", ".join(list(trac_tickets_df.keys())))
print("Trac Ticket Comments keys:", ", ".join(list(trac_ticket_comments_df.keys())))

In [None]:
prs = pull_requests_df.user.unique().tolist()
prcs = pull_request_comments_df.user.unique().tolist()
tts = trac_tickets_df.reporter.unique().tolist()
ttcs = trac_ticket_comments_df.name.unique().tolist()
gc = git_commits_df.author.unique().tolist()
_all = prcs + prcs + tts + ttcs + gc
#TODO add transifex, thanks

# cleanup emails
all_users = []
for i, g in enumerate(_all):
    if g:
        all_users.append(g.split("<")[0].strip())


all_users = list(set(all_users)) # make unique
all_users = sorted(all_users, key=str.casefold) # sort

# TODO fold usernames and names

all_names = []

# TODO include get_username with get_name to get unique pairs

for user in all_users:
    name = get_name(user)
    #if name in all_names:
    #    continue
    all_names.append(name)
    if name == user:
        print(user)
        counter += 1
    else:
        print(f"{name} ({user})")
        counter += 1
print(counter)
