# Django 4.1 Contributors

Skip to: [Scope](#scope), [Assumptions](#assumptions), [Data Gathering](#data-gathering), [Analysis](#analysis).

---

Who contributed to Django 4.1?

This answer should be easy to get, surely, because Django is hosted on GitHub. 

Well, yes, but more goes into a release of Django than just GitHub. There's also the separate ticketing system and translations, just to name two of the major systems involved. 

The Django repo also uses an interesting way of cutting releases, as described in [Carlton's notes](https://noumenal.es/posts/what-is-django-4/zj2/) from when he tried to answer this question. 

To get a more complete view of the information, we need to get data from a few systems. 

To start: which version do we want to get the information for? In this case, 4.1. (Noting of course that this data may still be in flux as this is still the current minor version, and will still have data changes until it is EOL'd some time in 2023 or there abouts.)

There's also the information just in the comments themselves: 

* The author and the committer (not always the same person)
* The Trac issue (often in the comment message itself)
* Any manual thanks (often in the case of security fixes) as part of the message
* Any discussions in attached pull requests. 

And from there, the Trac data, particularly the discussions in the ticket itself. 

So let's dive in.

_Along the way there will also be italiced notes that mostly serve as "here's how I did the thing" (I'm very new to notebooks!)_

---


#TODO

* Remove pandas
* add rich (is supported! https://www.willmcgugan.com/blog/tech/post/rich-adds-support-for-jupyter-notebooks/)

## Defining Scope  <a class="anchor" id="scope"></a>

Before we can start analysing, we need to gather the data. 

Based on the original blog post, we first need to determine the start and stop commit range for the branch in question. 

To do that, let's get setup: 

 * Clone a local copy of the django repo
   * _this command uses `!` to call out to `git` in the shell_

In [1]:
!git clone https://github.com/django/django django-codebase

fatal: destination path 'django-codebase' already exists and is not an empty directory.


 * Install the `gitPython` package
   * _this command uses `%%capture%%` as not to print the possibly lengthly output to the notebook_


In [2]:
%%capture
import sys
!{sys.executable} -m pip install GitPython

 * and set up for analysis

In [3]:
from git import Repo
repo = Repo("django-codebase")

We want to target the 4.1 release: 

In [4]:
target_release = "4.1"
previous_release = "4.0"  # semver -1

So using Carlton's method, let's get the start and end commits, and the merge commit: 

In [5]:
start_commit = repo.commit(previous_release)
end_commit = repo.commit(target_release)
merge_base = repo.merge_base(start_commit, end_commit)[0]

commits = list(repo.iter_commits(str(merge_base) + ".." + str(end_commit)))

We can check the start and end commits to confirm we're in the right range, and about the right commit number: 

In [6]:
from datetime import datetime

for commit in [start_commit, end_commit]: 
    print(datetime.fromtimestamp(commit.authored_date), commit.author, "\n" + commit.message)
    
print("Commit count:", len(commits))

2021-12-07 20:07:32 Mariusz Felisiak 
[4.0.x] Bumped version for 4.0 release.

2022-08-03 18:33:01 Carlton Gibson 
[4.1.x] Bumped version for 4.0 release.

Commit count: 829


This looks about right: a ~8 month development window, bookmarked by Django Fellows. 

---


# Assumptions <a class="anchor" id="assumptions"></a>



One of the assumptions we're making here is that we don't just want a list of the authors of the commits. 

We could already get this data, very easily. 

But more work goes into releasing open source software then just those who author code. 

Django uses a separate issue tracking system, [Trac](https://code.djangoproject.com/), hosted on https://code.djangoproject.com/. 

Django also uses a separate system for handling translations of the documentation, [Transifex](https://www.transifex.com/django/), which are output to https://github.com/django/django-docs-translations. (Documentation itself is in the main repo.)

For our purposes, here are our assumptions: 


Commits: 

* The commits in the release branch are the scope of the changes that we will focus on for a release. 
* The authors of those commits are contributors. 

Associating Tickets: 

* For any commit in scope, the linked tickets are associated work. 
* The state of those tickets is not in question, because if work happened to make a commit, it counts. 

Tickets: 

* Any non-trivial interaction on any linked ticket is a candidate for inclusion, where activity on a ticket will be further analysed. 
  
Associating Pull Requests: 

* Where tickets have linked GitHub Pull Requests, the linked pull requests are associated work. 
* Those pull requests may have additional interactions (e.g. code review, comments). 

Pull Requests: 

* Any non-trivial interaction on any linked pull request is a candidate for inclusion, where activity on a pull request will be further analysed. 

Translations: 
 
* All contributed translations are associated work. 

**Note**: In this case "non-trivial" interactions are those interactions that are work done. Arguments can be made about how much interactions count. The quantitative number in this case doesn't count (one interaction of high quality could count, many spam interactions could not). These will be analysed later. 

--

## Gathering Data

### From `git` 

With our local git clone, we can go through all the commits in our scope and gather the information about the associated tickets. 

We're using the assumption that ticket numbers appear in git messages. 

In [243]:
import re 
def get_git_commits(commit):
    git_commits.append(
        {
            "django_version": target_release,
            "commit_sha": commit.hexsha,
            "datetime": commit.authored_date,
            "author": commit.author.name,
            "author_email": commit.author.email,
            "committer": commit.committer.name,
            "committer_email": commit.committer.email,
            "message": commit.message,
        }
    )

    # Get all ticket references in message
    # TODO(glasnt): this will include "Fixed", but also "Refs" (which may include older tickets)
    tickets = [x.replace("#", "") for x in re.findall("\#[0-9]*", commit.message)]

    for ticket in tickets:
        if ticket:
            git_trac_links.append(
                {"commit_sha": commit.hexsha, "trac_ticket_id": ticket}
            )

            
git_commits = []
git_trac_links = []
tickets = [] 
            
for commit in commits:
    get_git_commits(commit)
    
# Get unique list
tickets = list(set([k["trac_ticket_id"] for k in git_trac_links]))

print("Git Commits:", len(git_commits))
print("Git Trac Links:", len(git_trac_links))
print("Tickets:", len(tickets))

Git Commits: 829
Git Trac Links: 563
Tickets: 402


In [242]:
print([g for g in git_trac_links if g["trac_ticket_id"] == "23646"])

[{'commit_sha': 'd35ce682e31ea4a86c2079c60721fae171f03d7c', 'trac_ticket_id': '23646'}]


So we now have our associated tickets. 

---

### From Trac
From here, we need to get the information for each of these tickets out of Trac. Since this operation is expensive, we'll [cache the results](https://realpython.com/caching-external-api-requests/). 

In [8]:
%%capture
import sys
!{sys.executable} -m pip install requests "requests-cache[all]" tqdm ipywidgets

In [9]:
import requests
from requests_cache import CachedSession

# Note: POST may be ignored by default! So ensure we cache that. 
session = CachedSession('api_cache', backend='sqlite', allowable_methods=('GET', 'POST'))
# Note: this will create a api_cache.sqlite file, which will contain your GitHub token (later).
# Don't commit this file!

With that setup, we'll now go through all the trac tickets we found: 

In [239]:
import json

# Shout out to John Sandall https://twitter.com/John_Sandall/status/1573711570894462977 
from tqdm.notebook import tqdm, trange

def get_trac_details(ticket_no):
    
    ticket_comments = []
    
    # Shout out to rixx https://gist.github.com/rixx/422392d2aa580b5d286e585418bf6915 
    resp = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.get", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )

    data = resp.json()["result"][3]
    
    ticket = {
        "ticket_id": ticket_no,
        "status": data["status"],
        "reporter": data["reporter"],
        "resolution": data["resolution"],
        "description": data["description"],
    }

    # struct ticket.changeLog(int id, int when=0)
    # Return the changelog as a list of tuples of the form
    # (time, author, field, oldvalue, newvalue, permanent).
    response = session.post(
        DJANGO_TRAC,
        data=json.dumps(
            {"method": "ticket.changeLog", "id": ticket_no, "params": [ticket_no]}
        ),
        headers={"Content-Type": "application/json"},
    )
    
    changes = response.json()["result"]

    for change in changes:
        name = change[1]
        if name:
            name = name.split("<")[0].strip() # remove emails,
        ticket_comments.append(
            {
                "ticket_id": ticket_no,
                "datetime": change[0]["__jsonclass__"][1],
                "name": name,
                "change_type": change[2],
                "old_value": change[3],
                "new_value": change[4],
            }
        )
    return ticket, ticket_comments



trac_tickets = []
trac_ticket_comments = []

DJANGO_TRAC = "https://code.djangoproject.com/jsonrpc"

for ticket_no in tqdm(tickets): 
    ticket, ticket_comments = get_trac_details(ticket_no)
    trac_tickets.append(ticket)
    trac_ticket_comments += ticket_comments

print("Trac Tickets:", len(trac_tickets))
print("Trac Ticket Comments:", len(trac_ticket_comments))

  0%|          | 0/402 [00:00<?, ?it/s]

Trac Tickets: 402
Trac Ticket Comments: 9799


We now have a list of all the Trac tickets and comments associated to commits in our target release. 

----

### From GitHub

From here, we can check for additional interactions on GitHub. 

Django's trac uses a [custom patch](https://github.com/django/code.djangoproject.com/blob/main/trac-env/htdocs/tickethacks.js#L38) to use the GitHub API to search for Pull Requests with the linked tickets, much like we did for associating commits to track tickets. 

We can use this same GitHub Search API to get the list of associated pull requests, then all the interactions on those pull requests. 

In [11]:
%load_ext dotenv
%dotenv

In [12]:
# Use a GitHub token to get better rate limits!
import os
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

In [13]:
# Check your rate limits. 

# https://docs.github.com/en/rest/rate-limit 
# curl \
#  -H "Accept: application/vnd.github+json" \
#  -H "Authorization: Bearer <YOUR-TOKEN>" \
#  https://api.github.com/rate_limit

from datetime import datetime
from pprint import pprint

with session.cache_disabled():
    resp = session.get("https://api.github.com/rate_limit",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3.raw",
            }
        )
if "message" in resp.json().keys():
    print(resp.json())
else: 
    data = resp.json()["resources"]
    now = datetime.now()

    for limit_type in ["core", "search"]: 
        d = data[limit_type]
        print(f"GitHub {limit_type} API limit:", d["used"], "/", d["limit"], ", resets", datetime.fromtimestamp(d["reset"]) )


GitHub core API limit: 0 / 5000 , resets 2022-11-24 09:01:27
GitHub search API limit: 0 / 30 , resets 2022-11-24 08:02:27


In [14]:
import time 

def github_api(uri):
    resp = session.get(
        "https://api.github.com" + uri,
        headers={
            "Authorization": f"Bearer {GITHUB_TOKEN}",
            "Accept": "application/vnd.github+json",
        }
    )

    if resp.status_code != 200:
        if "X-RateLimit-Remaining" in resp.headers and resp.headers["X-RateLimit-Remaining"] == 0:
            wait_seconds = int(resp.headers.get("x-ratelimit-reset")) - int(time.time())
            wait_minutes = int(wait_seconds / 60)
            print(f"Rate limit expired. Wait {wait_minutes} minutes (or {wait_seconds} seconds).")
        raise ValueError(resp.json()["message"])

    return resp.json()

def search_for_pull_requests(ticket_id):
    data = github_api(
        "/search/issues?q=repo:django/django+in:title+type:pr+"
        + "%23" + ticket_id + "%20" 
        + "+%23"+ ticket_id + "%2C" 
        + "+%23"+ ticket_id + "%3A" 
        + "+%23"+ ticket_id + "%29"
    )["items"]

    return [x["number"] for x in data]


def get_pull_request(pr_id): 
    data = github_api(f"/repos/django/django/pulls/{pr_id}")
    
    pr = {"id": data["number"], 
          "state": data["state"], 
          "user": data["user"]["login"], 
         }
    
    return pr

pull_request_ids = []
pull_requests = []

for ticket_no in tqdm(tickets): 
    pull_request_ids += search_for_pull_requests(ticket_no)

pull_request_ids = list(set(pull_request_ids))

for pr_id in tqdm(pull_request_ids):
    pull_requests.append(get_pull_request(pr_id))

print("Pull requests:", len(pull_requests))

  0%|          | 0/402 [00:00<?, ?it/s]

  0%|          | 0/756 [00:00<?, ?it/s]

Pull requests: 756


For all these pull requests, we need to get the comments associated.  

In [15]:
def get_comments_from_pull_request(pull_request_id):
    comments = []

    # Comments
    data = github_api(f"/repos/django/django/pulls/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": record["commit_id"],
                "message": record["body"],
                "pull_request": pull_request_id
            }
        )

    # Review Comments
    data = github_api(f"/repos/django/django/issues/{pull_request_id}/comments")

    for record in data:
        comments.append(
            {
                "user": record["user"]["login"],
                "commit_id": None,
                "message": record["body"],
                "pull_request": pull_request_id,
            }
        )

    return comments

pr_comments = []

for request in tqdm(pull_requests):
    pr_comments += get_comments_from_pull_request(request["id"])

print("Pull Request Comments:", len(pr_comments))


  0%|          | 0/756 [00:00<?, ?it/s]

Pull Request Comments: 6263


### Security Thanks

Another dataset we need to add in is security acknowledgements. 

Security reports are [separate to tickets](https://docs.djangoproject.com/en/dev/internals/security/), and thus don't follow the normal git flow. They are handled by the security team, and [credit is given separately](https://twitter.com/carltongibson/status/1588455611049676800).

So, we need to manually search for that credit. 

In [88]:
import pandas as pd

git_commits_df = pd.DataFrame(git_commits)
pd.set_option('display.max_colwidth', None)
git_commits_df[git_commits_df.message.str.contains("Thanks")][["author", "message"]]


Unnamed: 0,author,message
3,Carlton Gibson,[4.1.x] Fixed CVE-2022-36359 -- Escaped filename in Content-Disposition header.\n\nThanks to Motoyasu Saburi for the report.\n
6,Carlton Gibson,[4.1.x] Doc'd TextField.db_collation as optional.\n\nMatches CharField.db_collation docs.\n\nThanks to Paolo Melchiorre for the report.\n\nBackport of 5028a02352cb1fe3e64d63a614912ef694838862 from main\n
7,Carlton Gibson,"[4.1.x] Fixed #33876, Refs #32229 -- Made management forms render with div.html template.\n\nThanks to Claude Paroz for the report.\n\nBackport of 89e695a69b16b8c0e720169b3ca4852cfd0c485f from main\n"
14,Mariusz Felisiak,"[4.1.x] Fixed #33820 -- Doc'd ""true""/""false""/""null"" caveat for JSONField key transforms on SQLite.\n\nThanks Johnny Metz for the report.\n\nRegression in 71ec102b01fcc85acae3819426a4e02ef423b0fa.\nBackport of e20e5d1557785ba71e8ef0ceb8ccb85bdc13840a from main\n"
24,Shawn Dong,[4.1.x] Fixed #33822 -- Fixed save() crash on model formsets when not created by modelformset_factory().\n\nThanks Claude Paroz for the report.\n\nRegression in e87f57fdb8dcdabc452bd15abd015bf6c9b1f7a8.\n\nBackport of 18c5ba07cc81be993941ecc2ecc17923b401b66f from main\n
...,...,...
778,Mariusz Felisiak,"Fixed #33159 -- Reverted ""Fixed #32970 -- Changed WhereNode.clone() to create a shallow copy of children.""\n\nThis reverts commit e441847ecae99dd1ccd0d9ce76dbcff51afa863c.\r\n\r\nA shallow copy is not enough because querysets can be reused and\r\nevaluated in nested nodes, which shouldn't mutate JOIN aliases.\r\n\r\nThanks Michal Čihař for the report."
779,David Wobrock,Fixed #33018 -- Fixed annotations with empty queryset.\n\nThanks Simon Charette for the review and implementation idea.\n
791,ali,Fixed #33114 -- Defined default output_field of StringAgg.\n\nThanks Simon Charette for the review.\n
797,Carlton Gibson,Refs #33129 -- Added missing return statement.\n\nThanks to Claude Paroz for spotting it.\n\nRegression in 221b2f85febcf68629fc3a4007dc7edb5a305b91.\n


From here, let's isolate those thanks mentions (shout out to [https://pythex.org/](https://pythex.org/)!)

In [89]:
git_commits_df.message.str.extract(r'Thanks ([a-zA-Z ]+) for').dropna()

Unnamed: 0,0
3,to Motoyasu Saburi
6,to Paolo Melchiorre
7,to Claude Paroz
14,Johnny Metz
24,Claude Paroz
...,...
744,Lucidot for the report and Mariusz Felisiak
779,Simon Charette
791,Simon Charette
797,to Claude Paroz


The problem you're seeing here is that we're dropping records: 

In [90]:
contains_records = len(git_commits_df[git_commits_df.message.str.contains("Thanks")][["author", "message"]])
extracts_records = len(git_commits_df.message.str.extract(r'Thanks ([a-zA-Z ]+) for').dropna())

print("Compare: contains has", contains_records, "and extracts has", extracts_records)
print("Do they match?", contains_records == extracts_records)

Compare: contains has 83 and extracts has 67
Do they match? False


The issue here is that there are some users with non-alpha characters in their names. 

In [91]:
example_user = "Michal"
example_commits = git_commits_df[git_commits_df.message.str.contains(example_user).dropna()]["message"]
print(example_commits)

778    Fixed #33159 -- Reverted "Fixed #32970 -- Changed WhereNode.clone() to create a shallow copy of children."\n\nThis reverts commit e441847ecae99dd1ccd0d9ce76dbcff51afa863c.\r\n\r\nA shallow copy is not enough because querysets can be reused and\r\nevaluated in nested nodes, which shouldn't mutate JOIN aliases.\r\n\r\nThanks Michal Čihař for the report.
Name: message, dtype: object


So we need to make sure we're using better regex.

In [92]:
w_filter = git_commits_df.message.str.extract(r'Thanks ([\w ]+) for').dropna()
print(w_filter)
print("Do they match?", contains_records == len(w_filter))

                       0
3     to Motoyasu Saburi
6    to Paolo Melchiorre
7        to Claude Paroz
14           Johnny Metz
24          Claude Paroz
..                   ...
778         Michal Čihař
779       Simon Charette
791       Simon Charette
797      to Claude Paroz
809      Benjamin Locher

[74 rows x 1 columns]
Do they match? False


But even then we're missing some! Such as: 



In [93]:
git_commits_df[git_commits_df.message.str.contains("Splunk")][["author", "message"]]

Unnamed: 0,author,message
202,Mariusz Felisiak,"Fixed CVE-2022-28346 -- Protected QuerySet.annotate(), aggregate(), and extra() against SQL injection in column aliases.\n\nThanks Splunk team: Preston Elder, Jacob Davis, Jacob Moore,\nMatt Hanson, David Briggs, and a security researcher: Danylo Dmytriiev\n(DDV_UA) for the report.\n"


So let's just open it up to any character. 

In [22]:
dot_filter = git_commits_df.message.str.extract(r'Thanks (.*) for').dropna()

print(dot_filter)
print("Do they match?", contains_records == len(dot_filter))

                       0
3     to Motoyasu Saburi
6    to Paolo Melchiorre
7        to Claude Paroz
14           Johnny Metz
24          Claude Paroz
..                   ...
778         Michal Čihař
779       Simon Charette
791       Simon Charette
797      to Claude Paroz
809      Benjamin Locher

[80 rows x 1 columns]
Do they match? False


But then we're still missing some!

😨

So let's just do it manually. 

In [211]:
def flatten(l): 
    return [item for sublist in l for item in sublist]

def unique(l): 
    l = list(filter(lambda i: i is not None, l))
    l = [a.strip() for a in l]
    return sorted(list(set(l)), key=str.casefold)

thanks_messages = git_commits_df[git_commits_df.message.str.contains("Thanks")]["message"].values.tolist()

thanks = []
security_thanks = []
second_pass = []

for msg in thanks_messages: 
    found = False
    thanks = re.findall("Thanks (.*) for", msg)
    if not thanks:
        second_pass.append(msg)
        continue
    for thank in thanks:
        thank = thank.replace("to ", "")
        if "," in thank:
            thank = thank.split(",")
            security_thanks += [t.strip() for t in thank]
        elif "and" in thank:
            thank = thank.split("and")
            security_thanks += [t.strip() for t in thank]
        else:
            security_thanks.append(thank)

# inline cleanup
for i, thank in enumerate(security_thanks):
    security_thanks[i] = thank.split("for")[0]
    
for i, thank in enumerate(security_thanks):
    security_thanks[i] = thank.split("and")[0]
    
    
# remove duplicates
security_thanks = unique(security_thanks)
            
# parsing more complicated thanks, just stripping context. 
# outputs groups as a whole. Manual cleanup if required. 
for thank in second_pass: 
    thanks = ""
    found = False
    for line in thank.splitlines():
        if "Thanks" in line:
            found = True
        if found:
            thanks += line + " "
    security_thanks.append(thanks)

print("Security thanks:", len(security_thanks))
print("\n".join(security_thanks))


Security thanks: 64

Adam Johnson
Adam Zimmerman
Alan Crosswell
Alan Ryan
Allen Jonathan David
Andrew Chen Wang
Antonio Terceiro
Arsalan Ghassemi
Baptiste Mispelon
bcail
Ben Picolo
Benjamin Locher
Chris Bailey
Chris Lee
Claude Paroz
Daniel Swain
David Glenck
David Smith
David Wyde
Dennis Brinkrolf
Eugene Kovalev
Ferran Jovell
Florian Apolloner
Hervé Le Roy
Jacob Walls
Jakob Köhler
Johnny Metz
Jonny Park
Joseph Abrahams
Keryn Knight
Kevin Marsh
lind-marcus
Lucidot
Mariusz Felisiak
Matt Westcott
Michal Čihař
Motoyasu Saburi
Nick Pope
OutOfFocus4
Paolo Melchiorre
Paul in 't Hout
Pēteris Caune
Rick Yang
Sergey Fedoseev
Shai Berger
Silvio Gutierrez
Simon Charette
Sjoerd Job Postmus
Sourav Kumar
TakuYoshikai (Aeye Security Lab)
TengMA(@te3t123)
Terence Honles
Theodore Ni
Tim Graham
Tim McCurrach
Tobias Beng
Todor Velichkov
Tom Carrick
yakimka
אורי
Thanks Splunk team: Preston Elder, Jacob Davis, Jacob Moore, Matt Hanson, David Briggs, and a security researcher: Danylo Dmytriiev (DDV_UA) for t

Or, we could just use [NLTK](https://www.nltk.org/)

_Much of the following is based on https://unbiased-coder.com/extract-names-python-nltk/_

In [1]:
import nltk

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_people(text): 
    people = []
    nltk_results = ne_chunk(pos_tag(word_tokenize(text)))
    for nltk_result in nltk_results:
        if type(nltk_result) == Tree:
            name = ''
            for nltk_result_leaf in nltk_result.leaves():
                name += nltk_result_leaf[0] + ' '
            if nltk_result.label() == "PERSON":
                people.append(name.strip())
    return people

thanks_messages = git_commits_df[git_commits_df.message.str.contains("Thanks")]["message"].values.tolist()

people = []

for message in thanks_messages:
    people += get_people(message)

people = unique(people)

print("People:", len(people))
print("\n".join(people))

[nltk_data] Downloading package punkt to /Users/glasnt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/glasnt/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/glasnt/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/glasnt/nltk_data...
[nltk_data]   Package words is already up-to-date!


NameError: name 'git_commits_df' is not defined

There are some false negatives in there, and some names that start with "Thanks", so we can remove those manually: 


In [72]:
for i, person in enumerate(people):
    
    # Trim out word set
    people[i] = person.replace("Thanks ", "")
    
    # Remove word set
    word_set = ["Fixed", "Changed", "Django", "Matches", "Redis", "Catalan", "Exact", 
                "Ref", "Sphinx", "Splunk", "Moved", "Month", "Python", "Switched"]
    for phrase in word_set:
        if phrase in person:
            people.pop(i)
    

people = unique(people)
print("People:", len(people))
print("\n".join(people))

People: 70

Adam Johnson
Adam Zimmerman
Alan Crosswell
Alan Ryan
Allen Jonathan David
Andrew Chen Wang 
Antonio Terceiro
Arsalan Ghassemi
Baptiste Mispelon
bcail
Ben Picolo
Benjamin Locher
Chris Bailey
Chris Jerdonek
Chris Lee
Claude Paroz
Daniel Swain 
Danylo Dmytriiev
David Briggs
David Glenck
David Smith
David Wyde
Dennis Brinkrolf
Eugene Kovalev 
Ferran Jovell
Florian Apolloner
Hannes Ljungberg
Hervé Le Roy
Jacob Davis
Jacob Moore
Jacob Walls
Jakob Köhler
Johnny Metz
Jonny Park
Joseph Abrahams
Keryn Knight
Kevin Marsh
lind-marcus
Lucidot 
Mariusz Felisiak
Matt Hanson
Matt Westcott
Michal Čihař
Motoyasu Saburi
Nick Pope
OutOfFocus4
Paolo Melchiorre
Paul in 't Hout
Preston Elder
Pēteris Caune
Rick Yang
Sergey Fedoseev
Shai Berger
Silvio Gutierrez
Simon Charette
Sjoerd Job Postmus
Sourav Kumar
TakuYoshikai (Aeye Security Lab)
TengMA(@te3t123)
Terence Honles
Theodore Ni
Thibaud Colas
Tim Graham
Tim McCurrach
Tobias Beng
Todor Velichkov
Tom Carrick
yakimka
אורי


But now we have a disperate set, compared to the original list. 


In [64]:
print("Regex:", len(security_thanks), "vs NLTK:", len(people))
for person in security_thanks:
    if person not in people:
        print(person)

Regex: 64 vs NLTK: 61
Andrew Chen Wang 
Daniel Swain 
Eugene Kovalev 
Lucidot 
Thanks Splunk team: Preston Elder, Jacob Davis, Jacob Moore, Matt Hanson, David Briggs, and a security researcher: Danylo Dmytriiev (DDV_UA) for the report. 
Thanks to Adam Johnson, Claude Paroz, Keryn Knight, and Thibaud Colas for review. 
Thanks Florian Apolloner, Chris Jerdonek, Hannes Ljungberg, Nick Pope, and Mariusz Felisiak for reviews. 


So instead, let's just use NLTK on those lists of people. 


In [74]:
complex_thanks = []
people = []

# split out our thanks to parse and keep
for person in security_thanks:
    if "," in person:
        complex_thanks.append(person)
    else:
        people.append(person)
        
for thank in complex_thanks:
    people += get_people(thank)
    
people = unique(people)
for i, person in enumerate(people):
    if person in word_set:
        people.pop(i)

print("People:", len(people))
print("\n".join(people))

People: 70

Adam Johnson
Adam Zimmerman
Alan Crosswell
Alan Ryan
Allen Jonathan David
Andrew Chen Wang 
Antonio Terceiro
Arsalan Ghassemi
Baptiste Mispelon
bcail
Ben Picolo
Benjamin Locher
Chris Bailey
Chris Jerdonek
Chris Lee
Claude Paroz
Daniel Swain 
Danylo Dmytriiev
David Briggs
David Glenck
David Smith
David Wyde
Dennis Brinkrolf
Eugene Kovalev 
Ferran Jovell
Florian Apolloner
Hannes Ljungberg
Hervé Le Roy
Jacob Davis
Jacob Moore
Jacob Walls
Jakob Köhler
Johnny Metz
Jonny Park
Joseph Abrahams
Keryn Knight
Kevin Marsh
lind-marcus
Lucidot 
Mariusz Felisiak
Matt Hanson
Matt Westcott
Michal Čihař
Motoyasu Saburi
Nick Pope
OutOfFocus4
Paolo Melchiorre
Paul in 't Hout
Preston Elder
Pēteris Caune
Rick Yang
Sergey Fedoseev
Shai Berger
Silvio Gutierrez
Simon Charette
Sjoerd Job Postmus
Sourav Kumar
TakuYoshikai (Aeye Security Lab)
TengMA(@te3t123)
Terence Honles
Theodore Ni
Thibaud Colas
Tim Graham
Tim McCurrach
Tobias Beng
Todor Velichkov
Tom Carrick
yakimka
אורי


We'll use this list as our list of people with thanks in the security list. 


In [75]:
security_thanks = people

### Transifex

Another dataset we have is the transifex data. 

For each release, a new branch is created, for example https://github.com/django/django-docs-translations/tree/stable/4.1.x

If we use this repo, we can go through the commits since the last release and check for updates. 

However, these updates only include authors by year. 

For example, from a [recent commit](https://github.com/django/django-docs-translations/commit/ce809e91c8d8ade2de7982aa0014e9d1e77c1aa9#diff-7568ed75ebcd72f094cb0b97517e4f871aa26ba70271ada23ec71dd603273398R8):

```
-# Claude Paroz <claude@2xlibre.net>, 2013-2021
+# Claude Paroz <claude@2xlibre.net>, 2013-2022
```

This repo is based on output from a utility script [`manage_translations.py`](https://github.com/django/django-docs-translations/blob/stable/4.1.x/manage_translations.py) which itself relies on the package `transifex-client`.


#TODO(glasnt) work out if the new transifex API can be used to pull authors in a better way

A naive way of getting the translators for the branch is _very roughly_ map to the year that most development happened, and get any authors tagged with that year. 

Roughly: `git grep "^#"  | grep 2022 | cut -d"#" -f2 | cut -d',' -f1  | cut -d'<' -f1 | sort -n | uniq`

In [49]:
!git clone -b stable/4.1.x https://github.com/django/django-docs-translations
#TODO(glasnt) make the branch dynamic

fatal: destination path 'django-docs-translations' already exists and is not an empty directory.


In [212]:
import re
from pathlib import Path
folder = Path.cwd() / "django-docs-translations"

year = "2022"

translators = []

for file in folder.glob("**/*.po"):
    with open(file) as f: 
        for line in f.readlines():
            if re.search(f"^#(.*){year}", line):
                translators.append(line.split(",")[0])

for i, author in enumerate(translators):
    translators[i] = author.replace("#","").strip().split("<")[0]
                
translator_thanks = unique(translators)

print("Translators:", len(translator_thanks))
print("\n".join(translator_thanks))

Translators: 42
0d21a39e384d88c2313b89b5042c04cb
Albert Lei
Bertalan Ivan
Claude Paroz
Darek505
decyma
e4db27214f7e7544f2022c647b585925_bb0e321
eitaro
Fery Setiawan
Giacomo M
Gyeongjun Paik
Hiroki Nakayama
Ikemoto Hideki
Jiawei Xu
Juna Salviati
Lu
Maciej Olko
Marco De Paoli
Marco Richetta
Mariusz Felisiak
Mirco Grillo
Nicola Guglielmi
Paolo Melchiorre
peiji liu
seungho
Stefan Ocetkiewicz
Taichi Taniguchi
Todd Kasaki
Tomaz Marcelino Cunha Neto
Toru Ohno
Veoco
Wu Xiangfeng
xiaqi516
yeongkwang
Youngjun Lee
Youxi Luo
汪心禾
龙虎义
문채원
이우현
이재준
정환 윤


# Analysis <a class="anchor" id="analysis"></a>
 
 
Our analysis is going to be based on our [assumptions](#assumptions). 

Some other assumptions from other projects: 

 * https://github.com/cncf/devstats/blob/master/README_K8s.md "Contribution: a review, comment, commit, issue or PR"

In [99]:
%%capture
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install pip install "rich[jupyter]"

and convert our lists to DataFrames

In [170]:
"""import pandas as pd

git_commits_df = pd.DataFrame(git_commits)
print("Git Commit keys:", ", ".join(list(git_commits_df.keys())))

git_trac_links_df = pd.DataFrame(git_trac_links)
pull_requests_df = pd.DataFrame(pull_requests)
print("Pull Request keys:", ", ".join(list(pull_requests_df.keys())))
pull_request_comments_df = pd.DataFrame(pr_comments)
print("Pull Request Comments keys:", ", ".join(list(pull_request_comments_df.keys())))
trac_tickets_df = pd.DataFrame(trac_tickets)
print("Trac Ticket keys:", ", ".join(list(trac_tickets_df.keys())))
trac_ticket_comments_df = pd.DataFrame(trac_ticket_comments)
print("Trac Ticket Comments keys:", ", ".join(list(trac_ticket_comments_df.keys())))"""

'import pandas as pd\n\ngit_commits_df = pd.DataFrame(git_commits)\nprint("Git Commit keys:", ", ".join(list(git_commits_df.keys())))\n\ngit_trac_links_df = pd.DataFrame(git_trac_links)\npull_requests_df = pd.DataFrame(pull_requests)\nprint("Pull Request keys:", ", ".join(list(pull_requests_df.keys())))\npull_request_comments_df = pd.DataFrame(pr_comments)\nprint("Pull Request Comments keys:", ", ".join(list(pull_request_comments_df.keys())))\ntrac_tickets_df = pd.DataFrame(trac_tickets)\nprint("Trac Ticket keys:", ", ".join(list(trac_tickets_df.keys())))\ntrac_ticket_comments_df = pd.DataFrame(trac_ticket_comments)\nprint("Trac Ticket Comments keys:", ", ".join(list(trac_ticket_comments_df.keys())))'

For starters, who are our Git Commit Authors:

In [175]:
#git_commits_df.author.value_counts()

from collections import Counter


from rich.console import Console
from rich.table import Table


def print_table(table_name="Table", columns=["a", "b"], data=[[1,2],[3,4]], limit=None):
    table = Table(title=table_name)
    for col in columns:
        table.add_column(col)
    if limit:
        data = data[:limit]
        table.title += f" (limit {limit})"
    for x in data:
        table.add_row(*list([str(y) for y in x]))

    console = Console()
    console.print(table)

    
gc_authors = Counter([g["author"] for g in git_commits])
print_table("Git Committers", ["Name", "Commits"], gc_authors.most_common(), limit=10)




And who are our Pull Request Authors:

In [176]:
#pull_requests_df.user.value_counts()
pr_authors = Counter([p["user"] for p in pull_requests])
print_table("Pull Request Authors", ["username", "Pull Requests"], pr_authors.most_common(), limit=10)

This is where things start to break, because we have Names and Aliases. 

Git works by user.name and user.email (you may be familiar with setting this up yourself in your local `git`).

GitHub works by way of usernames.

You can use the GitHub API to get someone's GitHub display name from their email, but this may not match the name from their git commits. 

We'll setup some dataframes, and some helper functions for later. 

In [359]:
github_name = {}

def get_github_user(user):
    try:
        data = github_api(f"/users/{user}")
    except ValueError:
        return user + " [not found on GitHub]"
    if data["name"] == None:
        return user + " [no GitHub name]"
    return data["name"]


users = [u["user"] for u in pull_requests] + [u["user"] for u in pr_comments] + [u["reporter"] for u in trac_tickets ] +[u["name"] for u in trac_ticket_comments]

for user in tqdm(unique(users)):
    github_name[user] = get_github_user(user)

  0%|          | 0/704 [00:00<?, ?it/s]

In [294]:
"""def get_name(username):
    name = github_users_df[github_users_df.login.str.contains(username)]
    if len(name) == 1:
        return name["name"].item()
    else:
        # name doesn't exist in known github users, so return self.
        return username

def get_username(name):
    user =  github_users_df[github_users_df.name.str.contains(name, na=False)].dropna()
    if len(user) == 1:
        return user["login"].item()
    else:
        return name
"""

github_user = {v: k for k, v in github_name.items()}

print("GitHub users:", len(github_user))

print_table("GitHub Users", ["Username", "Name"], list(github_user.items()), limit=10)


GitHub users: 697


So we now have a conversion dict. This will be useful later. 

## Sampling by User

Let's take an example user, and check all their contributions. 

Keeping in mind the set of data we have, and the mapping of identities: 

```
git_commits author
pull_requests user->gname
pr_comments user->gname
trac_tickets reporter->gname
trac_ticket_comments name->gname
security_thanks (name)
translators (name)
```

In [298]:
def user_contributions(user):
    
    if user not in github_name.keys():
        return "User not found."
    
    name = github_name[user]
    
    title=f"{name} ({user})"
    columns = ["Contribution", "Count"]
    data = []
    
    gcs = [g for g in git_commits if g["author"] == name]
    data.append(["Git Commits", str(len(gcs))])
    
    prs = [p for p in pull_requests if p["user"] == user]
    data.append(["Pull Requests Authored", str(len(prs))])
    
    prcs = [p for p in pr_comments if p["user"] == user]
    data.append(["Pull Request comments",  str(len(prcs))])
    
    tts = [t for t in trac_tickets if t["reporter"] == user]
    data.append(["Trac Tickets created",  str(len(tts))])
    
    ttcs = [t for t in trac_ticket_comments if t["name"] == user]
    data.append(["Trac Ticket comments",  str(len(ttcs))])
    
    st = [s for s in security_thanks if s == name]
    data.append(["Security thanks", str(len(st))])
    
    tt = [t for t in translator_thanks if t == name]
    data.append(["Translation thanks", str(len(tt))])
             
    print_table(title, columns, data)


user_contributions("felixxm")
user_contributions("carltongibson")
user_contributions("glasnt")

We can also collate all the contribution types by user. 

Important notes to remember:

 * Data for PRs and Trac Tickets can be mapped to a name from a username.
 * Names in thanks lists _may_ be able to be mapped to a username.
 * Git names _may_ be able to be mapped to a username. 

In [355]:
contrib_array = {"g": 0, "pr": 0, "prc": 0, "ttr": 0, "ttc": 0, "st": 0, "tt": 0}

contributors = {}

def merge_contrib(records, key):
    for g in records:
        user, num = g
        if not user in contributors.keys():
            contributors[user] = contrib_array.copy()
        contributors[user][key] = num
    

gc = Counter([github_user.get(g["author"], g["author"]) for g in git_commits]).most_common()
merge_contrib(gc, "g")
    
prc = Counter([g["user"] for g in pull_requests]).most_common()
merge_contrib(prc, "pr")

prcsc = Counter([g["user"] for g in pr_comments]).most_common()
merge_contrib(prcsc, "prc")

ttrc = Counter([g["reporter"] for g in trac_tickets]).most_common()
merge_contrib(ttrc, "ttr")

ttcc = Counter([github_user.get(g["name"], g["name"]) for g in trac_ticket_comments]).most_common()
merge_contrib(ttcc, "ttc")

stc = Counter([github_user.get(g, g) for g in security_thanks]).most_common()
merge_contrib(stc, "st")

ttc = Counter([github_user.get(g, g) for g in translator_thanks]).most_common()
merge_contrib(ttc, "tt")
    
contributors

table = Table(title=f"Contributions to Django {target_release}")
for col in ["Name", "User", "Commits", "PRs", "PR Comm.", "Trac", "Trac Comm.", "Sec.", "Transl."]:
    table.add_column(col)

for c in contributors:
    table.add_row(c, github_name.get(c, c), *[str(s[1]) for s in contributors[c].items()] )
   
console.print(table)

# Results <a class="anchor" id="results"></a>

Now, with the data we have, we can now generate a unique list of all the humans who contributed. 

In [358]:
django_contributors = unique([github_name.get(g, g) for g in unique(contributors.keys())])
print(len(django_contributors))
print("\n".join(django_contributors))

810

0d21a39e384d88c2313b89b5042c04cb
0xC4
1337 H4X0|2
4mpty4mpty
Aapo Rista
Aaron Chong
Aaron Forsander
Aaron Pineda
Abhijeet
Abhinav Yadav
Abhyudai
Ad Timmering
Adam
Adam Johnson
Adam Wróbel
Adam Zimmerman
AdamDonna (no name)
Ade Lee
adontz (no name)
Adonys Alea Boffill
Adrian Smith
Adrian Torres
Ahmad A. Hussein
ahmadekhalili
Ahsan Shafiq
Akshesh Doshi
Alan Crosswell
Alan Ryan
Albert Defler
Albert Lei
Aleksandr Sobolev
Alex Aktsipetrov
Alex Gaynor
Alex Vandiver
alex7217 (not found)
Alexander Filimonov
Alexander Nestorov
Alexander Shchapov
Alexandr Tatarinov
Alexandre
Alexandre Laplante
Alexandru Mărășteanu
ali
ali sayyah
Ali Toosi
Aljaž Košir
Allen Jonathan David
AllenJonathan (no name)
Alokik Vijay
Amartya Gaur
Amir Hadi
Anders Kaseorg
Andreas Pelme
Andrei Fokau
andrei kulakov
Andrew
Andrew Chen Wang
Andrew Godwin
Andrew Neitsch
Andrew Nicolaou
andrewdotn (no name)
Andrey Fedoseev
Andrey Kuzminov
Andrey Otto
Andrey Shpak
Andriy Sokolovskiy
Andy Baker
Andy Chosak
Aniket Subhash Ujga