# Audit labels in a GitHub organization

This notebooks creates TSV files containing information about the labels used in repositories in an organization on GitHub.

## 1. Set up the notebook

Before running the other cells in this notebook:
- Set the `ORG_NAME` variable to the name of the organization of interest
- Set the `ACCESS_TOKEN` variable to a GitHub personal access token that has access to the organization of interest

In [None]:
ORG_NAME = r"microbiomedata"  # TODO: Replace me!
ACCESS_TOKEN = r"github_pat_..."  # TODO: Replace me!

Install dependencies.

In [None]:
# Reference: https://pygithub.readthedocs.io/en/stable/introduction.html
%pip install PyGithub

Import modules.

In [None]:
import csv
import io

from github import Github, Auth

Define helper functions you can use later.

> References:
> - https://docs.python.org/3/library/csv.html#csv.DictWriter

In [None]:
def get_lowercase_key(key_value_pair: tuple) -> str:
    r"""Returns the key from a `(key, value)` tuple, in lowercase."""
    return key_value_pair[0].lower()

def write_list_of_dicts_to_tsv_file(keys: list, list_of_dicts: list, file_path: str) -> None:
    r"""Writes a list of dictionaries to a TSV file."""
    with open(file_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=keys, delimiter="\t")
        writer.writeheader()
        for row_dict in list_of_dicts:
            writer.writerow(row_dict)

## 2. Fetch data from the GitHub API

Create an authenticated GitHub client.

In [None]:
auth = Auth.Token(ACCESS_TOKEN)
g = Github(auth=auth)

Fetch each label in each repository.

> References:
> - https://pygithub.readthedocs.io/en/stable/examples/Repository.html#get-all-the-labels-of-the-repository

In [None]:
# Note: This cell takes about 1 minute to run on my laptop.

repos_by_repo_name = {}

labels_by_repo_name = {}

org = g.get_organization(ORG_NAME)

for repo in org.get_repos():
    repo_name = repo.name
    repos_by_repo_name[repo_name] = repo
    labels_by_repo_name[repo_name] = []
    
    labels = repo.get_labels()
    for label in labels:
        labels_by_repo_name[repo_name].append(label)

Fetch the number of "labelings" (i.e uses) of each label in each repository—regardless of whether the labeled thing is an Issue or a PR and regardless of whether the labeled thing is "Open" or "Closed."

> References:
> - https://github.com/PyGithub/PyGithub/blob/cc766a6ffdfa4b24c395dd958df46704348637fb/github/Repository.py#L2816
> - https://github.com/PyGithub/PyGithub/blob/cc766a6ffdfa4b24c395dd958df46704348637fb/github/PaginatedList.py#L187

In [None]:
# Note: This cell takes 10-11 minutes to run on my laptop.

num_labelings_by_label_name_by_repo_name = {}

# Initialize a counter that we can use as a progress indicator.
repo_num = 0
num_repos = len(repos_by_repo_name.keys())

for repo_name, repo in sorted(repos_by_repo_name.items(), key=get_lowercase_key):
    # Print the name of the repo in which we are about to count "labelings."
    print(f"{repo_name} (repo {repo_num + 1} of {num_repos}): ", end="")

    # Ensure a dictionary entry for this repo name exists.
    if repo_name not in num_labelings_by_label_name_by_repo_name.keys():
        num_labelings_by_label_name_by_repo_name[repo_name] = {}
    
    for label in labels_by_repo_name[repo_name]:
        issues = repo.get_issues(state="all", labels=[label])
        num_labelings_by_label_name_by_repo_name[repo_name][label.name] = issues.totalCount
        print(".", end="")  # print a dot to represent a unit of progress

    # Increment the counter used as a progress indicator.
    print("")
    repo_num += 1

Close the GitHub client's connection to the GitHub API.

In [None]:
g.close()

## 3. Process the fetched data

Make a dictionary of label descriptions.

Example dictionary item, which shows a single label being described differently in different repos.
```json
{ 
    "small": {
        "some-repo": "Will take 1-2 days to complete",
        "other-repo": "Will take 1-2 hours to complete",
        "other-other-repo": "Will take 1-2 hours to complete"
    }
}
```

In [None]:
label_descriptions_by_repo_name_by_label_name = {}

for repo_name, labels in labels_by_repo_name.items():
    for label in labels:
        label_name = label.name

        # Ensure a dictionary entry for this label name exists.
        if label_name not in label_descriptions_by_repo_name_by_label_name.keys():
            label_descriptions_by_repo_name_by_label_name[label_name] = {}

        label_descriptions_by_repo_name_by_label_name[label_name][repo_name] = label.description

Write that dictionary out to a TSV file.

In [None]:
label_descriptions = []

for label_name, label_descriptions_by_repo_name in sorted(label_descriptions_by_repo_name_by_label_name.items(), key=get_lowercase_key):
    for repo_name, label_description in label_descriptions_by_repo_name.items():
        label_descriptions.append(dict(repo_name=repo_name, label_name=label_name, label_description=label_description))

write_list_of_dicts_to_tsv_file(keys=["repo_name", "label_name", "label_description"], 
                                list_of_dicts=label_descriptions,
                                file_path="label_descriptions.tsv")

Make a list of labels lacking descriptions.

Example list item, which shows the label name and the name of the repo in which that label lacks a description:
```json
{
    "label_name": "small",
    "repo_name": "mystery-repo"
}
```

In [None]:
labels_lacking_description = []

for label_name, label_descriptions_by_repo_name in sorted(label_descriptions_by_repo_name_by_label_name.items(), key=get_lowercase_key):
    for repo_name, label_description in label_descriptions_by_repo_name.items():
        
        # Note: I did, in fact, encounter a description of `None` in practice. :shrug:
        if label_description is None or len(label_description.strip()) == 0:
            labels_lacking_description.append(dict(label_name=label_name, repo_name=repo_name))

Write that list out to a TSV file.

In [None]:
write_list_of_dicts_to_tsv_file(keys=["repo_name", "label_name"], 
                                list_of_dicts=labels_lacking_description,
                                file_path="labels_lacking_description.tsv")

Make a fancy table showing (among other things) how much each label is used in each repo, and write that table out to a TSV file.

The table will have this format:
- Column headers (first row) are repository names
- Row headers (first column) are label names
- First non-header row (second row) contains the number of labels that exist in a given repo
- First non-header column (second column) contains the number of "labelings" (i.e. uses) of a given label across all repos:
    - `None` means the label does not exist in that repo
    - `0` means the label exists in that repo, but is not used in that repo
    - `1`+ means the label is used in that repo (which implies that it also exists in that repo)

In [None]:
# Make a dictionary of names of repos in which a given label exists.
repo_names_by_label_name = {}
for repo_name, labels in labels_by_repo_name.items():
    for label in labels:

        # Ensure a dictionary entry for this label name exists.
        if label.name not in repo_names_by_label_name.keys():
            repo_names_by_label_name[label.name] = []

        repo_names_by_label_name[label.name].append(repo_name)

# Write this stuff as TSV content to a buffer in memory (we'll dump it to a file later).
all_repo_names = sorted(labels_by_repo_name.keys(), key=str.lower)  # ignores letter casing when sorting
field_names = ["Label name", "Number of repos"] + all_repo_names
buffer = io.StringIO()
tsv_writer = csv.DictWriter(buffer, fieldnames=field_names, delimiter="\t")
tsv_writer.writeheader()

# Make the first data row, which contains the number of labels in each repo.
first_data_row = {"Label name": "Number of labels"}
for repo_name, labels in labels_by_repo_name.items():
    first_data_row[repo_name] = len(labels)
tsv_writer.writerow(first_data_row)

# Make the subsequent rows.
for label_name, repo_names in sorted(repo_names_by_label_name.items(), key=get_lowercase_key):
    row = {"Label name": label_name, "Number of repos": len(repo_names)}
    for repo_name in all_repo_names:
        if label_name in num_labelings_by_label_name_by_repo_name[repo_name]:
            row[repo_name] = num_labelings_by_label_name_by_repo_name[repo_name][label_name]  # some number >= 0
        else:
            row[repo_name] = None
    tsv_writer.writerow(row)

# Write the TSV content to a file.
with open("label_usage.tsv", "w", newline="") as f:
    f.write(buffer.getvalue())