In [1]:
import glob

import h5py
import numpy as np

In [2]:
awkward_repos = np.array([202413762, 137079949, 202413762, 137079949])
distinct_people = set()

for filename in sorted(glob.glob("/home/jpivarski/storage/data/GHArchive/GHArchive-*-aggregated.h5")):
    print(filename)
    with h5py.File(filename, mode="r") as file:
        actors = file["actor_id"][np.isin(file["repo_id"], awkward_repos)]
        distinct_people.update(actors)
    print(len(distinct_people))

/home/jpivarski/storage/data/GHArchive/GHArchive-2015-aggregated.h5
0
/home/jpivarski/storage/data/GHArchive/GHArchive-2016-aggregated.h5
0
/home/jpivarski/storage/data/GHArchive/GHArchive-2017-aggregated.h5
0
/home/jpivarski/storage/data/GHArchive/GHArchive-2018-aggregated.h5
31
/home/jpivarski/storage/data/GHArchive/GHArchive-2019-aggregated.h5
198
/home/jpivarski/storage/data/GHArchive/GHArchive-2020-aggregated.h5
450
/home/jpivarski/storage/data/GHArchive/GHArchive-2021-aggregated.h5
767
/home/jpivarski/storage/data/GHArchive/GHArchive-2022-aggregated.h5
969
/home/jpivarski/storage/data/GHArchive/GHArchive-2023-aggregated.h5
1026


In [None]:
for line in open("/home/jpivarski/storage/data/GHArchive/actor_id_name.txt"):
    idstr, name = line.rstrip("\n").split("\t")
    if int(idstr) in distinct_people:
        # print(line, end="")
        pass

In [23]:
people_usernames = {}
people_categories = {}

with open("distinct_people.txt") as file:
    for line in file:
        idstr, username, category = line.rstrip("\n").split("\t")
        people_usernames[int(idstr)] = username
        people_categories[int(idstr)] = category

In [24]:
set(people_categories.values())

{'ACOUSTIC',
 'AI',
 'ASTRO',
 'BIOLOGY',
 'CHEMISTRY',
 'CS',
 'DS',
 'ECONOMICS',
 'GEO',
 'HEALTH',
 'HEP',
 'IDK',
 'MATH',
 'MEDICAL',
 'PHYSICS',
 'QUANT',
 'SE'}

`HEP` versus everything else is the main categorization. This definition of HEP includes dark matter searches, neutrino experiments, and nuclear physics. It is as lenient a definition as possible, including everyone who has ever been in HEP, even if they are not now, and students who do not study HEP but were involved in developing HEP projects (e.g. IRIS-HEP fellows). Thus, when we see "non-HEP" in the plot, it really means "non-HEP," interest in Awkward Array that is truly beyond our HEP community.

`PHYSICS` includes anything that touches on physics, including materials science and chemistry, that isn't HEP. It can be merged with the single instance of `ACOUSTIC` and the only two instances of `CHEMISTRY`.

`BIOLOGY` is mostly neuroscience, but also bioinformatics and genomics. It can be merged with `MEDICAL` and `HEALTH`.

`GEO` may be its own group, but it's combining things like Earth sciences, climatology, and geosciences.

`SE` (software engineering), `AI` (artificial intelligence), `DS` (data science), `CS` (computer science), and `QUANT` (finance) are hopelessly interchangeable. It was very hard for me to make useful distinctions among them, based on their GitHub info. Nevertheless, I wish I made a separate category for robotics.

There's only one instance of `ECONOMICS` and four of `MATH`. They can be included in the giant group above.

`IDK` is for everyone who could not be classified. It usually meant that they had no repos, or the only repos they had were forks, and that wasn't enough to tell me their interests.

GitHub users who no longer exist (18% of the total) were removed from the sample.