# Coverage of github-eu-covid data

This notebook outlines preliminary findings of the coverage of github-eu-covid data.

## Geographic coverage

Geographic coverage can be assessed in two broad ways:

1. Location inferred from topics and description
2. Location of owner and contributors

which accordingly correspond to two research questions:

1. What geography does the codebase aim to represent?
2. What countries are contributing to this codebase?

For various reasons, there could be no geographic information associated with a given repository. This could, for example, be due privacy settings, insufficient information, or that the codebase wasn't designed for any specific geography.

It is found that from the 231 distinct covid-related topics, 150 indicate a specific country, of which 20 are EU (or closely affiliated) countries - meaning that 81 may have a more general or global coverage. This does not include countries mentioned in descriptions, or the affiliations of contributors (these require more in-depth analysis).

In [7]:
import json
with open("../collect/data_samples/repos_sample.json") as f:
    data = json.load(f)

In [35]:
list(data['covid-19'][0].keys())

['id',
 'node_id',
 'name',
 'full_name',
 'private',
 'owner',
 'html_url',
 'description',
 'fork',
 'url',
 'forks_url',
 'keys_url',
 'collaborators_url',
 'teams_url',
 'hooks_url',
 'issue_events_url',
 'events_url',
 'assignees_url',
 'branches_url',
 'tags_url',
 'blobs_url',
 'git_tags_url',
 'git_refs_url',
 'trees_url',
 'statuses_url',
 'languages_url',
 'stargazers_url',
 'contributors_url',
 'subscribers_url',
 'subscription_url',
 'commits_url',
 'git_commits_url',
 'comments_url',
 'issue_comment_url',
 'contents_url',
 'compare_url',
 'merges_url',
 'archive_url',
 'downloads_url',
 'issues_url',
 'pulls_url',
 'milestones_url',
 'notifications_url',
 'labels_url',
 'releases_url',
 'deployments_url',
 'created_at',
 'updated_at',
 'pushed_at',
 'git_url',
 'ssh_url',
 'clone_url',
 'svn_url',
 'homepage',
 'size',
 'stargazers_count',
 'watchers_count',
 'language',
 'has_issues',
 'has_projects',
 'has_downloads',
 'has_wiki',
 'has_pages',
 'forks_count',
 'mirror_u

In [28]:
import pandas as pd
eu_codes = set(pd.read_csv("../collect/data_samples/iso3166_alpha2_codes.csv")['ISO2'])
df = pd.read_csv("../collect/data_samples/country_codes_names.csv")
df.columns = ['iso2', 'name_en', 'name_local', 'lang']
df_eu = df.loc[df.iso2.apply(lambda x: x in eu_codes)]
other_names = ['czechia']

def ctry_list(df, other_names):
    _ctry_list = list(df.iso2) + list(df.name_en) + list(df.name_local) + other_names
    output = []
    for ctry in _ctry_list:
        if pd.isnull(ctry):
            continue
        output += ctry.lower().split(",")
    return output

In [30]:
import re
def split_tag(tag):
    return [x.lower() for x in re.split('[^a-zA-Z]', tag)]

def print_n_countries(data, df, other_names):
    _ctry_list = ctry_list(df, other_names)
    n = sum(any(ctry in split_tag(tag) for ctry in _ctry_list)
            for tag in data.keys())
    print(n)
    
print_n_countries(data, df, other_names)
print_n_countries(data, df_eu, other_names)

150
20


## Temporal coverage

### Is the data:

~~1. Static (it will never change)~~

2. Updating (it will increase in size). If so, at what rate?

~~3. Refreshing (old data will disappear). If so, at what rate and interval?~~

In the first month in which covid became a significant issue in Western nations (March 2020), [around 6000 repos were created with a covid tag](https://github.blog/2020-03-23-open-collaboration-on-covid-19/). This number has less than double in the subsequent 3 months, so I infer from this that the number of repositories is likely to plateau. This does not however mean that contributions have stopped, or that downloads have stopped.

Even so, lack of new activity is nevertheless an up-to-date statistic - and so I would describe the data as having excellent temporal coverage from at least March 2020 onwards.

## Ecosystem coverage

### List which ecosystems are covered by this dataset

From inspection, repositories are broadly:

* Data journalism
* Epidemiological modelling
* Epidemiological measurement
* New technologies with applications to covid (e.g. tracing apps)


### To what extent could there be partial coverage or blindspots?

The data has unknown coverage of ethnic minorities and gender balance.

It would be useful to have a counterfactual to fully understand the geographic and temporal coverage (perhaps a random selection of 10 x larger from general tags - note that this would cost an additional $400 or so)