# Collect github-eu-covid data

This notebook outlines preliminary findings of collecting github-eu-covid data.

## Source of the data

The GitHub REST API is used to extract repositories which have been manually tagged as being covid related.

## Collection options

Here we show different options for collection, where applicable. Any prototype code used for data collection is provided in `collect.py`.

### Option a) 

In [1]:
from collect import search  # See collect.py
from collections import defaultdict
import requests_cache
requests_cache.install_cache('collect_cache', backend='sqlite')

In [2]:
topics = []
for topic in ['sars-cov-2', 'covid', 'coronavirus', '2019-ncov']:
    topics += [result['name'] for _, result in search(q=topic, endpoint='topics', 
                                                      created='>2020-01-01', repositories=':>1')]
topics = set(topics)
print('Found', len(topics), 'topics')

results = defaultdict(list)
totals = {}
for t in topics:
    for total, result in search(endpoint='repositories', topic=t):
        totals[t] = total
        results[t].append(result)
        if len(results[t]) == 100:
            break

Found 231 topics


In [3]:
#contributers = [contributors_url --> COUNT, url --> LOCATION]
#Other fields: contributer_count, countributer_locations, owner_location, topic_count_owner_repos
# Will require (3 + 2*n_contribs_per_owner + n_repos_per_owner)*repos total calls

In [4]:
# Calculate the total number of API calls required to collect all repos
sum(v // 100 if v > 100 else 1 for v in totals.values())

309

In [5]:
import json
with open('data_samples/repos_sample.json', 'w') as f:
    json.dump(results, f)

## Practical considerations
This is where we consider CPU time, financial cost, disk space requirements, and last (but not least) development time/uncertainty.

### CPU time
#### Integrated collection time
*This is an estimate of the time required to collect the data, without batching or parallelisation.*

There are a total of 11696 repos associated with the covid topics. The repos are collected in chunks of roughly 100, though the tail of smaller topics is collected in smaller chunks. In total, there will be 309 API calls. The rate limit of 10 calls per minute therefore requisites 31 minutes for repo collection.

To collect extended metadata:

- Repo owner location
- Repo contributor locations
- Repo contributor count
- Topic counts of owner's other repos

Will require $(3 + 2*n_{\text{contributor}} + n_{\text{repos per owner}})*n_{\text{repos}}$ total calls. Assuming $n_{\text{contributor}}=3 , n_{\text{repos per owner}}=5$, this will require 16000 minutes (11 days).

#### Can the procedure be batched? Are there any caveats to this?

The second part could certainly be batched.

#### Real world collection time / cost
*Assume a maximum of 200 concurrent 8GB 2-core machines*

*NB (at time of writing based on [this](https://aws.amazon.com/ec2/pricing/on-demand/)) such a machine would cost $0.0944 per hour*

The total collection time with 200 concurrent machines would be less than 2 hours. The cost for this would be $37

### Disk space (GB)

#### By entity type, estimate how many "rows" there are to collect (e.g. 100s, 1000s, etc)

10000s

#### By entity type, and based on the field types, what is the estimated disk space?

#### What does this imply for database storage costs?

Negligible

### Development time
*How long do you think it will take to develop the codebase for the collection?*
*What uncertainties can you foresee?*

Unless extensions are required (e.g. in-depth analysis of the codebase), then development will likely take one week.

There are uncertainties, since the collection will be batched it could be temperamental. The number of required collections could be significantly higher, but it is difficult to calculate this properly in advance.