Analyze web-hosted JSON data
============================

This notebook reads and processes JSON-encoded data hosted on the web using a combination of [Dask Bag](https://docs.dask.org/en/latest/bag.html) and [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html).

This data comes from [mybinder.org](https://mybinder.org) a web service to run Jupyter notebooks live on the web (you may be running this notebook there now).  My Binder publishes records for every time someone launches a live notebook like this one, and stores that record in a publicly accessible JSON file, one file per day. 

## Introduction to the dataset

This data is stored as JSON-encoded text files on the public web.  Here are some example lines.

In [1]:
import dask.bag as db
db.read_text('https://archive.analytics.mybinder.org/events-2018-11-03.jsonl').take(3)

('{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "Qiskit/qiskit-tutorial/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "ipython/ipython-in-depth/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "QISKit/qiskit-tutorial/master", "status": "success"}\n')

We see that it includes one line for every time someone started a live notebook on the site.  It includes the time that the notebook was started, as well as the repository from which it was served.

In this notebook we'll look at many such files, parse them from JSON to Python dictionaries, and then from there to Pandas dataframes.  We'll then do some simple analyses on this data.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will start the dashboard which 
is useful to gain insight on the computation.  

In [2]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=1, 
                n_workers=4,
                memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:46599  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.00 GB


## Get a list of files on the web

The mybinder.org team maintains an index file that points to all other available JSON files of data.  Lets convert this to a list of URLs that we'll read in the next section.

In [3]:
import dask.bag as db
import json

In [4]:
db.read_text('https://archive.analytics.mybinder.org/index.jsonl').map(json.loads).compute()

[{'name': 'events-2018-11-03.jsonl', 'date': '2018-11-03', 'count': '7057'},
 {'name': 'events-2018-11-04.jsonl', 'date': '2018-11-04', 'count': '7489'},
 {'name': 'events-2018-11-05.jsonl', 'date': '2018-11-05', 'count': '13590'},
 {'name': 'events-2018-11-06.jsonl', 'date': '2018-11-06', 'count': '13920'},
 {'name': 'events-2018-11-07.jsonl', 'date': '2018-11-07', 'count': '12766'},
 {'name': 'events-2018-11-08.jsonl', 'date': '2018-11-08', 'count': '14105'},
 {'name': 'events-2018-11-09.jsonl', 'date': '2018-11-09', 'count': '11843'},
 {'name': 'events-2018-11-10.jsonl', 'date': '2018-11-10', 'count': '7047'},
 {'name': 'events-2018-11-11.jsonl', 'date': '2018-11-11', 'count': '6940'},
 {'name': 'events-2018-11-12.jsonl', 'date': '2018-11-12', 'count': '16322'},
 {'name': 'events-2018-11-13.jsonl', 'date': '2018-11-13', 'count': '16530'},
 {'name': 'events-2018-11-14.jsonl', 'date': '2018-11-14', 'count': '14099'},
 {'name': 'events-2018-11-15.jsonl', 'date': '2018-11-15', 'count': 

In [5]:
filenames = (db.read_text('https://archive.analytics.mybinder.org/index.jsonl')
               .map(json.loads)
               .pluck('name')
               .compute())

filenames = ['https://archive.analytics.mybinder.org/' + fn for fn in filenames]
filenames[:5]

['https://archive.analytics.mybinder.org/events-2018-11-03.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-04.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-05.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-06.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-07.jsonl']

## Create Bag of all events

We now create a [Dask Bag](https://docs.dask.org/en/latest/bag.html) around that list of URLs, and then call the `json.loads` function on every line to turn those lines of JSON-encoded text into Python dictionaries that can be more easily manipulated.

In [6]:
events = db.read_text(filenames).map(json.loads)
events.take(2)

({'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'Qiskit/qiskit-tutorial/master',
  'status': 'success'},
 {'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'ipython/ipython-in-depth/master',
  'status': 'success'})

## Most Popular Binders

Lets do a simple frequency count to find those binders that are run the most often.

In [7]:
events.pluck('spec').frequencies(sort=True).take(20)

(('ipython/ipython-in-depth/master', 4842086),
 ('jupyterlab/jupyterlab-demo/try.jupyter.org', 1589797),
 ('jupyterlab/jupyterlab-demo/master', 910425),
 ('binder-examples/requirements/master', 423057),
 ('ines/spacy-io-binder/live', 331240),
 ('DS-100/textbook/master', 221404),
 ('bokeh/bokeh-notebooks/master', 180361),
 ('binder-examples/r/master', 171114),
 ('ines/spacy-course/binder', 123571),
 ('ELC/8fdc0f490b3058872a7014f01416dfb6/master', 120699),
 ('QuantStack/xeus-cling/stable', 111675),
 ('explosion/spacy-io-binder/live', 108631),
 ('SamLau95/nbinteract/master', 94141),
 ('rationalmatter/juno-demo-notebooks/master', 69154),
 ('numba/numba-examples/master', 66974),
 ('joelachance/thebelab-requirements/master', 66961),
 ('rasahq/docs-binder/master', 63646),
 ('binder-examples/demo-julia/master', 59692),
 ('Petlja/AnalizaPodatakaGim2/master', 58819),
 ('ELC/380e584b87227b15727ec886223d9d4a/master', 51165))

## Convert to Dask Dataframe

Finally, we can convert our bag of Python dictionaries into a [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html), and follow up with more Pandas-like computations.

We'll do the same computation as above, now with Pandas syntax.

In [8]:
df = events.to_dataframe()
df.head()

Unnamed: 0,timestamp,schema,version,provider,spec,status
0,2018-11-03T00:00:00+00:00,binderhub.jupyter.org/launch,1,GitHub,Qiskit/qiskit-tutorial/master,success
1,2018-11-03T00:00:00+00:00,binderhub.jupyter.org/launch,1,GitHub,ipython/ipython-in-depth/master,success
2,2018-11-03T00:00:00+00:00,binderhub.jupyter.org/launch,1,GitHub,QISKit/qiskit-tutorial/master,success
3,2018-11-03T00:01:00+00:00,binderhub.jupyter.org/launch,1,GitHub,QISKit/qiskit-tutorial/master,success
4,2018-11-03T00:01:00+00:00,binderhub.jupyter.org/launch,1,GitHub,jupyterlab/jupyterlab-demo/master,success


In [9]:
df.spec.value_counts().nlargest(20).to_frame().compute()

Unnamed: 0,spec
ipython/ipython-in-depth/master,4842086
jupyterlab/jupyterlab-demo/try.jupyter.org,1589797
jupyterlab/jupyterlab-demo/master,910425
binder-examples/requirements/master,423057
ines/spacy-io-binder/live,331240
DS-100/textbook/master,221404
bokeh/bokeh-notebooks/master,180361
binder-examples/r/master,171114
ines/spacy-course/binder,123571
ELC/8fdc0f490b3058872a7014f01416dfb6/master,120699


## Persist in memory

This dataset fits nicely into memory. Lets avoid downloading data every time we do an operation and instead keep the data local in memory.

In [10]:
df = df.persist()

Honestly, at this point it makes more sense to just switch to Pandas, but this is a Dask example, so we'll continue with Dask dataframe.

## Investigate providers other than Github

Most binders are specified as git repositories on GitHub, but not all.  Lets investigate other providers.

In [11]:
import urllib

In [12]:
df.provider.value_counts().compute()

GitHub        13167205
Gist            220750
Git             182100
GitLab           56784
Zenodo            1273
Hydroshare         462
Figshare           316
Dataverse          201
Name: provider, dtype: int64

In [13]:
(df[df.provider == 'GitLab']
 .spec
 .map(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())

Unnamed: 0,spec
DGothrek/ipyaggrid/binder-demo,2897
rafelyall/biology-on-the-command-line/master,2808
JC_Bonnefoy/snt_2019/master,2627
lfortran/web/lfortran-binder/master,2586
rafelyall/introduction-to-ngs/master,1779
...,...
jmduarte1/rta-workshop/master,1
jibe-b/toupie-data-store/master,1
jibe-b/sens-de-la-vie-workflow-brouillon-tests-tuto/dev,1
jibe-b/sabre/master,1


In [14]:
(df[df.provider == 'Git']
 .spec
 .apply(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())

Unnamed: 0,spec
https://jovian.ml/api/git/6213bcb90b054261b82e68a6b8c75c3f_11.git/fb8b5c097292a32d8a35281e6db69031e9e33c9c,8637
https://jovian.ml/api/git/48a3164311c945628d7053812c1ce869_15.git/a6696d9684d6e695fe2b4fc98896beaf789e40c7,6833
https://gitlab.gwdg.de/fuelle1/statistik.git/master,6517
https://jovian.ml/api/git/77bf991ec84148f191e8b573242b9c60_6.git/106c482fe6a2753e8fad4c94c5c7bf6b49169e0a,6203
https://gitlab.tubit.tu-berlin.de/dima/ISDA-SS20.git/master,6124
...,...
https://jovian.ml/api/git/96fb6378bca34ed6bdec85af965923d8_7.git/a773e16e280e92611ff8496400d1cf1570b6b748,1
https://jovian.ml/api/git/96fe3a549ebb49b790f998a21e6125cd_1.git/0b6f7e7ad9cddde5d80702b770f64e233c54a535,1
https://jovian.ml/api/git/97032d80dbc34ba5a3a47dc7d394f200_14.git/b32ead231283ad732824c0618fc684d6eceee08a,1
https://jovian.ml/api/git/97032d80dbc34ba5a3a47dc7d394f200_15.git/e69d216310323fac4c3711889e6d8b41152313c8,1
