# Analysis of Jupyter notebooks by kernel langauge

Data collected from the following query
run on Google BigQuery on 28. October, 2018.
Results saved to a CSV.

```sql
SELECT
  COUNT(contents.id) AS N,
  JSON_EXTRACT(contents.content, '$.metadata.kernelspec.language') AS kernel_lang,
  JSON_EXTRACT(contents.content, '$.metadata.kernelspec.name') AS kernel_name
FROM
  `bigquery-public-data.github_repos.files` files
LEFT JOIN
  `bigquery-public-data.github_repos.contents` contents
ON
  files.id = contents.id
WHERE
#  languages.language.name = 'Jupyter Notebook'
  NOT contents.binary
  AND files.path LIKE '%.ipynb'
GROUP BY
  kernel_lang,
  kernel_name
ORDER BY N DESC

```



## Results from GitHub data on Google BigQuery

In [1]:
!head results-20181028-160521.csv

N,kernel_lang,kernel_name
90734,"""python""","""python3"""
57172,,
56363,"""python""","""python2"""
3540,"""python""","""Python [Root]"""
3400,"""python""","""conda-root-py"""
1395,"""R""","""ir"""
1032,"""julia""","""julia-0.4"""
544,"""julia""","""julia-0.6"""
463,"""julia""","""julia-0.5"""


In [2]:
import re
import pandas as pd
def unquote(s):
    if s:
        return s.strip('"').lower()
df = pd.read_csv(
    "results-20181028-160521.csv",
    converters={
        'kernel_name': unquote,
        'kernel_lang': unquote,
    },
).fillna('')
lang = df.kernel_lang
name = df.kernel_name
df.head()


Unnamed: 0,N,kernel_lang,kernel_name
0,90734,python,python3
1,57172,,
2,56363,python,python2
3,3540,python,python [root]
4,3400,python,conda-root-py


In [3]:
len(df.kernel_lang.unique())

71

In [4]:
len(df.kernel_name.unique())

1074

Total notebooks found:

In [5]:
df.N.sum()

226372

Top languages:

In [6]:
df.groupby('kernel_lang').N.sum().sort_values(ascending=False).head(10)

kernel_lang
python     160426
            58605
julia        2664
r            1712
scala         332
haskell       322
bash          317
groovy        268
c++           267
ruby          184
Name: N, dtype: int64

In [7]:
df[lang == "gap"]

Unnamed: 0,N,kernel_lang,kernel_name
238,8,gap,gap-native


Finding xeus-cling kernels

In [8]:
import re
df[
    lang.str.contains(
        re.compile(r"c\+\+", re.IGNORECASE)
    )
]

Unnamed: 0,N,kernel_lang,kernel_name
15,267,c++,root
44,68,c++14,xeus-cling-cpp14
551,2,c++17,xeus-cling-cpp17
696,2,c++11,xeus-cling-cpp11


In [9]:
df[
    name.str.contains("cling")
]

Unnamed: 0,N,kernel_lang,kernel_name
44,68,c++14,xeus-cling-cpp14
52,58,,xeus-cling-cpp14
83,29,,cling
125,16,,cling-cpp14
332,5,,cling-cpp11
551,2,c++17,xeus-cling-cpp17
696,2,c++11,xeus-cling-cpp11


In [10]:
_.N.sum()

180

180 notebooks using xeus-cling

Finding sage:

In [11]:
df[name.str.contains("sage") | lang.str.contains("sage")]

Unnamed: 0,N,kernel_lang,kernel_name
26,124,,sagemath
154,13,r,ir-sage
315,6,,sage-7.6
501,3,sagemath,sagemath
731,1,,sagemath


In [12]:
_.N.sum()

147

147 Sage notebooks

In [13]:
df[name.str.contains("pari") | lang.str.contains("gp")]

Unnamed: 0,N,kernel_lang,kernel_name


Finding Singular:

In [14]:
df[name.str.contains("singular") | lang.str.contains("singular")]

Unnamed: 0,N,kernel_lang,kernel_name


No results? That's odd, but [this report](https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150) tess us why: not all GitHub data is on BigQuery. In particular, small repos or those without a license are not included. This has a big impact on our results, because we are especially interested in small user repos.

## Results from GitHub Search API

However, we can use the method [described here](https://github.com/parente/nbestimate) and the information above to get alternate approximations for notebooks using the GitHub search API.

We can design our searches based on the results above:

In [15]:
import requests
import requests_cache
requests_cache.install_cache()

def count_notebooks(**kwargs):
    """Count the notebooks matching a particular metadata query"""
    query = "extension:ipynb"
    for key, value in kwargs.items():
        query = f'{query} "{key}: {value}"'
    if not kwargs:
        query += ' nbformat'
    r = requests.get('https://api.github.com/search/code', params={
        'q': query,
    })
    r.raise_for_status()
    return r.json()['total_count']


The total number of notebooks on GitHub:

In [16]:
count_notebooks()

3087257

That's a *lot* more notebooks!

And for each ODK-related kernel:

In [17]:
count_notebooks(name='sagemath')

6199

In [18]:
count_notebooks(name='xeus-cling')

684

In [19]:
count_notebooks(language='gap')

63

In [20]:
count_notebooks(language='singular')

8

In [21]:
count_notebooks(language='gp') # PARI/GP

3

In [22]:
count_notebooks(language='mmt') # MMT is brand new!

1