# Estimate of Public Jupyter Notebooks on GitHub

This notebook shows the historical count and future estimate of the number of `*.ipynb` files on GitHub. The daily count comes from executing the query [extension:ipynb nbformat_minor](https://github.com/search?utf8=%E2%9C%93&q=extension%3Aipynb+nbformat_minor) once a day, on most days. We re-render the notebook and publish it daily after the update.

## Assumptions

1. That the search query hits and notebooks on GitHub are in 1:1 correspondence.
1. That GitHub is accurately reporting the total number of `*.ipynb` file hits.
1. That the result is **not** inflated due to GitHub forks.
    * Evidence: We do not see the tutorial notebooks from the ipython/ipython GitHub repository duplicated in the search results because of the 2,000+ forks of the ipython/ipython repo.
1. That the result **is** inflated a tiny bit by manually created duplicates of notebooks.
    * Evidence: Some people seem to download their favorite notebooks and then upload them into their own git repositories for safe keeping.

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
%matplotlib inline

In [None]:
import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import statsmodels.api as sm

In [None]:
mpl.style.use('ggplot')
figsize = (14,7)

In [None]:
now = datetime.datetime.utcnow()
print(f'This notebook was last rendered at {now} UTC')

## Raw Hits

First, let's load the historical data into a DataFrame indexed by date.

In [None]:
hits_df = pd.read_csv('ipynb_counts.csv', index_col=0, header=0, parse_dates=True)
hits_df.reset_index(inplace=True)
hits_df.drop_duplicates(subset='date', inplace=True)
hits_df.set_index('date', inplace=True)
hits_df.sort_index(ascending=True, inplace=True)

In [None]:
hits_df.tail(3)

There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed.

In [None]:
til_today = pd.date_range(hits_df.index[0], hits_df.index[-1])

In [None]:
hits_df = hits_df.reindex(til_today)

Now we plot the known notebook counts for each day we've been tracking the query results.

In [None]:
ax = hits_df.plot(title=f"GitHub search hits for {len(hits_df)} days", figsize=figsize)
ax.set_xlabel('Date')
ax.set_ylabel('# of ipynb files');

## Smoothed Hits

We don't have data for every day in the collection timeframe. We'll use simple linear interpolation to fill the gaps.

In [None]:
daily_deltas = (hits_df.hits - hits_df.hits.shift()).fillna(0)

In [None]:
hits_df = hits_df.interpolate(method='time')

In [None]:
ax = hits_df.plot(title=f"GitHub search hits for {len(hits_df)} days sans outliers", 
                  figsize=figsize)
ax.set_xlabel('Date')
_ = ax.set_ylabel('# of ipynb files')

The total change in the number of `*.ipynb` hits between the tracking start date and today is:

In [None]:
total_delta_nbs = hits_df.iloc[-1] - hits_df.iloc[0]
total_delta_nbs

## Daily Change

The daily average change for the entire duration.

In [None]:
avg_delta_nbs = total_delta_nbs / len(hits_df)
avg_delta_nbs

We can look at the daily change over the entire period alongside the rolling 30-day mean of the daily deltas.

The large jumps in the data are from GitHub reporting drastically different counts from one day to the next. We suspect this happens when they rebuild their search index.

In [None]:
daily_deltas = (hits_df.hits - hits_df.hits.shift()).fillna(0)

In [None]:
fig, ax = plt.subplots(figsize=figsize)
ax.plot(daily_deltas.rolling(window=30, min_periods=0, center=False).mean(), 
        label='30-day rolling mean of daily-change')
ax.plot(daily_deltas, label='24-hour change')
ax.set_xlabel('Date')
ax.set_ylabel('Delta notebook count')
ax.set_title('Change in notebook count')
_ = ax.legend(loc='upper left')

Let's look at the rolling mean in isolation.

In [None]:
fig, ax = plt.subplots(figsize=figsize)
ax.plot(daily_deltas.rolling(window=30, min_periods=0, center=False).mean())
ax.set_xlabel('Date')
ax.set_ylabel('Delta notebook count')
_ = ax.set_title('30-day rolling mean of daily-change')

## Count Prediction

We next train an [autoregressive model](http://en.wikipedia.org/wiki/Autoregressive_model) on the prior year of data. We then use the model to predict the number of notebooks on GitHub a year from now.

In [None]:
delta = datetime.timedelta(365)

In [None]:
train_df = hits_df.loc[now.date()-delta:]
model = sm.tsa.AR(train_df, freq='D').fit(ic='aic')
model.params

We predict counts through a full year in the future.

In [None]:
pred_s = model.predict(
    (now.date()-delta + datetime.timedelta(model.k_ar)).strftime('%Y-%m-%d'),
    (now.date()+delta).strftime('%Y-%m-%d'),
    dynamic=True
)

Now we can plot the truth and prediction.

In [None]:
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(f'GitHub search hits predicted until {now.date()+delta}')
# plot the raw search numbers
ax.plot(hits_df, 'ko', markersize=1, label='truth')
# use the pandas plotting api mostly because it formats the legend for us
ax.plot(pred_s, linewidth=2, label='predicted')
# show labels
ax.legend()
_ = ax.set_ylabel('# of ipynb files')

We plot the residuals to get a sense of how well the model matches recent observations.

In [None]:
eval_df = pred_s.to_frame(name='predicted')
eval_df['truth'] = hits_df.hits

In [None]:
residual_df = eval_df.subtract(eval_df.truth, axis=0).dropna().drop('truth', axis=1)
_ = eval_df.drop('truth', axis=1)

In [None]:
fig, ax = plt.subplots(figsize=figsize)
ret = ax.plot(residual_df, 'o', ms=2)
ax.set_ylabel('# predicted hits - # true hits')
ax.set_title('Residuals')
fig.autofmt_xdate()