<a href="https://colab.research.google.com/github/michal-g/Notebooks-to-Packages/blob/main/predicting-ufo-sightings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this workshop we will take a data analysis pipeline implemented in a Jupyter notebook and convert it to a script that can be run from command-line. We will then convert this script into a Python package: a collection of code modules supporting a pre-defined set of command-line tools.

Why do this? This is a very important question. The easiest answer is that, often, there is no reason to. If you are already using Jupyter notebooks, you are familiar with how convenient they make it to create and test an experiment from scratch, allowing you to separate different parts of the experiment across "cells" for modular execution. Especially if you want to create plots quickly, notebooks' built-in GUI means that you can write code and produce plots within the same browser window.

Jupyter notebooks are great for experiments that are "linear" and "one-off", meaning that they consist of a single chain of steps carried out one after the other, and that these steps will not have to be updated or rearranged at some point in the future. Indeed, the very visual structure of a notebook reinforces this linear nature: one cell following another, each executed in turn. One can of course choose one's own order of executing individual cells, but this will usually result in errors, and notebooks do not have any built-in mechanism for informing which cell depends on another — other than the aforementioned order of the cells themselves.

This linearity simplifies things, but it is also extremely limiting in terms of the kinds of experiments we can design. Pipelines which execute heterogenous steps in parallel are off the table, as are pipelines which reuse code from other pipelines without simply copying the text. The modular structure of notebooks is somewhat of an illusion; in reality, the different cells have a very rigid relationship with one another.

Jupyter notebooks are also difficult to expand upon beyond the analysis they were designed to carry out. One of the more obvious ways this problem manifests itself is when we try to parametrize an existing experiment. If we are e.g. training a machine learning classifier with a regularization penalty of `alpha=0.01`, and we want to try other values of alpha in a systematic way, there is no way of doing so without manually updating the stated value of `alpha` within the notebook. For testing a handful of values of alpha this is fine, but notebooks quickly become cumbersome if we want to test hundreds of such values. The penalty you pay for being able to execute individual notebook cells within a pretty GUI is the inability to turn cells (or the entire notebook) into functions with arbitrary arguments and argument values.

It is difficult to appreciate the full gravity of these considerations until one actually tries to build upon an experiment in Jupyter. Thus we will dispense with any further preamble and introduce a simple data pipeline implemented in a notebook to better understand where exactly the properties inherent to notebooks limit further analysis.

In [None]:
import itertools
import re
import requests
from bs4 import BeautifulSoup
import plotly.express as px

!pip install -U kaleido
!pip install -U skits

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (14, 9)


base_url = 'https://nuforc.org/webreports'
grab = requests.get('/'.join([base_url, 'ndxevent.html']))
soup = BeautifulSoup(grab.text, 'html.parser')

sightings = []
col_labels = ['Date', 'City', 'Region', 'Country', 'Shape', 'Duration', 
              'Summary', 'Posted', 'Images']

for link in soup('a', string=re.compile("[0-9]{2}\/2000")):
  data = link.get('href')
  grab_date = requests.get('/'.join([base_url, data]))
  date_soup = BeautifulSoup(grab_date.text, 'html.parser')

  for row in date_soup('tr'):
    cols = row.find_all('td')

    if cols:
      cur_sighting = None

      for lbl, col in zip(itertools.cycle(col_labels), cols):
        if lbl == 'Date':
          if cur_sighting is not None:
            sightings.append(cur_sighting)

          cur_sighting = {'Date': col.string}

        else:
          cur_sighting[lbl] = col.string

      if cur_sighting is not None:
        sightings.append(cur_sighting)


In [None]:
import pandas as pd

valid_states = {
    'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
    'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN',
    'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
    'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
    'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'
    }

sights_df = pd.DataFrame(sightings)
sights_df = sights_df.loc[(sights_df.Country == 'USA') & sights_df.Region.isin(valid_states), :]
sights_df['Date'] = pd.to_datetime([dt.split()[0] for dt in sights_df['Date']], format='%m/%d/%y')

print(sights_df)


In [None]:
from copy import deepcopy

counts = sights_df.groupby('Region').size()

fig = px.choropleth(locations=[str(x) for x in counts.index],
                    locationmode="USA-states",
                    color=counts.values, range_color=[0, counts.max()],
                    scope="usa",
                    color_continuous_scale=['white', 'black'])
fig.show()


In [None]:
import imageio
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

counts = sights_df.groupby(['Date', 'Region']).size()
plt_files = list()

for dt, dt_counts in counts.groupby('Date'):
    date_lbl = dt.strftime('%F')

    fig = px.choropleth(locations=[str(x) for x in dt_counts.index.get_level_values('Region')],
                        locationmode="USA-states", title=date_lbl,
                        color=dt_counts.values, range_color=[0, 100],
                        scope="usa", color_continuous_scale=['white', 'black'])

    plt_file = f"counts_{date_lbl}.png"
    fig.write_image(plt_file, format='png')
    plt_files += [imageio.imread(plt_file)]

imageio.mimsave("counts.gif", plt_files, duration=0.03)
from IPython.display import Image
Image(filename="counts.gif")

In [None]:
import numpy as np

from skits.preprocessing import (ReversibleImputer,
                                 DifferenceTransformer)
from skits.pipeline import ForecasterPipeline
from sklearn.preprocessing import StandardScaler

from skits.pipeline import ForecasterPipeline
from skits.feature_extraction import (AutoregressiveTransformer,
                                      SeasonalTransformer)
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier,
                              RandomForestRegressor)
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import TimeSeriesSplit

pipeline = ForecasterPipeline([
    ('pre_scaler', StandardScaler()),
    ('features', FeatureUnion([
        ('ar_features', AutoregressiveTransformer(num_lags=3)),
        ('seasonal_features', SeasonalTransformer(seasonal_period=10)),
    ])),
    ('post_feature_imputer', ReversibleImputer()),
    ('post_feature_scaler', StandardScaler()),
    ('regressor', LinearRegression(fit_intercept=True))
    ])

tscv = TimeSeriesSplit(n_splits=5)
ca_counts = counts.loc[(slice(None), 'CA')]
ca_dates = ca_counts.index.get_level_values('Date').values.reshape(-1, 1)
ca_values = ca_counts.values

real_values = list()
pred_values = list()

for train_index, test_index in tscv.split(ca_counts):
    pipeline.fit(ca_dates[train_index], ca_values[train_index])

    preds = pipeline.predict(ca_dates[test_index], to_scale=True)

    real_values += ca_values[test_index].flatten().tolist()
    pred_values += preds.flatten().tolist()

    plt.plot(ca_dates[test_index], ca_values[test_index], color='black')
    plt.plot(ca_dates[test_index], preds, color='red')

print(f"MSE: {format(((np.array(real_values) - np.array(pred_values)) ** 2).sum(), '.1f')}")
