# A Day In The Life Of A Data Science Platform

*Or How To Monitor Everyone All The Time And Still Get Things To Prod*

Also: **Mostly Meta-metrics**

### A Talk w/ Antics by Lucas Durand

# Who is this?

<div style="float:left; position:relative">
<img src="https://media.licdn.com/dms/image/C4E03AQH4IqbY1sqyjw/profile-displayphoto-shrink_200_200/0?e=1579132800&v=beta&t=hHTrA-4kUx8YSdpAExwX7XiDHuiiaZ2_x_9i0g6ZG04">
</div>

<div style="left:50%; position:absolute">
    <b>Lucas Durand</b>
    <li>Theroretical Physicist?</li>
    <li>Data Scientist</li>
    <li>Software Engineer</li>
    <li>Pronouns: he, him, his</li>
</div>

## What do I do?
<p><img src="https://www.td.com/ca/en/personal-banking/system/v1.5/assets/img/header-nav/td-logo.png">Software Engineer - Big Data/ML/JupyterHub</p>
<p><img src="https://aideepdive.com/wp-content/uploads/2019/09/aiddFinalNewMedium-300x97.png">Data Science Instructor</p>


# The Goal

Provide a single, central, end-user computing environment for adhoc data science that doesn't make everyone sad

**and ... ***

operate within all of the restrictions of a heavily regulated environment 

# The Users

*Who are we doing this for?*

* Data Scientists
* Developers
* Excel Gurus
* Business Types

## Capabilities We Want

* Do *spreadsheet things*
* Create *reports*
* Some way to "get to prod"

## Security We Need

* User monitoring tools
* User isolation
* Protect sensitive and confidential data

# The Solution

![JupyterHub](https://jupyter.org/assets/hublogo.svg)
https://jupyter.org/hub

## JupyterHub

* Notebooks are a common language between Analytics, Development, and (maybe?) Business
* Approved environment (control packages centrally)
* Extensions for everything (data classification, logging, reporting, deploying)
* Integrations to other services (data feeds, deployment pipelines, compute clusters)

# Hands-on

We have a general architecture now, what's this look like in practice?

# Logging

Everyone loves logging!

## Jupyter Events

We can do it in `javascript`? Wild!

```javascript
Jupyter.notebook.events.on('execute.CodeCell', function(evt, data) {
    // data.cell is the cell object
});
```

In [1]:
%%javascript
Jupyter.notebook.events.on('execute.CodeCell', function(evt, data) {
    console.log(data.cell.input[0].innerText);
});

<IPython.core.display.Javascript object>

In [2]:
print("Are we logging yet?")

Are we logging yet?


### Where does this go?

* Put it in a logging.js file that users **can't edit**
* Point to it in their `jupyter_notebook_config.py` file
* **That should also not be editable!**

## IPython Events

We also have a few hooks on the python side:
https://ipython.readthedocs.io/en/stable/config/callbacks.html


```python
class KernelHook:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        pass
def load_ipython_extension(ip):
    hook = KernelHook(ip)
    ip.events.register("pre_run_cell", hook.pre_run_cell)
```

## Simple Logging

In [3]:
class CellPrinter:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        cell = info.raw_cell
        print(cell)
        print("~"*min(len(cell),100))
ip = get_ipython()
hook = CellPrinter(ip)
ip.events.register("pre_run_cell", hook.pre_run_cell)

In [4]:
print("Python is great!")

print("Python is great!")
~~~~~~~~~~~~~~~~~~~~~~~~~
Python is great!


In [5]:
ip.events.unregister("pre_run_cell", hook.pre_run_cell)

ip.events.unregister("pre_run_cell", hook.pre_run_cell)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


### Real Logging

In [6]:
import os, datetime, json
import pandas as pd # always as pd!

class CellLogger:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        cell = info.raw_cell
        user = os.environ["USER"]
        time = datetime.datetime.now().strftime("%Y%m%dT%H:%M:%S")
        log = {
            "input":cell,
            "user":user,
            "timestamp":time
        }
        print(json.dumps(log))
        print(">"*min(len(cell),100))
        # Upload to logging service? Or collect logs from here?
        with open("log_mania.csv","a") as f:
            f.write(pd.DataFrame([log]).to_csv(index=False, header=False))
def load_ipython_extension(ip):
    logger = CellLogger(ip)
    ip.events.register("pre_run_cell", logger.pre_run_cell)

load_ipython_extension(get_ipython())

In [7]:
print("Python is great!")

{"input": "print(\"Python is great!\")", "user": "durand", "timestamp": "20191116T21:56:37"}
>>>>>>>>>>>>>>>>>>>>>>>>>
Python is great!


## Some Quick Data Science

Why not?

In [8]:
def get_logs():
    logs = pd.read_csv("log_mania.csv", names=["input","user","timestamp"], parse_dates=["timestamp"])
    logs.loc[logs["input"].str.contains("%%javascript"),"language"]="javascript"
    logs.loc[~logs["input"].str.contains("%%javascript"),"language"]="python"
    return logs
logs = get_logs()
logs.tail()

{"input": "def get_logs():\n    logs = pd.read_csv(\"log_mania.csv\", names=[\"input\",\"user\",\"timestamp\"], parse_dates=[\"timestamp\"])\n    logs.loc[logs[\"input\"].str.contains(\"%%javascript\"),\"language\"]=\"javascript\"\n    logs.loc[~logs[\"input\"].str.contains(\"%%javascript\"),\"language\"]=\"python\"\n    return logs\nlogs = get_logs()\nlogs.tail()", "user": "durand", "timestamp": "20191116T21:56:37"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0,input,user,timestamp,language
772,#app.run_server(),durand,2019-11-16 21:46:40,python
773,"print(""That's it!"")",durand,2019-11-16 21:47:02,python
774,"print(""That's it! Now we've got a working moni...",durand,2019-11-16 21:47:24,python
775,"print(""Python is great!"")",durand,2019-11-16 21:56:37,python
776,"def get_logs():\n logs = pd.read_csv(""log_m...",durand,2019-11-16 21:56:37,javascript


In [9]:
logs.groupby("language").count()

{"input": "logs.groupby(\"language\").count()", "user": "durand", "timestamp": "20191116T21:56:37"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0_level_0,input,user,timestamp
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
javascript,28,28,28
python,749,749,749


### AST

Abstract Syntax Trees represent python code with a structure that allows exploring (and changing) relationships on a functional level (e.g. what are the args to this function?)

In [10]:
import ast
def bag_to_trees(logs):
    bag_of_words = logs.input.values

    trees = []
    for cell in bag_of_words:
        try:
            trees += [ast.parse(cell)]
        except SyntaxError:
            pass
    return trees
trees = bag_to_trees(logs)

{"input": "import ast\ndef bag_to_trees(logs):\n    bag_of_words = logs.input.values\n\n    trees = []\n    for cell in bag_of_words:\n        try:\n            trees += [ast.parse(cell)]\n        except SyntaxError:\n            pass\n    return trees\ntrees = bag_to_trees(logs)", "user": "durand", "timestamp": "20191116T21:56:37"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


In [73]:
trees[0].body[0].value.func.id, trees[0].body[0].value.args[0].s

{"input": "trees[0].body[0].value.func.id, trees[0].body[0].value.args[0].s", "user": "durand", "timestamp": "20191116T22:42:01"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


('print', 'Hello')

In [74]:
class Analyzer(ast.NodeVisitor):
    def __init__(self):
        self.stats = {"import": [], "from": []}

    def visit_Import(self, node):
        for alias in node.names:
            self.stats["import"].append(alias.name)
        self.generic_visit(node)

    def visit_ImportFrom(self, node):
        for alias in node.names:
            self.stats["from"].append(alias.name)
        self.generic_visit(node)
        
def get_stats(logs):     
    analyzer = Analyzer()
    trees = bag_to_trees(logs)
    [analyzer.visit(tree) for tree in trees]
    stats = analyzer.stats
    return stats
stats = get_stats(logs)

{"input": "class Analyzer(ast.NodeVisitor):\n    def __init__(self):\n        self.stats = {\"import\": [], \"from\": []}\n\n    def visit_Import(self, node):\n        for alias in node.names:\n            self.stats[\"import\"].append(alias.name)\n        self.generic_visit(node)\n\n    def visit_ImportFrom(self, node):\n        for alias in node.names:\n            self.stats[\"from\"].append(alias.name)\n        self.generic_visit(node)\n        \ndef get_stats(logs):     \n    analyzer = Analyzer()\n    trees = bag_to_trees(logs)\n    [analyzer.visit(tree) for tree in trees]\n    stats = analyzer.stats\n    return stats\nstats = get_stats(logs)", "user": "durand", "timestamp": "20191116T22:42:09"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


In [59]:
import plotly.express as px
from IPython.display import clear_output
imports = pd.DataFrame({"count":stats['import']+stats['from']})['count'].value_counts().to_frame().reset_index()
clear_output()
px.bar(imports, x="index",y="count", title="Most Imported", color="index")

In [14]:
def log_fig(logs):
    fig = px.line(logs, x="timestamp", hover_name="input", title="How I Made This Notebook", template="presentation", color="language")
    fig.update_traces(mode='markers+lines')
    return fig

{"input": "def log_fig(logs):\n    fig = px.line(logs, x=\"timestamp\", hover_name=\"input\", title=\"How I Made This Notebook\", template=\"presentation\", color=\"language\")\n    fig.update_traces(mode='markers+lines')\n    return fig", "user": "durand", "timestamp": "20191116T21:56:38"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


In [15]:
clear_output()
log_fig(get_logs())

# Monitoring

It's not enough to track what code users are executing, let's see what resources they use.

In [16]:
# metrics starts here
import psutil
cur_process = psutil.Process()
all_processes = [cur_process] + cur_process.children(recursive=True)
rss = sum([p.memory_info().rss for p in all_processes])
rss / 1000**2

{"input": "# metrics starts here\nimport psutil\ncur_process = psutil.Process()\nall_processes = [cur_process] + cur_process.children(recursive=True)\nrss = sum([p.memory_info().rss for p in all_processes])\nrss / 1000**2", "user": "durand", "timestamp": "20191116T21:56:38"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


181.055488

## Let's try and keep some of our metrics

https://github.com/yuvipanda/nbresuse

In [17]:
import nbresuse
nbresuse.__file__

{"input": "import nbresuse\nnbresuse.__file__", "user": "durand", "timestamp": "20191116T21:56:38"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


'/home/durand/dev/pycon2019/lib/python3.7/site-packages/nbresuse/__init__.py'

In [18]:
def get_memory():
    memory = pd.read_csv("memory.csv", names=["timestamp","rss"], parse_dates=["timestamp"])
    memory["MB"] = memory.rss/1000**2
    memory["GB"] = memory.MB/1000
    memory["momentum"] = memory.GB.diff().fillna(0)
    memory["speed"] = memory.momentum.abs().shift(-1).fillna(0)
    return memory
memory = get_memory()
memory.tail()

{"input": "def get_memory():\n    memory = pd.read_csv(\"memory.csv\", names=[\"timestamp\",\"rss\"], parse_dates=[\"timestamp\"])\n    memory[\"MB\"] = memory.rss/1000**2\n    memory[\"GB\"] = memory.MB/1000\n    memory[\"momentum\"] = memory.GB.diff().fillna(0)\n    memory[\"speed\"] = memory.momentum.abs().shift(-1).fillna(0)\n    return memory\nmemory = get_memory()\nmemory.tail()", "user": "durand", "timestamp": "20191116T21:56:38"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0,timestamp,rss,MB,GB,momentum,speed
3001,2019-11-16 21:56:16,1786839040,1786.83904,1.786839,-0.100454,0.010637
3002,2019-11-16 21:56:20,1797476352,1797.476352,1.797476,0.010637,0.040485
3003,2019-11-16 21:56:25,1756991488,1756.991488,1.756991,-0.040485,0.168223
3004,2019-11-16 21:56:30,1925214208,1925.214208,1.925214,0.168223,0.122991
3005,2019-11-16 21:56:36,1802223616,1802.223616,1.802224,-0.122991,0.0


In [19]:
clear_output()
px.line(memory.tail(50), x="timestamp", y="GB", template="presentation", title="Memory Lately")

## Let's be a real user now ...

In [20]:
import numpy as np

for i in range(5):
    print(i)
    #a = np.random.random([1000,i*100,1000])
a = 1

{"input": "import numpy as np\n\nfor i in range(5):\n    print(i)\n    #a = np.random.random([1000,i*100,1000])\na = 1", "user": "durand", "timestamp": "20191116T21:56:39"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0
1
2
3
4


In [21]:
clear_output()
def mem_fig(memory):
    mem_fig = px.scatter(memory, x="timestamp", y="GB", color="speed", color_continuous_scale=px.colors.diverging.Spectral,template="presentation", title="Memory Lately")
    return mem_fig
mem_fig(get_memory())

## User Profiling

*Not a terrible profile of what's happening in this Notebook*

In [22]:
clear_output()
from plotly.subplots import make_subplots
def combined_plot(plot1,plot2):
    combined = make_subplots(specs=[[{"secondary_y": True}]]).update_layout(template="presentation", title="Notebook Profile").update_coloraxes(colorscale="spectral")
    combined.add_traces(plot1.data, secondary_ys=[True,True]).add_trace(plot2.data[0])
    return combined
combined_plot(log_fig(logs), mem_fig(memory))

# What's Next For Notebooks

We have a pretty useful notebook, but where does it go now?

## What's Next For Notebooks

* Report (HTML)
    * We can save this as an HTML file
    * We can strip out all of the inputs to protect non-coder eyes
* API (flask)
    * Can we expose this data in a useful way?
* Dashboard (Dash, Voila)
    * Does this need a UI?

## Dash

We're already using `plotly`, so it will be quick work to make a `Dash` app. Let's show:

* Near-realtime updates of our plots
* Stats on most used libraries
* User stats

In [23]:
import dash
import dash_core_components as dcc
import dash_html_components as html

app = dash.Dash("")
app.title = "Notebook Stats"

app.layout = html.Div([
    html.H1(app.title),
    html.Div([
        dcc.Graph(
            figure=combined_plot(log_fig(logs),mem_fig(memory)),
            id="historical-profile"
        ),
        dcc.Graph(
            figure=px.bar(imports, x="index",y="count", title="Most Imported", color="index"),
            id="top-libraries"
        ),
        html.H3(
            f"Total Cell Executions: {logs.count()[0]}",
            id="total-executions"
        ),
        html.H3(
            f"Ticks: 0",
            id="ticks"
        )
    ]),
    dcc.Interval(
        id="interval-component",
        interval=5*1000, # in milliseconds
    )
])

{"input": "import dash\nimport dash_core_components as dcc\nimport dash_html_components as html\n\napp = dash.Dash(\"\")\napp.title = \"Notebook Stats\"\n\napp.layout = html.Div([\n    html.H1(app.title),\n    html.Div([\n        dcc.Graph(\n            figure=combined_plot(log_fig(logs),mem_fig(memory)),\n            id=\"historical-profile\"\n        ),\n        dcc.Graph(\n            figure=px.bar(imports, x=\"index\",y=\"count\", title=\"Most Imported\", color=\"index\"),\n            id=\"top-libraries\"\n        ),\n        html.H3(\n            f\"Total Cell Executions: {logs.count()[0]}\",\n            id=\"total-executions\"\n        ),\n        html.H3(\n            f\"Ticks: 0\",\n            id=\"ticks\"\n        )\n    ]),\n    dcc.Interval(\n        id=\"interval-component\",\n        interval=5*1000, # in milliseconds\n    )\n])", "user": "durand", "timestamp": "20191116T21:56:40"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

In [24]:
app.run_server()

{"input": "app.run_server()", "user": "durand", "timestamp": "20191116T21:56:41"}
>>>>>>>>>>>>>>>>
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


OSError: [Errno 98] Address already in use

In [None]:
from dash.dependencies import Input, Output
@app.callback([
    Output("top-libraries","figure"),
    Output("historical-profile", "figure"),
    Output("total-executions", "children"),
    Output("ticks", "children")
],
    [Input("interval-component", "n_intervals")])
def update_info(n_intervals):
    print(n_intervals)
    logs, memory = get_logs(), get_memory()
    stats = get_stats(logs)
    
    imports = pd.DataFrame({"count":stats['import']+stats['from']})['count'].value_counts().to_frame().reset_index()
    lib_fig = px.bar(imports, x="index",y="count", title="Most Imported", color="index")
    
    fig = combined_plot(log_fig(logs),mem_fig(memory))
    fig['layout']['uirevision'] = 'some-constant'

    return [lib_fig, fig , f"Total Cell Executions:{logs.count()[0]}", f"Ticks: {n_intervals}"]

# Deploying

We've created an app, but running it from a notebook is a pain. Let's *deploy* this as a *real app*

    gunicorn dash_monitor:server
    
* To make this more user-friendly, we should create an *app deployment server* to launch and manage the server processes
* A simple UI extension can be added to trigger converting the current notebook into the required .py file and launch the app
* While we're at it, we can register it's url within the JupyterHub proxy

In [None]:
# let's bump pandas up in the list
import pandas

# What Have We Done?

* Used Jupyter Notebooks as primary UI for development and analysis
* Implemented logging and monitoring for each user
* Talked (vaguely) about deploying to prod and what that could look like

Takeaway: we can do everything with python and jupyter notebooks