# Python for Everyone

*Building a Data Science Platform in a Corporate Setting*

## Mostly meta-metrics

### A Talk w/ Antics by Lucas Durand

# Who is this?

<div style="float:left; position:relative">
<img src="https://media.licdn.com/dms/image/C4E03AQH4IqbY1sqyjw/profile-displayphoto-shrink_200_200/0?e=1579132800&v=beta&t=hHTrA-4kUx8YSdpAExwX7XiDHuiiaZ2_x_9i0g6ZG04">
</div>

# Python in a Corporate Environment

# Logging

Let's implement logging!

## Jupyter Events

We can do it in `javascript`? Wild!

```javascript
Jupyter.events.on('execute.CodeCell', function(evt, data) {
    // data.cell is the cell object
});
```

In [1]:
%%javascript
Jupyter.notebook.events.on('execute.CodeCell', function(evt, data) {
    console.log(data.cell.input[0].innerText);
});

<IPython.core.display.Javascript object>

In [2]:
print("Are we logging yet?")

Are we logging yet?


### Where does this go?

* Put it in a logging.js file that users **can't edit**
* Point to it in their `jupyter_notebook_config.py` file
* **That should also not be editable!**

## IPython Events

We also have a few hooks on the python side:
https://ipython.readthedocs.io/en/stable/config/callbacks.html


```python
class VarPrinter:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        pass
def load_ipython_extension(ip):
    vp = VarPrinter(ip)
    ip.events.register("pre_run_cell", vp.pre_run_cell)
```

In [3]:
class VarPrinter:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        cell = info.raw_cell
        print(cell)
        print("~"*min(len(cell),100))
ip = get_ipython()
vp = VarPrinter(ip)
ip.events.register("pre_run_cell", vp.pre_run_cell)

In [4]:
print("Python is great!")

print("Python is great!")
~~~~~~~~~~~~~~~~~~~~~~~~~
Python is great!


In [5]:
ip.events.unregister("pre_run_cell", vp.pre_run_cell)

ip.events.unregister("pre_run_cell", vp.pre_run_cell)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


### Real Logging

In [6]:
import os, datetime, json
import pandas as pd # always as pd!
class VarPrinter:
    def __init__(self, ip):
        self.ip = ip
    def pre_run_cell(self, info):
        cell = info.raw_cell
        user = os.environ["USER"]
        time = datetime.datetime.now().strftime("%Y%m%dT%H:%M:%S")
        log = {
            "input":cell,
            "user":user,
            "timestamp":time
        }
        print(json.dumps(log))
        print(">"*min(len(cell),100))
        # Upload to logging service? Or collect logs from here?
        with open("log_mania.csv","a") as f:
            f.write(pd.DataFrame([log]).to_csv(index=False, header=False))
def load_ipython_extension(ip):
    vp = VarPrinter(ip)
    ip.events.register("pre_run_cell", vp.pre_run_cell)

load_ipython_extension(get_ipython())

In [7]:
print("Python is great!")

{"input": "print(\"Python is great!\")", "user": "durand", "timestamp": "20191114T23:40:20"}
>>>>>>>>>>>>>>>>>>>>>>>>>
Python is great!


## Some Quick Data Science

In [196]:
def get_logs():
    logs = pd.read_csv("log_mania.csv", names=["input","user","timestamp"], parse_dates=["timestamp"])
    logs.loc[logs["input"].str.contains("%%javascript"),"language"]="javascript"
    logs.loc[~logs["input"].str.contains("%%javascript"),"language"]="python"
    return logs
logs = get_logs()
logs.tail()

{"input": "def get_logs():\n    logs = pd.read_csv(\"log_mania.csv\", names=[\"input\",\"user\",\"timestamp\"], parse_dates=[\"timestamp\"])\n    logs.loc[logs[\"input\"].str.contains(\"%%javascript\"),\"language\"]=\"javascript\"\n    logs.loc[~logs[\"input\"].str.contains(\"%%javascript\"),\"language\"]=\"python\"\n    return logs\nlogs = get_logs()\nlogs.tail()", "user": "durand", "timestamp": "20191115T00:34:28"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0,input,user,timestamp,language
314,"def get_logs()\n return pd.read_csv(""log_ma...",durand,2019-11-15 00:33:32,python
315,"def get_logs:\n return pd.read_csv(""log_man...",durand,2019-11-15 00:33:37,python
316,"def get_logs():\n return pd.read_csv(""log_m...",durand,2019-11-15 00:33:39,python
317,from IPython.display import clear_output\nclea...,durand,2019-11-15 00:33:57,python
318,"def get_logs():\n logs = pd.read_csv(""log_m...",durand,2019-11-15 00:34:28,javascript


In [197]:
logs.groupby("language").count()

{"input": "logs.groupby(\"language\").count()", "user": "durand", "timestamp": "20191115T00:34:38"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0_level_0,input,user,timestamp
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
javascript,10,10,10
python,309,309,309


In [245]:
import plotly.express as px
from IPython.display import clear_output
clear_output()
def log_fig():
    logs = get_logs()
    fig = px.line(logs, x="timestamp", hover_name="input", title="How I Made This Notebook", template="presentation", color="language")
    fig.update_traces(mode='markers+lines')
    return fig
log_fig()

# Metrics or Monitoring?

It's not enough to track what code users are executing, let's see what resources they use.

https://github.com/yuvipanda/nbresuse

In [199]:
# metrics starts here
import psutil
cur_process = psutil.Process()
all_processes = [cur_process] + cur_process.children(recursive=True)
rss = sum([p.memory_info().rss for p in all_processes])
rss / 1000**2

{"input": "# metrics starts here\nimport psutil\ncur_process = psutil.Process()\nall_processes = [cur_process] + cur_process.children(recursive=True)\nrss = sum([p.memory_info().rss for p in all_processes])\nrss / 1000**2", "user": "durand", "timestamp": "20191115T00:34:49"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


2172.399616

## Let's try and keep some of our metrics

In [200]:
import nbresuse
nbresuse.__file__

{"input": "import nbresuse\nnbresuse.__file__", "user": "durand", "timestamp": "20191115T00:34:51"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


'/home/durand/dev/pycon2019/lib/python3.7/site-packages/nbresuse/__init__.py'

In [239]:
def get_memory():
    memory = pd.read_csv("memory.csv", names=["timestamp","rss"], parse_dates=["timestamp"])
    memory["MB"] = memory.rss/1000**2
    memory["GB"] = memory.MB/1000
    memory["momentum"] = memory.GB.diff().fillna(0)
    memory["speed"] = memory.momentum.abs().shift(-1).fillna(0)
    return memory
memory = get_memory()
memory.tail()

{"input": "def get_memory():\n    memory = pd.read_csv(\"memory.csv\", names=[\"timestamp\",\"rss\"], parse_dates=[\"timestamp\"])\n    memory[\"MB\"] = memory.rss/1000**2\n    memory[\"GB\"] = memory.MB/1000\n    memory[\"momentum\"] = memory.GB.diff().fillna(0)\n    memory[\"speed\"] = memory.momentum.abs().shift(-1).fillna(0)\n    return memory\nmemory = get_memory()\nmemory.tail()", "user": "durand", "timestamp": "20191115T00:49:24"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Unnamed: 0,timestamp,rss,MB,GB,momentum,speed
820,2019-11-15 00:49:03,2405269504,2405.269504,2.40527,0.0,0.0
821,2019-11-15 00:49:08,2405269504,2405.269504,2.40527,0.0,0.0
822,2019-11-15 00:49:13,2405269504,2405.269504,2.40527,0.0,0.001352
823,2019-11-15 00:49:18,2406621184,2406.621184,2.406621,0.001352,0.0
824,2019-11-15 00:49:23,2406621184,2406.621184,2.406621,0.0,0.0


In [240]:
clear_output()
px.line(memory.tail(10), x="timestamp", y="GB", template="presentation", title="Memory Lately")

## Let's be a real user now ...

In [141]:
import numpy as np

for i in range(10):
    print(i)
    a = np.random.random([1000,i*100,1000])

{"input": "import numpy as np\n\nfor i in range(10):\n    print(i)\n    a = np.random.random([1000,i*100,1000])", "user": "durand", "timestamp": "20191115T00:13:08"}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0
1
2
3


KeyboardInterrupt: 

In [250]:
clear_output()
def mem_fig():
    memory = get_memory()
    mem_fig = px.scatter(memory, x="timestamp", y="GB", color="speed", color_continuous_scale=px.colors.diverging.Spectral,template="presentation", title="Memory Lately")
    return mem_fig
mem_fig()

## User Profiling

*Not a terrible profile of what's happening in this Notebook*

In [251]:
clear_output()
combined = make_subplots(specs=[[{"secondary_y": True}]]).update_layout(template="presentation", title="Notebook Profile").update_coloraxes(colorscale="spectral")
combined.add_traces(log_fig().data, secondary_ys=[True,True]).add_trace(mem_fig().data[0])

# Getting to Prod