<div style="display: inline-block; float:left; width: 50%">
<img src="img/jupyter-logo.png" width="300px"/>
</div>
<div style="float: left; width: 50%">
    <h1>PyData Workflow mit Jupyter Lab</h1>
    <h3>Meine Noitzen zur nächsten Generation Notebooks</h3>
    <div><br>
    <p><cite>"I’m extremely excited the potential of JupyterLab. I have a high level of confidence in the team who is making this possible"</cite> - <a href="https://medium.com/@brianray_7981/jupyterlab-first-impressions-e6d70d8a175d">Brian Ray</a></p>
    <p><cite>"Finally a tool where I can do all my work. And yes, I have worked with Emacs."</cite> - <a href="https://medium.com/@boyanangelov/is-this-the-best-data-science-ide-jupyter-lab-review-fdd165470f13">Boyan Angelov, Data Scientist @mindmatch.ai</a></p>
        </div>
</div>

## Über den Autor

![Some pic of me](http://nico.kreiling.family/assets/img/me.png)

**Nico Kreiling**
* "Big Data Scientist" @ [inovex](https://www.inovex.de/de/)
* Hadoop, Data Science, Kubernetes, Cloud
* Tech-Podcast für Entwickler: [inoTecCast](https://inoteccast.de/)
* Meine Tech Talks auf [https://github.com/krlng/techtalks](https://github.com/krlng/techtalks)

# Kapitelübersicht

* Notebooks im Allgeinen
* Einführung in Jupyter Lab
* Individualisieren
    * Tastaturkürzel
    * die besten Plugins
    * Plugins schreiben
* Workflow Demo
    * Datenexploration
    * Algorithemn Entwicklung
    * Ergebnisse teilen
* Fazit

# Notebooks - Autobiographisch

## Die guten Seiten

* Übersichtlich Darstellung
* interaktive und intuitive Bedienung
* Gut für explorative Analysen und Trainings

## Die schlechten Seiten

* Ausführungsreihenfolge ist wichtig
* Meist spätere Re-Implementierung notwendig
* Weniger Entwicklugnssupport als in IDEs

## Die Bestseller

* Jupyter 
    * Jupyter Notebooks
    * JupyterLab
    * JupyterHub
* [Zeppelin](https://zeppelin.apache.org/)
* [Cloud Datalab (Google)](https://cloud.google.com/datalab/), [Watson Studio](https://www.ibm.com/cloud/watson-studio) ...

# Jupyter Lab - Die Einleitung

## Was bisher geschah

* Neu-Implementierung der Jupyter Notebooks
* Seit 3 Jahren von 100 Contributors
* 60 components, 12.000 commits
* [Public seit Februar](https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906)
* [1.0](https://github.com/jupyterlab/jupyterlab/milestone/2) vermutlich November


## Das Wichtigste

* Neues UI & UX
* Window Management
* Datei-Explorer & Terminal
* Erweiterte CSV und Markdown Unterstützung
* Collapsible Cells & Dark Theme

# Der Schlüssel zur Produktivität

* JupyterLab entworfen als Plugin-Kompositum
* Moderne Technologie-Stack (JavaScript, [YARN](https://yarnpkg.com/en/))
* Aber [nbextensions](https://github.com/ipython-contrib/jupyter_contrib_nbextensions) nicht kompatibel

## Bestehendes kennen

![Know your tools](./img/suit1.jpg)

### Standard-Taturbefehle

Command  | Shortcut
------------- | -------------
Command Palette |`Accel Shift C`
File Explorer |`Accel Shift F`
Toggle Bar |`Accel B`
Fullscreen Mode |`Accel Shift D`
Close Tab |`Ctrl Q`
Launcher |`Accel Shift L`

## Bestehendes erweitern

![Tweak your tools](./img/suit2.jpg)

### Eigene Tatstenkürzel

Settings > Keyboard Shortcuts
```
{
    "notebook:move-cell-up": {
      "command": "notebook:move-cell-up",
      "keys": [
        "Accel Alt ArrowUp"
      ],
      "selector": "body"
    },
    "notebook:move-cell-down": {
      "command": "notebook:move-cell-down",
      "keys": [
        "Accel Alt ArrowDown"
      ],
      "selector": "body"
    }        
}
```

### Nützliche Plugins

* Graphische Erweiterungen
    * [Git](https://github.com/jupyterlab/jupyterlab-git)
    * [toc](https://github.com/jupyterlab/jupyterlab-toc)
    * [Bokeh](https://github.com/bokeh/jupyterlab_bokeh)
    * [Tensorboard](https://github.com/chaoleili/jupyterlab_tensorboard)
* Erweiterte Datentypen
    * [Latex](https://github.com/jupyterlab/jupyterlab-latex)
    * [HTML Files](https://github.com/mflevine/jupyterlab_html)
    * [Drawio](https://github.com/QuantStack/jupyterlab-drawio)
    * [Geo-Suppoert](https://github.com/jupyterlab/jupyter-renderers)
* In Entwicklung
    * [RISE](https://github.com/lsst-sqre/RISE/tree/jupyterlab_extension/jupyterlab/jupyterlab_rise)
    * [Google Docs Collaboration](https://github.com/jupyterlab/jupyterlab-google-drive/issues/108)
    * [Variable Inspector](https://github.com/lckr/jupyterlab-variableInspector)
    * [Monaco Editor](https://github.com/jupyterlab/jupyterlab-monaco)


### Install Plugins


Entweder direkt **via Terminal**
```
jupyter labextension install @jupyterlab/<your-plugin-name>
```

oder über den **Extension Manager**

```
Settings > Extension Manager:

{
    "enabled": true
}
```

## Make your Tools

![Make your tools](./img/suit3.jpg)

# Der Hauptteil

## Exploration

Wir nutzen zur Demo die ["New York City Taxi Fare Prediction" Kaggle Challenge](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction). Der vollständige Datensatz findet sich dort, in diesem Repo findet sich nur ein minimaler Demo-Datensatz!

In [13]:
# %load ./imports.py
import os
from datetime import datetime as dt

import numpy as np
import pandas as pd
import sklearn as skl

# Pandas display options
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import IPython.display as ipd

# Set random seed 
RSEED = 42

# Visualizations
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (25, 5)
%matplotlib inline
#plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18

import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)

In [34]:
#data = pd.read_csv('input/train.csv', nrows = 50000, parse_dates = ['pickup_datetime']).drop(columns = 'key')
data = pd.read_csv('data/sample.csv', nrows = 50000, parse_dates = ['pickup_datetime']).drop(columns = 'key')

# Remove na
data = data.dropna()
data.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21,-73.844,40.721,-73.842,40.712,1
1,16.9,2010-01-05 16:52:16,-74.016,40.711,-73.979,40.782,1
2,5.7,2011-08-18 00:35:00,-73.983,40.761,-73.991,40.751,2
3,7.7,2012-04-21 04:30:42,-73.987,40.733,-73.992,40.758,1
4,5.3,2010-03-09 07:51:00,-73.968,40.768,-73.957,40.784,1


### Describe Data

An effective method for catching outliers and anomalies is to find the summary statistics for the data using the `.describe()` method. I like to concentrate on the maxes and the minimums for finding outliers.

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 7 columns):
fare_amount          50000 non-null float64
pickup_datetime      50000 non-null datetime64[ns]
pickup_longitude     50000 non-null float64
pickup_latitude      50000 non-null float64
dropoff_longitude    50000 non-null float64
dropoff_latitude     50000 non-null float64
passenger_count      50000 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 3.1 MB


In [16]:
data.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364,-72.51,39.934,-72.505,39.926,1.668
std,9.686,10.394,6.225,10.408,6.015,1.289
min,-5.0,-75.424,-74.007,-84.654,-74.006,0.0
25%,6.0,-73.992,40.735,-73.991,40.734,1.0
50%,8.5,-73.982,40.753,-73.98,40.753,1.0
75%,12.5,-73.967,40.767,-73.964,40.768,2.0
max,200.0,40.783,401.083,40.851,43.415,6.0


### Pandas Profiling

In [17]:
#pandas_profiling.ProfileReport(data)
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
import pandas_profiling
profile = pandas_profiling.ProfileReport(data)
print(workbookDir)
profile.to_file(outputfile=workbookDir+"/data_report.html")

/Users/nkreiling/dev/d2d-jupyterlab


In [18]:
%%! 
open "$(pwd)/data_report.html"

[]

### View Map Data

**Unser Standort**

In [21]:
from IPython.display import GeoJSON
GeoJSON({
    "type": "Feature",
    "geometry": {
      "type": "Point",
    "coordinates": [8.677636, 49.404927]
    }
})

<IPython.display.GeoJSON object>

In [20]:
from IPython.display import GeoJSON
point = data.sample(1)
GeoJSON({
    "type": "Feature",
    "geometry": {
      "type": "LineString",
    "coordinates": [
        [point.pickup_longitude.values[0], point.pickup_latitude.values[0]],
        [point.dropoff_longitude.values[0], point.dropoff_latitude.values[0]]
       ]
    }
})

<IPython.display.GeoJSON object>

### Bokeh

## Great Expectations

In [22]:
import great_expectations as ge
data = ge.read_csv('input/train.csv', nrows = 50000).drop(columns = 'key')

In [23]:
data.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21 UTC,-73.844,40.721,-73.842,40.712,1
1,16.9,2010-01-05 16:52:16 UTC,-74.016,40.711,-73.979,40.782,1
2,5.7,2011-08-18 00:35:00 UTC,-73.983,40.761,-73.991,40.751,2
3,7.7,2012-04-21 04:30:42 UTC,-73.987,40.733,-73.992,40.758,1
4,5.3,2010-03-09 07:51:00 UTC,-73.968,40.768,-73.957,40.784,1


In [24]:
data.expect_column_values_to_be_of_type("fare_amount","float")

{'success': True,
 'result': {'element_count': 50000,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [25]:
data.expect_column_values_to_be_of_type("fare_amount","float")

{'success': True,
 'result': {'element_count': 50000,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [26]:
data.expect_column_values_to_be_between("fare_amount",min_value=1,max_value=100)

{'success': False,
 'result': {'element_count': 50000,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 26,
  'unexpected_percent': 0.00052,
  'unexpected_percent_nonmissing': 0.00052,
  'partial_unexpected_list': [180.0,
   165.0,
   -2.9,
   -2.5,
   0.01,
   128.83,
   0.0,
   104.67,
   -3.0,
   108.0,
   120.0,
   135.0,
   110.0,
   149.0,
   0.0,
   200.0,
   -2.5,
   136.0,
   -2.5,
   128.61]}}

## Model Training

### DVC

## Ergebnispräsentation

### Reveal.js

### Binder

* Interaktives Bereitstellen von Notebooks via Github Repo
* Inklusive Abhängigkeiten und Erweiterungen
* [Beispiel-Seite](https://github.com/binder-examples/jupyterlab)

# Meine Rezension 


<p>
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&star;
(9/10) </p> 

**lobend hervorzuheben**
    * Die neue Oberfläche
    * Unterstützung unterschiedlicher Formate
    * Plugin-System
**weitere Wünsche**
    * Mächtigeren Zelleditor
    * Kompatiblität mit alten Erweiterungen
    * Multi-User Editing