<div style="display: inline-block; float:left; width: 50%">
<img src="img/jupyter-logo.png" width="300px"/>
</div>
<div style="float: left; width: 50%">
    <h1>PyData Workflow mit Jupyter Lab</h1>
    <h3>Meine Notizen zur nächsten Generation Notebooks</h3>
    <div><br>
    <p><cite>"Finally a tool where I can do all my work. And yes, I have worked with Emacs."</cite> - <a href="https://medium.com/@boyanangelov/is-this-the-best-data-science-ide-jupyter-lab-review-fdd165470f13">Boyan Angelov, Data Scientist @mindmatch.ai</a></p>
    <p><cite>"I’m extremely excited the potential of JupyterLab. I have a high level of confidence in the team who is making this possible"</cite> - <a href="https://medium.com/@brianray_7981/jupyterlab-first-impressions-e6d70d8a175d">Brian Ray</a></p>
        </div>
</div>

## Über den Autor

![Some pic of me](http://nico.kreiling.family/assets/img/me.png)

**Nico Kreiling**
* "Big Data Scientist" @ [inovex](https://www.inovex.de/de/)
* Hadoop, Data Science, Kubernetes, Cloud
* Tech-Podcast für Entwickler: [inoTecCast](https://inoteccast.de/)
* Meine Tech Talks auf [https://github.com/krlng/techtalks](https://github.com/krlng/techtalks)

# Kapitelübersicht

* Notebooks im Allgeinen
* Einführung in Jupyter Lab
* Individualisieren
    * Tastaturkürzel
    * die besten Plugins
    * Plugins schreiben
* Workflow Demo
    * Datenexploration
    * Algorithemn Entwicklung
    * Ergebnisse teilen
* Fazit

# Notebooks - Eine Autobiographie

## Die guten Seiten

* Übersichtlich Darstellung
* interaktive und intuitive Bedienung
* Gut für explorative Analysen und Trainings

## Die schlechten Seiten

* Ausführungsreihenfolge ist wichtig
* Meist spätere Re-Implementierung notwendig
* Weniger Entwicklugnssupport als in IDEs

## Die Bestseller

* Jupyter 
    * Jupyter Notebooks
    * JupyterLab
    * JupyterHub
* [Zeppelin](https://zeppelin.apache.org/)
* [Cloud Datalab (Google)](https://cloud.google.com/datalab/), [Watson Studio](https://www.ibm.com/cloud/watson-studio) ...

# Jupyter Lab - Die Einleitung

## Was bisher geschah...

* Neu-Implementierung der Jupyter Notebooks
* Seit 3 Jahren von 100 Contributors
* 60 components, 12.000 commits
* [Public seit Februar](https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906)
* [1.0](https://github.com/jupyterlab/jupyterlab/milestone/2) vermutlich November


## Das Wichtigste

* Neues UI & UX
* Window Management
* Datei-Explorer & Terminal
* Erweiterte CSV und Markdown Unterstützung
* Collapsible Cells & Dark Theme

# Der Schlüssel zur Produktivität

* JupyterLab entworfen als Plugin-Kompositum
* Moderne Technologie-Stack (JavaScript, [YARN](https://yarnpkg.com/en/))
* Aber [nbextensions](https://github.com/ipython-contrib/jupyter_contrib_nbextensions) nicht kompatibel

## Bestehendes kennen

![Know your tools](./img/suit1.jpg)

### Standard-Tastaturbefehle

Command  | Shortcut
------------- | -------------
Command Palette |`Accel Shift C`
File Explorer |`Accel Shift F`
Toggle Bar |`Accel B`
Fullscreen Mode |`Accel Shift D`
Close Tab |`Ctrl Q`
Launcher |`Accel Shift L`

## Bestehendes erweitern

![Tweak your tools](./img/suit2.jpg)

### Eigene Tatstenkürzel

Settings > Keyboard Shortcuts
```
{
    "notebook:move-cell-up": {
      "command": "notebook:move-cell-up",
      "keys": [
        "Accel Alt ArrowUp"
      ],
      "selector": "body"
    },
    "notebook:move-cell-down": {
      "command": "notebook:move-cell-down",
      "keys": [
        "Accel Alt ArrowDown"
      ],
      "selector": "body"
    }        
}
```

### Nützliche Plugins

* Graphische Erweiterungen
    * [Git](https://github.com/jupyterlab/jupyterlab-git)
    * [toc](https://github.com/jupyterlab/jupyterlab-toc)
    * [Bokeh](https://github.com/bokeh/jupyterlab_bokeh)
    * [Tensorboard](https://github.com/chaoleili/jupyterlab_tensorboard)
* Erweiterte Datentypen
    * [Latex](https://github.com/jupyterlab/jupyterlab-latex)
    * [HTML Files](https://github.com/mflevine/jupyterlab_html)
    * [Drawio](https://github.com/QuantStack/jupyterlab-drawio)
    * [Geo-Suppoert](https://github.com/jupyterlab/jupyter-renderers)
* In Entwicklung
    * [RISE](https://github.com/lsst-sqre/RISE/tree/jupyterlab_extension/jupyterlab/jupyterlab_rise)
    * [Google Docs Collaboration](https://github.com/jupyterlab/jupyterlab-google-drive/issues/108)
    * [Variable Inspector](https://github.com/lckr/jupyterlab-variableInspector)
    * [Monaco Editor](https://github.com/jupyterlab/jupyterlab-monaco)


### Install Plugins


Entweder direkt **via Terminal**
```
jupyter labextension install @jupyterlab/<your-plugin-name>
```

oder über den **Extension Manager**

```
Settings > Extension Manager:

{
    "enabled": true
}
```

## Make your Tools

![Make your tools](./img/suit3.jpg)

# Der Hauptteil

## Exploration

Wir nutzen zur Demo die ["New York City Taxi Fare Prediction" Kaggle Challenge](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction). Der vollständige Datensatz findet sich dort, in diesem Repo findet sich nur ein minimaler Demo-Datensatz!

In [None]:
# %load ./imports.py
import os
from datetime import datetime as dt

import numpy as np
import pandas as pd
import sklearn as skl

# Pandas display options
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import IPython.display as ipd

# Set random seed 
RSEED = 42

# Visualizations
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (25, 5)
%matplotlib inline
#plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18

import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)


In [None]:
#data = pd.read_csv('input/train.csv', nrows = 50000, parse_dates = ['pickup_datetime']).drop(columns = 'key')
data = pd.read_csv('data/sample.csv', nrows = 50000, parse_dates = ['pickup_datetime']).drop(columns = 'key')

# Remove na
data = data.dropna()
data

### Describe Data

An effective method for catching outliers and anomalies is to find the summary statistics for the data using the `.describe()` method. I like to concentrate on the maxes and the minimums for finding outliers.

In [None]:
data.info()
data.

In [None]:
data.describe()

### Pandas Profiling

In [35]:
#pandas_profiling.ProfileReport(data)
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
import pandas_profiling
profile = pandas_profiling.ProfileReport(data)
print(workbookDir)
profile.to_file(outputfile=workbookDir+"/output/data_report.html")

This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "/Users/nkreiling/miniconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/nkreiling/miniconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/nkreiling/miniconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/nkreiling/miniconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/nkreiling/miniconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 497, in start
    self.io_loop.start()
  File "/Users/nkreiling/miniconda3/lib/p

/Users/nkreiling/dev/d2d-jupyterlab


In [36]:
%%! 
open "$(pwd)/data_report.html"

[]

### View Map Data

**Unser Standort**

In [None]:
from IPython.display import GeoJSON
GeoJSON({
    "type": "Feature",
    "geometry": {
      "type": "Point",
    "coordinates": [8.677636, 49.404927]
    }
})

In [None]:
from IPython.display import GeoJSON
point = data.sample(1)
GeoJSON({
    "type": "Feature",
    "geometry": {
      "type": "LineString",
    "coordinates": [
        [point.pickup_longitude.values[0], point.pickup_latitude.values[0]],
        [point.dropoff_longitude.values[0], point.dropoff_latitude.values[0]]
       ]
    }
})

### Great Expectations

In [None]:
import great_expectations as ge
data = ge.read_csv('input/train.csv', nrows = 50000).drop(columns = 'key')

In [None]:
data.head()

In [None]:
data.expect_column_values_to_be_of_type("fare_amount","float")

In [None]:
data.expect_column_values_to_be_between("fare_amount",min_value=1,max_value=100)

## Feature Engineering

In [None]:
def minkowski_distance(x1, x2, y1, y2, p):
    return ((abs(x2 - x1) ** p) + (abs(y2 - y1)) ** p) ** (1 / p)

data['manhattan'] = minkowski_distance(data['pickup_longitude'], data['dropoff_longitude'],
                                       data['pickup_latitude'], data['dropoff_latitude'], 1)

In [None]:
# Radius of the earth in kilometers
R = 6378

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    
    
    source: https://stackoverflow.com/a/29546836

    """
    # Convert latitude and longitude to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # Find the differences
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    # Apply the formula 
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    # Calculate the angle (in radians)
    c = 2 * np.arcsin(np.sqrt(a))
    # Convert to kilometers
    km = R * c
    
    return km

data['haversine'] =  haversine_np(data['pickup_longitude'], data['pickup_latitude'],
                         data['dropoff_longitude'], data['dropoff_latitude']) 

In [None]:
data.head()

In [None]:
# Calculate distribution by each fare bin

palette = sns.color_palette('Paired', 10)
data['fare-bin'] = pd.cut(data['fare_amount'], bins = list(range(0, 50, 5))).astype(str)
color_mapping = {fare_bin: palette[i] for i, fare_bin in enumerate(data['fare-bin'].unique())}
data['color'] = data['fare-bin'].map(color_mapping)

In [None]:
plt.figure(figsize = (12, 6))
for f, grouped in data.groupby('fare-bin'):
    sns.kdeplot(grouped['haversine'], label = f'{f}', color = list(data['color'])[0]);

plt.xlabel('degrees'); plt.ylabel('density')
plt.title('Manhattan Distance by Fare Amount');

## Model Training

## Base line model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

lr = LinearRegression()

# Split data
X_train, X_valid, y_train, y_valid = train_test_split(data, np.array(data['fare_amount']), 
                                                      stratify = data['fare-bin'],
                                                      random_state = RSEED, test_size = 10000)
lr.fit(X_train[['manhattan', 'passenger_count']], y_train)

print('Intercept', round(lr.intercept_, 4))
print('manhattan coef: ', round(lr.coef_[0], 4), 
      '\tpassenger_count coef:', round(lr.coef_[1], 4))

lr.fit(X_train[['manhattan', 'passenger_count']], y_train)

print('Intercept', round(lr.intercept_, 4))
print('manhattan coef:', round(lr.coef_[0], 4),
      '\tpassenger_count coef:', round(lr.coef_[1], 4))

In [None]:
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore', category = RuntimeWarning)

def metrics(train_pred, valid_pred, y_train, y_valid):
    """Calculate metrics:
       Root mean squared error and mean absolute percentage error"""
    
    # Root mean squared error
    train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
    valid_rmse = np.sqrt(mean_squared_error(y_valid, valid_pred))
    
    # Calculate absolute percentage error
    train_ape = abs((y_train - train_pred) / y_train)
    valid_ape = abs((y_valid - valid_pred) / y_valid)
    
    # Account for y values of 0
    train_ape[train_ape == np.inf] = 0
    train_ape[train_ape == -np.inf] = 0
    valid_ape[valid_ape == np.inf] = 0
    valid_ape[valid_ape == -np.inf] = 0
    
    train_mape = 100 * np.mean(train_ape)
    valid_mape = 100 * np.mean(valid_ape)
    
    return train_rmse, valid_rmse, train_mape, valid_mape

def evaluate(model, features, X_train, X_valid, y_train, y_valid):
    """Mean absolute percentage error"""
    
    # Make predictions
    train_pred = model.predict(X_train[features])
    valid_pred = model.predict(X_valid[features])
    
    # Get metrics
    train_rmse, valid_rmse, train_mape, valid_mape = metrics(train_pred, valid_pred,
                                                             y_train, y_valid)
    
    print(f'Training:   rmse = {round(train_rmse, 2)} \t mape = {round(train_mape, 2)}')
    print(f'Validation: rmse = {round(valid_rmse, 2)} \t mape = {round(valid_mape, 2)}')

In [None]:
evaluate(lr, ['manhattan', 'passenger_count'], 
        X_train, X_valid, y_train, y_valid)

### DVC

In [None]:
data

In [None]:
!dvc init

In [None]:
!dvc run -d data/Posts.xml.tgz \
              -o data/Posts.xml \
              -f extract.dvc \
              tar -xvf data/Posts.xml.tgz -C data

In [None]:
!dvc add images.zip

In [None]:
dvc run -d data/sample.csv -o output unzip -q images.zip

## Ergebnispräsentation

### Reveal.js

### Binder

* Interaktives Bereitstellen von Notebooks via Github Repo
* Inklusive Abhängigkeiten und Erweiterungen
* [Beispiel-Seite](https://github.com/binder-examples/jupyterlab)

# Meine Rezension 


<p>
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&starf;
&star;
(9/10) </p> 

**lobend hervorzuheben**
    * Die neue Oberfläche
    * Unterstützung unterschiedlicher Formate
    * Plugin-System
**weitere Wünsche**
    * Mächtigeren Zelleditor
    * Kompatiblität mit alten Erweiterungen
    * Multi-User Editing