In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))
display(HTML("<style>div.output_scroll { height: 44em; }</style>"))

In [None]:
import pandas as pd
import popmon

# Data generation
Let's first load some data!

In [None]:
df = pd.read_csv("flight_delays.csv.gz", index_col=0, parse_dates=["DATE"])

# Simple report
Now we can go ahead and generate our first report!

In [None]:
df.pm_stability_report(time_axis='DATE')

If you inspect the report in the above example, you can see that for example for the maximum `departure_delay` on 2015-08-22 was more extreme than expected.

The time axis is a bit weird now (split into 40 bins of 9 days each), but fortunately we can specify that ourselves using the `time_width` parameter! 
We'll also set the `time_offset`, which we set equal to the first data in the document (otherwise we may end up with the first bin containing only half a week of data). 
Finally, for the remaining examples, we'll use `extended_report=False` in order to keep the size of the notebook somewhat limited.

In [None]:
df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False)

Finally, we could make the thresholds used in the traffic lights more stringent. 
For example, we could show the yellow traffic light for deviations bigger than 7 standard deviations, and the red traffic light for deviations bigger than 10 standard deviations.

In [None]:
df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False, pull_rules={"*_pull": [10, 7, -7, -10]})

There are quite a few more parameters in `pm_stability_report()`, for example to select which features to use (e.g. `features=['x']`), or how to bin the different features (`bin_specs={'x': {'bin_width': 1, 'bin_offset': 0}}`). 
We suggest that you check them out on your own!
Have a look at the documentation for `popmon.pipeline.report.df_stability_report()` (which corresponds to `df.pm_stability_report()`).

# What about Spark DataFrames?
No problem! We can easily perform the same steps on a Spark DataFrame. One important thing to note there is that we need to include two jar files (used to create the histograms using Histogrammar) when we create our Spark session. 
These will be automatically downloaded the first time you run this command.

In [None]:
# download histogrammar jar files if not already installed, used for histogramming of spark dataframe
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.jars.packages','org.diana-hep:histogrammar-sparksql_2.11:1.0.4').getOrCreate()

In [None]:
sdf = spark.createDataFrame(df)

In [None]:
sdf.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False)

# Using other reference types
## Using an external reference
Let's go back to Pandas again! (While all of this functionality also works on Spark DataFrames, it's just faster to illustrate it with Pandas.) 
What if we want to compare our DataFrame to another DataFrame? 
For example, because we trained a machine learning model on another DataFrame (which we'll call the reference data) and we want to monitor whether the new data (the current DataFrame) comes from a similar distribution? 
We can do that by specifying an external reference DataFrame.

In [None]:
df_ref = pd.read_csv("flight_delays_reference.csv.gz", index_col=0, parse_dates=['DATE'])
df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False, reference_type='external', reference=df_ref)

## Using an expanding reference
We can also use an expanding reference, which for each time slot uses all preceding time slots as a reference.

In [None]:
df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False, reference_type="expanding")

## Using a rolling window reference
And finally, we can use a rolling window reference. Here we can play with some additional parameters: shift and window. 
We'll set the window parameter to 5.

In [None]:
df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02', extended_report=False, reference_type="rolling", window=5)

# Plotting the individual histograms
Sometimes, when you're diving into alerts from the report, you may want to plot some individual histograms. 
Fortunately, you can! Let's first have a look at how these histograms are stored.

In [None]:
report = df.pm_stability_report(time_axis='DATE', time_width='1w', time_offset='2015-07-02')
split_hists = report.datastore['split_hists']['DEPARTURE_DELAY']
split_hists

Here we see the histograms for each time slot. Let us focus on the first time slot and plot the corresponding histogram.

In [None]:
split_hist = split_hists.query("date == '2015-07-05 12:00:00'")
split_hist.histogram[0].hist.plot.matplotlib();

And let's also plot the corresponding reference histogram.

In [None]:
split_hist.histogram_ref[0].hist.plot.matplotlib();

# Saving the report and the histograms to disk
If you run popmon regularly on the same dataset, you may want to store the report and the histograms to disk, so you can keep track of the alerts and easily inspect the histograms if anything goes wrong.

In [None]:
import pickle
with open('report.pkl', 'wb') as f: 
    pickle.dump(report, f)
report.to_file('report.html')

# Tuning parameters after generating the report
If you want to tune parameters after you've created the report, you can do so easily using `report.regenerate()`

In [None]:
report.regenerate(last_n=0, skip_first_n=0, skip_last_n=0, plot_hist_n=2, skip_empty_plots=True,
                  report_filepath=None, store_key='html_report', sections_key='report_sections')

# Building your own pipelines
The `stability_report()` interface covers many use cases, but if you need more flexibility, you can define your own custom pipeline. We provide an example here!

In [None]:
from popmon.hist.hist_splitter import HistSplitter
from popmon.analysis.profiling import HistProfiler
from popmon.pipeline.report import StabilityReport
from popmon.base import Pipeline
from popmon.visualization import SectionGenerator, ReportGenerator

monitoring_rules = {"*_pull": [7, 4, -4, -7], "*_zscore": [7, 4, -4, -7], "[!p]*_unknown_labels": [0.5, 0.5, 0, 0]}
datastore = dict()
datastore['hists'] = df.pm_make_histograms(time_axis='DATE', time_width='1w', time_offset='2015-07-02')

modules = [
    HistSplitter(read_key='hists', store_key='split_hists', feature_begins_with='DATE'),
    HistProfiler(read_key='split_hists', store_key='profiles'),
    SectionGenerator(section_name='Profiles', read_key="profiles", store_key="report_sections"),
    ReportGenerator(read_key="report_sections", store_key="html_report")
]

pipeline = Pipeline(modules)

stability_report = StabilityReport()
stability_report.transform(pipeline.transform(datastore))
stability_report

The above makes a very simple report, containing only the profiles (and no comparisons, traffic lights or alerts). The next examples shows how you can add the comparisons!

In [None]:
from popmon.analysis.comparison.hist_comparer import ReferenceHistComparer

datastore = dict()
datastore['hists'] = df.pm_make_histograms(time_axis='DATE', time_width='1w', time_offset='2015-07-02')

modules = [
    HistSplitter(read_key='hists', store_key='split_hists', feature_begins_with='DATE'),
    HistProfiler(read_key='split_hists', store_key='profiles'),
    ReferenceHistComparer(reference_key='split_hists', assign_to_key='split_hists', store_key='comparisons'),
    SectionGenerator(section_name='Profiles', read_key="profiles", store_key="report_sections"),
    SectionGenerator(section_name="Comparisons", read_key="comparisons", store_key="report_sections"),
    ReportGenerator(read_key="report_sections", store_key="html_report")
]

pipeline = Pipeline(modules)

stability_report = StabilityReport()
stability_report.transform(pipeline.transform(datastore))
stability_report

If you're interested in more complex examples, check the code in `popmon.pipeline.report_pipelines`.

Using the custom pipelines it becomes relatively easy to include new profiles and new comparisons. 
If you do, be sure to let us know! You may be able to make a pull request and add it to the package.