### Telemetry Hello World

This is a very a brief introduction to Spark and Telemetry in Python. You should have a look at the [tutorial](https://gist.github.com/vitillo/25a20b7c8685c0c82422) in Scala and the associated [talk](http://www.slideshare.net/RobertoAgostinoVitil/spark-meets-telemetry) if you are interested to learn more about Spark.

In [None]:
import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from moztelemetry.dataset import Dataset

%matplotlib inline

### Basics

We will use the Dataset API to fetch data.  Documentation can be found at: https://python-moztelemetry.readthedocs.io/en/stable/api.html#dataset

The goal of this example is to plot the startup distribution for each OS. Let's see how many parallel workers we have at our disposal:

In [None]:
sc.defaultParallelism

We can look at the schema of the dataset we are interested in:

In [None]:
Dataset.from_source('telemetry').schema

Let's create a Dataset of Telemetry submissions for a given submission date:

In [None]:
pings_dataset = (
    Dataset.from_source('telemetry')
    .where(docType='main')
    .where(submissionDate='20180105')
    .where(appUpdateChannel="nightly")
)

Select only the properties we need and then take a 10% sample:

In [None]:
pings = (
    pings_dataset
    .select(
        'clientId',
        osName='environment.system.os.name',
        firstPaint='payload.simpleMeasurements.firstPaint')
    .records(sc, sample=0.1)
)

Let's filter out submissions with an invalid startup time:

In [None]:
subset = pings.filter(lambda p: p.get('firstPaint', -1) >= 0)

To prevent pseudoreplication, let's consider only a single submission for each client. As this step requires a distributed shuffle, it should always be run only after extracting the attributes of interest with *Dataset.select()*.

In [None]:
subset = (
    subset
    .map(lambda p: (p['clientId'], p))
    .reduceByKey(lambda p1, p2: p1)
    .map(lambda p: p[1])
)

Caching is fundamental as it allows for an iterative, real-time development workflow:

In [None]:
cached = subset.cache()

How many pings are we looking at?

In [None]:
cached.count()

Let's group the startup timings by OS:

In [None]:
grouped = (
    cached
    .map(lambda p: (p['osName'], p['firstPaint']))
    .groupByKey()
    .collectAsMap()
)

And finally plot the data:

In [None]:
frame = pd.DataFrame({x: np.log10(pd.Series(list(y))) for x, y in grouped.items()})
plt.figure(figsize=(17, 7))
frame.boxplot(return_type='axes')
plt.ylabel('log10(firstPaint)')
plt.show()

In [None]:
plt.title('startup distribution for Windows')
plt.ylabel('count')
plt.xlabel('log10(firstPaint)')
frame['Windows_NT'].plot(kind='hist', bins=50, figsize=(14, 7))

### Histograms

Let's extract a histogram of GC_MARK_MS (time spent running JS garbage collection mark phase) from the submissions:

(see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Memory_Management for more information)

In [None]:
histograms = (
    pings_dataset
    .select(GC_MARK_MS_content='payload.processes.content.histograms.GC_MARK_MS.values',
            GC_MARK_MS_parent='payload.histograms.GC_MARK_MS.values')
    .records(sc, sample=0.05)
)

- `payload.histograms.GC_MARK_MS.values` is a path to the GC_MARK_MS values of the parent (main) process
- `payload.processes.content.histograms.GC_MARK_MS.values` is a path to the GC_MARK_MS values of the child processes

Let's aggregate the histogram over all submissions and plot it as a histogram.  Since the parent and child processes are recorded separately, we can create a histogram for each one and then add them together.

Each histogram is a pandas series where the index is the bucket and the value is the count.

In [None]:
def aggregate_series(s1, s2):
    """Function to sum up series; if one is None, return other"""
    if s1 is None:
        return s2
    if s2 is None:
        return s1
    return s1.add(s2, fill_value=0)

aggregated_content = (
    histograms
    .map(lambda p: pd.Series(p['GC_MARK_MS_content']))
    .reduce(aggregate_series)
)
aggregated_content.index = [int(i) for i in aggregated_content.index]
aggregated_content = aggregated_content.sort_index()

aggregated_parent = (
    histograms
    .map(lambda p: pd.Series(p['GC_MARK_MS_parent']))
    .reduce(aggregate_series)
)
aggregated_parent.index = [int(i) for i in aggregated_parent.index]
aggregated_parent = aggregated_parent.sort_index()

In [None]:
plt.title('GC_MARK_MS_content')
aggregated_content.plot(kind='bar', figsize=(15, 7))

In [None]:
plt.title('GC_MARK_MS_parent')
aggregated_parent.plot(kind='bar', figsize=(15, 7))

We can also aggregate the values of the parent and children processes:

In [None]:
plt.title('GC_MARK_MS')
(aggregated_content + aggregated_parent).plot(kind='bar', figsize=(15, 7))

Keyed histograms follow a similar pattern. To extract a keyed histogram for which we know the key/label we are interested in:

In [None]:
keyed_hist = (
    pings_dataset
    .select(redirects='payload.keyedHistograms.NETWORK_HTTP_REDIRECT_TO_SCHEME.https.values')
    .records(sc, sample=0.05)
)

Add up the counts of every ping and plot it:

In [None]:
aggregated = (
    keyed_hist
    .filter(lambda p: p['redirects'] is not None)
    .map(lambda p: pd.Series(p['redirects']))
    .reduce(lambda c1, c2: c1 + c2)
)

In [None]:
aggregated.plot(kind='bar', figsize=(15, 7))