### TELEMETRY_SEND Failure Logs

[Bug 1319026](https://bugzilla.mozilla.org/show_bug.cgi?id=1319026) introduced logs to try and nail down what kinds of failures users experience when trying to send Telemetry pings. Let's see what we've managed to collect.

In [None]:
import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py

from plotly.graph_objs import *
from moztelemetry import get_pings_properties, get_one_ping_per_client
from moztelemetry.dataset import Dataset

%matplotlib inline

In [None]:
pings = Dataset.from_source("telemetry") \
    .where(docType='main') \
    .where(appUpdateChannel='nightly') \
    .where(submissionDate=lambda x: x >= "20170429") \
    .where(appBuildId=lambda x: x >= '20170429') \
    .records(sc, sample=1)

In [None]:
subset = get_pings_properties(pings, ["clientId",
                                      "environment/system/os/name",
                                      "payload/log"])

In [None]:
log_entries = subset\
    .flatMap(lambda p: [] if p['payload/log'] is None else [l for l in p['payload/log'] if l[0] == 'TELEMETRY_SEND_FAILURE'])

In [None]:
log_entries = log_entries.cache()

In [None]:
error_counts = log_entries.map(lambda l: (tuple(l[2:]), 1)).countByKey()

In [None]:
entries_count = log_entries.count()
sorted(map(lambda i: ('{:.2%}'.format(1.0 * i[-1] / entries_count), i), error_counts.iteritems()), key=lambda x: x[1][1], reverse=True)

#### Conclusion

Alrighty, looks like we're mostly "error". Not too helpful, but does narrow things down a bit.

"timeout" is the reason for more than one in every four failures. That's a smaller cohort than I'd originally thought.

A few Gateway Timeouts (504) which could be server load, very few aborts, and essentially no Forbidden (403) or Bad Gateway (502).