# Overview
In investigating long subsession lengths in longitudinal, @spensrose and I  discussed investigating an upper bound on subsession lenghts.
This upper bound that we concluded is that subsession lengths should not overlap.
This notebook explores long (> 24H) overlapping subsessions.
A long overlapping pair of subsessions might be where a subsession (> 24H) starts on a first day, completes on the following day, but another subsession is present on first day. 
This should not occur.

After determining how many users this issue effects, I specifically analyze the portion of users that do have the long overlapping subsessions.

## Window of Analysis
I look at users with `sample_id=42` and have a `subsession_start_date` between the current date and the beginning of the year (exclusive). 
I choose this window primarily to investigate newer data.
If this window is to be extended in the future, the pandas code will have to be refactored to run on spark, otherwise memory issues are likely to occur for the local pandas dataframes in this notebook.

In [None]:
# dependencies dynamically added 
file_names = ['extract.py', 'transform.py', 'utils.py']
for file_name in file_names:
    sc.addPyFile(file_name)
from extract import *
from transform import *

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns
sns.set(color_codes=True)

In [None]:
df = read_from_main_summary(sqlContext)

## Analysis By Aggregating Profile
The following questions are answered by aggregating by profile to determine distributions. 
Later, we will aggregate by profile & other fields, like `build_id`.

In [None]:
aggregate_fields = ['client_id']
wide_df_by_client = long_to_wide(sqlContext, df, aggregate_fields)

In [None]:
overlaps_rdd_by_client = wide_df_by_client.rdd.map(find_overlaps).cache()

### what's the percentage of profiles that have at least one overlapping session?

In [None]:
num_total_profiles = df.count()
num_profiles_with_overlap = overlaps_rdd_by_client.count()
percentage = (float(num_profiles_with_overlap) / num_total_profiles) * 100 
print('Percentage of profiles with atleast one overlap: %.4r' % percentage) 

### what are the percentages for overlapping subsessions of all subsessions per profile for each channel?
The violin plot below tells the story quite clearly below. 
The percentage of long overlapping subsessions in this scenario are very small. 
The statistical summaries per profile are shown after the violin plots.
I had used `sns.stripplot` & `sns.boxplot` previously, but they didn't show how narrow the majority of the distribution are on the `release` channel.
In the end I opted for a violin plot and a supplementary statistical summary.

In [None]:
# tup[-1] is the percentage
perc_client_df = overlaps_rdd_by_client\
.filter(lambda tup: tup[-1] != 0)\
.toDF(aggregate_fields + ['normalized_channel', 'perc_overlaps'])\
.drop('client_id')\
.toPandas()

In [None]:
sns.violinplot(x='normalized_channel', y='perc_overlaps', data=perc_client_df)

In [None]:
perc_client_df\
.groupby('normalized_channel')\
.describe()\
.reset_index()\
.pivot('normalized_channel', 'level_1', values='perc_overlaps')

### how do the above distributions vary when considering only subsessions originating from recent Firefox builds?

In [None]:
# app_version <-> app_build_id should be 1 to 1
aggregate_fields = ['client_id', 'app_version', 'app_build_id']
wide_df_by_build = long_to_wide(sqlContext, df, aggregate_fields)

In [None]:
overlaps_rdd_by_build = wide_df_by_build.rdd.map(find_overlaps).cache()

In [None]:
# we should now have percentages per client & build
perc_builds_df = overlaps_rdd_by_build\
.filter(lambda tup: tup[-1] != 0)\
.toDF(aggregate_fields + ['normalized_channel', 'perc_overlaps'])\
.drop('client_id')\
.toPandas()

In [None]:
from datetime import datetime, date

def build_month(row):
    app_build_id = row['app_build_id']
    app_build_date_str = app_build_id[0:6]
    return app_build_date_str[0:4] + "-" + app_build_date_str[4:]

perc_builds_df['build_month'] = perc_builds_df.apply(build_month, 1)

In [None]:
violin_order_month = sorted(list(set(perc_builds_df['build_month'])))
violin_order_month = [month for month in violin_order_month if month >= '2016-01']
violin_df = perc_builds_df[perc_builds_df['build_month'] >= '2016-01']

The following plot & summary statistic dataframe show the distribution of percentages of long, overlapping subsession lengths per client per build aggregated by the month of the build.
Overall we can see for each aggregation that there are outliers that still effect our distribution, similar to when we only aggregated by `normalized_channel`.

In [None]:
pivot_column_str = 'build_month'
plt.xticks(rotation=90)
sns.violinplot(x=pivot_column_str, y='perc_overlaps', data=violin_df, order=violin_order_month)

In [None]:
perc_builds_df\
.groupby(pivot_column_str)\
.describe()\
.reset_index()\
.pivot(pivot_column_str, 'level_1', values='perc_overlaps')

### Using DocId
I've deduplicated with docid in `extract.py`. 
After deduplicating using docid, I move from a long dataset to a wide dataset, similar to longitudinal.