In [1]:
import pandas as pd
import numpy as np
import json
from __future__ import division

In [2]:
Total calls: 113714009
% of calls that are to a session replay provider: 3.7% (4211829/113714009)
Total calls with an unique base location: 87325
% of sites(base url) that uses a session replay provider: 5.6% (4858/87325)
.ru:
  Total calls with a .ru location: 6949846
  calls to a .ru url % : 6.1% (6949846/113714009)
  Total calls with an unique base .ru location: 2492
  % of .ru sites(base url): 2.9% (2492/87325)
Session Replay:
  Total calls: 4211829
  Total unique base locations: 4858   
  .ru
    Total calls to .ru: 1940300
    Total .ru sites (unique base locations): 1634
    given a session replay calls, % it is a .ru location: 46.1% (1940300/4211829)
    given a call with a .ru location, % it is a session replay call:  27.9% (1940300/6949846)
    given a .ru site, % it uses a session replay provider: 65.6% (1634/2492)
    given a site that uses a session replay provider, % it is .ru: 33.6% (1634/4858)
  Http vs Https:
    script_https + location_http = 942662
    script_https + location_https = 2110433
    script_http + location_http = 1158734
    surprising there are no script_https + location_http
    script:
      http: 942662
      https: 3053095
      http % on a session replay call: 23.6% (942662/(942622+3053095))
    location:
      http: 2101396
      https: 2110433
      http % on a session replay calls location: 49.9% (2101396/(2101396+2110433))

**This analysis aim to look at known sites which calls known session replay sites to see if we can find signs of session replay activity.**

Session replay providers are services that offer websites a way to track their users - from how they interact with the site to what searches they performed and input they provided. Some session replay providers may even record personal information such as personal addresses and credit card information.

The list of session replay sites comes from the  [**Princeton WebTAP project**](https://webtransparency.cs.princeton.edu/no_boundaries/session_replay_sites.html), which listed sites within the Alexa top 10,000 that show signs of session replay scripts.

With a sample size of 2494 base sites: 90 were found to use session replay calls.
However when we boil it down to the base url the sample size reduced to 1136 unique base urls and 62 of which uses session replay. And an overall 3.15% of all calls were to a session_replay provider.

**Of the sites that were found to be using session replay calls, 35.5% had a .ru suffix. Which is significantly higher than it's distribution in the overall urls captured, 4.0%. Additionally, 27/45 of the .ru sites were using a session replay site. That is 60% of all .ru sites tracked uses session replay scripts compared to the overall 62/1136, 5.5% of all domains using session replay scripts. With the .ru sites we track, it has a 20x more probability of using session replay_scripts versus non .ru sites.**

**When considering http vs https distributions: sites that uses session replay calls has a higher 53.3% http distribution vs the overall 31.3% http distribution**

Sites that uses session_replay scripts:

 '24smi.org',
 '>B:@KBK9C@>:.@D',
 'base.garant.ru',
 'dnevnik.ru',
 'dugtor.ru',
 'football.ua',
 'getcourse.ru',
 'hdrezka.ag',
 'hh.ru',
 'ibizlife.com',
 'lady.nur.kz',
 'mp3party.net',
 'my-shop.ru',
 'netbarg.com',
 'nl.justporno.tv',
 'out.pladform.ru',
 'pagseguro.uol.com.br',
 'porn555.com',
 'povar.ru',
 'rutube.ru',
 'seasonvar.ru',
 'serienstream.to',
 'sprashivai.ru',
 'steam.softonic.com',
 'studfiles.net',
 'tap.az',
 'top.mail.ru',
 'torrent-filmi.net',
 'trinixy.ru',
 'www.autodesk.com',
 'www.avito.ru',
 'www.azet.sk',
 'www.bamilo.com',
 'www.banggood.com',
 'www.bbcgoodfood.com',
 'www.cardinalcommerce.com',
 'www.casadellibro.com',
 'www.eleman.net',
 'www.eurosport.fr',
 'www.fastweb.it',
 'www.fotocasa.es',
 'www.geekbuying.com',
 'www.gl5.ru',
 'www.jbhifi.com.au',
 'www.kommersant.ru',
 'www.labirint.ru',
 'www.maam.ru',
 'www.maxcdn.com',
 'www.msu.ru',
 'www.net-a-porter.com',
 'www.newchic.com',
 'www.rbcplus.ru',
 'www.sports.ru',
 'www.stackoverflowbusiness.com',
 'www.stranamam.ru',
 'www.templatemonster.com',
 'www.the-star.co.ke',
 'www.thermofisher.com',
 'www.twirpx.com',
 'www.universal.org',
 'www.vseinstrumenti.ru',
 'xhamster.com'

Correlation between call symbols and wheter or not the call is a session replay call is also attempted, however the correlation weren't very strong - with the highest being **window.navigator.plugins[Shockwave Flash].version** at 0.105229

Get the list of known session replay providers

In [5]:
def get_replay_sites():
    """Loads a list of session replay providers from the Princeton WebTAP project,
    which listed sites within the Alexa top 10,000 that show signs of session replay scripts.
    """
    session_replay = spark.sql("SELECT DISTINCT third_party FROM sr_site_list_csv")
    rows_array = session_replay.collect()
    return [row.third_party for row in rows_array]

In [6]:
replay_sites = get_replay_sites()

Get the list of calls where the script_url is one of the known session replay providers

In [8]:
BUCKET = 'safe-ucosp-2017/safe_dataset/v1'

ACCESS_KEY = "YOUR-ACCESS-KEY"
SECRET_KEY = "YOUR-SERCRET-KEY"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = BUCKET

S3_LOCATION = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME)
MOUNT = "/mnt/{}".format(BUCKET.replace("/", "-"))

mountPoints = lambda: np.array([m.mountPoint for m in dbutils.fs.mounts()])
already_mounted = np.any(mountPoints() == MOUNT)
if not already_mounted:
    dbutils.fs.mount(S3_LOCATION, MOUNT)
display(dbutils.fs.ls(MOUNT))

In [9]:
df = spark.read.parquet("{}/{}".format(MOUNT, 'clean.parquet'))

In [10]:
from urllib.parse import urlparse
from pyspark.sql.functions import udf
from pyspark.sql.types import *

In [11]:
url_string = "|".join(replay_sites)

In [12]:
url_string

In [13]:
def parse_base_url(url):
  return urlparse(url).netloc
udf_parse_base_url = udf(parse_base_url, StringType())

In [14]:
df_with_base_script_url = df.withColumn("base_script_url", udf_parse_base_url(df.script_url)).withColumn("base_location_url", udf_parse_base_url(df.location))

In [15]:
df.filter(df.script_url.rlike(url_string)).count()

In [16]:
sites_using_session_replay =  df_with_base_script_url.filter(df_with_base_script_url.script_url.rlike(url_string))

In [17]:
sites_using_session_replay.count()

In [18]:
df_with_base_script_url.dropDuplicates(['base_location_url']).count()

In [19]:
def parse_url_scheme(url):
  return urlparse(url).scheme
udf_parse_url_scheme = udf(parse_url_scheme, StringType())

In [20]:
df_with_scheme = df.withColumn("script_scheme", udf_parse_url_scheme(df.script_url)).withColumn("location_scheme", udf_parse_url_scheme(df.location))

In [21]:
df_with_scheme.groupBy("script_scheme", "location_scheme").count().collect()

In [22]:
session_replay_scheme =  df_with_scheme.filter(df_with_scheme.script_url.rlike(url_string))

In [23]:
session_replay_scheme.groupBy("script_scheme", "location_scheme").count().collect()

In [24]:
def parse_suffix(url):
  return url.split(".")[-1]
udf_parse_suffix = udf(parse_suffix, StringType())

In [25]:
df_suffix = df_with_base_script_url.withColumn("location_suffix", udf_parse_suffix(df_with_base_script_url.base_location_url))

In [26]:
suffixes_session_replay = df_suffix.filter(df_suffix.script_url.rlike(url_string)).groupBy("location_suffix").count().collect()

In [27]:
suffixes_session_replay_unique = df_suffix.filter(df_suffix.script_url.rlike(url_string)).dropDuplicates(['base_location_url']).groupBy("location_suffix").count().collect()

In [28]:
suffixes_unique = df_suffix.dropDuplicates(['base_location_url']).groupBy("location_suffix").count().collect()

In [29]:
suffixes_unique.sort(key = lambda x: x[1], reverse = True)

In [30]:
suffixes_session_replay_unique.sort(key = lambda x: x[1], reverse = True)

In [31]:
suffixes_session_replay.sort(key = lambda x: x[1], reverse = True)

In [32]:
display(suffixes_session_replay)

In [33]:
display(suffixes_session_replay_unique)

In [34]:
suffixes = df_suffix.groupBy("location_suffix").count().collect()

In [35]:
suffixes.sort(key = lambda x: x[1], reverse = True)

In [36]:
display(suffixes)

In [37]:
display(suffixes_unique)

In [38]:
percentages = []
for i in range(0,30):
  for session_replay_suffix_row in suffixes_session_replay:
    if suffixes[i][0] == session_replay_suffix_row[0]:
      percentages.append([suffixes[i][0], session_replay_suffix_row[1]/ suffixes[i][1]])

In [39]:
percentages

In [40]:
display(percentages)

In [41]:
percentages_unique = []
for i in range(0,30):
  for session_replay_site in suffixes_session_replay_unique:
    if suffixes_unique[i][0] == session_replay_site[0]:
      percentages_unique.append([suffixes_unique[i][0], session_replay_site[1]/ suffixes_unique[i][1]])

In [42]:
percentages_unique

In [43]:
display(percentages_unique)

In [44]:
df_suffix.filter(df_suffix.location_suffix == "ru").count()

In [45]:
df_suffix.filter(df_suffix.location_suffix == "ru").dropDuplicates(['base_location_url']).count()

In [46]:
session_suffix = df_suffix.filter(df_suffix.script_url.rlike(url_string))

In [47]:
session_suffix.filter(df_suffix.location_suffix == "ru").count()

In [48]:
session_suffix.filter(df_suffix.location_suffix == "ru").dropDuplicates(['base_location_url']).count()