## Browser Attribute Fingerprinting.

Browser Attribute Fingerprinting referes to uniquely identifing a client based on a set of browser attributes. This kind of fingerprinting is difficult to evade because it doesn't store any cookies on the client-end. Secondly innocuous choices like enabling or disabling a plugin or an add-on adds to a unique identity of a client. Something like adding the system to a no-tracker list in itself adds a unique attribute about the client to the list and in turn increases the chances of a successful fingerprinting. 

Browser Fingerprinting can loosely be defined (but not restricted to) as collecting following information: Browser Plugins, Browser Add-on Enumeration, System Fonts Enumeration, User Agent String, Screen Resolution etc. Details about these can be found [here](https://multilogin.com/browser-fingerprinting-the-surveillance-you-can-t-stop/)

The questions I am trying to answer in this analysis:

- __Detect all the scripts linked in the dataset which can be used for browser finger printing.__ As discussed in [#34](https://github.com/mozilla/overscripted/issues/34) we already know that `hs-analytics`, `fingerprint2.js` and `/akam/` are scripts used for fingerprinting. Can we find features in the dataset using which we can identify other such scripts?

- __Understand which script uses what information to do the fingerprinting.__ This is useful to develop a broad heuristic which we can then employ to understand how prevalent is browser attribute fingerprinting in our dataset. Here I want to identify scripts which use plugin information vs scripts which employ add-ons or screen resolution details to fingerprint browsers. Idea here is that even if a script only takes plugin information it can potentially do fingerprinting without asking client for other attributes. Therefore we can not be too restrictive in our analysis/filtering.

- __Develop a heuristic to identify cases of Browser Attribute Fingerprinting.__ Finally I want to come up with a rule based heuristic which is smart enough to catch all cases of browser attribute fingerprinting. This is linked to both Q1 and Q2. Answering this would require a reasonable progress on both of the earlier questions.

*Note: Present work is towards my application to Outreachy program. I have restricted the analysis to 1 parquet file and all of this can be extended to the whole dataset with dask/spark. In interest of time and to focus more on analysis I have shown results only with pandas*

## Getting Started
Basic import statements for different libraries used below and corresponding display settings

In [1]:
import numpy as np
import pandas as pd
import tldextract
from urllib.parse import urlparse
import os

In [2]:
## Don't limit pandas display.
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", -1)

## Helper functions
Some useful functions defined here which can be used in analysis

In [3]:
def extract_domain(url):
    """Use tldextract to return the base domain from a url"""
    try:
        extracted = tldextract.extract(url)
        return '{}.{}'.format(extracted.domain, extracted.suffix)
    except Exception as e:
        return 'ERROR'

In [4]:
#Borrowed from 2018_09_biskit1_mordax__canvas_fingerprinting notebook
def parse_base_url(url):
  return urlparse(url).netloc # Extract the base part of a URL (netloc, up until the first '/'). 

In [5]:
def write_csv(path,name,df):
    df.to_csv(os.path.join(path,name))
    

In [44]:
def print_groupby(grouped_df):
    for key, item in grouped_df:
        print(grouped_df.get_group(key), "\n\n")

In [49]:
def get_end_of_path(x):
    splits = x.split('/')
    val = ''
    if len(splits) > 0:
        val = splits[-1]
    else:
        val = x
    if len(val) == 0:
        val = EMPTY_STRING
    return val

## Data directory
Change below to point to data location

In [21]:
DATA_DIR = '/home/alvis/Desktop/Richa/overscripted/sample.parquet/'
DATA_DIR = '/home/alvis/Desktop/Richa/overscripted/sample/'
PARQUET_FILE = DATA_DIR + 'part.0.parquet'
PARQUET_FILE = DATA_DIR + 'sample1.parquet'# I ran this with sample data*
IMP_COLUMNS = ['arguments','in_iframe', 'location', 'operation', 'script_url','symbol','time_stamp','value_1000','location_domain','script_domain','location_base_url']

In [22]:
df = pd.read_parquet(PARQUET_FILE, engine='pyarrow')
df.head()

Unnamed: 0,argument_0,argument_1,argument_2,argument_3,argument_4,argument_5,argument_6,argument_7,argument_8,arguments,arguments_n_keys,call_id,call_stack,crawl_id,file_name,func_name,in_iframe,location,operation,script_col,script_line,script_loc_eval,script_url,symbol,time_stamp,value,value_1000,value_len,valid,errors
0,,,,,,,,,,{},0,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json__0,,1,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json,a/<,True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,1802,57,,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.name,2017-12-16 02:54:10.079,fb_xdm_frame_https,fb_xdm_frame_https,18,True,
1,,,,,,,,,,{},0,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json__1,,1,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json,a/<,True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,2895,57,,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.name,2017-12-16 02:54:10.080,fb_xdm_frame_https,fb_xdm_frame_https,18,True,
2,,,,,,,,,,{},0,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json__2,A@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:57:2781\nx@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:55:3028\nw@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:55:931\na/<@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:57:2353\na@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:57:114\nrequire@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:36:610\n@https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com:57:3019,1,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json,A,True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,2781,57,,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.document.cookie,2017-12-16 02:54:10.086,,,0,True,
3,,,,,,,,,,{},0,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json__3,,1,1_028048bbce3f7816a5f1277ac3ac2372d6607581a77a4bfb7a1873ab.json,x,True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,156,49,,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.navigator.userAgent,2017-12-16 02:54:10.088,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,68,True,
4,,,,,,,,,,{},0,1_0401c74e1e381c6f4ebd5ca99102f1529c9a843360d6c9211525136f.json__0,,1,1_0401c74e1e381c6f4ebd5ca99102f1529c9a843360d6c9211525136f.json,ra/<,True,https://cas.us.criteo.com/delivery/r/afr.php?did=5a34c73ff17390f0eaeb591979874b00&z=WjTHPwAHZYkKT7BIAAcskrRh4Qozh3b-c-mtZg&u=%7C7J5NcLNwKWZvhHazrdQ0r3pEybQM2VrhNSue519M%2FnU%3D%7C&c1=M5BADJe1UR3zJ2HNju9b10FggySKKMK0AoYTtPDcqDnSIQIZUQPlDupK--OP2eR-eNGQ46cgN3mwCl5UMg4IstlvomsUbHEHUzImPBAbL0KpTFeMsdEkBo28MAQVY_79HvMen3pU9pjoRxbnxk_AxatU3fdvCPtFY7Wzui5q962zi71J5i_HHNmYi7XbHxLl1v3NLOEqWiI-3QfHE1byzwOhuyge44QAJfUpukDSr4X723xUoquihjIy6b6D_yU9AsLHIIxKQk64_ES4G8moUw5dbt7SG3KWRhyjzAZW5acfRwaX8v33UzaCSZKj4O0XffzJaDiMmsprtAOP0J4xHPtfZqvurKt_x3z5y83mK1o&ct0=https://adclick.g.doubleclick.net/aclk%3Fsa%3Dl%26ai%3DCcqaRP8c0WonLHcjgvgKS2ZyADO7lmPBNsu23nZ0BwI23ARABIABgyQaCARdjYS1wdWItNTc4NzU5MjQ4Mzc2Njc2MKABrN3-6APIAQngAgCoAwGqBMUBT9DGnU9Xf5zpWjsp7PXxVDLu7mhvsOzx8jjeTb-wk_FUQNpBqVd4QxwydKBkX31VemFtAuP1QMeGjoHagpA44JfU11OU46ZLmBKcADPeCDg8kDPJvowA7EbbZ6gvml2aRO7nKo1LHNbLoGTBvP6gmhnbhVqThagbrECDM6qxbcRiiWobTKajDG8KeWma5flmrMZiQe5Lu3cyX_WMmu36IIP2lojiMZaZvgiE_ncYb24UZCKxrORb0gO54t1XHQFwzDnBRPXgBAGABufvkeKYhIzL9gGgBiGoB6a-G9gHANIIBQiAYRAB8ggbYWR4LXN1YnN5bi0wOTI1MDI4NTk2NjIxNjE3%26num%3D1%26sig%3DAOD64_1ibrZmc1pVLttz4doqoDmSHOvwXg%26client%3Dca-pub-5787592483766760%26adurl%3D,get,306,25,,https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js,window.navigator.userAgent,2017-12-16 07:12:07.104,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,68,True,


In [23]:
len(df)

9234

In [27]:
# Some location and script url cleansing. Nice idea from 2018_09_biskit1_mordax__canvas_fingerprinting
df['location_domain'] = df.location.apply(extract_domain)
df['script_domain'] = df.script_url.apply(extract_domain)
df['location_base_url'] = df.location.apply(parse_base_url)
#Reduced dataframe = Rdf
Rdf = df[IMP_COLUMNS]
Rdf.head()

Unnamed: 0,arguments,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000,location_domain,script_domain,location_base_url
0,{},True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.name,2017-12-16 02:54:10.079,fb_xdm_frame_https,facebook.com,facebook.com,staticxx.facebook.com
1,{},True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.name,2017-12-16 02:54:10.080,fb_xdm_frame_https,facebook.com,facebook.com,staticxx.facebook.com
2,{},True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.document.cookie,2017-12-16 02:54:10.086,,facebook.com,facebook.com,staticxx.facebook.com
3,{},True,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,get,https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com,window.navigator.userAgent,2017-12-16 02:54:10.088,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,facebook.com,facebook.com,staticxx.facebook.com
4,{},True,https://cas.us.criteo.com/delivery/r/afr.php?did=5a34c73ff17390f0eaeb591979874b00&z=WjTHPwAHZYkKT7BIAAcskrRh4Qozh3b-c-mtZg&u=%7C7J5NcLNwKWZvhHazrdQ0r3pEybQM2VrhNSue519M%2FnU%3D%7C&c1=M5BADJe1UR3zJ2HNju9b10FggySKKMK0AoYTtPDcqDnSIQIZUQPlDupK--OP2eR-eNGQ46cgN3mwCl5UMg4IstlvomsUbHEHUzImPBAbL0KpTFeMsdEkBo28MAQVY_79HvMen3pU9pjoRxbnxk_AxatU3fdvCPtFY7Wzui5q962zi71J5i_HHNmYi7XbHxLl1v3NLOEqWiI-3QfHE1byzwOhuyge44QAJfUpukDSr4X723xUoquihjIy6b6D_yU9AsLHIIxKQk64_ES4G8moUw5dbt7SG3KWRhyjzAZW5acfRwaX8v33UzaCSZKj4O0XffzJaDiMmsprtAOP0J4xHPtfZqvurKt_x3z5y83mK1o&ct0=https://adclick.g.doubleclick.net/aclk%3Fsa%3Dl%26ai%3DCcqaRP8c0WonLHcjgvgKS2ZyADO7lmPBNsu23nZ0BwI23ARABIABgyQaCARdjYS1wdWItNTc4NzU5MjQ4Mzc2Njc2MKABrN3-6APIAQngAgCoAwGqBMUBT9DGnU9Xf5zpWjsp7PXxVDLu7mhvsOzx8jjeTb-wk_FUQNpBqVd4QxwydKBkX31VemFtAuP1QMeGjoHagpA44JfU11OU46ZLmBKcADPeCDg8kDPJvowA7EbbZ6gvml2aRO7nKo1LHNbLoGTBvP6gmhnbhVqThagbrECDM6qxbcRiiWobTKajDG8KeWma5flmrMZiQe5Lu3cyX_WMmu36IIP2lojiMZaZvgiE_ncYb24UZCKxrORb0gO54t1XHQFwzDnBRPXgBAGABufvkeKYhIzL9gGgBiGoB6a-G9gHANIIBQiAYRAB8ggbYWR4LXN1YnN5bi0wOTI1MDI4NTk2NjIxNjE3%26num%3D1%26sig%3DAOD64_1ibrZmc1pVLttz4doqoDmSHOvwXg%26client%3Dca-pub-5787592483766760%26adurl%3D,get,https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js,window.navigator.userAgent,2017-12-16 07:12:07.104,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,criteo.com,googleapis.com,cas.us.criteo.com


Now that I have the dataframe in a more manageable form, I am looking for scripts/instances where information about Browser plugins is sought. `navigator.plugins[SOME PLUGIN]` is used to validate if a given plugin is enabled on a browser or not. [This](http://www.howtocreate.co.uk/wrongWithIE/?chapter=navigator.plugins) tells me that `navigator.mimeTypes` is a similar function. If navigator.plugin and navigator.mimeTypes are called together multiple times with different arguments they can be used for fingerprinting.

In [70]:
df_plugins = Rdf[Rdf.symbol.str.contains('navigator.mimeTypes|navigator.plugins')]
#write_csv(DATA_DIR,'df_plugins.csv',df_plugins)
#Locally write CSV, sometimes it is easier to slice and dice the data in excel. :)

## Dataframe with rows which queried plugin information
Let us see what these scripts are and what are the domains where they are hosted

In [37]:
Scripts_Calling_Plugins = df_plugins.script_url.unique()
Scripts_Calling_Plugins


In [71]:
# Lets see which script is being used the most
df_plugins['script_url'].value_counts()


https://mc.yandex.ru/metrika/watch.js                                                                                                                                                                                                                                                                                                  70
https://www.google-analytics.com/analytics.js                                                                                                                                                                                                                                                                                          62
https://securepubads.g.doubleclick.net/gpt/pubads_impl_170.js                                                                                                                                                                                                                                                                          58
http://sta

A lot of interesting scripts have come up. Some of these should be safe like those from google-analytics. fingerprint.js is definitely a fingerpringint script. What is "https://mc.yandex.ru/metrika/watch.js" ?

In [43]:
res_df = df_plugins[df_plugins.script_url.str.contains('https://mc.yandex.ru/metrika/watch.js')]
res_df

Unnamed: 0,arguments,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000,location_domain,script_domain,location_base_url
793,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].name,2017-12-16 03:35:09.284,Shockwave Flash,vjav.com,yandex.ru,www.vjav.com
797,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].version,2017-12-16 03:35:09.288,28.0.0.126,vjav.com,yandex.ru,www.vjav.com
798,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].description,2017-12-16 03:35:09.288,Shockwave Flash 28.0 r0,vjav.com,yandex.ru,www.vjav.com
799,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].name,2017-12-16 03:35:09.288,Shockwave Flash,vjav.com,yandex.ru,www.vjav.com
800,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].version,2017-12-16 03:35:09.288,28.0.0.126,vjav.com,yandex.ru,www.vjav.com
801,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].description,2017-12-16 03:35:09.288,Shockwave Flash 28.0 r0,vjav.com,yandex.ru,www.vjav.com
802,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.plugins[Shockwave Flash].filename,2017-12-16 03:35:09.288,libflashplayer.so,vjav.com,yandex.ru,www.vjav.com
803,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.mimeTypes[application/futuresplash].type,2017-12-16 03:35:09.289,application/futuresplash,vjav.com,yandex.ru,www.vjav.com
804,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.mimeTypes[application/futuresplash].description,2017-12-16 03:35:09.289,FutureSplash Player,vjav.com,yandex.ru,www.vjav.com
805,{},False,https://www.vjav.com/tags/masturbating-958/,get,https://mc.yandex.ru/metrika/watch.js,window.navigator.mimeTypes[application/futuresplash].suffixes,2017-12-16 03:35:09.289,spl,vjav.com,yandex.ru,www.vjav.com


Let us check how does it look when grouped by location

In [52]:
Gres = res_df.groupby('location')
Gres.describe()

Unnamed: 0_level_0,arguments,arguments,arguments,arguments,arguments,arguments,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,operation,operation,operation,operation,operation,operation,script_url,script_url,script_url,script_url,script_url,script_url,symbol,symbol,symbol,symbol,symbol,symbol,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,value_1000,value_1000,value_1000,value_1000,value_1000,value_1000,location_domain,location_domain,location_domain,location_domain,location_domain,location_domain,script_domain,script_domain,script_domain,script_domain,script_domain,script_domain,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url
Unnamed: 0_level_1,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2
https://www.newchic.com/vintage-dresses-c-3664/,23,1,{},23,,,23,1,False,23,,,23,1,get,23,,,23,1,https://mc.yandex.ru/metrika/watch.js,23,,,23,10,window.navigator.plugins[Shockwave Flash].name,3,,,23,7,2017-12-16 14:09:31.515000,10,2017-12-16 14:09:31.512000,2017-12-16 14:09:46.543000,23,9,Shockwave Flash,5,,,23,1,newchic.com,23,,,23,1,yandex.ru,23,,,23,1,www.newchic.com,23,,
https://www.vjav.com/tags/masturbating-958/,34,1,{},34,,,34,1,False,34,,,34,1,get,34,,,34,1,https://mc.yandex.ru/metrika/watch.js,34,,,34,10,window.navigator.plugins[Shockwave Flash].name,5,,,34,8,2017-12-16 03:35:09.375000,8,2017-12-16 03:35:09.284000,2017-12-16 03:35:09.376000,34,9,Shockwave Flash,8,,,34,1,vjav.com,34,,,34,1,yandex.ru,34,,,34,1,www.vjav.com,34,,
https://zona.mobi/movies/gadkii-ya-3,13,1,{},13,,,13,1,False,13,,,13,1,get,13,,,13,1,https://mc.yandex.ru/metrika/watch.js,13,,,13,10,window.navigator.plugins[Shockwave Flash].name,2,,,13,4,2017-12-16 05:29:38.176000,5,2017-12-16 05:29:38.172000,2017-12-16 05:29:38.177000,13,9,Shockwave Flash,3,,,13,1,zona.mobi,13,,,13,1,yandex.ru,13,,,13,1,zona.mobi,13,,


So this script always asks for same 10 symbols which we can see below. This talks only about flash player and wants to know the name, version, description etc. Can we fingerprint using this? Yes but should not be very unique. What other things can this script be collecting??

In [55]:
res_df['symbol'].value_counts()

window.navigator.plugins[Shockwave Flash].name                           10
window.navigator.plugins[Shockwave Flash].description                    9 
window.navigator.plugins[Shockwave Flash].version                        9 
window.navigator.mimeTypes[application/x-shockwave-flash].description    6 
window.navigator.mimeTypes[application/futuresplash].type                6 
window.navigator.mimeTypes[application/futuresplash].description         6 
window.navigator.mimeTypes[application/x-shockwave-flash].suffixes       6 
window.navigator.plugins[Shockwave Flash].filename                       6 
window.navigator.mimeTypes[application/x-shockwave-flash].type           6 
window.navigator.mimeTypes[application/futuresplash].suffixes            6 
Name: symbol, dtype: int64

Now I am little confused. On one hand this script is asking a lot of information about the Flash players on client side, but this could be because of the kind of sites it is hosted on and they require precise knowledge about flashplayers to render the content correctly. It could though also be doing fingerprinting by collecting this information. Who knows! :(

I took a look at other scripts which we found above and found an interesting one: http://dthq3mor50viz.cloudfront.net/zbajck9faU.js. This is a JS tracker for Snowplow which is a web tracking service. [Here](https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker) is a documentation of the script. Yay! we found another script which does fingerprinting. We can add this to our list of `hs-analytics`, `/akam/` and `fingeprint.js`. This analysis got us another fingerprinting script (answer to Q1) which is definitely using plugin information (answer to Q2). 


## Browser fingerprinting using fingerprint.js
fingerprint.j2 or fingerprint2.js are common javascript libraries used for fingerprinting. They query a bunch of features from the client and produces a hashcode using murmur hash functions based on the query results. This assigns a unique (almost) hash to each device and can be used to track client across websites. 

Here I am interested in looking at all calls of fingerprint2.js. I want to understand what all arguments and values are associated with calls to fingerprint2.js. Can I infer a pattern with such calls and filter the calls to fingerprint2.js without explicitly looking for it?

In [#34](https://github.com/mozilla/overscripted/issues/34), Sarah has suggested to use "Cwm fjordbank glyphs vext quiz" in argument_0, but I have found that the above string is a panagram and can be used in other scripts as well and not just fingerprint.js. Therefore I have looked for "fingerprint" in the URL column of the dataset. 

In [30]:
FingerPrint = Rdf[Rdf.script_url.str.contains('fingerprint')]
#T = df[df.argument_0.str.contains('Cwm fjordbank glyphs vext quiz')]
FingerPrint

Unnamed: 0,arguments,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000,location_domain,script_domain,location_base_url
2660,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.userAgent,2017-12-16 17:01:59.862,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,lemonde.fr,lemde.fr,www.lemonde.fr
2661,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.language,2017-12-16 17:01:59.862,en-US,lemonde.fr,lemde.fr,www.lemonde.fr
2662,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.screen.colorDepth,2017-12-16 17:01:59.862,24,lemonde.fr,lemde.fr,www.lemonde.fr
2663,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.sessionStorage,2017-12-16 17:01:59.862,{},lemonde.fr,lemde.fr,www.lemonde.fr
2664,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.localStorage,2017-12-16 17:01:59.862,"{""alerte_tracking"":""{\""data\"":true,\""timeout\"":1544979718260}""}",lemonde.fr,lemde.fr,www.lemonde.fr
2665,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.platform,2017-12-16 17:01:59.864,Linux x86_64,lemonde.fr,lemde.fr,www.lemonde.fr
2666,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.doNotTrack,2017-12-16 17:01:59.865,unspecified,lemonde.fr,lemde.fr,www.lemonde.fr
2667,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.appName,2017-12-16 17:01:59.865,Netscape,lemonde.fr,lemde.fr,www.lemonde.fr
2668,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.appName,2017-12-16 17:01:59.865,Netscape,lemonde.fr,lemde.fr,www.lemonde.fr
2669,{},False,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,get,http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,window.navigator.userAgent,2017-12-16 17:01:59.866,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,lemonde.fr,lemde.fr,www.lemonde.fr


If you see the table above and look at the timestamps, almost all calls are made around same time and it is querying a bunch of attributes to produce a hash. Looking at the calls above and comparing it with counts in the dataset [here](https://github.com/mozilla/overscripted/blob/master/data_prep/symbol_counts.csv) I see that symbols like `window.navigator.doNotTrack`, `window.navigator.plugins[Shockwave Flash].length`, `window.navigator.mimeTypes[application/futuresplash].suffixes`	are fairly rare when compared to `window.sessionStorage` or `window.navigator.userAgent`. I want to test if I just filter on rare symbols can I catch `fingerprint.js` calls? Hypothesis is that these rare calls to symbols is only done by fingerprinting scripts.

In [67]:
sessionStorage = Rdf[Rdf.symbol == "window.sessionStorage"]
doNotTrack = Rdf[Rdf.symbol == "window.navigator.doNotTrack"]
ShockWaveLength = Rdf[Rdf.symbol.str.contains("[Shockwave Flash].length")]
FuturesplashSuffixes = Rdf[Rdf.symbol == "window.navigator.mimeTypes[application/futuresplash].suffixes"]
FuturesplashType = Rdf[Rdf.symbol == "window.navigator.mimeTypes[application/futuresplash].type"]
print("FingerPrint: ", len(FingerPrint))
print("sessionStorage: ",len(sessionStorage))
print("ShockWaveLength: ",len(ShockWaveLength))
print("doNotTrack: ",len(doNotTrack))
print("FuturesplashSuffixes: ",len(FuturesplashSuffixes))
print("FuturesplashType: ",len(FuturesplashType))

FingerPrint:  19
sessionStorage:  236
ShockWaveLength:  78
doNotTrack:  21
FuturesplashSuffixes:  9
FuturesplashType:  13


As expected `sessionStorage` is pretty common followed by `ShockWaveLength`. The count reduces a lot for `FingerPrint`, `doNotTrack` and `FuturesplashSuffixes`. I am interested to know which all scripts use `doNotTrack` and `FuturesplashSuffixes` 

### Scripts with call to doNotTrack

In [59]:
print(doNotTrack.script_url.unique())
print(len(doNotTrack.script_url.unique()))

['http://dthq3mor50viz.cloudfront.net/zbajck9faU.js'
 'https://cdn.krxd.net/ctjs/controltag.js.c3e8e6311e44dfc4f051e4a261784fa1'
 'http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js'
 'http://www.gap.com/akam/10/6e136b78'
 'https://c.disquscdn.com/next/embed/lounge.bundle.8d07a4869c3ec17ee1881ae6bd353027.js'
 'https://script.crazyegg.com/pages/scripts/0014/1290.js?420406'
 'http://media1.break.com/campaigns/global_lib/defy-prebid/0.31.0/defy_prebid.js'
 'https://script.crazyegg.com/pages/scripts/0032/8588.js?420389'
 'https://script.crazyegg.com/pages/scripts/0013/0568.js?420385'
 'https://g.alicdn.com/secdev/sufei_data/3.2.2/index.js']
10


In [60]:
print(doNotTrack.script_domain.unique())
print(len(doNotTrack.script_domain.unique()))

['cloudfront.net' 'krxd.net' 'lemde.fr' 'gap.com' 'disquscdn.com'
 'crazyegg.com' 'break.com' 'alicdn.com']
8


There are 9 unique `script_urls` in the sample dataset and 8 unique `script_domains`. This feature has caught *fingerprint.js* and *akam* scripts and have also suggested other scripts which can be potentially doing some kind of fingerprinting. I would need to do some more background search on the kind of domains which we see here. Some of these like cloudfront.net are CDNs and can be overlooked.

In [47]:
doNotTrack.groupby(['script_url']).describe()

Unnamed: 0_level_0,arguments,arguments,arguments,arguments,arguments,arguments,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,location,location,location,location,location,location,operation,operation,operation,operation,operation,operation,symbol,symbol,symbol,symbol,symbol,symbol,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,value_1000,value_1000,value_1000,value_1000,value_1000,value_1000,location_domain,location_domain,location_domain,location_domain,location_domain,location_domain,script_domain,script_domain,script_domain,script_domain,script_domain,script_domain,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url
Unnamed: 0_level_1,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last
script_url,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2
http://dthq3mor50viz.cloudfront.net/zbajck9faU.js,1,1,{},1,,,1,1,False,1,,,1,1,http://www.ufc.ca/media/Submission-of-the-Week-Mickey-Gall-vs-Mike-Jackson,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 00:17:13.993000,1,2017-12-16 00:17:13.993000,2017-12-16 00:17:13.993000,1,1,unspecified,1,,,1,1,ufc.ca,1,,,1,1,cloudfront.net,1,,,1,1,www.ufc.ca,1,,
http://media1.break.com/campaigns/global_lib/defy-prebid/0.31.0/defy_prebid.js,2,1,{},2,,,2,1,False,2,,,2,1,http://www.addictinggames.com/help/parents.jsp,2,,,2,1,get,2,,,2,1,window.navigator.doNotTrack,2,,,2,1,2017-12-16 22:55:09.661000,2,2017-12-16 22:55:09.661000,2017-12-16 22:55:09.661000,2,1,unspecified,2,,,2,1,addictinggames.com,2,,,2,1,break.com,2,,,2,1,www.addictinggames.com,2,,
http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,1,1,{},1,,,1,1,False,1,,,1,1,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 17:01:59.865000,1,2017-12-16 17:01:59.865000,2017-12-16 17:01:59.865000,1,1,unspecified,1,,,1,1,lemonde.fr,1,,,1,1,lemde.fr,1,,,1,1,www.lemonde.fr,1,,
http://www.gap.com/akam/10/6e136b78,2,1,{},2,,,2,1,False,2,,,2,1,http://www.gap.com/browse/category.do?cid=1096402&sop=true,2,,,2,1,get,2,,,2,1,window.navigator.doNotTrack,2,,,2,1,2017-12-16 22:51:16.026000,2,2017-12-16 22:51:16.026000,2017-12-16 22:51:16.026000,2,1,unspecified,2,,,2,1,gap.com,2,,,2,1,gap.com,2,,,2,1,www.gap.com,2,,
https://c.disquscdn.com/next/embed/lounge.bundle.8d07a4869c3ec17ee1881ae6bd353027.js,6,1,{},6,,,6,1,True,6,,,6,3,https://disqus.com/embed/comments/?base=default&f=sofifa&t_u=http%3A%2F%2Fsofifa.com%2Fplayer%2F231443&t_e=O.%20Demb%C3%A9l%C3%A9%20-%20Player%20-%20SoFIFA&t_d=Ousmane%20Demb%C3%A9l%C3%A9%20FIFA%2018%20Dec%2014%2C%202017%20SoFIFA&t_t=O.%20Demb%C3%A9l%C3%A9%20-%20Player%20-%20SoFIFA&s_o=default&l=en#version=8e0609b122e4529350708e7c7b2d6a12,2,,,6,1,get,6,,,6,1,window.navigator.doNotTrack,6,,,6,3,2017-12-16 02:09:35.686000,2,2017-12-16 02:09:35.686000,2017-12-16 19:38:17.239000,6,1,unspecified,6,,,6,1,disqus.com,6,,,6,1,disquscdn.com,6,,,6,1,disqus.com,6,,
https://cdn.krxd.net/ctjs/controltag.js.c3e8e6311e44dfc4f051e4a261784fa1,5,1,{},5,,,5,2,False,3,,,5,2,https://www.autoscout24.it/auto/audi/audi-a4/,3,,,5,1,get,5,,,5,1,window.navigator.doNotTrack,5,,,5,5,2017-12-16 23:00:21.948000,1,2017-12-16 07:32:34.471000,2017-12-16 23:00:21.948000,5,1,unspecified,5,,,5,2,autoscout24.it,3,,,5,1,krxd.net,5,,,5,2,www.autoscout24.it,3,,
https://g.alicdn.com/secdev/sufei_data/3.2.2/index.js,1,1,{},1,,,1,1,False,1,,,1,1,https://list.tmall.com/search_product.htm?abbucket=&pic_detail=1&active=1&acm=lb-zebra-22355-807833.1003.4.764609&sort=s&industryCatId=50106419&spm=a220m.1000858.1000721.5.8puO4R&abtest=&pos=13&cat=52588013&from=sn_1_cat&style=g&search_condition=55&scm=1003.4.lb-zebra-22355-807833.OTHER_147327151378711_764609&aldid=226900,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 17:23:58.982000,1,2017-12-16 17:23:58.982000,2017-12-16 17:23:58.982000,1,1,unspecified,1,,,1,1,tmall.com,1,,,1,1,alicdn.com,1,,,1,1,list.tmall.com,1,,
https://script.crazyegg.com/pages/scripts/0013/0568.js?420385,1,1,{},1,,,1,1,False,1,,,1,1,https://www.pwc.com/gx/en/industries/transportation-logistics.html,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 01:54:07.386000,1,2017-12-16 01:54:07.386000,2017-12-16 01:54:07.386000,1,1,unspecified,1,,,1,1,pwc.com,1,,,1,1,crazyegg.com,1,,,1,1,www.pwc.com,1,,
https://script.crazyegg.com/pages/scripts/0014/1290.js?420406,1,1,{},1,,,1,1,False,1,,,1,1,https://www.backcountry.com/travel?show=all&p=gender%3Amale%7Cattr_age%3Aadult&nf=12,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 22:22:45.461000,1,2017-12-16 22:22:45.461000,2017-12-16 22:22:45.461000,1,1,unspecified,1,,,1,1,backcountry.com,1,,,1,1,crazyegg.com,1,,,1,1,www.backcountry.com,1,,
https://script.crazyegg.com/pages/scripts/0032/8588.js?420389,1,1,{},1,,,1,1,False,1,,,1,1,https://www.rakuten.com.tw/search/%E5%B0%8F%E7%B1%B3/,1,,,1,1,get,1,,,1,1,window.navigator.doNotTrack,1,,,1,1,2017-12-16 05:05:13.394000,1,2017-12-16 05:05:13.394000,2017-12-16 05:05:13.394000,1,1,unspecified,1,,,1,1,rakuten.com.tw,1,,,1,1,crazyegg.com,1,,,1,1,www.rakuten.com.tw,1,,


### Scripts with call to FuturesplashSuffixes

In [62]:
print(FuturesplashSuffixes.script_url.unique())
print(len(FuturesplashSuffixes.script_url.unique()))

['http://dthq3mor50viz.cloudfront.net/zbajck9faU.js'
 'https://mc.yandex.ru/metrika/watch.js'
 'http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js'
 'http://www.gap.com/akam/10/6e136b78']
4


In [63]:
print(FuturesplashSuffixes.script_domain.unique())
print(len(FuturesplashSuffixes.script_domain.unique()))

['cloudfront.net' 'yandex.ru' 'lemde.fr' 'gap.com']
4


There are 4 unique script_urls in the sample dataset and 4 unique script_domains. This feature has caught fingerprint.js and akam scripts and have also suggested other scripts like `https://mc.yandex.ru/metrika/watch.js` and `http://dthq3mor50viz.cloudfront.net/zbajck9faU.js`. This feature is very selective and useful because it has restricted the dataset to only suspicious scripts. I would definitely like to keep this feature in my heuristic to detect browser attribute fingerprinting.

In [64]:
FuturesplashSuffixes.groupby(['script_url']).describe()

Unnamed: 0_level_0,arguments,arguments,arguments,arguments,arguments,arguments,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,in_iframe,location,location,location,location,location,location,operation,operation,operation,operation,operation,operation,symbol,symbol,symbol,symbol,symbol,symbol,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,time_stamp,value_1000,value_1000,value_1000,value_1000,value_1000,value_1000,location_domain,location_domain,location_domain,location_domain,location_domain,location_domain,script_domain,script_domain,script_domain,script_domain,script_domain,script_domain,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url,location_base_url
Unnamed: 0_level_1,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last,count,unique,top,freq,first,last
script_url,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2
http://dthq3mor50viz.cloudfront.net/zbajck9faU.js,1,1,{},1,,,1,1,False,1,,,1,1,http://www.ufc.ca/media/Submission-of-the-Week-Mickey-Gall-vs-Mike-Jackson,1,,,1,1,get,1,,,1,1,window.navigator.mimeTypes[application/futuresplash].suffixes,1,,,1,1,2017-12-16 00:17:13.996000,1,2017-12-16 00:17:13.996000,2017-12-16 00:17:13.996000,1,1,spl,1,,,1,1,ufc.ca,1,,,1,1,cloudfront.net,1,,,1,1,www.ufc.ca,1,,
http://s1.lemde.fr/medias/web/e6a5b4d902b44812df0af2f4d94778d1/js/lib/fingerprint.js,1,1,{},1,,,1,1,False,1,,,1,1,http://www.lemonde.fr/service/licence_et_droits_de_reproduction.html,1,,,1,1,get,1,,,1,1,window.navigator.mimeTypes[application/futuresplash].suffixes,1,,,1,1,2017-12-16 17:01:59.869000,1,2017-12-16 17:01:59.869000,2017-12-16 17:01:59.869000,1,1,spl,1,,,1,1,lemonde.fr,1,,,1,1,lemde.fr,1,,,1,1,www.lemonde.fr,1,,
http://www.gap.com/akam/10/6e136b78,1,1,{},1,,,1,1,False,1,,,1,1,http://www.gap.com/browse/category.do?cid=1096402&sop=true,1,,,1,1,get,1,,,1,1,window.navigator.mimeTypes[application/futuresplash].suffixes,1,,,1,1,2017-12-16 22:51:15.961000,1,2017-12-16 22:51:15.961000,2017-12-16 22:51:15.961000,1,1,spl,1,,,1,1,gap.com,1,,,1,1,gap.com,1,,,1,1,www.gap.com,1,,
https://mc.yandex.ru/metrika/watch.js,6,1,{},6,,,6,1,False,6,,,6,3,https://www.vjav.com/tags/masturbating-958/,3,,,6,1,get,6,,,6,1,window.navigator.mimeTypes[application/futuresplash].suffixes,6,,,6,6,2017-12-16 05:29:38.177000,1,2017-12-16 03:35:09.289000,2017-12-16 14:09:46.541000,6,1,spl,6,,,6,3,vjav.com,3,,,6,1,yandex.ru,6,,,6,3,www.vjav.com,3,,


I did little internet search on `https://mc.yandex.ru/metrika/watch.js` and found this [script](https://github.com/ValdikSS/p0f-mtu-script/blob/master/index.php). Looking at the script I can see that it produces a hash and matches it with stored hashes to identify a client. The information used is fethed from `https://mc.yandex.ru/metrika/watch.js`. Therefore this can also potentially be used as fingerprinting script. 



Finally `FuturesplashSuffixes` identifies 4 scripts all of which are tracking scripts. Therefore this is a very powerful feature and we can run this on a bigger dataset to find more scripts. This should also go in our heuristic to detect attempts at browser attribute fingerprinting. 

We are walking towards building a heuristic by collecting such example and therefore answering my Q3 using this analysis.