# Understanding Browser Fingerprinting

Browser fingerprinting is a way of recognizing various attributes of a browser using the specific feature information like the type of font used or even a trivial thing like the type of emoji. It has its own pros and cons: If the goal is to protect uniqueness of a individual then it works fine in the case when an individual needs to be recognized to eliminate foul play but if the goal is mere advertisements then it can affect the privacy of a individual in a negative aspect. There have been various projects related to this genre which have worked to find which features are most important to recognize a specific website. Some of them which I went through are Panopticlick, AmIUnique.org, slido and Erik Flood and Joel Karlsson. Browser fingerprinting. 2012.



In [2]:
from dask.distributed import Client,progress
Client()


0,1
Client  Scheduler: tcp://127.0.0.1:3358  Dashboard: http://127.0.0.1:8787,Cluster  Workers: 4  Cores: 4  Memory: 8.50 GB


In [3]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar


In [4]:
import tldextract

#DATA_DIR = 'sample_10percent_value_1000_only.parquet'

def extract_domain(url):
    """Use tldextract to return the base domain from a url"""
    try:
        extracted = tldextract.extract(url)
        return '{}.{}'.format(extracted.domain, extracted.suffix)
    except Exception as e:
        return 'ERROR'

# Feature Evaluation
Based on my learning's I decided to further dive in features : " location, script-url, symbol and value1000'. On the basis of initial analysis I could understand that we can extract additional features from the script-url attribute with respect to symbol and value attribute. As various rows of location contained same urls but different symbol values which could uncover various important elements like font used through canvas finger printing , audio finger printing details , and if a script could have been blocked by adblock.



In [5]:
#df = dd.read_parquet('sample_10percent_value_1000_only.parquet', engine='pyarrow',columns=['argument_0', 'func_name','in_iframe', 'location', 'operation','script_url', 'symbol', 'time_stamp', 'value_1000'])

Trying to use only one part to analyse

In [6]:
df = dd.read_parquet('sample_10percent_value_1000_only.parquet\part.10.parquet', engine='pyarrow',columns=['argument_0', 'func_name','in_iframe', 'location', 'operation','script_url', 'symbol', 'time_stamp', 'value_1000'])

In [12]:
#df.head()
df.head()


Unnamed: 0,argument_0,func_name,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000
0,,__nr_require<.loader<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.589,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...
1,,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.localStorage,2017-12-16 13:26:28.594,{}
2,__nr_flags,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,call,https://www.jumia.ma/soin-du-visage/,window.Storage.getItem,2017-12-16 13:26:28.595,
3,,__nr_require<[14]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.598,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...
4,,n,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/scripts/common_desktop.99...,window.localStorage,2017-12-16 13:26:29.267,{}


In [7]:
import pandas as pd
import numpy as np
import matplotlib 
import os
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso


  from pandas.core import datetools


I used the string processing functions from the repository:

In [8]:
from urllib.parse import urlparse

EMPTY_STRING = 'EMPTY_STRING'


def get_netloc(x):
    p = urlparse(x)
    val = p.netloc
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_path(x):
    p = urlparse(x)
    val = p.path
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_end_of_path(x):
    splits = x.split('/')
    val = ''
    if len(splits) > 0:
        val = splits[-1]
    else:
        val = x
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_clean_script(x):
    p = urlparse(x)
    return f'{p.netloc}{p.path}'


In order to extract the features I segregated the location and script_domain as well as the symbols into subparts. This way I could get the locations with various features aggregated together.

In [9]:
df['script_netloc'] = df.script_url.apply(get_netloc, meta=('O'))
df['script_path'] = df.script_url.apply(get_path, meta=('O'))
df['script_path_end'] = df.script_path.apply(get_end_of_path, meta=('O'))
df['agg'] = df.script_netloc + '||' + df.script_path_end + '||' + df.func_name
df['location_domain'] = df.location.apply(extract_domain, meta=('x', 'str'))

df['location_base_url'] = df.location.apply(get_netloc, meta=('x', 'str'))
df['script_domain'] = df.script_url.apply(extract_domain, meta=('x', 'str'))




In [10]:
df.head()

Unnamed: 0,argument_0,func_name,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000,script_netloc,script_path,script_path_end,agg,location_domain,location_base_url,script_domain
0,,__nr_require<.loader<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.589,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<.loader<,jumia.ma,www.jumia.ma,jumia.ma
1,,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.localStorage,2017-12-16 13:26:28.594,{},www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[1]<,jumia.ma,www.jumia.ma,jumia.ma
2,__nr_flags,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,call,https://www.jumia.ma/soin-du-visage/,window.Storage.getItem,2017-12-16 13:26:28.595,,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[1]<,jumia.ma,www.jumia.ma,jumia.ma
3,,__nr_require<[14]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.598,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[14]<,jumia.ma,www.jumia.ma,jumia.ma
4,,n,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/scripts/common_desktop.99...,window.localStorage,2017-12-16 13:26:29.267,{},www.jumia.ma,/scripts/common_desktop.9966544a3e.js,common_desktop.9966544a3e.js,www.jumia.ma||common_desktop.9966544a3e.js||n,jumia.ma,www.jumia.ma,jumia.ma


In [11]:
df['symbol_parts'] = df.symbol.str.split('.')
df['symbol_0'] = df.symbol_parts.str.get(0)
df['symbol_1'] = df.symbol_parts.str.get(1)
df['symbol_2'] = df.symbol_parts.str.get(2)
df.head()

Unnamed: 0,argument_0,func_name,in_iframe,location,operation,script_url,symbol,time_stamp,value_1000,script_netloc,script_path,script_path_end,agg,location_domain,location_base_url,script_domain,symbol_parts,symbol_0,symbol_1,symbol_2
0,,__nr_require<.loader<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.589,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<.loader<,jumia.ma,www.jumia.ma,jumia.ma,"[window, navigator, userAgent]",window,navigator,userAgent
1,,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.localStorage,2017-12-16 13:26:28.594,{},www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[1]<,jumia.ma,www.jumia.ma,jumia.ma,"[window, localStorage]",window,localStorage,
2,__nr_flags,__nr_require<[1]<,False,https://www.jumia.ma/soin-du-visage/,call,https://www.jumia.ma/soin-du-visage/,window.Storage.getItem,2017-12-16 13:26:28.595,,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[1]<,jumia.ma,www.jumia.ma,jumia.ma,"[window, Storage, getItem]",window,Storage,getItem
3,,__nr_require<[14]<,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/soin-du-visage/,window.navigator.userAgent,2017-12-16 13:26:28.598,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,www.jumia.ma,/soin-du-visage/,EMPTY_STRING,www.jumia.ma||EMPTY_STRING||__nr_require<[14]<,jumia.ma,www.jumia.ma,jumia.ma,"[window, navigator, userAgent]",window,navigator,userAgent
4,,n,False,https://www.jumia.ma/soin-du-visage/,get,https://www.jumia.ma/scripts/common_desktop.99...,window.localStorage,2017-12-16 13:26:29.267,{},www.jumia.ma,/scripts/common_desktop.9966544a3e.js,common_desktop.9966544a3e.js,www.jumia.ma||common_desktop.9966544a3e.js||n,jumia.ma,www.jumia.ma,jumia.ma,"[window, localStorage]",window,localStorage,


Using a specific location, I tried to find out various features that can be extracted from the symbol column like plugin, userAgent, language and other details. 
Some arguments have "Cwm fjordbank glyphs vext quiz" which is mentioned in various paper to be associated with canvas fingerprinting. This is said to be used to uniquely identify a browser almost accurately. If a mobile phone is considered , emojis are also considered as a significant trait to know about a browser as every mobile has unique way of representing its emojis. 

In [38]:
df.loc[df["location"] == "https://www.johnlewis.com/browse/gifts/gift-food-wine-champagne/view-all-hampers/_/N-amr", ("symbol_2")].compute()

0                            NaN
1                        getItem
2                      userAgent
3                            NaN
4                        getItem
5                            NaN
6                        getItem
7                      userAgent
8                            NaN
9                            NaN
10                        cookie
11                     userAgent
12                        cookie
13                        cookie
14                        cookie
15                        cookie
16                        cookie
17                        cookie
18                        cookie
19                     userAgent
20                     userAgent
21                        cookie
22                     userAgent
23                      platform
24      plugins[Shockwave Flash]
25                     userAgent
26                        cookie
27                     userAgent
28                       appName
29                     userAgent
          

Further, using the string operations mentioned in issue_36 we can find significant factors that could tell us how a browser is recognized
- location : http://ca.puma.com/en_CA/kids	
- window.navigator.language : en-US
- window.navigator.plugins[Shockwave Flash].name: Shockwave Flash 28.0 r0
- window.Storage.length : 0
- window.screen.pixelDepth :24
- window.screen.colorDepth :24
- window.navigator.platform :Linux x86_64
- window.navigator.doNotTrack :unspecified
- window.navigator.userAgent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
- Time_stamp	: 12/16/2017  3:46:19 AM

## Audio fingerprint aspect

OfflineAudioContex/AudioContext
- createOscillator
- createDynamicsCompressor
- destination
- startRendering
- oncomplete


## Canvas fingerprinting aspect
CanvasRenderingContext2D
- font
- getImagedata
- Fill/fillstyle
- globalcompositionoperation

HTMLCanvasElement
- Height,width,style,draggable


The initial process is to consider a certain location and its corresponding features like pulgin, font, language , screen size. Find entropy of every feature related to that location

In [19]:
feature_plugin_df = df[df.symbol_2 == 'plugins[Shockwave Flash]']
fp_urls = feature_plugin_df.location.unique().persist()
progress(fp_urls, notebook=False)


[########################################] | 100% Completed | 12.0s

In [20]:
fp_urls = fp_urls.compute()
fp_urls[0:5]


0             https://theporndude.com/989/babesmachine
1    https://ad.doubleclick.net/ddm/adi/N5620.15339...
2    http://www.lazada.com.ph/shop-air-conditioners...
3               http://animefreak.tv/watch/pupa-online
4    http://www.eventim.de/disneys-musical-tarzan-i...
Name: location, dtype: object

In [21]:
len(fp_urls)

192

In [30]:
featur_font_df = df[df.symbol_1 == 'font']
ff_urls = featur_font_df.location.unique().persist()
progress(ff_urls, notebook=False)
print(' # unique font features',len(ff_urls))

[########################################] | 100% Completed |  0.0s # unique font features 22


In [31]:
ffnu_urls = featur_font_df.location.persist()
progress(ffnu_urls, notebook=False)
ffnu_urls = ffnu_urls.compute()
print(' # total font features',len(ffnu_urls))

[########################################] | 100% Completed | 12.3s # total font features 131


In [15]:
ff_urls = ff_urls.compute()
ff_urls[0:5]

0    http://www.eventim.de/disneys-musical-tarzan-i...
1    https://www.hesport.com/matches.html?season=14113
2                  http://www.oregonlive.com/trending/
3    https://w.soundcloud.com/player/?url=https%3A/...
4                https://filmakinesi.org/kategori/spor
Name: location, dtype: object

## Plugins and Fonts
According to some papers, the plugins with list of fonts are  very significant in order to identify a browser. For example, Iphone is considered to be least affected by browser fingerprinting as it has no flash setting. Hence Desktop machines are more vulnerable to this than mobile browsers. Hence we find number of locations which share fonnt and plugin information and that are 16 out of total 37234 records in this one parquete file.

In [33]:
togethrfp_urls = set(fp_urls) & \
    set(ffnu_urls) 
print('# of location using plugins and font features', len(togethrfp_urls))

# of location using plugins and font features 16


# Entropy
Entropy is used for categorical features in order to find number of occurence of certain value in an attribute. This is also used in decision tree to find right questions to ask. Here Entropy of combined plugins and fonts is to be considered.

# Further Analysis
I have used Excel for most of my analysis due to the size of file and memory limitation on my side. That being said , I have mostly focused on location , symbol and value feature which could sufficiently identify a browser on the basis of its font type, user-agent, Date format , platform ,Canvas and audio fingerprint. I plan to use entropy to find the feature combination that is best to identify a browser. We still have the question of which urls could be blocked by adblocks? I decided to consider in_iframe , location , operation and script_url attributes to scratch the surface of the problem. The heuristic here is to check whether the feature in_iframe is true and script_url of a certain location has a '.js' file with similar '.js' file blocked in the past. Can we build a model to predict if a script would be blocked by adblock or not? 
We could consider features with highest entropy to uniquely identify a browser and train the model based on the past data.