In [1]:
import dask.dataframe as dd

In [2]:
data_set = './sample_10percent_value_1000_only.parquet'  # create a variable for the dataset 

TASK:  We want to identify scripts that have stored or created unique ids. We also want to identify scripts that have not been storing or creating unique ids.

ANALYSIS:  When a site is visited for the first time, cookies are installed for subsequent visits. A Web site might generate a unique ID number for each visitor and store the number on the machine of each user using a cookie file. Since these files are stored in the local storage of the machine, we want to look at calls, probably to storage APIs, to discover scripts that have been setting unique ids.

Creating unique IDs is a pointer to fingerprinting. However, it is not an axiom that the storage of unique IDs mean that fingerprinting truly occurred. If we also fail to detect unique IDs in storage, we can also not conclude fingerprinting did not happen. This task, in essence, is foundational in helping us understand what a script is possibly doing. Our local storage APIs are: window.localStorage, window.document.cookie, and window.sessionStorage.

SECTION 1: In this section, we will do the following:

a. Take a look at the data 

b. Check number of calls to local Storage

c. Show scripts that made calls to local Storage

d. Show scripts that did not make calls to local Storage.

SECTION 1A: We take a look at the data

In [3]:
ddf = dd.read_parquet(data_set, engine='pyarrow')
ddf.head()

Unnamed: 0,argument_0,argument_1,argument_2,argument_3,argument_4,argument_5,argument_6,argument_7,argument_8,arguments,...,location,operation,script_col,script_line,script_loc_eval,script_url,symbol,time_stamp,value_1000,value_len
0,,,,,,,,,,{},...,https://vk.com/widget_comments.php?app=2297596...,get,1,163,,https://vk.com/js/api/xdm.js?1449919642,window.name,2017-12-16 19:02:31.406,fXDcab74,8
1,,,,,,,,,,{},...,https://vk.com/widget_comments.php?app=2297596...,get,9,164,,https://vk.com/js/api/xdm.js?1449919642,window.name,2017-12-16 19:02:31.407,fXDcab74,8
2,,,,,,,,,,{},...,https://vk.com/widget_comments.php?app=2297596...,get,67,1,,https://vk.com/js/al/aes_light.js?592436914,window.navigator.userAgent,2017-12-16 19:02:31.659,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,68
3,,,,,,,,,,{},...,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,get,1,1,,https://cpro.baidustatic.com/cpro/ui/noexpire/...,window.navigator.userAgent,2017-12-16 00:24:09.355,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...,68
4,,,,,,,,,,{},...,http://serienjunkies.org/smilf/smilf-season-1-...,get,32,59,,https://apis.google.com/js/plusone.js?_=151338...,window.document.cookie,2017-12-16 01:24:30.372,_ga=GA1.2.1529583939.1513387469; _gid=GA1.2.17...,288


Fig. 1

From the table Fig. 1, the "script_url" column tells us the scripts making the call

To the task of the day. We want to check the number of calls to local storage made by scripts. 

In [4]:
# A list of localstorage spaces
stores = ['window.localStorage','window.document.cookie', 'window.sessionStorage']

In [21]:
# We start by defining a function'findID'. Then we read the parquet file into a variable
#Second line involves filtering the calls made to ensure we extract only those that made calls to local Storage
# We then return the result for further analysis toward the task

def findID(data):
    ddf = dd.read_parquet(data, engine='pyarrow') 
    unique_symbl = ddf.symbol.isin(stores) 
    new_table = ddf[unique_symbl]
    return new_table

SECTION 1B: Number of calls to each local Storage 

In [6]:
findID(data_set).symbol.value_counts().compute() # number of times calls were made to local Storage

window.document.cookie    3521632
window.localStorage        866712
window.sessionStorage      408549
Name: symbol, dtype: int64

The code below corresponds with what the function gave us above. We had to use head(10) to keep things compact. 
However, all the local Storage APIs were captured. 

In [7]:
# We want to get the exact number of times calls were made to local storage. This will serve as a test for any code 
# we write to 

ddf = dd.read_parquet(data_set, engine='pyarrow')
ddf.symbol.value_counts().compute().head(10)

window.document.cookie                                   3521632
window.navigator.userAgent                               1542070
window.Storage.getItem                                   1046105
window.localStorage                                       866712
window.Storage.setItem                                    408566
window.sessionStorage                                     408549
window.Storage.removeItem                                 284856
window.name                                               248318
CanvasRenderingContext2D.fillStyle                        196829
window.navigator.plugins[Shockwave Flash].description     184937
Name: symbol, dtype: int64

SECTION 1C: Below we check for scripts that made calls to local Storage using the funtion in Section 1B.  

In [32]:
# Scripts that made calls to local storage APIs.  

findID(data_set).script_url.unique().compute() # we used 'unique()' to eliminate duplicates. 

0        https://apis.google.com/js/plusone.js?_=151338...
1        https://assets.adobedtm.com/caacec67651710193d...
2        https://assets.adobedtm.com/caacec67651710193d...
3            https://www.google-analytics.com/analytics.js
4        https://www.google-analytics.com/plugins/ua/li...
5        https://www.canada.ca/etc/designs/canada/wet-b...
6        https://g.alicdn.com/sea/sitenav-global/0.8.0/...
7           https://g.alicdn.com/tb/tracker/3.0.7/index.js
8        https://maniform.world.tmall.com/category-1282...
9        https://g.alicdn.com/sanwant/shop-render/0.0.9...
10            https://g.alicdn.com/alilog/mlog/aplus_v2.js
11       https://g.alicdn.com/tb/tracker/4.0.1/p/index/...
12       https://g.alicdn.com/aliww/web.ww/scripts/webw...
13       https://g.alicdn.com/kissy/k/1.4.2/??io-min.js...
14       https://g.alicdn.com/tbc/??search-suggest/1.3....
15       https://g.alicdn.com/shop/wangpu/1.7.5/??decor...
16       https://g.alicdn.com/secdev/sufei_data/3.2.2/i.

From the result above, we deduce that 95, 801 unique scripts made calls to local Storage.

SECTION 1D: We now ascertain scripts that did not make calls to local Storage

In [29]:
def other_script(data):
    ddf = dd.read_parquet(data, engine='pyarrow') 
    no_calls = ddf[ddf.symbol.isin(stores) == False]
    return no_calls

In [33]:
other_script(data_set).script_url.unique().compute()

0                   https://vk.com/js/api/xdm.js?1449919642
1               https://vk.com/js/al/aes_light.js?592436914
2         https://cpro.baidustatic.com/cpro/ui/noexpire/...
3         https://apis.google.com/_/scs/apps-static/_/js...
4         https://assets.adobedtm.com/caacec67651710193d...
5             https://www.google-analytics.com/analytics.js
6         https://assets.adobedtm.com/caacec67651710193d...
7         https://www.canada.ca/etc/designs/canada/wet-b...
8         https://www.canada.ca/etc/designs/canada/wet-b...
9              https://s0.2mdn.net/879366/Enabler_01_197.js
10           https://g.alicdn.com/kissy/k/1.4.2/seed-min.js
11        https://g.alicdn.com/sea/sitenav-global/0.8.0/...
12           https://g.alicdn.com/tb/tracker/3.0.7/index.js
13        https://g.alicdn.com/sanwant/shop-render/0.0.9...
14        https://g.alicdn.com/mtb/videox/0.1.33/videox-...
15                      https://uaction.alicdn.com/js/ua.js
16             https://g.alicdn.com/alil

We can see that a total of 138,814 scripts did not call local Storage

SECTION 2: CONCLUSION

1. If you add the total number of calls made to each local Storage, you will get 4, 796, 893. Window.cookie has the
highest number of calls, which is 3,521,632. 

2. We decided to use 'unique()' in a bid to get only the unique scripts and keep things simple. Using unique(), the 
total number of scripts that made local storage calls stood at 95, 801(SECTION 1C). Without 'unique()', the number of 
scripts would tally with the total number of times the local storage APIs were called. To verify that claim, we do 
the following:

In [36]:
len(findID(data_set))

4796893

As you can see, the total number of scripts tally with the number of times local Storage APIs were called. 

3. The number of unique scripts that did not make calls to local storage APIs stood at 138, 813. If we get rid of 
'unique()', we get:

In [37]:
len(other_script(data_set))

6495974

4. If we add the number of scripts that made calls to local APIs and the ones that didn't, we will get the actual
number of rows in the data set. 

In [38]:
total_scripts = 4796893 + 6495974
total_scripts

11292867

So the total number of scripts is 11, 292, 867. The number of rows in the data set is:

In [39]:
len(ddf)

11292867