This notebook contains a series of string operations that I find myself doing frequently on this dataset.

For the purposes of issue 36 they may not be the least efficient but should be enough to get started and we can dig in further if needed.

For dask demo, this shows plenty of examples of applying functions and string operations using dask

In [2]:
import dask.dataframe as dd
from dask.distributed import Client

Client()

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)


0,1
Client  Scheduler: tcp://127.0.0.1:35921  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 12  Memory: 33.35 GB


In [4]:
df = dd.read_parquet(
    'sample_10percent_value_1000_only.parquet', engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
1,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
2,,,window.navigator.userAgent,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914
3,,,window.navigator.userAgent,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...
4,,Fd.iterate,window.document.cookie,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...


Some common string processing tasks:
* pulling domains
* pulling end of url
* building "grouping" string
* splitting symbol column
* finding things in strings

In [10]:
from urllib.parse import urlparse
from openwpm_utils.domain import get_ps_plus_1

EMPTY_STRING = 'EMPTY_STRING'


def get_netloc(x):
    p = urlparse(x)
    val = p.netloc
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_path(x):
    p = urlparse(x)
    val = p.path
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_end_of_path(x):
    splits = x.split('/')
    val = ''
    if len(splits) > 0:
        val = splits[-1]
    else:
        val = x
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_clean_script(x):
    p = urlparse(x)
    return f'{p.netloc}{p.path}'

#### Build aggregator

In [11]:
df['script_netloc'] = df.script_url.apply(get_netloc, meta=('O'))
df['script_path'] = df.script_url.apply(get_path, meta=('O'))
df['script_path_end'] = df.script_path.apply(get_end_of_path, meta=('O'))
df['agg'] = df.script_netloc + '||' + df.script_path_end + '||' + df.func_name
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url,script_netloc,script_path,script_path_end,agg
0,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,vk.com,/js/api/xdm.js,xdm.js,vk.com||xdm.js||w.fastXDM.Client
1,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,vk.com,/js/api/xdm.js,xdm.js,vk.com||xdm.js||w.fastXDM.Client
2,,,window.navigator.userAgent,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914,vk.com,/js/al/aes_light.js,aes_light.js,vk.com||aes_light.js||
3,,,window.navigator.userAgent,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...,cpro.baidustatic.com,/cpro/ui/noexpire/js/4.0.0/adClosefeedbackUpgr...,adClosefeedbackUpgrade.min.js,cpro.baidustatic.com||adClosefeedbackUpgrade.m...
4,,Fd.iterate,window.document.cookie,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...,apis.google.com,/js/plusone.js,plusone.js,apis.google.com||plusone.js||Fd.iterate


In [12]:
%%time
n_unique_aggs = df.agg.nunique().compute()

CPU times: user 27 s, sys: 8.27 s, total: 35.3 s
Wall time: 2min 42s


In [13]:
n_unique_aggs

185084

#### Count unique location domains

In [14]:
df = dd.read_parquet(
    'sample_10percent_value_1000_only.parquet', engine='pyarrow',
    columns=['location', 'script_url']
)
df.head()

Unnamed: 0,location,script_url
0,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
1,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
2,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914
3,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...
4,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...


In [15]:
df['location_domain'] = df.location.apply(get_ps_plus_1, meta='O')
df['script_domain'] = df.script_url.apply(get_ps_plus_1, meta='O')
df.head()

Unnamed: 0,location,script_url,location_domain,script_domain
0,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,vk.com,vk.com
1,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,vk.com,vk.com
2,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914,vk.com,vk.com
3,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...,baidu.com,baidustatic.com
4,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...,serienjunkies.org,google.com


In [16]:
%%time
print(df.location_domain.nunique().compute())

11335
CPU times: user 28.8 s, sys: 11.3 s, total: 40.1 s
Wall time: 3min 46s


In [17]:
%%time
print(df.script_domain.nunique().compute())

11641
CPU times: user 6.93 s, sys: 5.28 s, total: 12.2 s
Wall time: 2min


#### Looking for strings

In [18]:
df = dd.read_parquet(
    'sample_10percent_value_1000_only.parquet', engine='pyarrow',
    columns=['argument_0', 'script_url']
)
df.head()

Unnamed: 0,argument_0,script_url
0,,https://vk.com/js/api/xdm.js?1449919642
1,,https://vk.com/js/api/xdm.js?1449919642
2,,https://vk.com/js/al/aes_light.js?592436914
3,,https://cpro.baidustatic.com/cpro/ui/noexpire/...
4,,https://apis.google.com/js/plusone.js?_=151338...


In [22]:
%%time
print(df[df.argument_0.str.contains('modernizr')].script_url.nunique().compute())

1374
CPU times: user 1.39 s, sys: 487 ms, total: 1.88 s
Wall time: 4.37 s


#### Splitting symbol

In [23]:
df = dd.read_parquet(
    'sample_10percent_value_1000_only.parquet', engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
1,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642
2,,,window.navigator.userAgent,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914
3,,,window.navigator.userAgent,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...
4,,Fd.iterate,window.document.cookie,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...


In [28]:
df['symbol_parts'] = df.symbol.str.split('.')
df['symbol_0'] = df.symbol_parts.str.get(0)
df['symbol_1'] = df.symbol_parts.str.get(1)
df['symbol_2'] = df.symbol_parts.str.get(2)
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url,symbol_parts,symbol_0,symbol_1,symbol_2
0,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,"[window, name]",window,name,
1,,w.fastXDM.Client,window.name,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,"[window, name]",window,name,
2,,,window.navigator.userAgent,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914,"[window, navigator, userAgent]",window,navigator,userAgent
3,,,window.navigator.userAgent,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...,"[window, navigator, userAgent]",window,navigator,userAgent
4,,Fd.iterate,window.document.cookie,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...,"[window, document, cookie]",window,document,cookie


In [30]:
%%time
print(df[df.symbol_1 == 'fillText'].func_name.nunique().compute())

302
CPU times: user 27.1 s, sys: 7.46 s, total: 34.5 s
Wall time: 2min 7s
