## Introduction
This notebook have used the existing examples of string processing which @birdsarah has produced, present in 'analyses/issue_36.ipynb' to analyse it using **array extensions**

This notebook contains a series of string operations that I find myself doing frequently on this dataset.

For the purposes of issue 36 they may not be the least efficient but should be enough to get started and we can dig in further if needed.

For dask demo, this shows plenty of examples of applying functions and string operations using dask

In [26]:
import dask.dataframe as dd
from dask.distributed import Client
import fletcher as fr
Client()

0,1
Client  Scheduler: tcp://127.0.0.1:60056  Dashboard: http://127.0.0.1:60057/status,Cluster  Workers: 4  Cores: 4  Memory: 4.16 GB


In [27]:
df = dd.read_parquet(
    "C:\\Users\\Ayman Hasan\\Desktop\\Outreachy\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet", engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()


Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...


Some common string processing tasks:
* pulling domains
* pulling end of url
* building "grouping" string
* splitting symbol column
* finding things in strings

In [28]:
from urllib.parse import urlparse
#from openwpm_utils.domain import get_ps_plus_1

EMPTY_STRING = 'EMPTY_STRING'


def get_netloc(x):
    p = urlparse(x)
    val = p.netloc
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_path(x):
    p = urlparse(x)
    val = p.path
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_end_of_path(x):
    splits = x.split('/')
    val = ''
    if len(splits) > 0:
        val = splits[-1]
    else:
        val = x
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_clean_script(x):
    p = urlparse(x)
    return f'{p.netloc}{p.path}'

#### Build aggregator

In [29]:
df['script_netloc'] = df.script_url.apply(get_netloc, meta=('O'))
df['script_path'] = df.script_url.apply(get_path, meta=('O'))
df['script_path_end'] = df.script_path.apply(get_end_of_path, meta=('O'))
df['agg'] = df.script_netloc + '||' + df.script_path_end + '||' + df.func_name
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url,script_netloc,script_path,script_path_end,agg
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||A
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||x
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...,ajax.googleapis.com,/ajax/libs/webfont/1.6.26/webfont.js,webfont.js,ajax.googleapis.com||webfont.js||ra/<


In [30]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)
dfpd.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url,script_netloc,script_path,script_path_end,agg
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||A
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||x
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...,ajax.googleapis.com,/ajax/libs/webfont/1.6.26/webfont.js,webfont.js,ajax.googleapis.com||webfont.js||ra/<


In [31]:
%%time
n_unique_aggs = df.agg.nunique().compute()

Wall time: 567 ms


In [32]:
n_unique_aggs

1261

In [33]:
n_unique_aggs_from_fr = dfpd.agg.nunique()
#The above command returns an error


TypeError: data type not understood

In [34]:
#Using different approach
%time
n_unique_aggs_from_fr = len(dfpd.agg.unique().compute())


Wall time: 0 ns


In [35]:
n_unique_aggs_from_fr

1261

## Analysis #1
### Comparing time to calculate count of unique values 
Calculating count of unique values not using fletcher - Wall time = 567ms <br>
Calculating count of unique values using fletcher - Wall time = 0ns

### Findings
Array extensions has decreased the Wall time and has increased efficiency

#### Looking for strings

In [40]:
df = dd.read_parquet(
    'C:\\Users\\Ayman Hasan\\Desktop\\Outreachy\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet', engine='pyarrow',
    columns=['argument_0', 'script_url']
)
df.head()

Unnamed: 0,argument_0,script_url
0,,https://staticxx.facebook.com/connect/xd_arbit...
1,,https://staticxx.facebook.com/connect/xd_arbit...
2,,https://staticxx.facebook.com/connect/xd_arbit...
3,,https://staticxx.facebook.com/connect/xd_arbit...
4,,https://ajax.googleapis.com/ajax/libs/webfont/...


In [41]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)
dfpd.head()

Unnamed: 0,argument_0,script_url
0,,https://staticxx.facebook.com/connect/xd_arbit...
1,,https://staticxx.facebook.com/connect/xd_arbit...
2,,https://staticxx.facebook.com/connect/xd_arbit...
3,,https://staticxx.facebook.com/connect/xd_arbit...
4,,https://ajax.googleapis.com/ajax/libs/webfont/...


In [42]:
%%time
print(df[df.argument_0.str.contains('modernizr')].script_url.nunique().compute())
#print(df[df.argument_0.str.contains('modernizr')].head())

5
Wall time: 603 ms
Compiler : 258 ms


In [43]:
%%time
print(len(dfpd[dfpd.argument_0.str.contains('modernizr')].script_url.unique().compute())) #contains is not supported with fletcher array extension

AttributeError: Can only use .str accessor with object dtype

## Analysis #2
### .str in DASK with Fletcher Array Extension is not supported
It gives the above error because DASK doesn't allow access to str with Fletcher Array Extension

# Conclusion

1. Array extension makes string processing faster for some cases
1. Needs extra methods and operations for DASK methods to work correctly
1. Some methods are not supported for Fletcher Array type objects

#### Splitting symbol

In [44]:
df = dd.read_parquet(
    'C:\\Users\\Ayman Hasan\\Desktop\\Outreachy\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet', engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...


In [45]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)
dfpd.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...


In [46]:
df['symbol_parts'] = df.symbol.str.split('.')
df['symbol_0'] = df.symbol_parts.str.get(0)
df['symbol_1'] = df.symbol_parts.str.get(1)
df['symbol_2'] = df.symbol_parts.str.get(2)
df.head()
dfpd['symbol_0'] = dfpd.symbol_parts.str.get(0)

AttributeError: 'DataFrame' object has no attribute 'symbol_parts'

In [47]:
dfpd['symbol_parts'] = dfpd.symbol.str.split('.')
dfpd['symbol_0'] = dfpd.symbol_parts.str.get(0)
dfpd['symbol_1'] = df.symbol_parts.str.get(1)

df.head()

AttributeError: Can only use .str accessor with object dtype

In [48]:
%%time
print(df[df.symbol_1 == 'fillText'].func_name.nunique().compute())

5
Wall time: 367 ms
