## Introduction
This notebook have used the existing examples of string processing which @birdsarah has produced, present in 'analyses/issue_36.ipynb' to analyse it using **array extensions**

### Fletcher Array Extensions
Source: [fletcher docs](https://fletcher.readthedocs.io/en/latest/)

Fletcher provides a generic implementation of the ExtensionDtype and ExtensionArray interfaces of Pandas for columns backed by Apache Arrow. By using it you can use any data type available in Apache Arrow natively in Pandas. Most prominently, fletcher provides native String und List types.

In addition to bringing an alternative memory backend to NumPy, fletcher also provides high-performance operations on the new column types. It will either use the native implementation of an algorithm if provided in pyarrow or otherwise provide an implementation by itself using Numba.

An example of usage is shown below


In [16]:
import fletcher as fr
import pandas as pd

df = pd.DataFrame({
    'str_column': fr.FletcherArray(['Test', None, 'Strings'])
})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
str_column    2 non-null fletcher[string]
dtypes: fletcher[string](1)
memory usage: 108.0 bytes


## Analysis #1
### Comparing time to calculate count of unique values and nunique method usage 
Calculating count of unique values not using fletcher - Wall time = 567ms. See <a href='#analysis-1.1'>Analysis-1.1</a> for code and processing <br>
Calculating count of unique values using fletcher - Wall time = 0ns. See <a href='#analysis-1.2'>Analysis-1.2</a> for code and processing <br>
The nunique method doesn't work using Fletcher Array thus  



### Findings
Array extensions has decreased the Wall time and has increased efficiency

## Analysis #2
### .str in DASK with Fletcher Array Extension is not supported
It gives the above error because DASK doesn't allow access to str with Fletcher Array Extension. See <a href='#analysis-1.1'>Analysis-1.1</a> for code and processing

# Conclusion

1. Array extension makes string processing faster for some cases
1. Needs extra methods and operations for DASK methods to work correctly
1. Some methods are not supported for Fletcher Array type objects

This notebook contains a series of string operations that I find myself doing frequently on this dataset.

For the purposes of issue 36 they may not be the least efficient but should be enough to get started and we can dig in further if needed.

For dask demo, this shows plenty of examples of applying functions and string operations using dask

In [1]:
import dask.dataframe as dd
from dask.distributed import Client
import fletcher as fr
Client()

0,1
Client  Scheduler: tcp://127.0.0.1:56669  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 3.98 GB


In [2]:
df = dd.read_parquet(
    "C:\\Users\\hamza\\Downloads\\safe_dataset.sample (1)\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet", engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()


Unnamed: 0,argument_0,func_name,symbol,location,script_url
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...


Some common string processing tasks:
* pulling domains
* pulling end of url
* building "grouping" string
* splitting symbol column
* finding things in strings

In [3]:
from urllib.parse import urlparse
#from openwpm_utils.domain import get_ps_plus_1

EMPTY_STRING = 'EMPTY_STRING'


def get_netloc(x):
    p = urlparse(x)
    val = p.netloc
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_path(x):
    p = urlparse(x)
    val = p.path
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_end_of_path(x):
    splits = x.split('/')
    val = ''
    if len(splits) > 0:
        val = splits[-1]
    else:
        val = x
    if len(val) == 0:
        val = EMPTY_STRING
    return val


def get_clean_script(x):
    p = urlparse(x)
    return f'{p.netloc}{p.path}'

#### Build aggregator

In [4]:
df['script_netloc'] = df.script_url.apply(get_netloc, meta=('O'))
df['script_path'] = df.script_url.apply(get_path, meta=('O'))
df['script_path_end'] = df.script_path.apply(get_end_of_path, meta=('O'))
df['agg'] = df.script_netloc + '||' + df.script_path_end + '||' + df.func_name
df.head()

Unnamed: 0,argument_0,func_name,symbol,location,script_url,script_netloc,script_path,script_path_end,agg
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||A
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||x
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...,ajax.googleapis.com,/ajax/libs/webfont/1.6.26/webfont.js,webfont.js,ajax.googleapis.com||webfont.js||ra/<


In [5]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)


## Analysis - 1.1
<a id='analysis-1.1'></a>


In [6]:
%%time
n_unique_aggs = df.agg.nunique().compute()

Wall time: 418 ms


In [7]:
n_unique_aggs

1261

In [8]:
#Using different approach
%time
n_unique_aggs_from_fr = len(dfpd.agg.unique().compute())


Wall time: 0 ns


In [9]:
n_unique_aggs_from_fr

1261

## Analysis - 1.2
<a id='analysis-1.2'></a>
See <a href='#analysis-1.1'>Analysis-1.1</a> for different approach <br>

In [10]:
n_unique_aggs_from_fr = dfpd.agg.nunique()
#The above command returns an error


TypeError: data type not understood

#### Looking for strings

In [None]:
df = dd.read_parquet(
    'C:\\Users\\Ayman Hasan\\Desktop\\Outreachy\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet', engine='pyarrow',
    columns=['argument_0', 'script_url']
)
df.head()

In [11]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)
dfpd.head()

  (                           argument_0             ...  9 columns], 5)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s))


Unnamed: 0,argument_0,func_name,symbol,location,script_url,script_netloc,script_path,script_path_end,agg
0,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
1,,a/<,window.name,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||a/<
2,,A,window.document.cookie,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||A
3,,x,window.navigator.userAgent,https://staticxx.facebook.com/connect/xd_arbit...,https://staticxx.facebook.com/connect/xd_arbit...,staticxx.facebook.com,/connect/xd_arbiter/r/lY4eZXm_YWu.js,lY4eZXm_YWu.js,staticxx.facebook.com||lY4eZXm_YWu.js||x
4,,ra/<,window.navigator.userAgent,https://cas.us.criteo.com/delivery/r/afr.php?d...,https://ajax.googleapis.com/ajax/libs/webfont/...,ajax.googleapis.com,/ajax/libs/webfont/1.6.26/webfont.js,webfont.js,ajax.googleapis.com||webfont.js||ra/<


In [12]:
%%time
print(df[df.argument_0.str.contains('modernizr')].script_url.nunique().compute())
#print(df[df.argument_0.str.contains('modernizr')].head())

5
Wall time: 439 ms


## Analysis - 2
<a id='analysis-2'></a>


In [13]:
%%time
print(len(dfpd[dfpd.argument_0.str.contains('modernizr')].script_url.unique().compute())) #contains is not supported with fletcher array extension

AttributeError: Can only use .str accessor with object dtype

#### Splitting symbol

In [14]:
df = dd.read_parquet(
    'C:\\Users\\Ayman Hasan\\Desktop\\Outreachy\\sample\\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet', engine='pyarrow',
    columns=['argument_0', 'func_name', 'symbol', 'location', 'script_url']
)
df.head()

OSError: Passed non-file path: C:\Users\Ayman Hasan\Desktop\Outreachy\sample\part-00000-34d9b361-ea79-42eb-82ee-9c9f9259c339-c000.snappy.parquet

In [None]:
dfpd = df.compute() #Converting to Pandas Dataframe


for i in dfpd.columns:
    dfpd[i] = fr.FletcherArray(dfpd[i])
dfpd = dd.from_pandas(dfpd, npartitions=2)
dfpd.head()

In [None]:
df['symbol_parts'] = df.symbol.str.split('.')
df['symbol_0'] = df.symbol_parts.str.get(0)
df['symbol_1'] = df.symbol_parts.str.get(1)
df['symbol_2'] = df.symbol_parts.str.get(2)
df.head()
dfpd['symbol_0'] = dfpd.symbol_parts.str.get(0)

In [None]:
dfpd['symbol_parts'] = dfpd.symbol.str.split('.')
dfpd['symbol_0'] = dfpd.symbol_parts.str.get(0)
dfpd['symbol_1'] = df.symbol_parts.str.get(1)

df.head()

In [None]:
%%time
print(df[df.symbol_1 == 'fillText'].func_name.nunique().compute())