## Setup

Reads the data in from the dataset. At the moment, this is set up to uses pandas and just the data sample provided at the hackathon. Will need to be updated to scrape JS files from the full dataset.

In [1]:
import pandas as pd

DATA_DIR = r'/media/ddobre/UCOSP_DATA/'
PARQUET_FILE = DATA_DIR + r'sample'  # I ran this with sample data*
df = pd.read_parquet(PARQUET_FILE, engine='pyarrow')

In [2]:
df['script_url'].unique()

array(['https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=f30ef17b61f384&origin=http%3A%2F%2Fwww.ubitennis.com',
       'https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js',
       'http://cpro.baidustatic.com/cpro/ui/noexpire/js/4.0.1/adClosefeedbackUpgrade.min.js',
       'https://staticxx.facebook.com/connect/xd_arbiter/r/lY4eZXm_YWu.js?version=42#channel=fe1ad16a94c816&origin=http%3A%2F%2Farabi21.com',
       'https://static.dynamicyield.com/scripts/12290/dy-coll-min.js',
       'https://www.syracuse.edu/about/',
       'https://www.googletagmanager.com/gtm.js?id=GTM-5FC97GL',
       'https://www.syracuse.edu/wp-includes/js/wp-emoji-release.min.js?ver=4.9.1',
       'https://www.google-analytics.com/analytics.js',
       'https://code.jquery.com/jquery-migrate-1.4.1.min.js',
       'https://www.syracuse.edu/wp-content/themes/g6-carbon/js/carbon-all.js?ver=6.3.6',
       'https://www.syracuse.edu/wp-includes/js/wp-embed.min.js?ver=4.9.

## Scrape the JS files

Does the actual scrapping. `ssl._create_default_https_context = ssl._create_unverified_context` was needed in order to get around some ssl authentication errors, there is probably a better/cleaner way of handling this. Certain text from the url name was also replaced because it was causing issues with naming the scripts based off of the url, a different naming scheme may be more effective there. Finally, there were some special chars that threw errors in decoding the scripts, those are handled with 'backslashreplace'.

In [None]:
import urllib 
import os
import ssl
from slugify import slugify 

ssl._create_default_https_context = ssl._create_unverified_context
failed = []

for url_name in df['script_url'].unique():
    folder_name = DATA_DIR + r'js_source_files'
    shortened_url = url_name.replace('https://', '').replace('http://', '').replace('/', '_')   
    shortened_url = slugify(shortened_url)[:250]
    suffix = '.txt'
    
    file_name = folder_name + '/' + shortened_url + suffix

    with open(file_name, 'w') as source_file:
        try:
            source_file.write(urllib.request.urlopen(url_name).read().decode('utf-8', 'backslashreplace'))
        except (urllib.error.URLError, ValueError) as e:
            failed.append(url_name)
            print('Attempted:', url_name)
            print(str(e), '\n')

## Scrape the Princeton examples

Reads the Princeton .tsv files that have been categorized according to the type of fingerprinting done before then scraping the JS files from those sources. Had a lot more failures here (possibly due to the script url being updated on some cadence to prevent this from happening) so wrote all of the failures to an output file to keep track of those. Same issues re ssl, naming and special chars as mentioned above.

In [None]:
import pandas as pd

directory = '/Users/rob/Projects/ucosp/Overscripted-Data-Analysis-Challenge/scripts'
headers = ['site_url', 'script_url']
scripts = ['audio_fingerprinting', 'canvas_fingerprinting', 'font_fingerprinting', 'webrtc_ip_retrieval']
fingerprinting = []

for script in scripts:
    loc = os.path.join(directory, '{}.tsv'.format(script))
    fingerprinting.append(pd.read_table(loc, header=None, names=headers))

In [None]:
import urllib 
import os
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# for index, script in enumerate(fingerprinting):
script = fingerprinting[3]
index = 3
print('Trying:', scripts[index])
print(len(script['script_url'].unique()), 'scripts')
failed = []
folder_name = os.path.join(directory, scripts[index])
failed_file = os.path.join(directory, '{}_failed.txt'.format(scripts[index]))
os.makedirs(folder_name, exist_ok=True)
for url_name in script['script_url'].unique():
    shortened_url = url_name.replace('https://', '').replace('http://', '').replace('/', '_')
    suffix = '.txt'
    file_name = os.path.join(folder_name, shortened_url + suffix)

    try:
        response = urllib.request.urlopen(url_name).read().decode('utf-8', 'backslashreplace')
        with open(file_name, 'w') as source_file:
            source_file.write(response)
    except Exception as e:
        failed.append((url_name, e))

print('Failed:', len(failed))
with open(failed_file, 'w') as filename:
    for failure in failed:
        filename.write(failure[0] + ', ' + str(failure[1]))
        filename.write('\n')