# Genererate presence of TPs over harvests
Create a DF where all TPs from visited EU/EEA sites are as columns and all visited sites as index (their IDs). All harvests are interated over and all unique TP responses from EU/EEA visited sites are recoreder with 1 in the DF.

    • input: (i) all enriched responses .csv files, and (ii) EU-UNIQUE-RES-TPs.csv - list of unique TPs
    • output: EU_TP_total_occurence.csv - presence of each unique TP at every visited site over harvesting period
    • script steps:
        1. Import libraries
        2. Load all unique TPs as a DF
        3. Create an empty dataframe (df_distribution) with visited site IDs as index (1 – 12 778) and unique TPs as columns
        4. Iterate through all enriched responses:
            (a) Load response file as a DF
            (b) Filter out all responses to non EU/EEA origin visited sites requests
            (c) Filter out all FP-to-FP communication
            (d) Group the response DF by visited site IDs (visit_ID) and TP root domains (RD_url), get the size, and create a new DF out of it
            (e) Iterate through the grouped DF: (i) insert 1 to the df_distribution at position [visit_ID, RD_url] every time the given TP appears in the given visited site responses
        5. Export the df_distribution containing the occurrence of each unique TP at every visited site, during all harvests - EU_TP_total_occurence.csv

In [1]:
# Import
import pandas as pd
import csv
import os

In [2]:
# Define path and file name
TPs_path = '/home/ubuntu/data/processed/TPs/TPs_merged/'
TPs_name = 'EU-UNIQUE-RES-TPs.csv'

In [3]:
# Load all unique TPs
df_TPs = pd.read_csv(TPs_path + TPs_name, header=None)
df_TPs.head()

Unnamed: 0,0
0,01mspmd5yalky8.com
1,01net.com
2,030876vw.com
3,0914.global.ssl.fastly.net
4,0klxjejyxak3.com


In [4]:
# Define visit ids
visit_ids = list(range(1, 12779))

In [5]:
# Create a DF with visited site IDs as indexes and all unique TPs as columns
df_distribution = pd.DataFrame(index =visit_ids, columns = df_TPs[0])
df_distribution.head()

Unnamed: 0,01mspmd5yalky8.com,01net.com,030876vw.com,0914.global.ssl.fastly.net,0klxjejyxak3.com,1.98.201.35.bc.googleusercontent.com,100posto.hr,1053041200.rsc.cdn77.org,108.59.8.1,108.59.8.35,...,zrh50.cloudfront.net,zro56hd6szoy.com,ztat.net,ztkcdn.net,ztsrv.com,zumby.io,zuora.com,zuuvi.com,zvuki.ru,zxcvads.com
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# Define folder with all harvest data
harvests_path = '/home/ubuntu/data/processed/crawls/response_enriched/v.3/'

i=0

# Iterate through all file names in the folder
for harvests_name in os.listdir(harvests_path):
    i += 1
    # Load harvest data
    df_single_harvest = pd.read_csv(harvests_path + harvests_name)
    
     # Filter out all non European sites
    df_res_TP_EU = df_single_harvest[df_single_harvest['Europe'].isin(['EU', 'EEA'])]

    # Filter out all first-to-first party communication
    df_res_filt = df_res_TP_EU.where(df_res_TP_EU['third_party']==True).dropna(subset=['third_party'])
    df_res_filt = df_res_filt[['visit_id', 'url', 'site_url', 'RD_url', 'RD_site_url']]
    
    # Group DF by visit id and TP url
    df_res_filt = df_res_filt.astype({'visit_id': 'int32'})
    df_grouped = df_res_filt.groupby(['visit_id', 'RD_url']).size()
    
    # For each visited site and TP appearing in given harvest, assign 1 to the dataframe
    for index, value in df_grouped.items():
        visit_id = index[0]
        TP = index[1]
        df_distribution.at[visit_id, TP] = 1
    
    print('Completed ', i, ' - ', harvests_name)

print('Completed')


In [11]:
# Export
df_distribution.to_csv('/home/ubuntu/data/processed/crawls/response_enriched/analysis_v.3/' + 'EU_TP_total_occurence.csv', index = True, header = True)