## Data splitting
Again, this itterates on the dataset creation process. It takes the full dataset and splits it into the datasets described in the notes of the paper. First we will be splitting the data on a per domain basis. The for each domain, we will be extracting, for each label, a separate dataset with only the pages on the domain containing that label.

In [1]:
%matplotlib inline
# standard library
import itertools
import sys, os
import re

from urllib.parse import urlparse

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# skelearn
from sklearn.preprocessing import LabelBinarizer

# local imports
sys.path.append(os.path.join(os.getcwd(), "src"))
from utils import get_domain_from_url

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

In [2]:
# read it
data = dd.read_csv('../data/ecommerce-new/final/dom-full-*.csv').drop('Unnamed: 0', axis='columns')

In [3]:
# get url stats
sum_df = data.loc[:, ['url', 'detail_description_label', 'detail_image_label', 'detail_price_label',
       'detail_title_label', 'list_image_label', 'list_price_label',
       'list_title_label']]
sum_df = sum_df.groupby('url').sum().reset_index()


In [8]:
# compute how many pages per site with each label
per_page = (sum_df.set_index('url').loc[:, ['detail_description_label', 'detail_image_label', 'detail_price_label',
       'detail_title_label', 'list_image_label', 'list_price_label', 'list_title_label']] > 0).reset_index()

per_page = per_page.assign(domain=per_page.loc[:, 'url'].apply(lambda x: get_domain_from_url(x), meta=('domain', str)))

In [9]:
# get the domains
domains = per_page['domain'].unique()
domains.compute()

0       www.piata-az.ro
1           lajumate.ro
2    www.aliexpress.com
3        www.amazon.com
4           www.emag.ro
5         www.okazii.ro
6            www.olx.ro
Name: domain, dtype: object

In [10]:
per_page.groupby('domain').sum().compute()

Unnamed: 0_level_0,detail_description_label,detail_image_label,detail_price_label,detail_title_label,list_image_label,list_price_label,list_title_label
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
lajumate.ro,62.0,54.0,63.0,62.0,0.0,60.0,60.0
www.aliexpress.com,4.0,4.0,0.0,4.0,1.0,1.0,1.0
www.amazon.com,0.0,22.0,20.0,23.0,37.0,36.0,36.0
www.emag.ro,7.0,10.0,9.0,10.0,76.0,86.0,86.0
www.okazii.ro,48.0,48.0,41.0,48.0,42.0,48.0,42.0
www.olx.ro,36.0,16.0,36.0,36.0,152.0,136.0,137.0
www.piata-az.ro,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Selection
For each site, we will be selecting, as follows:
* the pages of the entire website(aka. 5 pages containing each type of label - may overlap, and 5 containing no label whatsoever)
* 5 pages containing each type of label

In [11]:
label_names = ['detail_description_label', 'detail_image_label', 'detail_price_label', 'detail_title_label', 'list_image_label', 'list_price_label', 'list_title_label']
domains = per_page['domain'].unique().compute()

per_page = per_page.compute()

In [8]:
# filter by urls and labels
for label_name in label_names:
    for domain in domains:
        # get the urls fitting the descriptions
        mask_with_label = per_page[label_name] & (per_page['domain'] == domain)
        
        # get 10 pages each
        urls_with_label = per_page.loc[mask_with_label, 'url'].iloc[:10]
        filename = '{0}-{1}-*.csv'.format(domain, label_name)
        
        # write them to csvs
        data.loc[data['url'].isin(urls_with_label), :].compute().to_csv('../data/ecommerce-new/final/split-label/' + filename)


In [34]:
# fitler and split by urls
# for each domain get 10 urls for each label and 10 without any
for domain in domains:
    urls = set()  # urls to use
    # get 10 urls for each label
    for label_name in label_names:
        # get the urls fitting the descriptions
        mask_with_label = per_page[label_name] & (per_page['domain'] == domain)
        urls_with_label = per_page.loc[mask_with_label, 'url'].iloc[:10]
        
        urls |= set(urls_with_label) # append them
    
    # get another 10 without any label on the domain
    mask_without_label = (~per_page[label_names].any(axis='columns')) &  (per_page['domain'] == domain)
    urls_without_label = per_page.loc[mask_without_label, 'url'].iloc[:10]
    urls |= set(urls_without_label)

    # write to csv
    filename = '{0}.csv'.format(domain)
    data.loc[data['url'].isin(urls_with_label), :].compute().to_csv('../data/ecommerce-new/final/split-url/' + filename)