### Data labeling
In this notebook we will be attempting to label scraped data from several e-commerce websites to later use the labels to train models. We will be associating semantical label to the data. This way, we will be able to test model's performance at classifying tags on the same sites, and also the model's ability to generalize to other websites. 


We will be assessing the extracted data's quality and giving it the following semantic labels, relevant to e-commerce websites:
* list/detail_price
* list/detail_description
* list/detail_name
* list/detail_image
* product container

The price, description and name are self-explanatory, having a detail/list version representing whether the data is on a list or on a detail page. We will also be extracting *product containers*. They represent boxes which encapsulate a products data and would help associate information into cohesive groups.

![Olx label example](imgs/labelvis.png)

### Constants 

In [1]:
# constatants
FIRST_RAW_FILENAME = '../data/raw/first-ecommerce.csv'

In [2]:
%matplotlib inline
import sys, os
import re

# lxml
from lxml import etree

# pandas
import pandas as pd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# add the library path
sys.path.append(os.path.join(os.getcwd(), "../src"))
from features import extract_features_from_html, extract_features_from_df

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

### Data reading
The data is not split on a per domain basis, so we will be doing so, to ease our work

In [3]:
# print some statistics
WEBSITES = ['emag', 'okazii', 'olx', 'aliexpress', 'amazon', 'lajumate', 'piata-az']
df = pd.read_csv(FIRST_RAW_FILENAME)
dfs = {}

# print how many pages are there per site
print('number of pages per site')
print('='*80)
for site in WEBSITES:
    dfs[site] = df.loc[df['url'].str.contains(site), :]  # it's naive, but works
    print('{0}: {1} pages'.format(site, dfs[site].size))
    
dfs['amazon'].head()

number of pages per site
emag: 368 pages
okazii: 316 pages
olx: 384 pages
aliexpress: 88 pages
amazon: 394 pages
lajumate: 314 pages
piata-az: 86 pages


Unnamed: 0,html,url
768,"<!DOCTYPE html><html class="" a-js a-audio a-vi...",https://www.amazon.com/
778,"<!DOCTYPE html><html class=""a-js a-audio a-vid...",https://www.amazon.com/stream/ref=nav_upnav_La...
780,"<!DOCTYPE html><html class="" a-js a-audio a-vi...",https://www.amazon.com/dp/B00DBYBNEE/140-38209...
781,"<!DOCTYPE html><html class="" a-js a-audio a-vi...",https://www.amazon.com/ref=nav_logo/146-105384...
782,"<!DOCTYPE html><html class="" a-js a-audio a-vi...",https://www.amazon.com/dp/B00DBYBNEE/144-77301...


### Labeling
In this section, we will be labeling tags on all pages of a website based on url filters to select list and detail pages. For example:
`https://www.olx.ro/oferta/jante-mercedes-amg-ID8Vi7d.html` is a detail page. On each such page we will apply a custom-made XPath to extract the desired elements. The order of the labels is given by `lxml` iteration order so it's consistent with the feature extraction step.

In [4]:
def match_urls(df, url_regex):
    """Returns a datarame of all the urls, telling whther they are matched or not"""
    url_pattrn = re.compile(url_regex)
    url_filter = lambda url: bool(url_pattrn.match(url))
    
    df['url_match'] = df['url'].apply(url_filter)
    df['tree'] = df['html'].apply(etree.HTML)  # get the nodes
    
    return df


def label_pages(df, url_regex, rules):
    """Returns a DataFrame wherhe each row represents a tag 
    and there are columns that specify whther the tag has a label or not.
    
    Rules is a dictionary of {label: xpath} which tells the xpath to match
    tags with the corresponding label.
    """
    df = match_urls(df, url_regex)  # match the urls
    
    for label, xpath in rules.items():
        # add columns with the tags matching the xpaths
        label_tag_col_name = '{0}_label_tags'.format(label)
        df[label_tag_col_name] = None  # set it to the list of tags
        
        # the series of lists of tags that match the rule
        # applied only to matching ulrs
        tag_series = df.loc[df['url_match'], 'tree'].apply(lambda tree: tree.xpath(xpath))
        df.loc[df['url_match'], label_tag_col_name] = tag_series
    
    # iterate over rows, explode into component tags and mark them 
    # as being labeled or not
    pages_labels = []  # the row values
    for row_data in df.iterrows():
        row = row_data[1]  # ignore the index, which is on the first pos
        
        # a series of the tags
        tags_series = pd.Series(list(row['tree'].iter()))  
        label_cols = []
        
        for label, xpath in rules.items():
            label_tag_col_name = '{0}_label_tags'.format(label)
            label_col_name = '{0}_label'.format(label)
            
            if row['url_match']:
                # if the url is among the sought ones
                # chekc if the tags are in the current list of xpath-ed tags
                label_series = tags_series.apply(lambda tag: tag in row[label_tag_col_name])
                
            else:
                label_series = pd.Series(data=np.zeros(tags_series.size, dtype=bool))
                
            label_series.name = label_col_name  # rename it to the final name
            label_cols.append(label_series)
            
        # cocnatenate the labels for the current page
        page_label_df = pd.concat(label_cols, axis='columns')
        #page_label_df['url'] = row['url']  # add the url
        pages_labels.append(page_label_df)
    
    # return the veritcally stacked values of all the pages
    return pd.concat(pages_labels, axis='rows', ignore_index=True)


def label_data(df, rules):
    """Returns a df of all the labels given a set of url filters
    and tag xpaths"""
    return pd.concat([label_pages(df, rule['url_regex'], rule['xpaths']) for rule in rules], axis='columns')

In [5]:
# print urls matching the pattern
# just the first 10 for a quick sanity-check

df_match = match_urls(dfs['olx'], r'.*oferta.*')
print('\n'.join(df_match[df_match['url_match']].iloc[1:10].url))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


https://www.olx.ro/oferta/boiler-80l-stare-perfecta-de-functionare-ID5HoTe.html#6e84df002a
https://www.olx.ro/oferta/etrieri-pod-cabluri-ID7d3PV.html#2bceab15ed
https://www.olx.ro/oferta/zmeur-polka-si-citrya-galben-ID6ZqZq.html#9689dfbbb9
https://www.olx.ro/oferta/mimoza-pudica-ID93rAh.html#6ac64d3a4f
https://www.olx.ro/oferta/volkswagen-passat-cc-r-line-170cp-extra-full-pret-fix-ID8GD57.html#6256e9ac30
https://www.olx.ro/oferta/mercedes-e220-anfab-2006-motor-2148cm-150-cp-inscris-ro-ID8Gk0P.html#6256e9ac30
https://www.olx.ro/oferta/impecabila-inmatriculata-ro-si-gpl-variante-ID7jflC.html#6256e9ac30
https://www.olx.ro/oferta/audi-a4-tdi-1-9-ID97DlF.html#6256e9ac30
https://www.olx.ro/oferta/vand-vw-polo-1-4i-ID97De9.html#6256e9ac30


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [6]:
# same
url_regex = r'.*oferta.*'
rules =  {'detail_price': r'//*[contains(concat( " ", @class, " " ), concat( " ", "arranged", " " ))]',
          'detail_title': r'//h1'}
df = label_pages(dfs['olx'], url_regex, rules)

# print hte pages that have the either of the labels
df[df['detail_title_label'] | df['detail_price_label']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,detail_price_label,detail_title_label
6641,False,True
7622,False,True
186434,False,True
187598,False,True
189168,False,True


### Labeling the tags
From here on, we will be labeling tags by filtering urls from each site with regular expressions and extracting relevant tags with xpaths.

#### OLX

In [7]:
# define the rules
rules =  [
            {
                'url_regex': r'.*oferta.*',
                'xpaths': 
                {
                    'detail_price': r'//*[contains(concat( " ", @class, " " ), concat( " ", "arranged", " " ))]',
                    'detail_title': r'//h1',
                    'detail_description': r'//*[(@id = "textContent")]//*[contains(concat( " ", @class, " " ), concat( " ", "large", " " ))]',
                    'detail_image': r'//*[contains(concat( " ", @class, " " ), concat( " ", "scale4", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "fleft", " " ))]'
                }
            },
            {
                'url_regex': r'((?!oferta).)*',
                'xpaths': 
                {
                    'list_container': r'//td[contains(@class,"offer")]',
                    'list_title': r'//td[contains(@class,"offer")]//a/strong',
                    'list_price': r'//td[contains(@class,"offer")]//p[contains(@class,"price")]/strong',
                    'list_image': r'//*[contains(concat( " ", @class, " " ), concat( " ", "scale4", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "fleft", " " ))]'
                }
            }
        ]

# extract from them
olx_df = label_data(dfs['olx'], rules)

In [8]:
# sum all tags to see a total
olx_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label            30
detail_title_label            36
detail_description_label      36
detail_image_label            56
list_container_label        5750
list_title_label            5686
list_price_label            5574
list_image_label            5603
dtype: int64

#### Emag

In [9]:
rules =  [
            {
                'url_regex': r'.*/pd/[A-Z0-9]*/.*',
                'xpaths': 
                {
                    'detail_price': r'//*[contains(concat( " ", @class, " " ), concat( " ", "product-page-pricing", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "product-new-price", " " ))]',
                    'detail_title': r'//*[contains(concat( " ", @class, " " ), concat( " ", "page-title", " " ))]',
                    'detail_description': r'//*[(@id = "description-body")]',
                    'detail_image': r'//*[(@id = "product-gallery")]//img'
                }
            },
            {
                'url_regex': r'.*',
                'xpaths': 
                {
                    'list_container': r'//div[contains(@class, "card-section-wrapper")]',
                    'list_title': r'//*[contains(concat( " ", @class, " " ), concat( " ", "product-title", " " ))]',
                    'list_price': r'//*[contains(concat( " ", @class, " " ), concat( " ", "product-new-price", " " ))]',
                    'list_image': r'//*[(@id = "card_grid")]//img'
                }
            }
        ]

# extract from them
emag_df = label_data(dfs['emag'], rules)

In [10]:
emag_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label             9
detail_title_label            10
detail_description_label       7
detail_image_label            18
list_container_label        1227
list_title_label            2127
list_price_label            2205
list_image_label            1227
dtype: int64

#### Lajumate

In [11]:
rules =  [
            {
                'url_regex': r'^((?!anunturi).)+$',
                'xpaths': 
                {
                    'detail_price': r'//*[(@id = "price")]',
                    'detail_title': r'//*[@id="main_info_holder"]//h1',
                    'detail_description': r'//*[(@id = "description")]//p',
                    'detail_image': r'//*[(@id = "holder_popup")]//img'
                }
            },
            {
                'url_regex': r'.*anunturi.*',
                'xpaths': 
                {
                    'list_container': r'//*[@id="list_cart_holder"]//*[contains(concat( " ", @class, " " ), concat( " ", "item_cart", " " ))]',
                    'list_title': r'//*[@id="list_cart_holder"]//*[contains(@class,"title")]',
                    'list_price': r'//*[(@id = "list_cart_holder")]//*[contains(concat( " ", @class, " " ), concat( " ", "shadow", " " ))]',
                    'list_image': r'//*[(@id = "image_holder")]//img'
                }
            }
        ]

# extract from them
lajumate_df = label_data(dfs['lajumate'], rules)

In [12]:
lajumate_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label            63
detail_title_label            62
detail_description_label      66
detail_image_label           317
list_container_label        1873
list_title_label            1873
list_price_label            1856
list_image_label               0
dtype: int64

In [13]:
print('\n'.join(dfs['lajumate'].url))

https://lajumate.ro/anunturi_casa-gradina.html
https://lajumate.ro/
https://lajumate.ro/
https://lajumate.ro/favorite/cautari
https://lajumate.ro/favorite/anunturi
https://lajumate.ro/anunturi_casa-gradina.html
https://lajumate.ro/anunt/nou
https://lajumate.ro/anunturi.html
https://lajumate.ro/anunturi_mobila-decoratiuni.html
https://lajumate.ro/anunturi_gradina-plante-pomi.html
https://lajumate.ro/anunturi_renovari-constructii-bricolaj.html
https://lajumate.ro/anunturi_climatizare-si-electrice.html
https://lajumate.ro/anunturi_menaj-uz-casnic.html
https://lajumate.ro/cort-pavilion-3x3-m-nou-pt-piata-expozitii-camping-1840880.html
https://lajumate.ro/gard-beton-garduri-beton-montaj-transport-inclus-1020761.html
https://lajumate.ro/araci-lemn-tratat-sustinere-plasa-antipasariantigrindina-5505636.html
https://lajumate.ro/pavilion-cort-3x-45-m-pliabil-nou-1541301.html
https://lajumate.ro/scaune-pentru-bar-1673383.html
https://lajumate.ro/ventilator-sero-impecabil-5588248.html
https://laju

#### Okazii

In [14]:
rules =  [
            {
                'url_regex': r'.*a[0-9]+$',
                'xpaths': 
                {
                    'detail_price': r'//*[@id="buy_form"]/div[1]/div[2]/span/span/span[1]/span/span[1]',
                    'detail_title': r'//*[contains(concat( " ", @class, " " ), concat( " ", "fn", " " ))]',
                    'detail_description': r'//*[(@id = "description_tmce")]',
                    'detail_image': r'//*[(@id = "gallery_show_big_img")]'
                }
            },
            {
                'url_regex': r'.*[^0-9][0-9]{0,5}$',
                'xpaths': 
                {
                    'list_container': r'//div[contains(@class, "list-item")]',
                    'list_title': r'//*[contains(@class,"item-title")]//h2//span',
                    'list_price': r'//*[contains(@class,"prSup")][span]',
                    'list_image': r'//*[contains(concat( " ", @class, " " ), concat( " ", "ajaxTrackable", " " ))]//img'
                }
            }
        ]

# extract from them
okazii_df = label_data(dfs['okazii'], rules)

In [15]:
okazii_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label            41
detail_title_label            56
detail_description_label      48
detail_image_label            48
list_container_label        1834
list_title_label            1690
list_price_label            1829
list_image_label            1834
dtype: int64

**NOTE:** Apparently the okazii data doesn't have any detail pages.

#### Piata-AZ

In [16]:
rules =  [
            {
                'url_regex': r'.*anunturi.*[0-9]+$',
                'xpaths': 
                {
                    'detail_price': r'//*[@id="detaliu-pret"]',
                    'detail_title': r'//h1',
                    'detail_description': r'//*[@id="anunt-descriere"]/p[1]',
                    'detail_image': r'//*[@id="slider"]/div/ul/li[1]/a/img'
                }
            },
            {
                'url_regex': r'.*(page|cautare).*',
                'xpaths': 
                {
                    'list_container': r'//div[contains(@class, "anunt-promo")]',
                    'list_title': r'//div[contains(@class, "anunt-promo")]//h3',
                    'list_price': r'//div[contains(@class, "anunt-promo")]//div[contains(@class, "pret")]',
                    'list_image': r'//div[contains(@class, "anunt-promo")]//div[contains(@class, "pic")]//img'
                }
            }
        ]

piataaz_df = label_data(dfs['piata-az'], rules)

In [17]:
piataaz_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label           0
detail_title_label          33
detail_description_label     0
detail_image_label           0
list_container_label        33
list_title_label            33
list_price_label            28
list_image_label            25
dtype: int64

**NOTE:** images are missing because they are loaded with ajax.

#### Aliexpress

In [18]:
rules =  [
            {
                'url_regex': r'.*item.*',
                'xpaths': 
                {
                    'detail_price': r'//*[@id="detaliu-pret"]',
                    'detail_title': r'//h1[contains(@itemprop,"name")]',
                    'detail_description': r'//*[@id="j-product-description"]//div[contains(@class,"description-content")]',
                    'detail_image': r'//*[@id="magnifier"]/div/a/img'
                }
            },
            {
                'url_regex': r'.*category.*',
                'xpaths': 
                {
                    'list_container': r'//li[contains(@class,"list-item")]',
                    'list_title': r'//li[contains(@class,"list-item")]//a[contains(@class,"product")]',
                    'list_price': r'//li[contains(@class,"list-item")]//span[contains(@class,"price")]',
                    'list_image': r'//li[contains(@class,"list-item")]//img[contains(@itemprop,"image")]'
                }
            }
        ]

aliexpress_df = label_data(dfs['aliexpress'], rules)

In [19]:
aliexpress_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label           0
detail_title_label           4
detail_description_label     4
detail_image_label           4
list_container_label        44
list_title_label            44
list_price_label            44
list_image_label            44
dtype: int64

**NOTE:** Aliexpress employs heavy scrape-protection. Apparently no data was scraped, it might not be viable to use its data

#### Amazon

In [20]:
rules =  [
            {
                'url_regex': r'.*',
                'xpaths': 
                {
                    'detail_price': r'//*[@id="priceblock_ourprice"]',
                    'detail_title': r'//*[@id="productTitle"]',
                    'detail_description': r'//*[@id="productDescription"]',
                    'detail_image': r'//*[@id="landingImage"]'
                }
            },
            {
                'url_regex': r'.*',
                'xpaths': 
                {
                    'list_container': r'//div[contains(@class,"s-item-container")]',
                    'list_title': r'//div[contains(@class,"s-item-container")]//h2',
                    'list_price': r'//span[contains(@class,"-price-whole")]',
                    'list_image': r'//div[contains(@class,"s-item-container")]//img[contains(@class,"s-access-image")]'
                }
            }
        ]

amazon_df = label_data(dfs['amazon'], rules)

In [21]:
amazon_df.filter(axis='columns', regex=r'.*label').sum()

detail_price_label            20
detail_title_label            23
detail_description_label       0
detail_image_label            22
list_container_label        1082
list_title_label            1050
list_price_label             829
list_image_label            1055
dtype: int64