## Label Extraction: Amazon Webpage

In [1]:
# import dependencies for system configuration
import os
import sys
import warnings
import time

# import dependencies for collecting images
import pandas as pd 
import numpy as np 

Approach: Image Extraction

+ Loading Dataset of amazon urls
+ Extracting images per url 
+ Downloading mobile phone images with created folder
+ Storing images

## Data Loading

In [3]:
# Load the amazon weblink dataset
weblink_file = "amazon_links2.csv"
amazon_ds = pd.read_csv(weblink_file)
amazon_ds.head()

Unnamed: 0,links
0,https://www.amazon.com.be/-/en/s?i=electronics...
1,https://www.amazon.com.be/-/en/s?i=electronics...
2,https://www.amazon.com.be/-/en/s?i=electronics...
3,https://www.amazon.com.be/-/en/s?i=electronics...
4,https://www.amazon.com.be/-/en/s?i=electronics...


## Data Extracting from URL (P1): Testcase

+ Setup WebDriver
+ Extracting labels from amazon webpage
+ Storing labels in DataFrame

In [4]:
# Import dependencies for extracting image content from amazon webpage
from selenium import webdriver 
from selenium.webdriver.edge.options import Options
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By 

In [5]:
# Set up Webdriver for Edge Browser
options = Options()
options.add_argument("--headless")

# Set up Browser service 
edgedriver_path = "D:\\Data_Engineering\\data_extraction\\msedgedriver.exe"
service = Service(executable_path=edgedriver_path)
driver = webdriver.Edge(service=service, options=options)

Test Case 1: Amazon with 10 labels

In [6]:
# Test case 1: 1 amazon link 
amazon_link = amazon_ds["links"].values[0]

def extract_labels(webdriver, link) -> list:
    webdriver.get(link)
    time.sleep(5)

    product_titles = driver.execute_script(""" 
        var h2_selector = "h2.a-size-base-plus.a-spacing-none.a-color-base.a-text-normal";
        var titles = [];
        var elements = document.querySelectorAll(h2_selector);
        elements.forEach(function(element){
            titles.push(element.innerText);                                       
        });
        return titles
    """)

    return product_titles

# Test case 2: 3 amazon links 
amazon_links = amazon_ds["links"].values[0:3]
item_title_lst = []
item_tot = 0
for i, link in enumerate(amazon_links): 
    # Extract the product titles from each link + take the size of each sequence
    text_sequence = extract_labels(webdriver=driver, link=link)
    seq_len = len(text_sequence)

    # Store text sequence in list of all item titles
    item_title_lst.append(text_sequence)
    print(f"Amazon URL {i + 1}: {seq_len} items (extracted successfully) ")
    item_tot += seq_len
print(f"Extraction Successful. Number of Items: {item_tot}")


Amazon URL 1: 2 items (extracted successfully) 
Amazon URL 2: 2 items (extracted successfully) 
Amazon URL 3: 24 items (extracted successfully) 
Extraction Successful. Number of Items: 28


In [7]:
item_titles = np.array([item for title_seq in item_title_lst for item in title_seq])
item_titles[:5]

array(['Samsung Galaxy A53 5G 256GB Black',
       'S906 Galaxy S22+ 5G 128GB Pink Gold',
       'Samsung Smartphone Galaxy A50 - Enterprise Edition SM-A505FZKSE28 128 GB 6.4 Inch (16.2 cm) Dual SIM Android™ 9.0 25',
       'SAMSUNG G389F Galaxy XCOVER 3 Dark Silver',
       'Nokia 105-2019 Dual SIM Black (TA-1174)'], dtype='<U140')

## Data Extraction P1: Iteration Extraction Loop

The iterative extraction loop allows to extract all text titles for each mobile phone items from the webpage. Whenever a link has been targeted, the loop monitors the extraction mechanism whereby each title has been identified and operational to be scraped. It also monitors the numbers of extracted text content from each amazon weblink. Once the extraction has been completed, it computes a sum of all the extraction labels to idenitfy the the content size of all request URLs simultaneously. 

In [None]:
# Iterated Extraction loop 
def iterative_extraction(amazon_links):
    # Define constants
    item_title_lst = []
    item_tot = 0

    # Extraction process
    for i, link in enumerate(amazon_links): 
        # Extract the product titles from each link + take the size of each sequence
        text_sequence = extract_labels(webdriver=driver, link=link)
        seq_len = len(text_sequence)

        # Store text sequence in list of all item titles
        item_title_lst.append(text_sequence)
        print(f"Amazon URL {i}: {seq_len} items (extracted successfully) ")
        item_tot += seq_len
       
    # Flatten all item titles
    item_titles = np.array([item for title_seq in item_title_lst for item in title_seq])
    print(f"Extraction Successful. Number of Items: {item_tot}")
    
    return item_titles

In [10]:
# Extract with full page
full_page = amazon_ds["links"].values
label_list = iterative_extraction(amazon_links=full_page)
print(f"Extracted labels: {label_list[:5]}")

Amazon URL 0: 2 items (extracted successfully) 
Amazon URL 1: 2 items (extracted successfully) 
Amazon URL 2: 24 items (extracted successfully) 
Amazon URL 3: 24 items (extracted successfully) 
Amazon URL 4: 24 items (extracted successfully) 
Amazon URL 5: 24 items (extracted successfully) 
Amazon URL 6: 24 items (extracted successfully) 
Amazon URL 7: 24 items (extracted successfully) 
Amazon URL 8: 24 items (extracted successfully) 
Amazon URL 9: 24 items (extracted successfully) 
Amazon URL 10: 24 items (extracted successfully) 
Amazon URL 11: 24 items (extracted successfully) 
Amazon URL 12: 24 items (extracted successfully) 
Amazon URL 13: 24 items (extracted successfully) 
Amazon URL 14: 24 items (extracted successfully) 
Amazon URL 15: 24 items (extracted successfully) 
Amazon URL 16: 24 items (extracted successfully) 
Amazon URL 17: 24 items (extracted successfully) 
Amazon URL 18: 10 items (extracted successfully) 
Amazon URL 19: 24 items (extracted successfully) 
Amazon URL 2

In [12]:
print(f"Extraction Successful. Number of Items: {len(label_list)}")

Extraction Successful. Number of Items: 1064


## Label Storage

The Label Storage is considered a so called database where all extracted labels are stored. These labels contains all brands of smartphones that are visible in the amazon webpage. Brands like iPhone, Samsung Galaxy, Nokia and many more will be dealt as target labels for training the image recongision system. Once that has been collected in the database, it requires further processing steps to comply with the objectives of the project.  

In [13]:
# Store label list
label_ds = pd.DataFrame()
label_ds["Labels"] = label_list
label_ds.to_csv("labels2.csv")