## **Q3**  How can we scrape online store product's material composition/percentages? *(Jasmine)*

### Qualitative:
#### Problem -
- How can we use web scraping to get the material composition for clothing on online stores?
#### Hypothesis -
- This should be easy, web scraping looks at the HTML and extracts elements
#### Context, Motivation & Rationale -
- We want the Chrome Extension to access the links provided in the description of YouTube hauls on clothing, so that we can find the material and generate a sustainability score for the items featured in the video
#### Definitions, Data, and Methods -
- Using online stores' clothing links
- Using Python language and web scraping libraries (Beautiful Soup and Requests). 

### Quantitative:

Step 1: we want to import the libraries

In [17]:
from bs4 import BeautifulSoup
import requests
import time

Step 2: then we find links that we want to scrape, for this I will use an American Eagle clothing item:
https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004


In [31]:
url = "https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004"


Step 3: use basic web scraping to get the material percentage, search the html by looking for "%"

In [32]:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
composition_elements = soup.find_all(string=lambda text: '% ' in str(text).lower())
if composition_elements:
    for element in composition_elements:
        print(element)
else:
    print("none")

none


none

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we use web scraping to get the material composition for clothing on online stores?
    - We can use scraping to obtain the HTML page source, but it doesn't seem to be outputting what I want.
    - My method cannot find "%" on the page why is that?
#### Summary
- We aren't able to find the materials.
#### Uncertainty, Limitations & Caveats
- The materials are on the page when I look at it manually.
- Why does this method work for some websites and not others? (This simple method works for some sites/links but not all)
- The content seems to be loaded in by JavaScript.
#### New Problems & Next Steps
- Is there a way to access the materials?
- Find a way to access the materials when they are dynamically loaded.

## **Q4**  How can we scrape online store product's material composition/percentages that are dynamically loaded in by JavaScript? *(Jasmine)*

### Qualitative:
#### Problem -
- Is there a way to access the materials that are loaded in by JavaScript?
- Web Scraping is not as easy as I originally thought.
#### Hypothesis -
- If HTML page sources do not store JavaScript loaded content, I must be able to access it somehow.
#### Context, Motivation & Rationale -
- We want to obtain more material information for many brands not just be limited to a few based on the simple code from above.
#### Definitions, Data, and Methods -
- Using Selenium WebDriver, a library and tool used for automated web scraping that allows you to control a web browser, interact with dynamic elements, and scrape the resulting content. 

### Quantitative:

Step 1: import relevant libraries and tools

In [18]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from lxml import html

Step 2: examine the website find what elements need to be interacted with, get the XPATH of the interactive and loaded in content.

In [35]:
url = "https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004"
interactive_element_xpath = '//*[@id="main-content-focus"]/div[2]/div[2]/div[2]/div/div[3]/div/div[1]/div[1]'
loaded_content_xpath = '//*[@id="main-content-focus"]/div[2]/div[2]/div[2]/div/div[3]/div/div[1]/div[2]/div/div[2]'

Step 3: use Selenium and WebDriver, interact with the page and scrape the loaded in material information

In [38]:
def scrape_american_eagle(url, interactive_element_xpath, loaded_content_xpath, wait_time=10):
    try:
        # Set Chrome 
        chrome_options = Options()

        # Initialize the WebDriver 
        driver = webdriver.Chrome(options=chrome_options)
        
        # Open the webpage
        driver.get(url)

        # Wait for the specified time before clicking the interactive element
        time.sleep(wait_time)  # Wait for the specified time in seconds

        # Find the interactive element
        interactive_element = driver.find_element(By.XPATH, interactive_element_xpath)
        
        # Click the interactive element
        interactive_element.click()

        # Wait for the loaded content to be visible
        loaded_element = WebDriverWait(driver, wait_time).until(
            EC.visibility_of_element_located((By.XPATH, loaded_content_xpath))
        )

        # Once loaded, scrape the content
        dynamic_content = loaded_element.text
        
        return dynamic_content
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None
        
    finally:
        # Close the WebDriver
        driver.quit()


dynamic_content = scrape_american_eagle(url, interactive_element_xpath, loaded_content_xpath)
if dynamic_content:
    print(dynamic_content)

Materials & Care
57% Cotton, 38% Recycled Polyester, 5% Elastane
Machine wash
Imported


Materials & Care

57% Cotton, 38% Recycled Polyester, 5% Elastane

Machine wash

Imported

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- Is there a way to access the materials that are loaded in by JavaScript?
    - Yes!
- Web Scraping is not as easy as I originally thought.
    - Now I have learned about WebDriver and automation!
#### Summary
- I can now scrape dynamically loaded in content that I wasn't able to do before.
#### Uncertainty, Limitations & Caveats
- Not all websites are the same, I can't re-use this method for every single site.
- Opens a test browser that can trigger bot detection (on some sites) and I can't scrape what I need.
- Scrapes the relevant data but how can I extract it?
#### New Problems & Next Steps
- Scrape more websites!
- Cannot scrape some sites because of bot detection, for ethics reasons I won't try to bypass bots, as it can be against the terms of use for some sites.
- Use regular expressions to pull out the information into variables for sustainability score.