# Webscraping
    * For this project we are going to be creating our own data set 
    * The best tool for this is going to be webscraping
        * Using tools to extract data from websites
        * This data is generally represented in html format

## HTML
    * The standard web development language
    * HTML stores information in a very interesting syntax

    * Here is a an example HTML
        <html>
        <head>
        <title>Sample Web Page</title>
        </head>
        <body>
        <h1>Welcome to My Web Page</h1>
        <p>Here is some sample text with a <a href="https://www.example.com">link</a>.</p>
        <p class="description">This paragraph contains a brief description.</p>
        </body>
        </html>
    * Thanks to the tags (stuff inside brackets) in html, we can find information very fast. IF we are looking for title in this document we would look for the title tag and grab the information just outside it 

## Two main types of webscraping 
        * Static 
        * Dynamic 

### Static Webscraping
    * This means that once a page is loaded, all content is pre generated and displayable 
    * This is the easiest way to webscrape, and parse/extract the useful information for our data set 
    * This works very well for simple/older websites, but nowadays most content is personalized and real time!

### Dynamic Webscraping 
    * Once a page is loaded, all content is not visible, some of it is loaded with clicks, interactions, scrolls, etc
    * This can be very complex task without the right tools, but with them we can work around it!
    * The extraction process is very similar for static pages, but the information must be "found first"
    

## Packages 
### Requests
    * Python package designed for us to load these pages
    * This allows us to get the html or data from a webpage and can many other taks 
### Beautiful Soup 
    * Python package allowing us to extract this information 
    * Beatiful Soup does not load web pages, but 
    instead is designed to parse html that is already loaded
    * It contains method, allowing us to find information like titles, images, or anything inside html or other web documents
### Selenium
    * Although requests can load these pages, the hmtl returned by requests is static.
    * Selenium is an interactive web browser that allows to not just automatically load pages but click, scroll, and do things that mimic human interaction
    * This solves the main problem with dynamic pages

In [1]:
pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install beautifulsoup4 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install selenium


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests 
from bs4 import BeautifulSoup
import selenium

## Goat 
    * Second hand market for sneaker and clothes reselling
    * Contains prices, images, and other data perfect for our dataset!
    * There are no repeated sneakers and it sneaker is mutually exclusive

Let's take a look at the static html for goat!

In [3]:
# Page we are looking at scraping
url = 'https://www.goat.com/sneakers?gender=men'

# Requests to get the html for this page 
response = requests.get(url)

# Use soup to parse this html
soup = BeautifulSoup(response.text, 'html.parser')

# Make this html pretty
pretty_html = soup.prettify()
print(pretty_html)

<!DOCTYPE html>
<html lang="en-us">
 <head>
  <meta charset="utf-8"/>
  <link href="/images/icons/apple-touch-icon.png" rel="apple-touch-icon"/>
  <link href="/favicon.ico" rel="icon"/>
  <link href="/manifest.json" rel="manifest"/>
  <meta content="#000" name="theme-color"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="width=device-width, initial-scale=1.0 maximum-scale=1.0" name="viewport"/>
  <meta content="/images/icons/goat-logo-512.png" property="og:image"/>
  <link href="https://www.goat.com/sneakers" hreflang="en-us" rel="alternate"/>
  <link href="https://www.goat.com/en-gb/sneakers" hreflang="en-gb" rel="alternate"/>
  <link href="https://www.goat.com/en-ca/sneakers" hreflang="en-ca" rel="alternate"/>
  <link href="https://www.goat.com/en-au/sneakers" hreflang="en-au" rel="alternate"/>
  <link href="https://www.goat.com/fr-fr/sneakers" hreflang="fr-fr" rel="alternate"/>
  <link href="https://www.goat.com/en-nl/sneakers" hreflang="en-nl" rel="al

### Goat scraping
    * Using google's inspect element, we can find a sneaker in our listing's and the elements we desire and search for it in the original code. 
    * For example a jordan 4 that it is listed we can click on it and see it has the html tags  

    



In [5]:
price_span = soup.find('span', class_='LocalizedCurrency__Amount-sc-yoa0om-0 jDDuev')
if price_span:
    price = price_span.get_text()
    print(price)
else:
    print('Price information could not be found.')

$217


### Results of static scraping 
    * We can notice that the information pulled from the static loaded page is different. We have found the sneaker but it has a placeholder for the information we need.
    * We are going to need to dynamically load this information 
    

In [25]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Selenium takes in options where the type of browser, experience, etc
options = webdriver.ChromeOptions()
options.add_argument('window-size=1920x1080')
options.add_argument('user-agent=Mozilla/5.0')

url = 'https://www.goat.com/sneakers?gender=men'
# Open this browser
browser = webdriver.Chrome(options=options)
browser.get(url)
time.sleep(2)

try:
    # Wait until element becomes present, a main benefit of selenium
    element = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.XPATH, "//img[@alt=\"Air Jordan 4 Retro 'Red Cement'\"]"))
    )
    # grab the html 
    raw = browser.page_source
    # use soup to find this element 
    soup = BeautifulSoup(raw, 'html.parser')
    image = soup.find('img', alt="Air Jordan 4 Retro 'Red Cement'")
    # if element exists, print it!
    if image:
        pretty_html = image.prettify()
        print(pretty_html)
    else:
        print("Image not found")
finally:
    browser.quit()





In [8]:
# Code to show the prices 
price_span = soup.find('span', class_='LocalizedCurrency__Amount-sc-yoa0om-0 jDDuev')
if price_span:
    price = price_span.get_text()
    print(price)
else:
    print('Price information could not be found.')

$217


### Results of Dynamic Scraping
    * Dynamic Scraping allowed us to extract the information 
    * This allows to see our title, price, and image, all we need for our dataset
    * We still need to extract all 10,000 plus listing not just 1!

In [11]:
options = webdriver.ChromeOptions()
options.add_argument('window-size=1920x1080')
options.add_argument('user-agent=Mozilla/5.0')

url = 'https://www.goat.com/sneakers?gender=men'
browser = webdriver.Chrome(options=options)
browser.get(url)
try:
    WebDriverWait(browser, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img[data-qa='grid_cell_product_image']"))
    )
    raw_html = browser.page_source
    soup = BeautifulSoup(raw_html, 'html.parser')
    # Finding image data, all images are contianed in this grid_cell_product_image
    images = soup.find_all('img', {'data-qa': 'grid_cell_product_image'})
    image_data = [{'src': img['src'], 'alt': img.get('alt', 'No Alt')} for img in images]

    # Finding price data
    price_spans = soup.find_all('span', class_='LocalizedCurrency__Amount-sc-yoa0om-0')
    prices = [span.get_text() for span in price_spans]

    # Assuming each product has one price and one image in the same order
    products = [{'title': data['alt'], 'image': data['src'], 'price': price} for data, price in zip(image_data, prices)]

    # Print results
    for product in products:
        print(f"Title: {product['title']}, Image: {product['image']}, Price: {product['price']}")

finally:
    browser.quit()

Title: Air Jordan 4 Retro 'Red Cement', Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $217
Title: Dunk Low Premium 'Light Orewood Brown', Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $96
Title: Yeezy Slides 'Slate Marine', Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $120
Title: Dunk Low 'Clear Jade', Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $87
Title: Air Jordan 12 Retro 'Cherry' 2023, Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $83
Title: Air Jordan 8 Retro 'Playoff' 2023, Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $110
Title: Air Jordan 1 Retro High OG 'Palomino', Image: data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==, Price: $209

## Dyanmic Content Problems
    * Due to the innovations of modern web development we can notice that after a certain point, the images become placeholders
    * Let's introduce some scrolling to see if we can get this

## Goat Structure
    * Listings are loaded 25 at a time and triggered with a scroll at the bottom of the screen
    * This means we can trigger all listings by scrolling to the bottom and reaching the limit
    * This will look like a loop with a range from start until this limit
    * We can add commands like wait and sleep making sure this information is laoded
        * A wait will go once the element is found or x amount of time go bys

In [46]:
options = webdriver.ChromeOptions()
options.add_argument('window-size=1920x1080')
options.add_argument('headless')
options.add_argument('user-agent=Mozilla/5.0')
# For time complexity, Instead of scrapping all sneakers, lets take a filtered amount
url = 'https://www.goat.com/sneakers?gender=men'
browser = webdriver.Chrome(options=options)
browser.get(url)
try:
    # This is the height of page
    last_height = browser.execute_script("return document.body.scrollHeight")
    # create a loop for this scrolling
    while True:
        # Scroll down to the bottom of the page
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(4)  # Wait to load page
        time.sleep(4)
        # Wait for images to be loaded after the scroll
        WebDriverWait(browser, 10).until(
            lambda driver: driver.execute_script("return document.readyState") == "complete"
        )
        new_height = browser.execute_script("return document.body.scrollHeight")
        if new_height == last_height:  # Check if the page height has not changed, this is the end of the page!
            break
        last_height = new_height
            # Wait for images to be loaded after the scroll
        # Wait for images to be loaded after the scroll
    
    raw_html = browser.page_source
    soup = BeautifulSoup(raw_html, 'html.parser')
    # Make sure it exists 
    WebDriverWait(browser, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img[data-qa='grid_cell_product_image']"))
    )
    raw_html = browser.page_source
    soup = BeautifulSoup(raw_html, 'html.parser')
    # Finding image data, all images are contianed in this grid_cell_product_image
    images = soup.find_all('img', {'data-qa': 'grid_cell_product_image'})
    image_data = [{'src': img['src'], 'alt': img.get('alt', 'No Alt')} for img in images]

    # Finding price data
    price_spans = soup.find_all('span', class_='LocalizedCurrency__Amount-sc-yoa0om-0')
    prices = [span.get_text() for span in price_spans]

    # Assuming each product has one price and one image in the same order
    products = [{'title': data['alt'], 'image': data['src'], 'price': price} for data, price in zip(image_data, prices)]

    # Print results
    for product in products:
        print(f"Title: {product['title']}, Image: {product['image']}, Price: {product['price']}")

finally:
    browser.quit()



Title: Air Jordan 4 Retro 'Red Cement', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/092/326/342/original/1200361_00.png?action=crop&width=750, Price: $227
Title: Dunk Low Premium 'Light Orewood Brown', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/092/362/981/original/1172478_00.png.png?action=crop&width=750, Price: $96
Title: Yeezy Slides 'Slate Marine', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/090/667/634/original/1209357_00.png.png?action=crop&width=750, Price: $120
Title: Dunk Low 'Clear Jade', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/090/898/230/original/1218094_00.png.png?action=crop&width=750, Price: $81
Title: Air Jordan 12 Retro 'Cherry' 2023, Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/094/167/791/original/1152263_00.png.png?action=crop&width=750, Price: $

## Modifications
    * Scrolling straight to bottom skips over some images-> shown by the placeholder instead of link
    * We must scroll all the down to load these new sneaker listings, but then we can scroll back up to load these skipped images, and finally down to the bottom again repeating this process!

In [51]:
options = webdriver.ChromeOptions()
options.add_argument('window-size=1920x1080')
options.add_argument('user-agent=Mozilla/5.0')

url = 'https://www.goat.com/sneakers?gender=men'
browser = webdriver.Chrome(options=options)
browser.get(url)

try:
    last_height = browser.execute_script("return document.body.scrollHeight")
    while True:
        # Scroll down to the bottom of the page
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(7)  # Wait to load page

        # Scroll slightly up
        browser.execute_script("window.scrollBy(0, -750);")  # Scroll up by 750 pixels
        time.sleep(6)
        
        browser.execute_script("window.scrollBy(0, -750);")  # Scroll up by 750 pixels
        time.sleep(3)

        browser.execute_script("window.scrollBy(0, 375);")  # Scroll dowm by 450 pixels
        time.sleep(3)

        browser.execute_script("window.scrollBy(0, 500);")  # Scroll up by 450 pixels
        time.sleep(3)
        # Wait for images to be loaded after the scroll
        WebDriverWait(browser, 10).until(
            lambda driver: driver.execute_script("return document.readyState") == "complete"
        )

        new_height = browser.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        raw_html = browser.page_source
    soup = BeautifulSoup(raw_html, 'html.parser')
    # Make sure it exists 
    WebDriverWait(browser, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img[data-qa='grid_cell_product_image']"))
    )
    raw_html = browser.page_source
    soup = BeautifulSoup(raw_html, 'html.parser')
    # Finding image data, all images are contianed in this grid_cell_product_image
    images = soup.find_all('img', {'data-qa': 'grid_cell_product_image'})
    image_data = [{'src': img['src'], 'alt': img.get('alt', 'No Alt')} for img in images]

    # Finding price data
    price_spans = soup.find_all('span', class_='LocalizedCurrency__Amount-sc-yoa0om-0')
    prices = [span.get_text() for span in price_spans]

    # Assuming each product has one price and one image in the same order
    products = [{'title': data['alt'], 'image': data['src'], 'price': price} for data, price in zip(image_data, prices)]

    # Print results
    for product in products:
        print(f"Title: {product['title']}, Image: {product['image']}, Price: {product['price']}")

finally:
    browser.quit()


df = pd.DataFrame(products)
df.to_csv('products.csv', index=False)





Title: Air Jordan 4 Retro 'Red Cement', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/092/326/342/original/1200361_00.png?action=crop&width=750, Price: $227
Title: Air Jordan 12 Retro 'Cherry' 2023, Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/094/167/791/original/1152263_00.png.png?action=crop&width=750, Price: $213
Title: Dunk Low Premium 'Light Orewood Brown', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/092/362/981/original/1172478_00.png.png?action=crop&width=750, Price: $96
Title: Yeezy Slides 'Slate Marine', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/090/667/634/original/1209357_00.png.png?action=crop&width=750, Price: $120
Title: Bapesta #1 'Black', Image: https://image.goat.com/transform/v1/attachments/product_template_pictures/images/093/829/390/original/1276374_00.png.png?action=crop&width=750, Price: $81

### Saving this data to a df and csv file!