# Selenium Web Scraper - Grailed.com

### About Grailed.com:

Grailed is an online community driven marketplace where individuals can buy and sell clothes. Sellers can upload images and descriptions of their items whereas buyers can select which brand or designer they want to browse through. 

### About Selenium

Selenium is a headless browser, which means it enables users to mock human-browsing behavior. Text can be entered in search boxes, buttons can be clicked, and new tabs can be created. It's super fun! More information can be found in the following: [Selenium docs](http://selenium-python.readthedocs.io/getting-started.html), [locating elements](http://selenium-python.readthedocs.io/locating-elements.html#locating-elements), and [FAQs](http://selenium-python.readthedocs.io/faq.html). 


### Objective:

Personally, I spend a bit too much time on grailed. I am also a bit lazy, so I wanted to create a webscraper that will automatically return the brand, name, picture, size, and price of clothes that I like!

I have created 2 functions.
1. `grailed_scraper()` scrapes and filters items from grailed.com and returns the scraped information in a dataframe
2. `photo_download_displayer()` downloads each item's picture and displays each item's picture, price, and name in the Jupyter Notebook

More detailed information is located below:

#### Import Modules
The following modules are imported. Pandas is needed for constructing a dataframe, Numpy is used to randomly select numbers, selenium is needed for interacting with the webpage and scraping information, and HTML related modules are needed as well to filter and grab necessary information.

### Function #1
The objective is to scrape and filter items from grailed.com.
The function `grailed_scraper()` receives 3 inputs.
1. `search_text` - Allows us to input whatever query into the search box. For example, I can look up the brand 'Online Ceramics' (which you will see repeatedly in this example)
![grailed search box](../pics/GRAILED_SEARCH_BOX.jpeg)


2. `category` - Filter for what kind of article of clothing we want, such as Tops, Bottoms, or Shoes.
![grailed search box](../pics/GRAILED_CATEGORY.jpeg)


3. `size` - Filter for sizes, ie small/medium/large for Tops, size 30 for Bottoms, and size 10.5 for shoes
<img src="../pics/GRAILED_SIZE.jpeg" alt="drawing" width="250"/>


After taking in 3 inputs, the `grailed_scraper()` function will open up a separate chrome driver, fill in the search box, click on the categories and the sizes, scrape all of the necessary information of each item, such as its name, price, url, picture url, and posting date, and then neatly return a dataframe with the information.


In [5]:
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from IPython.display import Image, display, HTML  
from bs4 import BeautifulSoup
import requests
from time import sleep

def grailed_scraper(search_text, category = 'Tops', size = 'medium'):  

    # Connect to the chrome driver
    chrome_options = Options()
    chrome_options.add_argument("--start-maximized")  # Using full screen
    driver = webdriver.Chrome(service=Service(executable_path="chromedriver.exe"), options=chrome_options)

    # Open new tab. We are going straight to grailed.com
    driver.get('https://www.grailed.com/')
    
    # Expand chrome window
    # driver.maximize_window()
    
    # Wait one second
    sleep(1)

    # Search whatever brand we want in the search text box
# Search whatever brand we want in the search text box
    driver.find_element("id", "header_search-input").send_keys(search_text)
    sleep(1)
    
    # Press the enter key
    driver.find_element("id", "header_search-input").send_keys(u'\ue007', Keys.ENTER)
    elem = driver.find_element(By.XPATH, "//div[@class='UsersAuthentication']")  # User login window will pop up
    ac = ActionChains(driver)
    ac.move_to_element(elem).move_by_offset(250, 0).click().perform()  # Clicking away from the login window
    driver.implicitly_wait(1)
    element = driver.find_elements(By.XPATH, "//*[contains(text(),'Show Only')]")  # Finding the show-only button again
    element[0].click()  # Clicking now that the pop-up is away
    # Clicks on sizes, then tops, then sizes!
    sleep(5)
    
    '''
    Category and size filtering process:
    If the category is equal to Tops, then the function will examine the following:
    If there are both tops and outerwear in the categories list, it will return the size for both categories.
    If there is only outerwear in the categories list, it will return the size for outerwear.
    If there is only tops in the categories list, it will return the size for tops.
    The nested while loops are used to tell selenium to continue to click on the right buttons until it works correctly
    '''
    
    if category == 'Tops':
        
    # Size Filter
        size_dict = {
                     'small'  : 'S/44-46',
                     'medium' : 'M/48-50',
                     'large'  : 'L/52-54'
                    }
        
        selected_size = size_dict[size]


        # Grab HTML page source
        html = driver.page_source
        html = BeautifulSoup(html, 'lxml')

        categories_list = [category.text for category in list(set(html.find_all('h3')))]

        # if outerwear and tops are part of the cateogories, filter for both 'tops' and 'outerwear'
        if ('Outerwear' in categories_list) and ('Tops' in categories_list):
            while True:
                try:
                    driver.find_element(By.XPATH, "//h3[contains(text(), 'Tops')]").click()
                    sleep(1)
                    try:
                        driver.find_element(By.XPATH, f"//input[@type='checkbox' and @name = '{selected_size}' and @value = '{selected_size}']").click()
                        sleep(1)
                    except:
                        print('Different Sizes')
                        break
                    driver.find_element(By.XPATH,"//h3[contains(text(), 'Outerwear')]").click()
                    sleep(1)
                    try:
                        driver.find_elements(By.XPATH, f"//*[contains(text(), '{selected_size}')]")[1].click()
                    except:
                        print('Different Sizes')
                        break
                except:
                     continue
                else:
                     break
        elif ('Outerwear' not in categories_list) and ('Tops' in categories_list):
            while True:
                try:
                    driver.find_element(By.XPATH, "//h3[contains(text(), 'Tops')]").click()
                    sleep(1)
                    driver.find_element(By.XPATH, f"//input[@type='checkbox' and @name = '{selected_size}' and @value = '{selected_size}']").click()
                    sleep(1)
                except:
                     continue
                else:
                     break
        elif ('Outerwear' in categories_list) and ('Tops' not in categories_list):
            while True:
                try:
                    driver.find_element(By.XPATH, "//h3[contains(text(), 'Outerwear')]").click()
                    sleep(1)
                    driver.find_element(By.XPATH, f"//input[@type='checkbox' and @name = '{selected_size}' and @value = '{selected_size}']").click()
                    sleep(1)
                except:
                     continue
                else:
                     break
    else: 
        while True:
                try:
                    driver.find_element(By.XPATH, f"//h3[contains(text(), '{category}')]").click()
                    sleep(1)
                    driver.find_element(By.XPATH, f"//input[@type='checkbox' and @name = '{size}' and @value = '{size}']").click()
                    sleep(1)
                except:
                     continue
                else:
                     break
    

    # Scroll all the way down to the page (because more items load as you go down)
    # https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python
    SCROLL_PAUSE_TIME = 1

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # Wait one second.
    sleep(1)

    # Grab the page source.

    html = driver.page_source

    # Beautiful Soup it!
    html = BeautifulSoup(html, 'lxml')


    # finding all posts on page
    all_posts = html.find_all('p', {'class' : "listing-size sub-title"})

    # total number of postings 
    num_postings = len(all_posts)

    posts_to_keep = list(range(num_postings))

    # Print how many postings there are
    print(f'Number of Items: {num_postings}')

    # loop through the items to grab all of the information

    urls = []
    names = []
    prices = []
    bump_date = []
    post_date = []
    pics = []

    for position in posts_to_keep:

        url = 'https://www.grailed.com' + html.find_all('div', {'class' : 'feed-item'})[position].find('a')['href']
        urls.append(url)

        name = html.find_all('div', {'class' : "truncate"})[position].text
        names.append(name)

        price = html.find_all('div', {'class' : "listing-price"})[position].text.split('$')[1].replace(',', "")
        prices.append(int(price))

        all_date_info = html.find_all('div', {'class' : 'feed-item'})[position].find_all('p')[0].text.split('(')

        # removing unwanted text such as 'about', 'ago', ')'
        all_date_info = [item.replace('\xa0ago', '').replace(')', '').replace('about ', '') for item in all_date_info]

        if len(all_date_info) == 2:
            bump_date.append(all_date_info[0])
            post_date.append(all_date_info[1])
            
        else: # meaning there is only one timestamp, so bump date = posted date
            bump_date.append(all_date_info[0])
            post_date.append(all_date_info[0])

        # Grab each item's picture url. Some items don't have them because they are lazy loaded :( 
        try:
            pic_url = html.find_all('div', {'class' : "listing-cover-photo"})[position].find('img')['src']
            pics.append(pic_url)
        except:
            pics.append('LazyLoader')


    clothes_dict = {'name' : names,
                    'bump_date' : bump_date,
                    'post_date' : post_date,
                    'price_dollar' : prices,
                    'url' : urls,
                    'pic_url' : pics
                   }

    df = pd.DataFrame(clothes_dict)
    df.head()

    # Sort dataframe based on how expensive the items are
    df = df.sort_values('price_dollar')
    df.reset_index(drop = True, inplace= True)
    
    # Close the driver, you are done!
    driver.close()
    
    return df

### Function #1: Example
Let's use the brand [Online Ceramics](https://online-ceramics.com/) as an example! I am a fan of their medium sized t-shirts and hoodies 🙂🙂

**Note:** I can be more specific with my search query. For example, I can choose 'Online Ceramics white T-Shirts' for example

In [6]:
rick_owens = grailed_scraper('Rick Owens', category = 'Tops', size='medium')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//div[@class='UsersAuthentication']"}
  (Session info: chrome=123.0.6312.124); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF712027032+63090]
	(No symbol) [0x00007FF711F92C82]
	(No symbol) [0x00007FF711E2EC65]
	(No symbol) [0x00007FF711E7499D]
	(No symbol) [0x00007FF711E74ADC]
	(No symbol) [0x00007FF711EB5B37]
	(No symbol) [0x00007FF711E9701F]
	(No symbol) [0x00007FF711EB3412]
	(No symbol) [0x00007FF711E96D83]
	(No symbol) [0x00007FF711E683A8]
	(No symbol) [0x00007FF711E69441]
	GetHandleVerifier [0x00007FF7124225AD+4238317]
	GetHandleVerifier [0x00007FF71245F70D+4488525]
	GetHandleVerifier [0x00007FF7124579EF+4456495]
	GetHandleVerifier [0x00007FF712100576+953270]
	(No symbol) [0x00007FF711F9E54F]
	(No symbol) [0x00007FF711F99224]
	(No symbol) [0x00007FF711F9935B]
	(No symbol) [0x00007FF711F89B94]
	BaseThreadInitThunk [0x00007FF985257344+20]
	RtlUserThreadStart [0x00007FF985B026B1+33]


Seems like there are 39 medium t-shirts/hoodies from Online Ceramics!

Let's examine the dataframe:

In [29]:
online_ceramics.head()

Unnamed: 0,name,bump_date,post_date,price_dollar,url,pic_url


### Function #2

The objective of `photo_downloader_displayer()` is to download each item's picture and display each item's picture, price, name and URL in the Jupyter Notebook!

The function `photo_downloader_displayer()` receives 2 inputs.
1. `folder_name` - This name is quite self-explanatory. The function will create a folder with the folder name and store all of the downloaded photos in the folder. 


2. `df` - The df is the dataframe that was created from `grailed_scraper()`


After taking in 2 inputs, the `photo_downloader_displayer()` function will open up a separate chrome driver using each individual item's stored URL in the dataframe. It will then grab and open up a new tab with the item's picture URL. Next, a screenshot will be taken and saved into the created folder. After downloading all of the screenshots, a for-loop will run through the dataframe and display the item's information in the notebook.

In [8]:
def photo_downloader_displayer(folder_name, df):
    
    import os 
    
    # create folder
    
    directory = f'../{folder_name}_pics/'
    try:
        if not os.path.exists(directory):
            os.makedirs(directory)
    except OSError:
        print ('Error: Creating directory. ' +  directory)
    
    # Create a new picture url list
    pic_url = []

    # Loop through each row in the dataframe
    for pos in df.index:
        
        url = df['url'][pos]

        driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")

        driver.get(url)
        
        # Window name
        window_before = driver.window_handles[0]
        
        # Wait one second.
        sleep(2)

        html = driver.page_source

        # Beautiful Soup it!
        html = BeautifulSoup(html, 'lxml')

        # Grab image_url
        image_url = html.find('div', {'class' : '-image-wrapper'}).find('img')['src']

        # Store it in pic_url
        pic_url.append(image_url)
        
        
        # Open image url in new tab
        driver.execute_script(f'''window.open("{image_url}","_blank");''')
        
        
        # switch to image url
        window_after = driver.window_handles[1]
        
        driver.switch_to.window(window_after)
        
        sleep(np.random.choice(range(1,3)))
        
        # save the image
        driver.save_screenshot(f"{directory}{str(pos)}_screenshot.png")
        
        # Close both tabs
        driver.quit()


    # Reassigning pic url's for the dataframe:
    df['pic_url'] = pic_url
    
    # Displays images and necessary information in the jupyter notebook
    for idx in df.index:
        display(Image(f"{directory}{str(idx)}_screenshot.png"))
        print(f"Name: {df.at[idx, 'name']}")
        print(f"Price: ${df.at[idx, 'price_dollar']}")
        print(f"URL: {df.at[idx, 'url']}")
        print('')
    
    return df

### Function #2: Example
No surprises here, we're going to use Online Ceramics as an example already. It's also because I already have the created dataframe from running the first function.

In [None]:
online_ceramics = photo_downloader_displayer('online_ceramics', online_ceramics)

### Ta-Da!

There you have it! You can play around Selenium and any other website to mess around with. Based on this function, I'm definitely not going to buy some of these shirts. They're wayyyyy too expensive.