# WebScraping-Sephora: Step 1. Get Product URLs
NYCDSA web scraping project

## Project Description
The goal of this project is to explore the color spectrum of the foundations and lipsticks given reviewer's dominant colors (hair color, eye color, and skin tone from Sephora's reviewer inputs) to see if particular features are strongly correlated between the purchased and liked foundation and lipstick colors.

Please see Readme.md for more information including the Repository layout.

---
### Project Outline
-Step 1. Scrape product URLs
-Step 2. Scrape product reviews
-Step 3. Load all data and explore data


### Step 1. Get Product URLs
Since sephora.com runs heavily on dynamic API components, product URLs were first collected from Sephora using Selenium instead of Scrapy.

In [None]:
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("/usr/local/bin/chromedriver")


# Define product categories to scrape urls for
categories = ["foundation-makeup","lipstick"]

# Create dataframe for url list
df_urls = pd.DataFrame(columns = ['Category','Product','URL'])


def scrollDown(driver, n_scrolls):
    body = driver.find_element_by_tag_name("body")
    while n_scrolls >=0:
        body.send_keys(Keys.PAGE_DOWN)
        n_scrolls -= 1
    return driver


for category in categories:
    url = 'https://www.sephora.com/shop/' + category + '?pageSize=300'
    driver.get(url)
    time.sleep(10)
    
    old_len = 0
    while True:
        browser = scrollDown(driver,100)
        time.sleep(5)
        slug = driver.find_elements_by_class_name("css-ix8km1")
        new_len = len(slug)
        if old_len == new_len:
            break
        else:
            old_len = new_len
    
    slugURL = []
    for a in slug:
        subURL = {}
        subURL['Category'] = category 
        subURL['Product'] = a.get_attribute('aria-label')
        subURL['URL'] = a.get_attribute('href')
        slugURL.append(subURL)
    
    #assign scraped data to a dataframe
    df = pd.DataFrame(slugURL)
    
    #concatenating to get all in same df
    df_urls = pd.concat([df_urls, df], axis = 0, ignore_index = True) 


driver.close()

df_urls.to_csv('./data/product_urls.csv', index = False)