# Capstone Project: Dog Toy Recommendation System 
### Data Collection 
In order to collect my data, I will use Selenium and ChromeDriver in order to scrape dog toy reviews from Chewy. 

Sources used throughout this pages:
https://selenium-python.readthedocs.io/locating-elements.html#locating-hyperlinks-by-link-text
https://www.scrapingbee.com/blog/selenium-python/
https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/
https://www.scrapingbee.com/blog/scraping-single-page-applications/
https://stackoverflow.com/questions/11549647/getting-the-url-of-the-current-page-using-selenium-webdriver
https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab
https://towardsdatascience.com/5-top-tips-for-data-scraping-using-selenium-d8b83804681c

In [2]:
# Code in my selenium_practice.py file so far for scraping data

import pandas as pd

# imports 
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

In [3]:
def scrape_toy_title(page_source):
    toy_list = []
    soup = BeautifulSoup(page_source, 'lxml')
    
    # Getting the toy's title 
    section = soup.find('section', id='right-column')
    title = section.find('div', id='product-title').find('h1').get_text().strip()
    return title

In [4]:
def scrape_toy_price(page_source):
    # Getting the toy's price 
    soup = BeautifulSoup(page_source, 'lxml')
    price = soup.find('div', id='pricing').find(
        'ul', class_='product-pricing').find(
        'li', class_='our-price').find(
        'p', class_='price').find(
        'span', class_='ga-eec__price').get_text().strip()
    return price

In [5]:
def scrape_toy_description(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    try:
        descriptions =  soup.find(
            'div', class_='cw-tabs__body container').find(
            'article', id='descriptions').find(
            'section', class_='descriptions__content cw-tabs__content--left').find_all(
            'p')
        text_list = []
        for description in descriptions:
            text = description.get_text()
            text_list.append(text)
    
    except:
        description =  soup.find(
                'div', class_='cw-tabs__body container').find(
                'article', id='descriptions').find(
                'section', class_='descriptions__content cw-tabs__content--left').find(
                'p')
        text_list = []
        text = description.get_text()
        text_list.append(text)
    
    else: 
        pass
    return text_list
    

In [6]:
def scrape_toy_description(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    descriptions =  soup.find(
            'div', class_='cw-tabs__body container').find(
            'article', id='descriptions').find(
            'section', class_='descriptions__content cw-tabs__content--left').find_all(
            'p')
    text_list = []
    for description in descriptions:
        text = description.get_text()
        text_list.append(text)
    return text_list

In [7]:
def scrape_toy_keybenefits(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    ul = soup.find(
        'div', class_='cw-tabs__body container').find(
        'article', id='descriptions').find(
        'section', class_='descriptions__content cw-tabs__content--left').find(
        'ul')
    lis = ul.find_all('li')
    text_list = []
    for li in lis:
        text = li.get_text()
        text_list.append(text)

#             If you want each key benefit to be in its own list run this instead 
#             text_item = []
#             text = li.get_text()
#             text_item.append(text)
#             text_list.append(text_item)

    return text_list

In [8]:
def scrape_toy_rating(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    picture = soup.find(
        'div', class_='product-header-extras').find(
        'div', class_='ugc ugc-head').find(
        'picture')
    img = picture.find('img') # How do I access the img and then the stuff inside the img? 
    rating = img['src']
    return rating[-7:-4] # Grabbing the number itself from the 'src' attribute 

In [9]:
def scrape_toy_reviews(page_source): 
    soup = BeautifulSoup(page_source, 'lxml')
    reviews = soup.find_all('span', class_='ugc-list__review__display')
#     print(len(reviews))
#     print(reviews[0].get_text())
    text_list = []
    for review in reviews:
        review.get_text()
#         print(review)
        text_list.append(review)
    return text_list

# Need to figure out the best ways to get all the reviews 

In [10]:
def scrape_toy(page_source):
    # Getting elements off page
    toy_dict = {}
    
    # toy title
    toy_title = scrape_toy_title(page_source)
    toy_dict['title'] = toy_title

    # toy price 
    toy_price = scrape_toy_price(page_source)
    toy_dict['price'] = toy_price
    
    # toy description 
    toy_description = scrape_toy_description(page_source)
    toy_dict['descriptions'] = toy_description
    
    try:
        # toy key benefits 
        toy_keybenefits = scrape_toy_keybenefits(page_source)
        toy_dict['key_benefits'] = toy_keybenefits
    except:
        pass
    
    # toy rating -- NEEDS FIXING
    toy_rating = scrape_toy_rating(page_source)
    toy_dict['rating'] = toy_rating

    # toy reviews
    toy_reviews = scrape_toy_reviews(page_source)
#     print(f'Toy Reviews: {toy_reviews}')
    toy_dict['reviews'] = toy_reviews
    return toy_dict

In [11]:
def scrape_toy_page(toy_cat_dict, toy_subcat, toy_links): #products
#     # Lopping through all products and scraping
#     toys_links =[]
#     for product in products:
#         link = product.get_attribute('href')
#         toys_links.append(link)

    toy_subcat_dict = {}
    for link in toy_links:
        driver.get(link)
        page_source = driver.page_source
        toy_dict = scrape_toy(page_source)
        toy_subcat_dict[link] = toy_dict
        time.sleep(30)

    toy_cat_dict[toy_subcat] = toy_subcat_dict

In [12]:
def number_of_toys(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    numbers = soup.find_all('span', class_='category-count')
#     print(numbers[0].text)
    subcat_numbers = []
    for span in numbers:
        number = span.text
        subcat_numbers.append(int(number[1:-1]))
    return subcat_numbers

In [13]:
def grab_subcat_links(link, number_of_toys):
    # https://www.chewy.com/b/moderate-2718
    # https://www.chewy.com/b/moderate_c2718_p5
    
    main_href = f'{link[:-5]}_c{link[-4:]}_p'    
    subcat_pages = []
    subcat_pages.append(link)
    for i in range(2, round(number_of_toys / 36)+1):
        href = f'{main_href}{i}'
        subcat_pages.append(href)
    return subcat_pages

In [14]:
def get_links(page_source):
    soup = BeautifulSoup(page_source, 'lxml')
    subcats = soup.find_all('a', class_='facet_selection')
    links_list = []
    for subcat in subcats:
        link = subcat['href']
        full_link = f'https://www.chewy.com{link}'
        links_list.append(full_link)
    return links_list

In [15]:
def grab_toy_links(subcat_pages):
    toys_links =[]
    for page in subcat_pages:
        driver.get(page)
        products = driver.find_elements_by_class_name('product')
        # Lopping through all products on first page 
        for product in products:
            link = product.get_attribute('href')
            toys_links.append(link)
    return toys_links

In [15]:
all_toy_links = []

# Chew Toys

In [83]:
# Defining a larger dictionary to hold subcat dictionaries
chew_toys = {}

In [None]:
# CHEW TOYS 


DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
original_link = "https://www.chewy.com/b/toys-315"
driver.get(original_link)

# To first just look at CHEW TOYS
chew_toys_link = driver.find_element_by_link_text('Chew Toys')
chew_toys_link.click()


# Going to MODERATE chew toys
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Moderate")))
element.click()

# Checking for number of toys in each subcategory (looking at side bar)
cat_page_source = driver.page_source
chew_numbers = number_of_toys(cat_page_source)

# Getting all first page links for each subcategory
chew_links = get_links(cat_page_source)
# print(chew_links)

# Getting links for all pages for moderate toys 
mod_pages_links = grab_subcat_links(chew_links[0], chew_numbers[0])
all_moderate_links = grab_toy_links(mod_pages_links)
# all_toy_links.append(all_moderate_links)

# Collecting all MODERATE chew toys 
check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Chew Toys")))
scrape_toy_page(chew_toys, 'moderate', all_moderate_links)

# Back to Chew Toys
driver.get('https://www.chewy.com/b/chew-toys-316')

print('Done with Moderate Toys')

################################################################################################

# To get to TOUGH chew toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Tough")))
# element.click()

# # Getting links for all pages for tough toys 
# tough_pages_links = grab_subcat_links(chew_links[1], chew_numbers[1])
# # print(tough_pages_links)
# all_tough_links = grab_toy_links(tough_pages_links)
# # all_toy_links.append(all_tough_links)

# # Collecting all TOUGH chew toys 
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Chew Toys")))
# scrape_toy_page(chew_toys, 'tough', all_tough_links)

# #To get back to Chew Toys
# driver.get('https://www.chewy.com/b/chew-toys-316')

# print("Done with Tough Toys")

################################################################################################
# To get to EXTREME chew toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Extreme")))
# element.click()

# # Getting links for all pages for extreme toys 
# extreme_pages_links = grab_subcat_links(chew_links[2], chew_numbers[2])
# all_extreme_links = grab_toy_links(extreme_pages_links)
# # all_toy_links.append(all_extreme_links)

# # Collecting the extreme chew toys 
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Chew Toys")))
# scrape_toy_page(chew_toys, 'extreme', all_extreme_links)

# print('Done with Extreme Toys and Chew Toys')

In [None]:
chew_toy_list = []
for subcat in ['moderate', 'tough', 'extreme']:
    for index, link in enumerate(chew_toys[subcat]):
        chew_toys[subcat][link]['link'] = link
        chew_toys[subcat][link]['subcat'] = subcat
        chew_toys[subcat][link]['cat'] = 'chew toys'
        chew_toy_list.append(chew_toys[subcat][link])
chew_toy_df = pd.DataFrame(chew_toy_list)

In [88]:
chew_toy_df

Unnamed: 0,title,price,descriptions,key_benefits,rating,reviews,link,subcat,cat
0,Nylabone Teething Pacifier Puppy Chew Toy,$3.23,[Every puppy needs a pacifier to soothe teethi...,[Designed to encourage positive play and teach...,4_2,[[I read the reviews and thought we'd be safe....,https://www.chewy.com/nylabone-teething-pacifi...,moderate,chew toys
1,"KONG Puppy Dog Toy, Color Varies",$6.99,[The Puppy KONG dog toy is customized for a gr...,[Unpredictable bounce is great for energetic p...,4_3,"[[I have had dozens of dogs over the years, an...",https://www.chewy.com/kong-puppy-dog-toy-color...,moderate,chew toys
2,Petstages Dogwood Tough Dog Chew Toy,$8.83,"[Chewing is a natural behavior in all dogs, as...",[Chew toy that combines real wood with synthet...,4_2,[[My dogs like chasing sticks and the two of t...,https://www.chewy.com/petstages-dogwood-tough-...,moderate,chew toys
3,KONG Classic Dog Toy,$12.99,[Give your furry friend a reliable and fun pla...,[Made in the USA from globally sourced materia...,4_5,[[I got the small Kong classic for my Westie. ...,https://www.chewy.com/kong-classic-dog-toy-lar...,tough,chew toys
4,Nylabone Strong Chew Stick Maple Bacon Flavore...,$12.49,[Help fulfill your dog’s natural chewing insti...,[Real wood and a strong nylon chew toy won't s...,4_0,"[[We weren’t sure about this bone, but it is a...",https://www.chewy.com/nylabone-strong-chew-sti...,tough,chew toys
...,...,...,...,...,...,...,...,...,...
142,Starmark Everlasting Treat Ball with Dental Tr...,$11.69,[A fun chewing challenge for your dog! The Eve...,[Includes an edible dental treat - other Everl...,3_4,[[My 2 year old pittie seriously loves this to...,https://www.chewy.com/starmark-everlasting-tre...,extreme,chew toys
143,"GoughNuts Pro 50 Ball Dog Toy, 3-in",$24.63,[Playful pooches will go nuts for the GoughNut...,[Extremely durable and long-lasting ball is ma...,4_4,[[My 90 lb super chewer loves this ball! It wa...,https://www.chewy.com/goughnuts-pro-50-ball-do...,extreme,chew toys
144,"DuraForce Ring Squeaky Dog Toy, Pink, Medium",$16.11,[Keep your pup on her toes with DuraForce’s Ri...,[Soft on the outside but built multiple layers...,3_0,[[My chewer worked at the seams and had it fal...,https://www.chewy.com/duraforce-ring-squeaky-d...,extreme,chew toys
145,"DuraForce Gear Ring Squeaky Dog Toy, Blue, Medium",$19.89,[Keep your pup on her toes with DuraForce’s Ri...,[Soft on the outside but built multiple layers...,2_6,[[This is less than 15 mins with my bully boy....,https://www.chewy.com/duraforce-gear-ring-sque...,extreme,chew toys


In [24]:
# convert chew toy data frame to csv -- uncomment to rerun 
# chew_toy_df.to_csv('./data/chewtoy_df.csv', index=False)

# Fetch Toys

In [30]:
fetch_toys = {}

In [36]:
# FETCH TOYS 

DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
driver.get("https://www.chewy.com/b/toys-315")

# To first just look at CHEW TOYS
chew_toys_link = driver.find_element_by_link_text('Fetch Toys')
chew_toys_link.click()


# Now Looking at FETCH TOYS
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Balls")))
element.click()

# Checking for number of toys in each subcategory (looking at side bar)
cat_page_source = driver.page_source
fetch_numbers = number_of_toys(cat_page_source)

# Getting all first page links for each subcategory
fetch_links = get_links(cat_page_source)

# # Getting links for all pages for treat toys & dispensers
# ball_pages_links = grab_subcat_links(fetch_links[0], fetch_numbers[0])
# all_ball_links = grab_toy_links(ball_pages_links)
# print(all_ball_links)
# # all_toy_links.append(all_ball_links)

# # Looking a the Balls toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))
# element.click()

# # Scraping
# scrape_toy_page(fetch_toys, 'balls', all_ball_links)

driver.get('https://www.chewy.com/b/fetch-toys-317')

# print('Done with Ball Fetch Toys!')
# #######################################################################################################

# To look at the ball fetch toys 
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Balls")))

# # Looking a the Discs toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Discs")))
# element.click()

# # Getting links for all pages for treat toys & dispensers
# disc_pages_links = grab_subcat_links(fetch_links[1], fetch_numbers[1])
# all_disc_links = grab_toy_links(disc_pages_links)
# all_toy_links.append(all_disc_links)

# # To look at the disc fetch toys -- NEED TO FIGURE OUT HOW TO ACCESS THEM
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))

# # Scraping
# scrape_toy_page(fetch_toys, 'discs', all_disc_links)

# # To get back to fetch toys
# driver.get('https://www.chewy.com/b/fetch-toys-317')

# print("Done with Disc Fetch Toys!")

###################################################################################################

# # Looking a the Launcher toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Launchers")))
# element.click()

# # Getting links for all pages for treat toys & dispensers
# launcher_pages_links = grab_subcat_links(fetch_links[2], fetch_numbers[2])
# all_launcher_links = grab_toy_links(launcher_pages_links)
# all_toy_links.append(all_launcher_links)


# # To look at the launcher fetch toys -- NEED TO FIGURE OUT HOW TO ACCESS THEM
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))

# # Scraping
# scrape_toy_page(fetch_toys, 'launchers', all_launcher_links)

# print("Done with Launcher Fetch Toys!")

# # To get back to fetch toys
# driver.back()

######################################################################################################

# Looking a the Stick toys
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Sticks")))
element.click()

# Getting links for all pages for treat toys & dispensers
stick_pages_links = grab_subcat_links(fetch_links[2], fetch_numbers[2])
all_stick_links = grab_toy_links(stick_pages_links)
all_toy_links.append(all_stick_links)

# To look at the stick fetch toys 
check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))
# To get back to fetch toys

# Scraping
scrape_toy_page(fetch_toys, 'sticks', all_stick_links)

print('Done with Stick Fetch Toys!')

print('Done with Fetch Toys!')


# 

KeyboardInterrupt: 

In [82]:
len(fetch_toys)

NameError: name 'fetch_toys' is not defined

In [36]:
len(all_toy_links)

168

In [80]:
fetch_toy_list = []
for subcat in ['balls', 'discs', 'launchers', 'sticks']: #, 'launchers'
    for index, link in enumerate(fetch_toys[subcat]):
        fetch_toys[subcat][link]['link'] = link
        fetch_toys[subcat][link]['subcat'] = subcat
        fetch_toys[subcat][link]['cat'] = 'fetch toys'
        fetch_toy_list.append(fetch_toys[subcat][link])
fetch_toy_df = pd.DataFrame(fetch_toy_list)

NameError: name 'fetch_toys' is not defined

In [81]:
fetch_toy_df

NameError: name 'fetch_toy_df' is not defined

In [54]:
# fetch_toy_df.to_csv('./data/fetchtoy_df.csv', index=False)

In [None]:
# fetch_toys['balls']
# ball_df = pd.DataFrame(fetch_toys['balls']).T
# ball_df.to_csv('./data/ball_df.csv')

# Plush Toys

In [28]:
plush_toys = {}

In [26]:
# PlUSH TOYS

DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
driver.get("https://www.chewy.com/b/toys-315")

# To first just look at CHEW TOYS
chew_toys_link = driver.find_element_by_link_text('Plush Toys')
chew_toys_link.click()

# plush_toys = {}

# Looking a the Stuffed toys
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Stuffed Toys")))
element.click()

# Checking for number of toys in each subcategory (looking at side bar)
cat_page_source = driver.page_source
plush_numbers = number_of_toys(cat_page_source)

# Getting all first page links for each subcategory
plush_links = get_links(cat_page_source)
# print(plush_links)

# Getting links for all pages for stuffed toys 
stuffed_pages_links = grab_subcat_links(plush_links[0], plush_numbers[0])
# print(len(stuffed_pages_links))
all_stuffed_links = grab_toy_links(stuffed_pages_links)
# all_toy_links.append(all_stuffed_links)

# To look at the STUFFED plush toys 
check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Plush Toys")))

# Scraping
# plush_toys_1 = {}
# scrape_toy_page(plush_toys_1, 'stuffed', all_stuffed_links[:100])
# print('Done with 1st round')
# plush_toys_2 = {}
# scrape_toy_page(plush_toys_2, 'stuffed', all_stuffed_links[101:200])
# print('Done with 2nd round')
# plush_toys_3 = {}
# scrape_toy_page(plush_toys_3, 'stuffed', all_stuffed_links[201:300])
# print('Done with 3rd round')
# plush_toys_4 = {}
# scrape_toy_page(plush_toys_4, 'stuffed', all_stuffed_links[301:400])
# print('Done with 4th round')
# plush_toys_5 = {}
# scrape_toy_page(plush_toys_5, 'stuffed', all_stuffed_links[401:500])
# print('Done with 5th round')
# plush_toys_6 = {}
# scrape_toy_page(plush_toys_6, 'stuffed', all_stuffed_links[501:600])
# print('Done with 6th round')
# plush_toys_7 = {}
# scrape_toy_page(plush_toys_7, 'stuffed', all_stuffed_links[601:700])
# print('Done with 7th round')
# plush_toys_8 = {}
# scrape_toy_page(plush_toys_8, 'stuffed', all_stuffed_links[701:800])
# print('Done with 8th round')
# plush_toys_9 = {}
# scrape_toy_page(plush_toys_9, 'stuffed', all_stuffed_links[801:900])
# print('Done with 9th round')
# plush_toys_10 = {}
# scrape_toy_page(plush_toys_10, 'stuffed', all_stuffed_links[901:1000])
# print('Done with 10th round')
# plush_toys_11 = {}
# scrape_toy_page(plush_toys_11, 'stuffed', all_stuffed_links[1001:1100])
# print('Done with 11th round')
# plush_toys_12 = {}
# scrape_toy_page(plush_toys_12, 'stuffed', all_stuffed_links[1101:1200])
# print('Done with 12th round')
# plush_toys_13 = {}
# scrape_toy_page(plush_toys_13, 'stuffed', all_stuffed_links[1201:1300])
# print('Done with 13th round')

# plush_toys_14 = {}
# scrape_toy_page(plush_toys_14, 'stuffed', all_stuffed_links[1301:1403])
# print('Done with 14th round')
    
#To get back to Chew Toys
driver.get('https://www.chewy.com/b/plush-toys-320')

print("Done with Stuffed subcategory!")

# ##########################################################################################################

# Looking a the Unstuffed toys
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Unstuffed Toys")))
element.click()

# Getting links for all pages for unstuffed toys 
unstuffed_pages_links = grab_subcat_links(plush_links[1], plush_numbers[1])
# print(unstuffed_pages_links)
all_unstuffed_links = grab_toy_links(unstuffed_pages_links)
# all_toy_links.append(all_unstuffed_links)

# To look at the unstuffed plush toys 
check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Plush Toys")))

# # Scraping 
# plush_toys_15 = {}
# scrape_toy_page(plush_toys_15, 'unstuffed', all_unstuffed_links[:50])
# print("Done with 1st unstuffed toys")
# plush_toys_16 = {}
# scrape_toy_page(plush_toys_16, 'unstuffed', all_unstuffed_links[51:100])
# print("Done with 2nd unstuffed toys")
# plush_toys_17 = {}
# scrape_toy_page(plush_toys_17, 'unstuffed', all_unstuffed_links[101:173])
# print("Done with 3rd unstuffed toys")

# print("Done with Unstuffed subcategory!")

# print('Done with Plush category!')


Done with Stuffed subcategory!
Done with 1st unstuffed toys
Done with 2nd unstuffed toys
Done with 3rd unstuffed toys
Done with Unstuffed subcategory!
Done with Plush category!


In [97]:
len(all_toy_links)

5

In [146]:
len(plush_toys_15['unstuffed'])

50

In [29]:
plush_toys['unstuffed'] = {}
plush_toys['stuffed'] = {}

In [30]:
plushlist = [plush_toys_1, plush_toys_2, plush_toys_3, plush_toys_4, plush_toys_5, plush_toys_6,
            plush_toys_7, plush_toys_8, plush_toys_9, plush_toys_10, plush_toys_11,
            plush_toys_12, plush_toys_13, plush_toys_15, plush_toys_16, plush_toys_17]

In [160]:
# plush_toys_15['unstuffed']

In [31]:
for toydict in plushlist[:12]:
    for item, link in enumerate(toydict['stuffed']):
        plush_toys['stuffed'][link] = toydict['stuffed'][link]
        
for toydict in plushlist[14:]:
    for item, link in enumerate(toydict['unstuffed']):
        plush_toys['unstuffed'][link] = toydict['unstuffed'][link]

In [32]:
len(plush_toys['stuffed'])

1189

In [33]:
len(plush_toys['unstuffed'])

92

In [78]:
plush_toy_list = []
for subcat in ['stuffed', 'unstuffed']: 
    for index, link in enumerate(plush_toys[subcat]):
        plush_toys[subcat][link]['link'] = link
        plush_toys[subcat][link]['subcat'] = subcat
        plush_toys[subcat][link]['cat'] = 'plush toys'
        plush_toy_list.append(plush_toys[subcat][link])
plush_toy_df = pd.DataFrame(plush_toy_list)

In [79]:
# plush_toy_df.to_csv('./data/plushtoy_df.csv', index=False)

# Rope and Tug Toys

In [64]:
ropetug_toys = {}

In [58]:
# ROPE & TUG TOYS

DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
driver.get("https://www.chewy.com/b/toys-315")

# Getting the total number of toys
page_source = driver.page_source 
soup = BeautifulSoup(page_source, 'lxml')
numbers = soup.find_all('span', class_='category-count')
number = int(numbers[3].text[1:-1])

# getting all the links to the pages to scrape
rope_toy_link = 'https://www.chewy.com/b/rope-tug-toys-321'
main_href = f'{rope_toy_link[:-4]}_c{rope_toy_link[-3:]}_p'    
rope_pages = []
# rope_pages.append(link)
for i in range(2, round(number / 36)+1):  # make sure to put in the right numbers 
    href = f'{main_href}{i}'
    rope_pages.append(href)
# print(rope_pages)

all_rope_links = grab_toy_links(rope_pages)
all_toy_links.append(all_rope_links)

# # To first just look at ROPE & TUG TOYS
# rope_toys_link = driver.find_element_by_link_text('Rope & Tug Toys')
# rope_toys_link.click()


# To look at the rope & tug toys 
check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Toys")))


# The actual scraping
# rope_tug_toys_1 = {}
# scrape_toy_page(rope_tug_toys_1, 'rope_tug_toys', all_rope_links[:100])
# print("Done with 1")
# time.sleep(60)
# rope_tug_toys_2 = {}
# scrape_toy_page(rope_tug_toys_2, 'rope_tug_toys', all_rope_links[101:200])
# print("Done with 2")
time.sleep(60)
rope_tug_toys_3 = {}
scrape_toy_page(rope_tug_toys_3, 'rope_tug_toys', all_rope_links[201:300])
print("Done with 3")
time.sleep(60)
rope_tug_toys_4 = {}
scrape_toy_page(rope_tug_toys_4, 'rope_tug_toys', all_rope_links[301:347])
print("Done with 4")


Done with 3
Done with 4


In [99]:
len(all_toy_links)

6

In [60]:
rope_tug_toys_1['rope_tug_toys']

{'https://www.chewy.com/chuckit-amphibious-bumper-dog-toy/dp/38378': {'title': 'Chuckit! Amphibious Bumper Dog Toy, Color Varies',
  'price': '$9.95',
  'descriptions': ["Games of tug-of-war don't need to end just because you're in a pool or lake when you have the floating Amphibious Bumper from Chuckit! Your four-legged swimmer can pull to his heart's delight while enjoying a splashing day outdoors. Let him sink his teeth into the soft-but-durable memory foam material while you pull on the cord. Just don't forget: the dog always wins!",
   'Ships in a variety of random and fun colors!',
   'Created especially for paw-some games of fetch and tug, this toy is not designed to be a chew toy. Not recommended for aggressive chewers. Every dog plays differently and, since not all toys are created equal, it’s always best to keep a close watch on your pup in case things get ruff. Supervised play will help toys last longer and most importantly keep your pal safe. No dog toy is truly indestructi

In [65]:
ropetug_toys['rope_tug_toys'] = {}
ropelist = [rope_tug_toys_1, rope_tug_toys_2, rope_tug_toys_3, rope_tug_toys_4]

In [66]:
for toydict in ropelist:
    for item, link in enumerate(toydict['rope_tug_toys']):
        ropetug_toys['rope_tug_toys'][link] = toydict['rope_tug_toys'][link]

In [67]:
len(ropetug_toys['rope_tug_toys'])

309

In [70]:
ropetug_toy_list = []
for subcat in ['rope_tug_toys']: #, 'unstuffed'
    for index, link in enumerate(ropetug_toys[subcat]):
        ropetug_toys[subcat][link]['link'] = link
        ropetug_toys[subcat][link]['subcat'] = subcat
        ropetug_toys[subcat][link]['cat'] = 'rope & tug toys'
        ropetug_toy_list.append(ropetug_toys[subcat][link])
ropetug_toy_df = pd.DataFrame(ropetug_toy_list)

In [71]:
ropetug_toy_df

Unnamed: 0,title,price,descriptions,key_benefits,rating,reviews,link,subcat,cat
0,"Chuckit! Amphibious Bumper Dog Toy, Color Varies",$9.95,[Games of tug-of-war don't need to end just be...,[Designed for exceptional visibility and perfo...,4_2,[[Omg! My Labrador loves these things !!i only...,https://www.chewy.com/chuckit-amphibious-bumpe...,rope_tug_toys,rope & tug toys
1,KONG SqueakStix Dog Toy,$13.99,[Give your chewsy pooch a stick that will stic...,"[Extra-long, durable design built for easy gam...",4_2,[[My dog chewed a hole through the end of this...,https://www.chewy.com/kong-squeakstix-dog-toy-...,rope_tug_toys,rope & tug toys
2,Booda Fresh N Floss Spearmint 3-Knot Rope Dog Toy,$4.16,"[With Booda Fresh N' Floss, you can finally ge...",[Combines durable cotton with mint-scented flo...,4_2,[[Bought this due to such positive reviews and...,https://www.chewy.com/booda-fresh-n-floss-spea...,rope_tug_toys,rope & tug toys
3,"KONG Squeezz Crackle Bone for Dogs, Color Varies",$9.89,[KONG Crackle is a new twist on a fun and favo...,[All Squeezz products are sold in assorted jew...,3_2,[[We have a 1 year old St. Bernard puppy and h...,https://www.chewy.com/kong-squeezz-crackle-bon...,rope_tug_toys,rope & tug toys
4,"Otterly Pets Assorted Small to Medium Ropes, F...",$13.87,[Treat your energetic pooch to an all-star ass...,[Play-ready assortment of toys includes food-g...,3_7,[[This collection is perfect for my new 5 mont...,https://www.chewy.com/otterly-pets-assorted-sm...,rope_tug_toys,rope & tug toys
...,...,...,...,...,...,...,...,...,...
304,"Pets First NCAA Basketball Rope Dog Toy, Purdue",$8.96,[Get your furry friend ready for tipoff with t...,[Squeaker on the inside to keep your furry fri...,0_0,[],https://www.chewy.com/pets-first-ncaa-basketba...,rope_tug_toys,rope & tug toys
305,Cheering Pet Agility Equipment & Treat Bag Dog...,$79.99,[Train your aspiring or experienced athlete in...,"[Includes 58.5-inch tunnel, 2 vertical poles, ...",0_0,[],https://www.chewy.com/cheering-pet-agility-equ...,rope_tug_toys,rope & tug toys
306,"Snugarooz Knot Yours Rope Dog Toy, 9-in",$2.66,"[Give Spot a fun new knot to trot, chew and tu...",[Spot will love the intricacies of his new kno...,0_0,[],https://www.chewy.com/snugarooz-knot-yours-rop...,rope_tug_toys,rope & tug toys
307,Squishy Face Studio Fleece Tug Dog Toy,$9.99,"[If your dog loves tug of war, the Squishy Fac...","[Designed to be used on its own, or as a repla...",3_8,[[Yay! Great quality product. I ordered the bl...,https://www.chewy.com/squishy-face-studio-flee...,rope_tug_toys,rope & tug toys


In [72]:
# ropetug_toy_df.to_csv('./data/ropetugtoy_df.csv', index=False)

# Interactive Toys

In [None]:
# interactive_toys = {}

In [48]:
# INTERACTIVE TOYS

DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
driver.get("https://www.chewy.com/b/toys-315")

# To first just look at CHEW TOYS
chew_toys_link = driver.find_element_by_link_text('Interactive Toys')
chew_toys_link.click()

# The interactive toys
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Treat Toys & Dispensers")))
element.click()

# Checking for number of toys in each subcategory (looking at side bar)
cat_page_source = driver.page_source
interactive_numbers = number_of_toys(cat_page_source)

# Getting all first page links for each subcategory
interactive_links = get_links(cat_page_source)
# print(interactive_links)

# # Getting links for all pages for treat toys & dispensers
# dispenser_pages_links = grab_subcat_links(interactive_links[0], interactive_numbers[0])
# all_dispenser_links = grab_toy_links(dispenser_pages_links)
# # all_toy_links.append(all_dispenser_links)

# # To look at the dog treat toys & dispenser interactive toys
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Interactive Toys")))

# interactive_toys_1 = {}

# # Scraping
# scrape_toy_page(interactive_toys_1, 'treat toys & dispensers', all_dispenser_links)

driver.get('https://www.chewy.com/b/interactive-toys-319')

print('Done with Treat Toys & Dispensers subcategory')

############################################################################################################

# # Treat Dispenser Refills
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Treat Dispenser Refills")))
# element.click()

# interactive_toys_2 = {}

# # Getting links for all pages for treat toys & refills
# refills_pages_links = grab_subcat_links(interactive_links[1], interactive_numbers[1])
# all_refills_links = grab_toy_links(refills_pages_links)
# # all_toy_links.append(all_refills_links)

# # To look at the dog treat dispensers refills interactive toys
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Interactive Toys")))

# # Scraping
# scrape_toy_page(interactive_toys_2, 'treat dispenser refills', all_refills_links)

# driver.get('https://www.chewy.com/b/interactive-toys-319')

# print('Done with Treat Dispenser Refills subcategory!')

########################################################################################################

# # Puzzle toys and Games 
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Puzzle Toys & Games")))
# element.click()

# interactive_toys_3 = {}

# # Getting links for all pages for puzzle toys & games
# game_pages_links = grab_subcat_links(interactive_links[2], interactive_numbers[2])
# all_game_links = grab_toy_links(game_pages_links)
# # all_toy_links.append(all_game_links)

# # To look at the dog puzzle toys & games
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Interactive Toys")))

# # Scraping
# scrape_toy_page(interactive_toys_3, 'puzzle toys & games', all_game_links)

# driver.get('https://www.chewy.com/b/interactive-toys-319')

# print('Done with Puzzle Toys & Games')

#########################################################################################################

# # Automatic Ball Launchers
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Automatic Ball Launchers")))
# element.click()

# interactive_toys_4 = {}

# # Getting links for all pages for automatic ball launchers
# auto_pages_links = grab_subcat_links(interactive_links[3], interactive_numbers[3])
# all_auto_links = grab_toy_links(auto_pages_links)

# # To look at the dog automatic ball launchers
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Interactive Toys")))


# # Scraping
# scrape_toy_page(interactive_toys_4, 'automatic ball launchers', all_auto_links)

# print("Done with Automatic Ball Launchers subcategory!")

# print("Done with Interactive Toys category!")

Done with Treat Toys & Dispensers subcategory
Done with Automatic Ball Launchers subcategory!
Done with Interactive Toys category!


In [101]:
len(all_toy_links)

9

In [None]:
len(interactive_toys)

In [49]:
interactive_toys = {}
interactivelist = [interactive_toys_1, interactive_toys_2, interactive_toys_3, interactive_toys_4]

In [67]:
# interactive_toys['treat toys & dispensers'] = 0
# interactivelist[0]['treat toys & dispensers']
interactive_toys['treat toys & dispensers'] = {}
interactive_toys['treat dispenser refills'] = {}
interactive_toys['puzzle toys & games'] = {}
interactive_toys['automatic ball launchers'] = {}
interactive_toys

{'treat toys & dispensers': {},
 'treat dispenser refills': {},
 'puzzle toys & games': {},
 'automatic ball launchers': {}}

In [71]:
# for toydict in interactivelist[0]:
#     print(toydict)
# #     for subcat in ['treat toys & dispensers', 'treat dispenser refills', 'puzzle toys & games']:
#         for item, link in enumerate(toydict[subcat]):
#             interactive_toys_full[subcat][link] = toydict[subcat][link]
        
# for toydict in interactivelist[0]:
for item, link in enumerate(interactivelist[0]['treat toys & dispensers']):
    interactive_toys['treat toys & dispensers'][link] = interactivelist[0]['treat toys & dispensers'][link]

# for toydict in interactivelist[1]:
for item, link in enumerate(interactivelist[1]['treat dispenser refills']):
    interactive_toys['treat dispenser refills'][link] = interactivelist[1]['treat dispenser refills'][link]
        
# for toydict in interactivelist[2]:
for item, link in enumerate(interactivelist[2]['puzzle toys & games']):
    interactive_toys['puzzle toys & games'][link] = interactivelist[2]['puzzle toys & games'][link]
        
# for toydict in interactivelist[3]: 
for item, link in enumerate(interactivelist[3]['automatic ball launchers']):
    interactive_toys['automatic ball launchers'][link] = interactivelist[3]['automatic ball launchers'][link]

In [73]:
# interactive_toys

In [78]:
# interactivelist = [interactive_toys, interactive_toys_2]

In [90]:
# for toydict in interactivelist[1]:
#     print(toydict)

automatic ball launchers


In [None]:
# for toydict in interactivelist:
#     for item, link in enumerate(toydict['interactive_toys']):
#         ropetug_toys['rope_tug_toys'][link] = toydict['rope_tug_toys'][link]a

In [74]:
interactive_toy_list = []
for subcat in ['treat toys & dispensers', 'treat dispenser refills', 'puzzle toys & games', 'automatic ball launchers']: #, 'automatic ball launchers'
    for index, link in enumerate(interactive_toys[subcat]):
        interactive_toys[subcat][link]['link'] = link
        interactive_toys[subcat][link]['subcat'] = subcat
        interactive_toys[subcat][link]['cat'] = 'interactive toys'
        interactive_toy_list.append(interactive_toys[subcat][link])
interactive_toy_df = pd.DataFrame(interactive_toy_list)

In [76]:
# interactive_toy_df

In [54]:
# interactive_df_list = []
# for cat in interactive_toys:
#     for toy in cat:
#         interactive_df_list.append(toy)
# interactive_df = pd.DataFrame(interactive_df_list)
# interactive_df

In [77]:
# interactive_toy_df.to_csv('./data/interactivetoy_df.csv', index=False)

In [102]:
# # Combining all the links into one large list
# big_toy_links = []
# for i in range(len(all_toy_links)):
#     print(len(all_toy_links[i]))
#     for j in range(len(all_toy_links[i])):
#         big_toy_links.append(i)
# len(big_toy_links)

169
72
72
1392
144
311
180
33
36


2409

In [41]:
# # FETCH TOYS 

# DRIVER_PATH = '/Users/haleytaft/Downloads/chromedriver'
# driver = webdriver.Chrome( executable_path=DRIVER_PATH) 
# driver.get("https://www.chewy.com/b/toys-315")

# # To first just look at CHEW TOYS
# chew_toys_link = driver.find_element_by_link_text('Fetch Toys')
# chew_toys_link.click()

# fetch_toys = {}

# # Now Looking at FETCH TOYS
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Balls")))
# element.click()

# # Checking for number of toys in each subcategory (looking at side bar)
# cat_page_source = driver.page_source
# fetch_numbers = number_of_toys(cat_page_source)

# # Getting all first page links for each subcategory
# fetch_links = get_links(cat_page_source)
# # print(fetch_links)

# # Getting links for all pages for treat toys & dispensers
# ball_pages_links = grab_subcat_links(fetch_links[0], fetch_numbers[0])
# all_ball_links = grab_toy_links(ball_pages_links)
# all_toy_links.append(all_ball_links)

# # Looking a the Balls toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))
# element.click()

# # Scraping
# # scrape_toy_page(fetch_toys, 'balls', all_ball_links)

# driver.get('https://www.chewy.com/b/fetch-toys-317')

# print('Done with Ball Fetch Toys!')
# #######################################################################################################

# # To look at the ball fetch toys 
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Balls")))

# # Looking a the Discs toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Discs")))
# element.click()

# # Getting links for all pages for treat toys & dispensers
# disc_pages_links = grab_subcat_links(fetch_links[1], fetch_numbers[1])
# all_disc_links = grab_toy_links(disc_pages_links)
# all_toy_links.append(all_disc_links)

# # To look at the disc fetch toys -- NEED TO FIGURE OUT HOW TO ACCESS THEM
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))

# # Scraping
# # scrape_toy_page(fetch_toys, 'discs', all_disc_links)

# # To get back to fetch toys
# driver.get('https://www.chewy.com/b/fetch-toys-317')

# print("Done with Disc Fetch Toys!")

# ###################################################################################################

# # Looking a the Launcher toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Launchers")))
# element.click()

# # Getting links for all pages for treat toys & dispensers
# launcher_pages_links = grab_subcat_links(fetch_links[2], fetch_numbers[2])
# all_launcher_links = grab_toy_links(launcher_pages_links)
# all_toy_links.append(all_launcher_links)


# # To look at the launcher fetch toys -- NEED TO FIGURE OUT HOW TO ACCESS THEM
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))

# # Scraping
# # scrape_toy_page(fetch_toys, 'launchers', all_launcher_links)

# print("Done with Launcher Fetch Toys!")

# # To get back to fetch toys
# driver.back()

# ######################################################################################################

# # Looking a the Stick toys
# element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Sticks")))
# element.click()

# # Getting links for all pages for treat toys & dispensers
# stick_pages_links = grab_subcat_links(fetch_links[2], fetch_numbers[2])
# all_stick_links = grab_toy_links(stick_pages_links)
# all_toy_links.append(all_stick_links)

# # To look at the stick fetch toys 
# check = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Fetch Toys")))
# # To get back to fetch toys

# # Scraping
# # scrape_toy_page(fetch_toys, 'sticks', all_stick_links)

# print('Done with Stick Fetch Toys!')

# print('Done with Fetch Toys!')



<span class="ugc-list__review__display">My dog LOVES chewing his toys and tearing them apart. Unless I’m getting him a toy that I know is gonna last, I typically buy him the cheapest ones since they won’t last more than 20 minutes. I thought  these balls would only last a day or two max, but we’re going on a week AND THERES NOT EVEN A DENT!! 

He absolutely loves these balls. They’re scented too which I thinks makes him like them even more- but we have an indoor one and an outside one since he cuddles this toy and tries his darn best to chew it. Also, it’s $4 cheaper on chewy than petsmart.</span>
<span class="ugc-list__review__display">Our young pup loves this ball! She is an extreme chewer and does not play traditional "fetch". Instead, she loves to have the ball bounced so that she can chase it down and take it somewhere in the yard to chew. We have had this ball for about 2 months and so far no damage. It has outlived every toy other than the extreme medium Kong toy. The ball bounc

AttributeError: 'NoneType' object has no attribute 'find'

In [None]:
# fetch_toy_list = []
# for subcat in ['balls', 'discs', 'sticks']: #, 'launchers'
#     for index, link in enumerate(fetch_toys[subcat]):
#         fetch_toys[subcat][link]['subcat'] = subcat
#         fetch_toys[subcat][link]['cat'] = 'interactive toys'
#         fetch_toy_list.append(fetch_toys[subcat][link])
# fetch_toy_df = pd.DataFrame(fetch_toy_list)

In [None]:
# fetch_toy_df.to_csv('./data/fetchtoy_df.csv', index=False)