# Adidas Footwear Scraping

Scrape footwear product information from the [official Adidas website](https://www.adidas.com/us/shoes?grid=true). The following information are obtained:
* Product title and subtitle
* Direct URL to the individual product page
* Prices (original and discounted)
* Product description
* Color choices / number of color choices
* Number of reviews
* Average rating for the product based on the reviews
* Other product details such as materials made and product code

In [1]:
from time import sleep
import json
import requests

from selenium import webdriver 
from bs4 import BeautifulSoup

import numpy as np
import pandas as pd

In [None]:
from utils import AdidasProductCard

In [17]:
def parse_adidas_shoes(adidas_soup):
    shoes_tags = adidas_soup.select('.glass-product-card')
    shoes_cards = [AdidasProductCard(tag) for tag in shoes_tags]
    shoes_info = [ {'title': card.get_title(),
                     'subtitle': card.get_subtitle(),
                     'num_colors': card.get_num_colors(),
                     'price': card.get_prices()[0],
                     'reduced_price': card.get_prices()[1],
                     'url': card.url,
                     'description': card.get_description(),
                     'details': card.get_details(),
                     'colors': card.get_colors(), # each color is separated by "; "
                     'n_reviews': card.get_review_info()[0],
                     'avg_stars': card.get_review_info()[1]
                    } 
                  for card in shoes_cards]
    return shoes_info

In [4]:
driver = webdriver.Chrome()

adidas_shoes_list = []
url_root = 'https://www.adidas.com'
page_num = 1
current_page_url = "https://www.adidas.com/us/shoes?grid=true%2F" # first page
while True:
    print(f"page {page_num}: {current_page_url}")
    driver.get(current_page_url)
    sleep(2)
    adidas_soup = BeautifulSoup(driver.page_source, "html.parser")
    current_list = parse_adidas_shoes(adidas_soup)
    print(f"Number of shoes on page {page_num}: {len(current_list)}")
    adidas_shoes_list.extend(current_list)
    
    # Find the next page to scrape in the pagination.
    next_page_element = adidas_soup.find(attrs = {'data-auto-id': 'plp-pagination-next'})
    if not next_page_element: # no next page
        break
    current_page_url = next_page_element.get('href')
    current_page_url = url_root + current_page_url
    page_num += 1
    
driver.close()
adidas_shoes_df = pd.DataFrame(adidas_shoes_list)
print(f"\nTotal number of Adidas shoes: {len(adidas_shoes_list)}")

Connected to the page
/us/shoes?grid=true%2F&start=48
Number of shoes on page 2: 48
/us/shoes?grid=true%2F&start=96
Number of shoes on page 3: 48
/us/shoes?grid=true%2F&start=144
Number of shoes on page 4: 48
/us/shoes?grid=true%2F&start=192
Number of shoes on page 5: 0

Total number of Adidas shoes: 192


In [5]:
display(adidas_shoes_df.head())
adidas_shoes_df.to_csv("../data/adidas_raw.csv")

Unnamed: 0,title,subtitle,num_colors,url,price,reduced_price,description,details,colors,n_reviews,avg_stars
0,Start Your Run Shoes,Women's Running,4 colors,/us/start-your-run-shoes/GY9233.html,$65,$33,You'll want these adidas running shoes the nex...,,Dash Grey / Matte Silver / Core Black,5,4.8
1,NMD_R1 Shoes,Youth Originals,,/us/nmd_r1-shoes/H03994.html,$130,$91,"One shoe to rule them all. School, work or kic...",,,131,4.6
2,Edge Lux Shoes,Women's Training,5 colors,/us/edge-lux-shoes/GZ6741.html,$90,$45,"Comfort is key, whether you're racing to catch...",,Core Black / Core Black / Iron Metallic,191,4.0
3,Adilette Comfort Slides,Sportswear,19 colors,/us/adilette-comfort-slides/GW9647.html,$40,$24,Classics for a reason. These adidas slides are...,,Core Black / Core White / Grey Six,9735,4.7
4,Fluidflow 2.0 Shoes,Men's Sportswear,3 colors,https://www.adidas.com/us/fluidflow-2.0-shoes/...,$85,$51,It doesn't really matter whether or not a run ...,,Legend Ink / Cloud White / Shadow Maroon,866,4.6


### Extract and save by page

In [20]:
page_num = 41

driver = webdriver.Chrome()
print(f"Start connection from page {page_num}")
url_root = "https://www.adidas.com"
current_page_url = url_root + f"/us/shoes?grid=true%2F&start={48 * (page_num - 1)}" 

while True:
    print(f"page {page_num}: {current_page_url}")
    driver.get(current_page_url)
    sleep(2)
    adidas_soup = BeautifulSoup(driver.page_source, "html.parser")
    current_list = parse_adidas_shoes(adidas_soup)
    print(f"Number of shoes on page {page_num}: {len(current_list)}")
    adidas_shoes_df = pd.DataFrame(current_list)
    adidas_shoes_df.to_csv(f"../data/adidas_raw_page{page_num}.csv")
    
    # Find the next page to scrape in the pagination.
    next_page_element = adidas_soup.find(attrs = {'data-auto-id': 'plp-pagination-next'})
    if not next_page_element: # no next page
        break
    current_page_url = next_page_element.get('href')
    current_page_url = url_root + current_page_url
    page_num += 1

driver.close()

Connected to the page 41

Total number of Adidas shoes on page 41: 48
url page 42: /us/shoes?grid=true%2F&start=1968

Total number of Adidas shoes on page 42: 48
url page 43: /us/shoes?grid=true%2F&start=2016

Total number of Adidas shoes on page 43: 48
url page 44: /us/shoes?grid=true%2F&start=2064

Total number of Adidas shoes on page 44: 48
url page 45: /us/shoes?grid=true%2F&start=2112

Total number of Adidas shoes on page 45: 48
url page 46: /us/shoes?grid=true%2F&start=2160

Total number of Adidas shoes on page 46: 48
url page 47: /us/shoes?grid=true%2F&start=2208

Total number of Adidas shoes on page 47: 48
url page 48: /us/shoes?grid=true%2F&start=2256

Total number of Adidas shoes on page 48: 48
url page 49: /us/shoes?grid=true%2F&start=2304

Total number of Adidas shoes on page 49: 48
url page 50: /us/shoes?grid=true%2F&start=2352

Total number of Adidas shoes on page 50: 33
