# Web Scraping Code for Extracting Data from Vestiairecollective

## by kibrom Hailu

In this web scraping code, the objective is to extract data from the website **Vestiairecollective**. The code will systematically collect all available links and corresponding data from the site.

Vestiairecollective is an online platform that offers a wide range of fashion items. It is organized into six distinct categories, each representing a specific clothing type or theme. These categories are:

- **Men**: Products for men's fashion.
- **Women**: Products for women's fashion.
- **Kids**: Products for children's fashion.
- **We-love**: A curated selection of trending and popular items.
- **New Items**: Recently added products.
- **Vintage**: Vintage and pre-owned fashion items.

Within each category, there are multiple pages containing product listings. On average, each category has approximately 21 pages. Each page displays 48 product entities, resulting in a substantial amount of data to be extracted.

To achieve this, the web scraping code has been developed with the following functionalities:

1. **Data Extraction**: The code will traverse through the website's categories and pages, capturing information about each product. It will extract details such as the product's name, price, description, and any other relevant attributes.

2. **Data Storage**: The extracted product data will be stored in a structured format for further analysis and utilization. The code will create a file named **data.csv** to store the collected data. This file will serve as a repository for all the extracted information.

3. **Link Extraction**: In addition to product data, the code will also capture and store the links associated with each product. These links provide direct access to the individual product pages on Vestiairecollective. The code will create a file named **links.csv** to store all the collected links. This file will be valuable for future reference and analysis.

4. **Error Handling**: While scraping the website, it's possible to encounter failed or inaccessible links. To account for this, the code will track and record any failed links in a separate file named **failed.csv**. This file will provide a reference for investigating and resolving any issues with the scraping process.

The code is structured with modular functions to handle specific tasks efficiently:

- The `file_handler` function manages file operations, including converting files to lists or dictionaries, and vice versa. It ensures seamless data handling and processing.

- The `manipulate_data` function allows for data manipulation and transformation. It provides flexibility to reorder data, modify prices, or perform any other required data transformations.

- The main file orchestrates the scraping process, utilizing the `store_link` function to collect and store all the necessary links while traversing through the available data.

By implementing this web scraping code, it becomes possible to efficiently extract and organize the desired data from Vestiairecollective, enabling further analysis, insights, and applications related to the fashion industry and e-commerce.

In [3]:
# import the important library 
import time , csv, subprocess , sys

#if selenium is not installed use uncomment the following code 
#!pip install selenium 
from selenium import webdriver as wd 
from selenium.webdriver.common.by import By

## collect possible paths where we can find all the products 

In [4]:
values = {
    'men': ['men','#gender=Men%232',21],
    'women': ['women','#gender=Women%231',21],
    'kids': ['kids','#gender=Kids%233',21],
    'we-love': ['we-love','',21],
    'new-items': ['new-items','',21],
    'vintage': ['vintage','',21]
}
def get_url(value,cur):
    return f'https://www.vestiairecollective.com/{value[0]}/p-{cur}/{value[1]}'
urls = []
urls.append('https://www.vestiairecollective.com/vintage/')
urls.append('https://www.vestiairecollective.com/we-love/')
urls.append('https://www.vestiairecollective.com/kids/')
urls.append('https://www.vestiairecollective.com/men/')
urls.append('https://www.vestiairecollective.com/women/')
urls.append('https://www.vestiairecollective.com/new-items/')
for val in values.values():
    for i in range(2,val[2]+1):
        urls.append(get_url(val,i))

In [5]:
# Create custom function which return empity string instead of exception
def find_element(body, path):
    try:
        res = body.find_elements(By.XPATH, path)
        if len(res) > 0:
            take =  res[0].text.strip()
            if take.startswith('(') and take.endswith(',)'):
                return take[1:len(take)-2]
            return take 
        return ""
    except:
        return ""


In [7]:
# read data from local storage 
def read_csv_data():
    existing_data = []
    try:
        with open('data.csv', 'r', newline='', encoding="utf-8") as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                existing_data.append(row)
        return existing_data
    except FileNotFoundError:
        return []
 
# save the final data 
def save_as_csv(data,fieldnames,finalname):

    for i in range(len(data)):
        dictdata = data[i]
        temp = dictdata.copy()
        for label in dictdata.keys():
            if label not in fieldnames:
                del temp[label]
        data[i] = temp.copy()
    with open(finalname, 'w',newline='',encoding="utf-8") as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)


In [15]:
# use the following code to extract all links
try: 
    driver = wd.Chrome() 
    fnames = ['link']
    all_links = set()
    for url in urls:
        driver.get(url) 

        #scroll to the end to access all objects in a page
        scroll_height = 0
        viewport_height = driver.execute_script("return window.innerHeight")
        scroll_increment = viewport_height // 20  # Adjust the scroll increment as desired

        while scroll_height < driver.execute_script("return document.documentElement.scrollHeight"):
            driver.execute_script(f"window.scrollTo(0, {scroll_height});")
            scroll_height += scroll_increment

        time.sleep(2)
        try:
            #  to make the results per page 96
            # btn = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[3]/div/div[2]/div[2]/div[1]/div[1]/button[2]')
            # btn.click()
            links = driver.find_elements(By.TAG_NAME,'a')  # Find all anchor elements
            def last_check(link):
                pos = ['men','women','kids']
                return any(link.get_attribute('href').startswith('https://www.vestiairecollective.com/'+p+'-') for p in pos) and link.get_attribute('href').endswith('.shtml')


            hrefs = {link.get_attribute('href') for link in links if link.get_attribute('href') if last_check(link)}
            all_links.update(hrefs)
            # save all links
            links = [{'link': eachlink } for eachlink in all_links]
            save_as_csv(links,fnames,'links.csv')
            print(f'From the current page {len(hrefs)} links collected. Total collected links-{len(all_links)}')
        except:
            print("some error happend related to network")
    driver.quit()
except Exception as e: 
    print(f'There is an error {e} : most probably it is because of closing the web')

There is an error unsupported operand type(s) for //: 'NoneType' and 'int' : most probably it is because of closing the web


In [11]:
# This code will load all the links that strored from the previous code 
all_stored_link = []
products = read_csv_data()
failed = []

with open('links.csv', 'r') as file:
    csv_reader = csv.DictReader(file)
    for row in csv_reader:
        link = row['link']
        all_stored_link.append(link)


In [17]:
# extract all products from the collected links 
# Start the web 
try:
    
    driver = wd.Chrome() 
    for href in all_stored_link[len(products):]:
        try:
            driver.get(href)
            time.sleep(1)
            product_detail = driver.find_element(By.XPATH, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul')
            var1 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[1]/span[1]')
            var2 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[2]/span[1]')
            var3 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[3]/span[1]')
            var4 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[4]/span[1]')
            var5 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[5]/span[1]')
            var6 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[6]/span[1]')
            var7 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[7]/span[1]')
            var8 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[8]/span[1]')
            var9 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[9]/span[1]')
            var10 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[10]/span[1]')
            var11 = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[11]/span[1]')


            fieldnames = [
                'Name', 'Price', 'price_with_discount', 'Image', 'Link','Designer:', 'Categories :', 'Sub-category:','Condition:', 'Online since:',
                'Color:', 'Material:', 'Category:','Discription', 'Location:' ,'Reference:','Size:','Measurement','Model:','Style:','Place of purchase:'
            ]


            product = {
                #//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[1]/div/ul/li[11]/span[1]
                var1 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[1]/span[2]'),
                var2 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[2]/span[2]'),
                var3 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[3]/span[2]'),
                var4 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[4]/span[2]'),
                var5 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[5]/span[2]'),
                var6 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[6]/span[2]'),
                var7 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[7]/span[2]'),
                var8 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[8]/span[2]'),
                var9 : find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[9]/span[2]'),
               'Link': href,
            }
            if var10:
                product[var10] = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[10]/span[2]'),
            if var11:
                product[var11] = find_element(product_detail, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div/div/ul/li[11]/span[2]')
            disc = find_element(driver,'//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[1]/p[1]')
            if disc:
                product['Discription'] = disc 
            measurement  = find_element(driver, '//*[@id="__next"]/div/main/section[1]/div/div/div[2]/div[2]/div[2]/div[1]/ul')
            if measurement:
                product['Measurement'] = measurement
            product['Price'] = find_element(driver, '//*[@id="__next"]/div/main/div[1]/div/div[3]/div/div[1]/div/div[2]/div/p/span[1]')
            product['price_with_discount'] = find_element(driver, '//*[@id="__next"]/div/main/div[1]/div/div[3]/div/div[1]/div/div[2]/div/p/span[2]')
            # image 
            image_element = driver.find_element(By.CLASS_NAME, 'vc-images_image__TfKYE')
            product['Image'] = image_element.get_attribute('src')
            product['Name'] = driver.title.split('-')[0]

            products.append(product)

            save_as_csv(products,fieldnames,'data.csv')
        except Exception as e:
            failed.append({'failed':href})
            save_as_csv(failed,['failed'],'failed.csv')
            print(e)
    driver.quit()
except Exception as e:
    print(f'There is an error {e} : most probably it is because of closing the web')

Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=123.0.6312.107)
Stacktrace:
	GetHandleVerifier [0x00007FF775E37032+63090]
	(No symbol) [0x00007FF775DA2C82]
	(No symbol) [0x00007FF775C3EC65]
	(No symbol) [0x00007FF775C1CA7C]
	(No symbol) [0x00007FF775CAD687]
	(No symbol) [0x00007FF775CC2AC1]
	(No symbol) [0x00007FF775CA6D83]
	(No symbol) [0x00007FF775C783A8]
	(No symbol) [0x00007FF775C79441]
	GetHandleVerifier [0x00007FF7762325AD+4238317]
	GetHandleVerifier [0x00007FF77626F70D+4488525]
	GetHandleVerifier [0x00007FF7762679EF+4456495]
	GetHandleVerifier [0x00007FF775F10576+953270]
	(No symbol) [0x00007FF775DAE54F]
	(No symbol) [0x00007FF775DA9224]
	(No symbol) [0x00007FF775DA935B]
	(No symbol) [0x00007FF775D99B94]
	BaseThreadInitThunk [0x00007FFF7D61257D+29]
	RtlUserThreadStart [0x00007FFF7F2CAA48+40]

Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=123

Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=123.0.6312.107)
Stacktrace:
	GetHandleVerifier [0x00007FF775E37032+63090]
	(No symbol) [0x00007FF775DA2C82]
	(No symbol) [0x00007FF775C3EC65]
	(No symbol) [0x00007FF775C1CA7C]
	(No symbol) [0x00007FF775CAD687]
	(No symbol) [0x00007FF775CC2AC1]
	(No symbol) [0x00007FF775CA6D83]
	(No symbol) [0x00007FF775C783A8]
	(No symbol) [0x00007FF775C79441]
	GetHandleVerifier [0x00007FF7762325AD+4238317]
	GetHandleVerifier [0x00007FF77626F70D+4488525]
	GetHandleVerifier [0x00007FF7762679EF+4456495]
	GetHandleVerifier [0x00007FF775F10576+953270]
	(No symbol) [0x00007FF775DAE54F]
	(No symbol) [0x00007FF775DA9224]
	(No symbol) [0x00007FF775DA935B]
	(No symbol) [0x00007FF775D99B94]
	BaseThreadInitThunk [0x00007FFF7D61257D+29]
	RtlUserThreadStart [0x00007FFF7F2CAA48+40]

Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=123

# Conclusion

In conclusion, the web scraping code presented here offers a powerful solution for extracting data from the Vestiairecollective website. By systematically crawling through the site's categories and pages, the code collects valuable information about fashion products, including their names, prices, descriptions, and other relevant attributes.

The code's modular structure and specific functions, such as `file_handler` and `manipulate_data`, ensure efficient data handling and manipulation. The extracted data is stored in structured files, with the main data being saved in a file named **data.csv** and the product links in **links.csv**. Additionally, any encountered errors or failed links are recorded in **failed.csv**, enabling subsequent troubleshooting and improvement of the scraping process.

By leveraging this web scraping code, fashion industry professionals, researchers, and data enthusiasts can gain insights into the Vestiairecollective platform. The extracted data can be used for various purposes, such as market analysis, trend identification, pricing strategies, or creating personalized recommendations for customers.

Moreover, this code can serve as a foundation for further development and customization to meet specific requirements or integrate with other data analysis pipelines. It provides a solid starting point for leveraging the vast amount of fashion data available on Vestiairecollective and extracting meaningful insights to drive business decisions and innovation in the fashion industry.

In summary, by employing this web scraping code, one can efficiently extract and organize valuable fashion data from Vestiairecollective, opening up a world of possibilities for analysis, research, and informed decision-making in the dynamic realm of fashion.