Webscrapping & Data Processing <br>
KARUNATHASAN Nilany <br>
SAMBATH Sïndoumady <br>
DIA2

#### <center><font  align="center"> Project : Sustainable Nike Sneaker Marketplace 👟</font> <center>
# <center><font  align="center"> Vinted Scrapping </font> <center>


___

## SUMMARY

[1. Vinted Data Scraping and Processing](#1.-Vinted-Data-Scraping-and-Processing) <br>
[2. Approximate Carbon Footprint Calculation](#2.-Approximate-Carbon-Footprint-Calculation) <br>
[3. Nike Price Scraping for Profit Calculation](#3.-Nike-Price-Scraping-for-Profit-Calculation) <br>

## 1. Vinted Data Scraping and Processing

In the following code, the method for scraping data from Vinted, with a focus on Nike brand shoes, is presented. The get_vinted_data function is structured to adapt to user preferences by incorporating parameters like gender, color, and size, allowing customization of the search based on specific criteria.

**• Mapping Dictionnaries for Filters**

Noticing changes in Vinted's filters affecting the URL structure, a flexible mapping approach is employed. This involves using dictionaries to handle these changes, ensuring accurate construction of URLs and generating personalized search queries aligned with user preferences.

User-provided gender, color, and size parameters are mapped to corresponding Vinted category IDs, ensuring precise and targeted search results. <br>
<br>
For gender-specific searches, the function utilizes mappings for men, women, girls, and boys, associating each with a unique Vinted category ID. <br>
<br>
Similarly, color and size parameters are mapped to their respective IDs using Vinted color IDs, enabling a comprehensive search. The constructed URL integrates these parameters, resulting in a refined search query that precisely aligns with the user's specified preferences. <br>



**• Navigation & Data Extraction**

The function then navigates to the constructed URL page, handling cookie acceptance to facilitate seamless interaction with the Vinted website. <br>


Iterating through paginated search results, the function retrieves essential details for each article, encompassing title, price, brand, size, link, image source, and localization. Extracting localization information involves opening a new tab for each article. To achieve this, the function navigates to the article's href link, initiating the opening of a new tab in the process. The localization information is then extracted from the details section of the article.


**• Data Processing**

Once the data is collected, the process_dataframe function is applied. This function refines and structures the raw data into a DataFrame.

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import pandas as pd


def get_vinted_data(query, gender=None, color=None, size=None):
    # Define dictionaries for gender, color, and size mappings
    gender_mapping = {"homme": 1231, "femme": 1904, "fille": 1255, "garcon": 1256}
    color_mapping = {
        "noir": 1, "blanc": 12, "gris": 3, "bleu": 9, "rose": 5,
        "rouge": 7, "marine": 27, "jaune": 8, "orange": 11, "multicolore": 15,
        "vert": 10, "violet": 6, "turquoise": 17, "bordeaux": 23, "bleu clair": 26,
        "lila": 25, "corail": 22, "beige": 4, "kaki": 16, "doré": 14,
        "argenté": 13, "menthe": 30, "marron": 2, "moutarde": 29, "vert foncé": 28,
        "crème": 20, "abricot": 21
    }
    size_mapping_femme = {
        "34": 1364, "34.5": 1580, "35": 55, "35.5": 1195, "36": 56, "36.5": 1196,
        "37": 57, "37.5": 1197, "38": 58, "38.5": 1198, "39": 59, "39.5": 1199,
        "40": 60, "40.5": 1200, "41": 61, "41.5": 1201, "42": 62, "42.5": 1579,
        "43": 63, "43.5": 1573, "44": 1574, "44.5": 1575, "45": 1576, "45.5": 1577,
        "46": 1578
    }
    size_mapping_homme = {
        "38": 776, "38.5": 777, "39": 778, "39.5": 779, "40": 780, "40.5": 781,
        "41": 782, "41.5": 783, "42": 784, "42.5": 785, "43": 786, "43.5": 787,
        "44": 788, "44,5": 789, "45": 790, "45,5": 791, "46": 792, "46,5": 793,
        "47": 794, "47,5": 795, "48": 1190, "48,5": 1621, "49": 1191, "50": 1327,
        "51": 1622, "52": 1623
    }
    size_mapping_fille_garcon = {
        "15 et moins": 657, "16": 585, "17": 586, "18": 587, "19": 588, "20": 589,
        "21": 590, "22": 591, "23": 592, "24": 593, "25": 594, "26": 595, "27": 596,
        "28": 597, "29": 598, "30": 599, "31": 600, "32": 601, "33": 602, "34": 603,
        "35": 604, "36": 605, "37": 606, "38": 607, "39": 608, "40": 609
    }

    # Resolve gender, color, and size IDs based on user input
    gender_id = gender_mapping.get(gender.lower(), None)
    color_id = color_mapping.get(color.lower(), None)
    size_id = None
    if gender_id == 1904:  # Femme
        size_id = size_mapping_femme.get(size, None)
    elif gender_id == 1231:  # Homme
        size_id = size_mapping_homme.get(size, None)
    elif gender_id in (1255, 1256):  # Fille or Garcon
        size_id = size_mapping_fille_garcon.get(size, None)

    # Constructing the base URL with gender_ids, color_ids, and size_ids
    base_url = "https://www.vinted.fr/catalog?brand_ids[]=53&"
    if gender_id:
        base_url += f"catalog[]={gender_id}&"
    if color_id:
        base_url += f"color_ids[]={color_id}&"
    if size_id:
        base_url += f"size_ids[]={size_id}&"

    my_url = f"{base_url}search_text={query.replace(' ', '%20')}"

    driver = webdriver.Firefox()
    driver.get(my_url)

    # Handle cookie acceptance
    try:
        cookie_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, 'onetrust-accept-btn-handler'))
        )
        cookie_button.click()
    except:
        print("Cookie button not found or timed out")

    data = {"Title": [], "Price": [], "Brand": [], "Size": [], "Link": [], "Image Source": [], "Localisation": []}

    while True:
        try:
            # Wait for the articles to load on the current page
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, 'feed-grid'))
            )

            # Find the new articles on the current page
            article_body = driver.find_element(By.CLASS_NAME, 'feed-grid')
            articles = article_body.find_elements(By.CLASS_NAME, "feed-grid__item")

            for article in articles:
                try:
                    title_element = article.find_element(By.CLASS_NAME, 'new-item-box__overlay')
                    title = title_element.get_attribute('title')

                    # Extracting data from the title attribute
                    price_start = title.find('prix') + len('prix&nbsp;:')
                    price_end = title.find('&nbsp;€', price_start)
                    price = title[price_start:price_end].strip()

                    brand_start = title.find('marque') + len('marque&nbsp;:')
                    brand_end = title.find(',', brand_start)
                    brand = title[brand_start:brand_end].strip()

                    size_start = title.find('taille') + len('taille&nbsp;:')
                    size_end = title.find(',', size_start)
                    size = title[size_start:size_end].strip()

                    href = title_element.get_attribute('href')

                    # Extract image source from the correct elements
                    div_element = article.find_element(By.CLASS_NAME, 'new-item-box__image')
                    img_element = div_element.find_element(By.CLASS_NAME, 'web_ui__Image__content')
                    img_src = img_element.get_attribute('src')

                    # Extract localisation information from article's href link
                    driver.execute_script("window.open('', '_blank');")
                    driver.switch_to.window(driver.window_handles[1])
                    driver.get(href)

                    try:
                        localisation_element = WebDriverWait(driver, 10).until(
                            EC.presence_of_element_located((By.XPATH, '//div[@class="details-list__item" and @data-testid="item-details-location"]/div[@class="details-list__item-value"]'))
                        )
                        localisation = localisation_element.text
                    except:
                        localisation = None

                    driver.close()
                    driver.switch_to.window(driver.window_handles[0])

                    data["Title"].append(title)
                    data["Price"].append(price)
                    data["Brand"].append(brand)
                    data["Size"].append(size)
                    data["Link"].append(href)
                    data["Image Source"].append(img_src)
                    data["Localisation"].append(localisation)

                except ValueError:
                    pass

            next_page_link = driver.find_element(By.CLASS_NAME, "web_ui__Pagination__next").get_attribute('href')

            driver.get(next_page_link)

        except NoSuchElementException:
            # If the "Next Page" is not found, break out of the loop
            break

    driver.quit()
    return pd.DataFrame(data)

def process_dataframe(df):
    data = {
        'Title': [],
        'Price': [],
        'Brand': [],
        'Size': [],
        'Link': [],
        'Image Source': [],
        'Localisation': []
    }

    for index, row in df.iterrows():
        title_components = row['Title'].split(', ')
        data['Title'].append(title_components[0])

        price_components = title_components[1].split(':')
        if len(price_components) > 1:
            price = price_components[1].strip().replace('€', '').replace('\xa0', '')
            data['Price'].append(float(price.replace(',', '.')) if price else None)
        else:
            data['Price'].append(None)

        brand_components = title_components[2].split(':')
        data['Brand'].append(brand_components[1].strip().upper() if len(brand_components) > 1 else None)

        size_components = title_components[3].split(':')
        data['Size'].append(size_components[1].strip() if len(size_components) > 1 else None)
        data['Link'].append(row['Link'])
        data['Image Source'].append(row['Image Source'])
        data['Localisation'].append(row['Localisation'])

    return pd.DataFrame(data)

#user defined query
query = "nike cortez"
gender = "femme"  
color = "jaune"    
size = "40"       

vinted_data = get_vinted_data(query, gender=gender, color=color, size=size)
processed_data = process_dataframe(vinted_data)

processed_data


Unnamed: 0,Title,Price,Brand,Size,Link,Image Source,Localisation
0,Nike cortez,25.0,NIKE,40,https://www.vinted.fr/items/3756249914-nike-co...,https://images1.vinted.net/t/01_01c7f_y3ViupHW...,"BONDOUFLE, FRANCE"
1,Nike Cortez Jewel Two tone,15.0,NIKE,40,https://www.vinted.fr/items/3502006600-nike-co...,https://images1.vinted.net/t/02_020d6_kbXyN2o9...,PAYS-BAS
2,Nike cortez,14.0,NIKE,40,https://www.vinted.fr/items/3604624670-nike-co...,https://images1.vinted.net/t/02_00b49_CBjpKg3A...,"SALLANCHES, FRANCE"
3,nike cortez,15.0,NIKE,40,https://www.vinted.fr/items/3574580483-nike-co...,https://images1.vinted.net/t/03_025ac_tULt4tpe...,ESPAGNE
4,Nike cortez,23.0,NIKE,40,https://www.vinted.fr/items/3377202709-nike-co...,https://images1.vinted.net/t/01_02091_Lz4GGBxN...,"ÉCOUFLANT, FRANCE"
5,Nike Cortez Union Sesame,100.0,NIKE,40,https://www.vinted.fr/items/3813692439-nike-co...,https://images1.vinted.net/t/02_024bf_P3hsGRG2...,"MONTPELLIER, FRANCE"
6,Cortèze,23.0,NIKE,40,https://www.vinted.fr/items/234485412-corteze?...,https://images1.vinted.net/t/01_00101_rCh9EQ5E...,"ARNOUVILLE, FRANCE"
7,Chaussures nike cortez en 40 violette et jaune,7.5,NIKE,40,https://www.vinted.fr/items/3310750318-chaussu...,https://images1.vinted.net/t/03_00431_wqdoFRZU...,"MÉRU, FRANCE"
8,Nike cortez jaune,20.0,NIKE,40,https://www.vinted.fr/items/995385937-nike-cor...,https://images1.vinted.net/t/01_00596_hPKoh8YH...,"NICE, FRANCE"
9,baskets NIKE,38.0,NIKE,40,https://www.vinted.fr/items/3642764164-baskets...,https://images1.vinted.net/t/01_00fa9_5VX2qaVT...,"DOUAI, FRANCE"


## 2. Approximate Carbon Footprint Calculation

After extracting the data, the analysis was expanded to incorporate the environmental impact of shipping items from their respective locations to Paris. To quantify this impact, a function was developed to calculate the approximate carbon footprint associated with the localization of each article.

In the analysis, a constant emission rate per kilometer was assigned, and the carbon footprint for each item's journey to Paris was calculated. Additionally, a fixed value for Nike's carbon footprint from ordering on their website was introduced, set at 405 kg CO2 per ton (explained in our report).
<br>
The "Carbon Profit" column in the DataFrame reflects the environmental benefit attained by ordering through Vinted. It takes into account the reduction in carbon footprint compared to ordering directly from Nike's website.

In [9]:
from geopy.distance import geodesic
from geopy.geocoders import Nominatim

# Initialize geocoder
geolocator = Nominatim(user_agent="vinted_carbon_footprint")

def calculate_approximate_carbon_footprint(location):
    try:
        # Geocode the location to get coordinates
        location_info = geolocator.geocode(location, timeout=10)

        if location_info:
            location_coordinates = (location_info.latitude, location_info.longitude)

            # Coordinates of Paris, France
            paris_coordinates = (48.8566, 2.3522)

            # Calculate the distance between the location and Paris using geopy
            distance_km = geodesic(location_coordinates, paris_coordinates).kilometers

            # Assume a constant emission rate per kilometer 
            emission_rate_per_km = 0.2  
            # Calculate the approximate carbon footprint
            carbon_footprint = distance_km * emission_rate_per_km

            return carbon_footprint

    except (ValueError, TypeError, GeocoderTimedOut):
        # Handle cases where geocoding fails or times out
        return None

# Adding columns to the DataFrame for the approximate carbon footprint and Nike's fixed value
processed_data["Approximate Carbon Footprint (kg CO2)"] = processed_data["Localisation"].apply(calculate_approximate_carbon_footprint)
processed_data["Approximative Carbon Print from Ordering on Nike (kg CO2/ton)"] = 405  

# Adding a column for Carbon Profit
processed_data["Carbon Profit (kg CO2)"] = processed_data["Approximative Carbon Print from Ordering on Nike (kg CO2/ton)"] - processed_data["Approximate Carbon Footprint (kg CO2)"]

processed_data


Unnamed: 0,Title,Price,Brand,Size,Link,Image Source,Localisation,Approximate Carbon Footprint (kg CO2),Approximative Carbon Print from Ordering on Nike (kg CO2/ton),Carbon Profit (kg CO2)
0,Nike cortez,25.0,NIKE,40,https://www.vinted.fr/items/3756249914-nike-co...,https://images1.vinted.net/t/01_01c7f_y3ViupHW...,"BONDOUFLE, FRANCE",5.418004,405,399.581996
1,Nike Cortez Jewel Two tone,15.0,NIKE,40,https://www.vinted.fr/items/3502006600-nike-co...,https://images1.vinted.net/t/02_020d6_kbXyN2o9...,PAYS-BAS,87.928759,405,317.071241
2,Nike cortez,14.0,NIKE,40,https://www.vinted.fr/items/3604624670-nike-co...,https://images1.vinted.net/t/02_00b49_CBjpKg3A...,"SALLANCHES, FRANCE",91.563944,405,313.436056
3,nike cortez,15.0,NIKE,40,https://www.vinted.fr/items/3574580483-nike-co...,https://images1.vinted.net/t/03_025ac_tULt4tpe...,ESPAGNE,240.819687,405,164.180313
4,Nike cortez,23.0,NIKE,40,https://www.vinted.fr/items/3377202709-nike-co...,https://images1.vinted.net/t/01_02091_Lz4GGBxN...,"ÉCOUFLANT, FRANCE",52.005156,405,352.994844
5,Nike Cortez Union Sesame,100.0,NIKE,40,https://www.vinted.fr/items/3813692439-nike-co...,https://images1.vinted.net/t/02_024bf_P3hsGRG2...,"MONTPELLIER, FRANCE",118.951617,405,286.048383
6,Cortèze,23.0,NIKE,40,https://www.vinted.fr/items/234485412-corteze?...,https://images1.vinted.net/t/01_00101_rCh9EQ5E...,"ARNOUVILLE, FRANCE",3.113751,405,401.886249
7,Chaussures nike cortez en 40 violette et jaune,7.5,NIKE,40,https://www.vinted.fr/items/3310750318-chaussu...,https://images1.vinted.net/t/03_00431_wqdoFRZU...,"MÉRU, FRANCE",9.017902,405,395.982098
8,Nike cortez jaune,20.0,NIKE,40,https://www.vinted.fr/items/995385937-nike-cor...,https://images1.vinted.net/t/01_00596_hPKoh8YH...,"NICE, FRANCE",137.33721,405,267.66279
9,baskets NIKE,38.0,NIKE,40,https://www.vinted.fr/items/3642764164-baskets...,https://images1.vinted.net/t/01_00fa9_5VX2qaVT...,"DOUAI, FRANCE",35.2198,405,369.7802


## 3. Nike Price Scraping for Profit Calculation

To assess potential cost savings, a process was initiated to scrape all available prices for the specified query on the Nike website. This allows for the calculation of the average price of the desired item by aggregating the individual prices. 

In [12]:
from selenium import webdriver

from selenium.webdriver.common.by import By

def accept_cookies(driver):
    try:
        # Accept cookies
        accept_cookies_button = driver.find_element(By.CLASS_NAME, 'btn-primary-dark')
        accept_cookies_button.click()
    except:
        pass  # If the cookies banner is not present or there's an error, ignore

def get_nike_prices(search_query):
    # Set up the Firefox driver
    driver = webdriver.Firefox()
    
    try:
        # Go to nike.fr and accept cookies
        driver.get("https://www.nike.com/fr/")
        accept_cookies(driver)

        # Construct the URL with the search query
        url = f"https://www.nike.com/fr/w?q={search_query.replace(' ', '%20')}"

        # Navigate to the URL with the search query
        driver.get(url)

        # Wait for the prices to load (you may need to adjust the waiting time)
        driver.implicitly_wait(10)

        # Retrieve prices
        price_elements = driver.find_elements(By.CLASS_NAME, 'product-price')
        prices = [float(price.text.replace('€', '').replace(',', '.')) for price in price_elements]

        # Calculate the average price
        average_price = sum(prices) / len(prices)

        return average_price

    except Exception as e:
        print("An error occurred:", e)

    finally:
        # Close the browser window
        driver.quit()

average_nike_price=get_nike_prices(query)
print(average_nike_price)

107.99000000000001


A mechanism is implemented to calculate the potential profit made by purchasing items on Vinted compared to the average price of the same product on the Nike website. The function computes the profit made for each item. The resulting DataFrame, processed_data, is then enriched with additional columns, providing insights into the potential financial benefit of opting for Vinted over the average market price on Nike.

In [13]:
def calculate_profit_made(row, average_nike_price):
    try:
        item_price = float(row['Price'])
        profit_made = average_nike_price - item_price
        return profit_made

    except ValueError:
        return None

processed_data.sort_values(by='Price', inplace=True)
# Calculate profit made for Vinted data
processed_data['Average price of the product on Nike'] = average_nike_price
processed_data['Profit Made'] = processed_data.apply(lambda row: calculate_profit_made(row, average_nike_price), axis=1)

processed_data

Unnamed: 0,Title,Price,Brand,Size,Link,Image Source,Localisation,Approximate Carbon Footprint (kg CO2),Approximative Carbon Print from Ordering on Nike (kg CO2/ton),Carbon Profit (kg CO2),Average price of the product on Nike,Profit Made
7,Chaussures nike cortez en 40 violette et jaune,7.5,NIKE,40,https://www.vinted.fr/items/3310750318-chaussu...,https://images1.vinted.net/t/03_00431_wqdoFRZU...,"MÉRU, FRANCE",9.017902,405,395.982098,107.99,100.49
2,Nike cortez,14.0,NIKE,40,https://www.vinted.fr/items/3604624670-nike-co...,https://images1.vinted.net/t/02_00b49_CBjpKg3A...,"SALLANCHES, FRANCE",91.563944,405,313.436056,107.99,93.99
17,Baskets Cortez jaune nike,14.0,NIKE,40,https://www.vinted.fr/items/929940934-baskets-...,https://images1.vinted.net/t/02_00627_hymwci5t...,FRANCE,50.586158,405,354.413842,107.99,93.99
1,Nike Cortez Jewel Two tone,15.0,NIKE,40,https://www.vinted.fr/items/3502006600-nike-co...,https://images1.vinted.net/t/02_020d6_kbXyN2o9...,PAYS-BAS,87.928759,405,317.071241,107.99,92.99
3,nike cortez,15.0,NIKE,40,https://www.vinted.fr/items/3574580483-nike-co...,https://images1.vinted.net/t/03_025ac_tULt4tpe...,ESPAGNE,240.819687,405,164.180313,107.99,92.99
24,Yellow Nike cortez,20.0,NIKE,40,https://www.vinted.fr/items/985728423-yellow-n...,https://images1.vinted.net/t/01_00363_o5z4vu2a...,"WUUSTWEZEL, BELGIQUE",64.539516,405,340.460484,107.99,87.99
8,Nike cortez jaune,20.0,NIKE,40,https://www.vinted.fr/items/995385937-nike-cor...,https://images1.vinted.net/t/01_00596_hPKoh8YH...,"NICE, FRANCE",137.33721,405,267.66279,107.99,87.99
4,Nike cortez,23.0,NIKE,40,https://www.vinted.fr/items/3377202709-nike-co...,https://images1.vinted.net/t/01_02091_Lz4GGBxN...,"ÉCOUFLANT, FRANCE",52.005156,405,352.994844,107.99,84.99
6,Cortèze,23.0,NIKE,40,https://www.vinted.fr/items/234485412-corteze?...,https://images1.vinted.net/t/01_00101_rCh9EQ5E...,"ARNOUVILLE, FRANCE",3.113751,405,401.886249,107.99,84.99
0,Nike cortez,25.0,NIKE,40,https://www.vinted.fr/items/3756249914-nike-co...,https://images1.vinted.net/t/01_01c7f_y3ViupHW...,"BONDOUFLE, FRANCE",5.418004,405,399.581996,107.99,82.99
