### 1. Setup

We start by importing the necessary libraries and defining key parameters for scraping product data from the Pyaterochka’s online store. This includes base URLs for making requests to the store's API, headers to simulate a browser request, and the current date for reference.

In [None]:
import requests
import json
import time
import random
import pandas as pd
from datetime import date

# define base URLs and headers for making requests throughout the scraping process
BASE_URL = "https://5d.5ka.ru/api/catalog/v2/stores/Y233/categories/"
CATEGORIES_URL = "https://5d.5ka.ru/api/catalog/v2/stores/Y233/categories?mode=delivery&include_subcategories=1"
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
CHECK_DATE = date.today() # current date for reference

HEADERS = {
    "user-agent": USER_AGENT,
    "origin": "https://5ka.ru"
}

### 2. Web Scraping

To extract product data from the Pyaterochka website, we follow a multi-step process: retrieving product categories, fetching product data by category, cleaning the raw data, and storing the results in a structured format.

#### 2.1. Fetching Categories

Since products are grouped by categories, we first need to request a list of all available categories and subcategories. The list of categories is fetched from a dedicated API endpoint used by the site’s catalogue view. Each subcategory includes a unique alphanumeric ID, which is later used to query the corresponding product listings.

We define a function to query the categories endpoint, extract only the necessary fields (*id*, *name*, and *parent_id*), and return a cleaned list of subcategories.

In [None]:
def fetch_categories():
    
    response = requests.get(CATEGORIES_URL, headers=HEADERS) # fetch category data from the API endpoint
    raw_categories = json.loads(response.text) # convert JSON response to a list of category dictionaries
    
    cleaned_categories = []
    
    # go through the raw data and select only the necessary fields
    for category in raw_categories:    # iterate through each top-level category
        for subcategory in category["categories"]:    # go through all its subcategories and save them, specify parent_category
            cleaned_categories.append({
                "id": subcategory["id"], # id for further products fetching
                "name": subcategory["name"], # name for reference
                "parent_id": category["id"] # id of a parent category for reference
            })

    return cleaned_categories

We fetch all available categories and save them to a CSV file for transparency and reproducibility.

In [None]:
categories = fetch_categories()
# categories = categories[83:] # this line can be used to resume scraping from a specific point in case of interruption
total_categories_count = len(categories) # for fetching progress indication

categories_df = pd.DataFrame(categories)
categories_df.to_csv(f'categories-{CHECK_DATE}-pyaterochka.csv', index=False)

In [None]:
# optional intermediate output of the categories list
# with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
#     display(categories_df)

#### 2.2. Fetching Raw Product Data

The Pyaterochka website dynamically loads product data through GET requests tied to specific category codes. Each category page uses a unique alphanumeric identifier, included in the URL when fetching product data. The site first loads 12 items and then increases the number incrementally (e.g., 24, 36, 48…) as the user scrolls.
To replicate this behavior in our scraping process, we increment the limit parameter with each request and keep replacing the product list with the most complete version returned so far, until no more new products are added. A random delay is included between requests to avoid being rate-limited or blocked.

This step returns raw product data in JSON format, preserving the site’s original structure.

In [None]:
def fetch_products(category_id):
    
    current_limit = 0
    total_products = 0
    products = []
    
    while total_products >= current_limit:
        
        time.sleep(random.uniform(1, 5)) # random time delay to avoid being blocked
        
        current_limit += 12 # increase the limit to fetch more products
        url = f'{BASE_URL}{category_id}/products?mode=delivery&include_restrict=false&limit={current_limit}'
        
        response = requests.get(url, headers=HEADERS) # fetch product data for the current category and limit
        response_data = json.loads(response.text) # convert into a dictionary
        
        products = response_data["products"] # extract only products
        total_products = len(products) # check the total number of products
        
        print(f'{len(products)}..', end='') # progress indicator, showing the number of products fetched in the current request

    return products    

#### 2.3. Cleaning Product Data

Next, we define a function to extract only the relevant product fields from the raw data:

- Category ID (for traceability)
- Product name
- Unit of measurement
- Regular and discounted prices
- Pricing clarification (e.g. net weight or price per unit info)

The result is a cleaned list of product entries ready for storage or further analysis.

In [None]:
def clean_product_data(category, raw_products):

    cleaned_products = []

    # go through the raw data and select only the necessary fields
    for product in raw_products:
        cleaned_products.append({
            "category_id": category,
            "name": product["name"],
            "unit_of_measurement": product["uom"],
            "price_reg": product["prices"]["regular"],
            "price_disc": product["prices"]["discount"],
            "pricing_clarification": product["property_clarification"] # clarifies the unit for the price or the net weight
        })

    return cleaned_products

#### 2.4. Putting It All Together

In the main loop, we:

1. Iterate through the selected categories,
2. Fetch and clean product data for each,
3. The cleaned data for each category is appended to the same CSV file to build a complete dataset.

A progress tracker prints feedback for each category to help monitor the scraping process. We include a short random delay between requests here too to avoid potential rate-limiting. A timestamp is added to the filename using the current date (`CHECK_DATE`) to track when the data was collected.

In [None]:
fetched_categories_count = 0    # counter for fetching progress tracker

# create the file with headers first
products_df = pd.DataFrame(columns=["category_id", "name", "unit_of_measurement", "price_reg", "price_disc", "pricing_clarification"])
products_df.to_csv(f'scraped_products-{CHECK_DATE}-pyaterochka.csv', index=False, mode='w')

for category in categories:
    
    raw_products = fetch_products(category["id"]) # fetch products
    new_products = clean_product_data(category["id"], raw_products) # select only relevant data and add new products to the list

    products_df = pd.DataFrame(new_products)
    products_df.to_csv(f'scraped_products-{CHECK_DATE}-pyaterochka.csv', index=False, mode='a', header=False)

    fetched_categories_count += 1
    print(f'Category ID: {category["id"]} finished, {fetched_categories_count} out of {total_categories_count} categories fetched')
    
    time.sleep(random.uniform(1, 5))

print(f'Fetching complete. Results saved to scraped_products-{CHECK_DATE}-pyaterochka.csv')

### 3. Filtering and Normalizing Product Data

After collecting and cleaning the raw product data, we proceed with filtering the dataset to include only the products relevant for comparison. This step involves several stages:

#### 3.1. Initial Preprocessing

We start by loading the previously saved product and category datasets and dropping columns that are no longer needed. The column names are simplified for ease of further processing. Also we remove exact duplicates, which may appear if the same item was in more than one category.

In [None]:
import pandas as pd
import re

categories = pd.read_csv('categories-2025-03-03-pyaterochka.csv')
products_original = pd.read_csv('scraped_products-2025-03-03-complete.csv')

# drop unnecessary columns and rename for simplicity
products = products_original.drop(['category_id', 'price_disc', 'unit_of_measurement'], axis=1)    # drop category_id, uom and price_disc columns
products = products.rename(columns={'price_reg': 'price', 'pricing_clarification': 'pricing_unit'})    # rename price column for simplicity
products = products.drop_duplicates() # remove duplicates

#### 3.2. Filtering Products by Type

To identify relevant products for comparison, we define a dictionary mapping product types to regular expressions. Each expression captures the base form of the product while deliberately excluding variations (e.g., flavored, processed, or pickled) that fall outside the scope of this analysis.

In [None]:
product_regex_map = {
    'rice': r'(^|^")рис\b',
    'bread': r'(^хлеб\b|^багет\b|^батон\b)(?!.*(чесн|заморож))', # matches "хлеб", "багет", or "батон"
                                                                 # but excludes "багет с чесноком" or "багет замороженный"
    'chicken_fillet': r'^филе\b.*(кур|цыпл)(?!.*запеч)',
    'pork_leg': r'^окорок\b.*свин',
    'egg': r'^яйцо.*курин',
    'cucumber': r'^огур(цы|ец)(?!.*(солен|маринован))', # matches "огурец" or "огурцы" but excludes "огурцы соленые" or "огурцы маринованные"
    'carrot': r'^морковь(?!.*корей)',
    'onion': r'^лук.*реп(?!.*зелен)',
    'tomato': r'^томаты(?!.*(сок|очищ|маринован|вялен|солен))',
    'cabbage': r'^капуста\b.*белокоч',
    'eggplant': r'^баклажаны?($|.*теплич)',
    'banana': r'^банан',
    'orange': r'^апельсин',
    'milk': r'^молоко(?!.*(сгущ|сух))',
    'yogurt': r'^йогурт\b(?!.*питье)', # matches "йогурт" but excludes "йогурт питьевой"
    'condensed_milk': r'(^молоко.*сгущ|^сгущ)(?!.*(варен|какао|шокол))',
    'green_tea': r'^чай.* зел(?!.*(порош|л$))',
    'black_tea': r'^чай.* черн(?!.*л$)',
    'ground_coffee': r'^кофе(?!.*(капсул|раствор)).*молот',
    'sugar': r'^сахар\b(?!.*ванил)',
    'salt': r'^соль(?!.*(розов|посуд|чесн|ванн|спец))',
    'sunflower_oil': r'^масло\b.*подсолн(?!.*добавл)',
    'water': r'^вода(?!.*(малин|лимон)).*негаз',
    'buckwheat': r'(^крупа\b.*гречн|^гречка\b)(?!.*(\bпшен|\bкиноа))',
    'spaghetti': r'(^макароны.*спагетти|^спагетти\b)(?!.*(заморож|кукуруз))',
    'rice_noodles': r'(^лапша|^вермишель).*(фунчоз)',
    'tofu': r'^тофу\b',
    'mango': r'^манго\b(?!.*(суш|заморож))'
}
product_regex_list = '|'.join(product_regex_map.values()) # create a single regex by joining all individual regexes with the OR operator (|)

# filter products matching any of the product types
filtered_products = products.loc[products.name.str.contains(product_regex_list, case=False, regex=True)]

#### 3.3. Assigning Tags

Each filtered product is tagged with its corresponding product type based on regex matching and is also assigned the supermarket name (“Pyaterochka”) for traceability.

In [None]:
def assign_product_type(row):
    name = row['name']
    for product_type, regex in product_regex_map.items():
        match = re.search(regex, name, flags=re.IGNORECASE)
        if match:
            return product_type
    return None

filtered_products = filtered_products.copy()  # create a copy to avoid a warning when adding new columns with .loc
filtered_products.loc[:,'product_type'] = filtered_products.apply(assign_product_type, axis=1)
filtered_products.loc[:,'supermarket'] = 'Pyaterochka'

# optional intermediate output of the filtered products list
# with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
#     display(filtered_products)

#### 3.4. Extracting and Normalizing Units

Many product listings differ in quantity, weight, or volume. To enable a fair comparison, we extract the relevant information from the product name or pricing clarification field and calculate normalized price metrics such as price per kilogram, per liter, or per unit.

- Weight in grams
- Number of units (in particular, eggs)
- Volume in milliliters

Each value is extracted using pattern matching. Not all products contain all values, so some normalization columns (e.g., *price_kg*, *price_lit*, *price_unit*) may be missing depending on the item.

In [None]:
def extract_weight(row):
    """Extracts total weight in grams from the product name or pricing clarification.
    Supports both single weights and multi-portion formats (e.g., '5x100г').
    """
    
    name, pricing_unit = row['name'], row['pricing_unit']

    # multi-portion format (e.g., 5x100г)
    match = re.search(r'(\d+)(x|х)(\d+|\d+[.]\d+)\s?г', name) # matches digits х digits g
    if match:
        portion, per_portion = map(float, match.group(1,3)) # extract the number of portions and the weight per portion
        return portion * per_portion # return total weight
    # single weight (grams or kilograms)
    match = re.search(r'(\d+|\d+[.]\d+)\s?(г|кг)', name)
    if match:
        weight = float(match.group(1))
        unit = match.group(2)
        return weight * 1000 if unit == 'кг' else weight # convert kilograms to grams
    # if name doesn't contain anything, check pricing_unit
    match = re.search(r'(\d+|\d+[.]\d+)\s?(г|кг)', pricing_unit)
    if match:
        weight = float(match.group(1))
        unit = match.group(2)
        return weight * 1000 if unit == 'кг' else weight

    return None  # if nothing matched

# the next two functions follow the same logic as extract_weight, but for units and milliliters
def extract_number_of_units(row):
    """Extracts number of units ('шт') from the product name or pricing clarification."""
    
    name, pricing_unit = row['name'], row['pricing_unit']

    # check name
    match = re.search(r'(\d+)\s?шт', name)
    if match:
        number_of_units = int(match.group(1))
        return number_of_units
    # check pricing_unit
    match = re.search(r'(\d+)\s?шт', pricing_unit)
    if match:
        number_of_units = int(match.group(1))
        return number_of_units

    return None  # if nothing matched

def extract_volume(row):
    """Extracts total volume in milliliters from the product name or pricing clarification.
    Supports both single and multi-portion formats (e.g., '5x100мл').
    """
    
    name, pricing_unit = row['name'], row['pricing_unit']

    # multi-portion format (e.g., 5x100мл)
    match = re.search(r'(\d+)(x|х)(\d+|\d+[.]\d+)\s?мл', name)
    if match:
        portion, per_portion = map(float, match.group(1,3))
        return portion * per_portion
    # single volume (liters or milliliters)
    match = re.search(r'(\d+|\d+[.]\d+)\s?(мл|л\b)', name)
    if match:
        volume = float(match.group(1))
        unit = match.group(2)
        return volume * 1000 if unit == 'л' else volume
    # check pricing_unit
    match = re.search(r'(\d+|\d+[.]\d+)\s?(мл|л\b)', pricing_unit)
    if match:
        volume = float(match.group(1))
        unit = match.group(2)
        return volume * 1000 if unit == 'л' else volume

    return None  # if nothing matched

filtered_products = filtered_products.copy()  # recreate the dataframe

# calculate normalized prices
filtered_products.loc[:,'weight'] = filtered_products.apply(extract_weight, axis=1)  # a column with weights in grams
filtered_products.loc[:,'price_kg'] = filtered_products.price / filtered_products.weight * 1000   # a column with prices per kg

filtered_products.loc[:,'number_of_units'] = filtered_products.apply(extract_number_of_units, axis=1)  # a column with number of units
filtered_products.loc[:,'price_unit'] = filtered_products.price / filtered_products.number_of_units   # a column with prices per unit

filtered_products.loc[:,'volume'] = filtered_products.apply(extract_volume, axis=1)  # a column with volume in ml
filtered_products.loc[:,'price_lit'] = filtered_products.price / filtered_products.volume * 1000   # a column with prices per liter

# optional intermediate output
# with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
#     display(filtered_products)

#### 3.5. Saving the Final Filtered Dataset

Finally, the enriched dataset is saved to a new CSV file for further analysis.

In [None]:
filtered_products.to_csv(f'filtered_products-2025-03-03-pyaterochka.csv')