### 1. Setup

We import the necessary libraries for web scraping and data handling. The `CHECK_DATE` stores the current date for labeling output files.

In [None]:
from bs4 import BeautifulSoup
import requests
import re
import time
import random
import pandas as pd
import json
from datetime import date

CHECK_DATE = date.today()

### 2. Web Scraping

To extract product data from Co.op, we follow a multi-step process: fetch product categories, fetch product codes from category pages, collect item data by category, clean the results, and save everything in a structured format.

#### 2.1. Fetching Categories

We start by extracting product categories from a saved HTML page that represents the structure of the Co.op Online website. The script navigates through three levels of category hierarchy: top-level categories, their subcategories, and third-level subsubcategories.

Each category is stored as a dictionary containing the category *name*, *level* (1 for top-level, 2 for subcategory, 3 for subsubcategory), a flag indicating whether it has children, its *parent* category (if any), and the *link* associated with it.

In [None]:
with open("groups.html", "r", encoding="utf-8") as file:
    groups_page = BeautifulSoup(file, "html.parser")    # take the html page saved before and put it into a BeautifulSoup object

categories = []
top_categories = groups_page.find_all("a", class_="clearfix", href=re.compile(r"https://cooponline.vn/groups/[^#]"))    # find all top-level categories

# loop through all categories and extract relevant fields
for category in top_categories:
    
    categories.append({                  # extract only relevant data and save into a dictionary
        "name": category.span.string, # category name for reference
        "level": 1, # hierarchy level for reference
        "hasChild": True, # hardcoded True - top-level categories has children, important for the products fetching
        "parent": None, # parent category for reference
        "link": category["href"] # link to a category page, important for getting product codes further
    })
    
    submenu = category.find_next_sibling("div", class_="sub-menu")    # the subcategories are contained in the following div block
    subcategories = submenu.find_all("a", class_="main-menu")    # find all subcategories and loop through them too
    for subcategory in subcategories:

        subsubmenu = subcategory.find_next_sibling("ul")    # check if the subcategory has child categories
        
        categories.append({    # add subcategories to the list of dictionaries too
            "name": subcategory.string,
            "level": 2,
            "hasChild": True if subsubmenu else False,   # subsubmenu can be either None or a list of third-level categories
            "parent": category.span.string,
            "link": subcategory["href"]
        })

        # if subsubcategories exists, loop through them and extract too
        if subsubmenu: 
            
            subsubcategories = subsubmenu.find_all("a")
            for subsubcategory in subsubcategories:
                
                categories.append({
                    "name": subsubcategory.string,
                    "level": 3,
                    "hasChild": False,
                    "parent": subcategory.string,
                    "link": subsubcategory["href"]
                })

After building the list, we clean up the category names by stripping leading and trailing spaces to ensure consistency.

In [None]:
for category in categories:
    category["name"] = category["name"].strip()
    if category["parent"] is not None:
        category["parent"] = category["parent"].strip()

#### 2.2. Fetching Product Codes

Unlike the other websites in this project, Co.op Online requires product codes to be explicitly specified in the POST request when fetching product data. These codes are embedded within the HTML of each category’s webpage. To proceed, we need to visit each category page and extract the relevant codes.

While all categories contain products, we focus on the most specific ones—those without any child categories. These are smaller and more targeted, which helps reduce the risk of timeouts or rate-limiting during batch data requests. On each category page, we locate the `module-taxonomy` tag, which holds the category’s `term_id` and a list of product codes within its `items` attribute. This information is saved along with the category metadata for use in the next stage.

In [None]:
# create the file with headers first
categories_df = pd.DataFrame(columns=["name", "level", "hasChild", "parent", "link", "term_id", "item_codes"])
categories_df.to_csv(f'categories-{CHECK_DATE}-coop.csv', index=False, mode='w')

total_categories_count = len(categories)    # number of categories that we will scan, for progress indication
fetched_categories_count = 0 # counter for fetching progress tracker

for category in categories:
    
    print(f'{category["name"]}..', end='') # current category indication
    
    if not category["hasChild"]: # if category has no children, we get into it
        
        print('has no children, looking for codes', end='') # progress indication

        # fetch the HTML of a category page, locate term_id and item_codes and save them
        current_page = requests.get(category["link"]).text
        page_bs = BeautifulSoup(current_page, "html.parser")
        products_tag = page_bs.find("module-taxonomy")
        if products_tag is not None:
            category["term_id"] = products_tag["term_id"]
            category["item_codes"] = products_tag["items"]

        # save into the DataFrame and append to the CSV
        categories_df = pd.DataFrame([category])
        categories_df.to_csv(f'categories-{CHECK_DATE}-coop.csv', index=False, mode='a', header=False)
    else: # if category has child categories, we don't parse HTML, but still add it to the DataFrame and CSV
        print('has children, skip', end='')
        category["term_id"] = None
        category["item_codes"] = None
        categories_df = pd.DataFrame([category])
        categories_df.to_csv(f'categories-{CHECK_DATE}-coop.csv', index=False, mode='a', header=False)
        
    fetched_categories_count += 1
    print(f'..fetched - {fetched_categories_count} out of {total_categories_count}')
    
    time.sleep(random.uniform(1, 3))

We load a resulting category reference table and convert missing values to `None` to ensure compatibility with later processing.

In [None]:
categories = pd.read_csv('categories-2025-03-06-coop-complete.csv', dtype={"term_id": str})
categories = categories.where(pd.notna(categories), None)  # convert NaN to None
categories = categories.to_dict('records')
# categories = categories[:] # this line can be used to resume scraping from a specific point in case of interruption

#### 2.3. Fetching Raw Product Data

To retrieve product listings from Co.op Online, we need to replicate the website’s behavior when users browse a category. Unlike some other sites where product data is retrieved based on category IDs or slugs, Co.op requires a list of product codes to be included in each POST request.

We iterate through each specific (non-parent) category and use the `term_id` and product codes we previously extracted to send POST requests to the site’s backend. Each request fetches a batch of up to 24 products. Pagination is handled using an increasing page number (`trang`), and we continue making requests until the returned batch contains fewer than 24 products, signaling the end of available items for that category.

To avoid detection or rate-limiting, a random delay is introduced between requests. The final result is a list of raw product dictionaries in the site’s internal JSON format, ready for further cleaning and analysis.

In [None]:
ITEMS_HEADERS = {
    'origin': 'https://cooponline.vn',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'
}

In [None]:
def fetch_products(category):
    
    current_products = 24 # set to 24 in order to have at least 1 iteration of while loop
    current_page = 1
    products = []

    # set the referer header to match the category being requested
    ITEMS_HEADERS['referer'] = category["link"]
    url = 'https://cooponline.vn/ajax/'
        
    while current_products >= 24: # loop until current page contains less than 24 products
        
        time.sleep(random.uniform(1, 5)) # random delay to avoid rate-limiting or blocking

        # construct request body for POST request
        DATA = {
            'request': 'w_getProductsTaxonomy',
            'termid': category["term_id"],
            'taxonomy': 'groups',
            'store': 'xtanphong',
            'items': category["item_codes"],
            'trang': current_page,
        }
    
        response = requests.post('https://cooponline.vn/ajax/', headers=ITEMS_HEADERS, data=DATA) # fetch product data for the current category and page
        response_data = json.loads(response.text) # convert JSON response to a list of category dictionaries
        products += response_data # extract products and add them to the list
        
        current_products = len(response_data) # update current number of products
        current_page += 1 # update page number
        
        print(f'{len(products)}..', end='') # progress indication

    return products 

#### 2.4. Cleaning Product Data

Once the raw product listings are collected, we extract only the essential information for analysis. This includes:

- Category name (for context)
- Product name
- Price
- Unit of measurement
- Supermarket name (hardcoded for clarity)

This step reduces the raw JSON response to a clean, uniform structure that can easily be stored, aggregated, or visualized. The output is a list of dictionaries, each representing a simplified product entry, ready for export or merging with data from other sources.

In [None]:
def clean_product_data(category, raw_products):

    cleaned_products = []

    # go through the raw data and select only the necessary fields
    for product in raw_products:
        cleaned_products.append({
            "category_name": category["name"],
            "name": product["name"],
            "price": product["price"],
            "uom": product["unit"],
            "supermarket": "Co.op"
        })

    return cleaned_products

#### 2.5. Putting It All Together

The following code:

- Iterates through the selected categories containing product codes
- Fetches and cleans product data for each
- Appends the cleaned data to a single CSV file to build a complete dataset

A progress tracker prints a summary after each category to monitor scraping progress. A random delay is added between requests to reduce the risk of being blocked. The output filename includes a timestamp (via `CHECK_DATE`) for reproducibility.

In [None]:
fetched_categories_count = 0    # counter for fetching progress tracker
total_categories_count = len(categories)

# create the file with headers first
products_df = pd.DataFrame(columns=["category_name", "name", "price", "uom", "supermarket"])
products_df.to_csv(f'scraped_products-{CHECK_DATE}-coop.csv', index=False, mode='w')

for category in categories:
    
    if category["item_codes"] is not None: # check that the category isn't empty
        
        raw_products = fetch_products(category) # fetch products
        new_products = clean_product_data(category, raw_products) # select only relevant data and add new products to the list
    
        products_df = pd.DataFrame(new_products)
        products_df.to_csv(f'scraped_products-{CHECK_DATE}-coop.csv', index=False, mode='a', header=False)

        print(f'Category "{category["name"]}" finished..', end='')
    
    else:
        print(f'Skip category "{category["name"]}"..', end='')    
    
    fetched_categories_count += 1
    print(f'{fetched_categories_count} out of {total_categories_count} categories fetched')
    
    time.sleep(random.uniform(1, 5))

print(f'Fetching complete. Results saved to scraped_products-{CHECK_DATE}-coop.csv')

### 3. Filtering and Normalizing Product Data

After collecting and cleaning the raw product data, we proceed with filtering the dataset to include only the products relevant for comparison. This step involves several stages:

#### 3.1. Initial Preprocessing

We begin by loading the previously saved product and category datasets.

- The `category_name` column is dropped, as it is no longer needed for the next steps.
- Duplicate product entries are removed — these may occur if a product was listed under multiple categories.
- Extra spaces in product names are stripped to ensure consistent formatting.

In [None]:
import pandas as pd
import re

categories = pd.read_csv('categories-2025-03-06-coop-complete.csv')
products_original = pd.read_csv('scraped_products-2025-03-07-coop-complete.csv')

products = products_original.drop(['category_name'], axis=1)    # drop category_name column
products = products.drop_duplicates() # remove duplicates
products.loc[:,'name'] = products['name'].str.strip() # strip extra spaces

#### 3.2. Filtering Products by Type

To identify relevant products for comparison, we define a dictionary mapping product types to regular expressions.
Each expression captures the base form of the product while deliberately excluding variations (e.g., flavored, processed, or pickled) that fall outside the scope of this analysis.

In [None]:
product_regex_map = {
    'rice': r'^gạo(?!.*(lứt|lức|dưỡng|nếp))',
    'bread': r'^bánh (mì|mỳ|sandw|bag)(?!.*(bông|thịt|bơ|kem|hoa cúc|gà|pate|xốt|sữa|floss|socola|khoai|trứng|trong|hươu|nho|smile))',
    'chicken_fillet': r'(file|phi lê|\bức)(?!.*đùi).*gà',
    'pork_leg': r'đùi.*heo',
    'egg': r'^trứng gà(?!.*(ăn liền|tiềm|nướng|cay))', # matches "trứng gà", but excludes "trứng vịt" and varieties like already cooked eggs
    'cucumber': r'^dưa.*leo',
    'carrot': r'^cà rốt',
    'onion': r'hành tây',
    'tomato': r'^cà chua(?!.*(puree|đặc))',
    'cabbage': r'bắp cải trắng',
    'banana': r'^chuối(?!.*sấy)',
    'orange': r'^cam\b(?!.*sấy)',
    'milk': r'^sữa (tươi|tiệt|dinh|vina)(?!.*(melon|chuối|trái cây|có đường|ít đường|soco|dâu|vani|trân châu|ngữ|choco|lacto))',
    'yogurt': r'^sữa chua(?!.*(uống|men|khô|dẻo|ml))',
    'condensed_milk': r'sữa đặc(?!.*xanh lá)',
    'black_tea': r'^(hồng trà|trà\b)(?!.*(ml|l\b|xanh|sữa|khổ|sen|atiso|hoa cúc|ô long|olong|o long|green|ice|nestea|thảo|gừng|lài|matcha|chia|sâm|thế hệ|hà thủ|thái nguyên|15g|blendy|linh chi|happy|tân cương|huế|tết|thanh nhiệt))',
    'green_tea': r'^trà\b(?!.*(ml|l\b|sữa|khổ|atiso|hoa cúc|ice|nestea|thảo|gừng|matcha|chia|thế hệ|hà thủ|blendy|linh chi|happy|huế|thanh nhiệt|dilmah|twinings|tết|tim sen|chanh|tâm sen|đen|lipton|dâu|bạc hà|hàn quốc|đào|quất))',
    'ground_coffee': r'^(cà phê|cafe)(?!.*(hòa tan|hoà tan|sữa|in1|nesca|hạt|425g|bịch|fin|cino|hương))', # matches ground coffees, but excludes instant coffees and coffee with additives
    'sugar': r'^đường\s(tinh|trắng|mía|kính)',
    'salt': r'^muối(?!.*(tôm|ớt|tiêu)).*(biển|iot|tinh|sạch)',
    'sunflower_oil': r'^dầu.*hướng dương',
    'soybean_oil': r'dầu.*nành',
    'water': r'nước\s(uống đóng|khoáng|tinh)(?!.*(ion|chanh|perr))',
    'spaghetti': r'^mì(?!.*(kool|trộn|bò|omto|kem)).*(ý|spag|hair|buca)',
    'rice_noodles': r'^(bún|phở)(?!.*(lứt|đen|60g|65g|\sg$)).*(wai|minh hảo|nuffam|bình tây|sa đéc|saf|select|mikiri|hùng lô)',
    'tofu': r'^(đậu|tàu)\shũ(?!.*(chiên|trứng|cá\b|nấm|hạt|ky))', # matches plain tofus, but excludes fried and flavored varieties
    'water_spinach': r'^rau.*muống',
    'mango': r'^xoài(?!.*(sấy|ngâm))',
    'fish_sauce': r'^nước mắm(?!.*(ớt|me\b|gừng|chua\b|chay|tỏi|ngừ|nục|ăn liền))'
}
product_regex_list = '|'.join(product_regex_map.values()) # create a single regex by joining all individual regexes with the OR operator (|)

# filter products matching any of the product types
filtered_products = products.loc[products.name.str.contains(product_regex_list, case=False, regex=True)]

#### 3.3. Assigning Product Types

Each filtered product is tagged with its corresponding product type by matching its name against the predefined regex patterns.

In [None]:
def assign_product_type(row):
    name = row['name']
    for product_type, regex in product_regex_map.items():
        match = re.search(regex, name, flags=re.IGNORECASE)
        if match:
            return product_type
    return None

filtered_products = filtered_products.copy()  # recreate the dataframe
filtered_products.loc[:,'product_type'] = filtered_products.apply(assign_product_type, axis=1)

# optional intermediate output of the filtered products list
# with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
#     display(filtered_products)

#### 3.4. Extracting and Normalizing Units

Many product listings differ in quantity, weight, or volume. To enable a fair comparison, we extract the relevant information from the product name or pricing clarification field and calculate normalized price metrics such as price per kilogram, per liter, or per unit.

- Weight in grams
- Number of units (in particular, eggs)
- Volume in milliliters

Each value is extracted using pattern matching. Not all products contain all values, so some normalization columns (e.g., *price_kg*, *price_lit*, *price_unit*) may be missing depending on the item.

In [None]:
def extract_weight(row):
    """Extracts total weight in grams from the product name or unit info.
    Supports single weights and multi-portion formats (e.g., '5x100g', '5 gói x 100g').
    If only 'kg' is mentioned or implied, defaults to 1kg.
    """
    
    name, uom = row['name'], row['uom']
    
    # multi-portion format (weight goes first)
    match = re.search(r'(\d+|\d+[,.]\d+)\s?(g\b|gr\b)\s?(x|gói)\s?(\d+)', name, flags=re.IGNORECASE) # matches digits g х digits
    if match:
        portion = int(match.group(4)) # extract the number of portions
        per_portion = float(match.group(1).replace(',', '.')) # extract the weight per portion
        return portion * per_portion # return total weight
    # multi-portion format (weight goes second)    
    match = re.search(r'(\d+)(\s|\shủ\s?|\shộp\s?|\sgói\s?|\stúi\s?)?x\s?(\d+|\d+[,.]\d+)\s?g\b', name, flags=re.IGNORECASE)
    if match:
        portion = float(match.group(1).replace(',', '.'))
        per_portion = float(match.group(3).replace(',', '.'))
        return portion * per_portion
    # single weight (grams or kilograms)
    match = re.search(r'(\d+|\d+[,.]\d+)\s?(g\b|gr\b|kg)', name, flags=re.IGNORECASE)
    if match:
        weight = float(match.group(1).replace(',', '.'))
        unit = match.group(2)
        return weight * 1000 if unit in ['kg','Kg'] else weight # convert kilograms to grams
    # if name doesn't contain anything, check uom
    if uom == 'kg':
        weight = 1000
        return weight
    # if none of above worked but there's 'kg' in the name
    match = re.search(r'kg', name, flags=re.IGNORECASE)
    if match:
        weight = 1000
        return weight
    
    return None  # if nothing matched

# the next two functions follow the same logic as extract_weight, but for units and milliliters
def extract_number_of_units(row):
    """Extracts number of units from the product name or clarification.
    Supports formats like '10x', '10 túi', '10 trứng'.
    """
    
    name, product_type, uom = row['name'], row['product_type'], row['uom']
    
    # check name
    match = re.search(r'(\d+)\s?(túi|gói|trứng|t\b|x)', name, flags=re.IGNORECASE) # túi - bag, gói - package, trứng/t - egg
    if match:
        number_of_units = int(match.group(1))
        return number_of_units
   
def extract_volume(row):
    """Extracts total volume in milliliters from the product name.
    Supports single and multi-portion formats (e.g., '5x100ml', 'thùng 6 x 330ml').
    """
    
    name, uom = row['name'], row['uom']
    
    # multi-portion format (volume goes first)
    match = re.search(r'(\d+|\d+[,.]\d+)\s?(ml|l\b|lít)\s?(x|thùng)\s?(\d+)', name, flags=re.IGNORECASE) # thùng - box
    if match:
        portion = int(match.group(4))
        per_portion = float(match.group(1).replace(',', '.'))
        unit = match.group(2)
        return portion * per_portion * 1000 if unit in ['l', 'L', 'lít'] else portion * per_portion
    # multi-portion format (volume goes second)    
    match = re.search(r'(\d+)(\s|\sgói\s?|\sbịch\s?|\shộp\s?|\schai\s?)?[x×]\s?(\d+|\d+[,.]\d+)\s?(ml|l\b|lít)', name, flags=re.IGNORECASE) # bịch - bag, hộp - box, chai - bottle
    if match:
        portion = int(match.group(1))
        per_portion = float(match.group(3).replace(',', '.'))
        unit = match.group(4)
        return portion * per_portion * 1000 if unit in ['l', 'L', 'lít'] else portion * per_portion
    # single volume (liters or milliliters)
    match = re.search(r'(\d+|\d+[,.]\d+)\s?(ml|l\b|lít)', name, flags=re.IGNORECASE)
    if match:
        volume = float(match.group(1).replace(',', '.'))
        unit = match.group(2)
        return volume * 1000 if unit in ['l', 'L', 'lít'] else volume
    
    return None  # if nothing matched

filtered_products = filtered_products.copy()  # recreate the dataframe

# calculate normalized prices
filtered_products.loc[:,'weight'] = filtered_products.apply(extract_weight, axis=1)  # a column with weigths in grams
filtered_products.loc[:,'price_kg'] = filtered_products.price / filtered_products.weight * 1000   # a column with prices per kg

filtered_products.loc[:,'number_of_units'] = filtered_products.apply(extract_number_of_units, axis=1)  # a column with number of units
filtered_products.loc[:,'price_unit'] = filtered_products.price / filtered_products.number_of_units   # a column with prices per unit

filtered_products.loc[:,'volume'] = filtered_products.apply(extract_volume, axis=1)  # a column with volume in ml
filtered_products.loc[:,'price_lit'] = filtered_products.price / filtered_products.volume * 1000   # a column with prices per liter

# optional intermediate output
# with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
#     display(filtered_products)

#### 3.5. Saving the Final Filtered Dataset

Finally, the enriched dataset is saved to a new CSV file for further analysis.

In [None]:
filtered_products.to_csv(f'filtered_products-2025-03-07-coop.csv')