# Webscraping practice

## Static websites
1. Get robot.txt and create exclusion protocol for other calls [wrapper for requests.get()]
2. Data model with Pydantic
4. Get the website structure for directories
5. Follow pagination
6. Catalogue VS. Product page
7. Save data to CSV

## Soumari

### Robots.txt
1. Explore the website and its robots.txt
2. Create a wrapper for requests.get() to avoid forbidden directories

In [56]:
import requests
from bs4 import BeautifulSoup

In [57]:
# Setup session and user-agent
headers = {  # Need to be replaced with your details
    'User-Agent': 'Webscraping Capacity Building 1.0',
    'From': 'luigi.palumbo@unitus.it'  
}

s = requests.Session()
s.headers.update(headers)

In [58]:
soumari_root = "https://www.soumari.com/"

Getting robots.txt file

In [59]:
robots_soumari = s.get(soumari_root + "robots.txt")

In [60]:
print(robots_soumari.text)

User-agent: *



It does not really help in this case, but it gives permission to access all the website

### Data Model
1. Define what data needs to be captured and its type
2. Try to keep consistency across different websites to facilitate post-processing

We use Pydantic [https://docs.pydantic.dev/](https://docs.pydantic.dev/)

In [6]:
from datetime import date
from pydantic import BaseModel

Basic product class, to be reused over different websites

In [7]:
class Product(BaseModel):
    link: str
    source: str
    category: str = None
    subcategory: str = None
    subsubcategory: str = None
    name: str = None
    brand: str = None
    uid: str = None
    price: float
    regular_price: float = None
    currency: str
    in_stock: str = None
    description: str = None
    date: str = date.today().strftime("%Y-%m-%d")


Specific product class for a source website

In [8]:
class Soumari(Product):
    source: str = "Soumari"
    currency: str = "CFA"

You could do more validations for data fields in Pydantic, but it may impact the speed and stability of webscraping. We suggest to perform most of the cleaning afterwards

### Website structure

In [61]:
homepage = s.get(soumari_root)

Check into the browser how the menu is placed using the Inspector

![Soumari menu](./images/soumari_menu.png "Soumari menu")

In [62]:
page = BeautifulSoup(homepage.text, 'html.parser')

In [63]:
links = [{"link": item.get("href"), "name": item.get_text()} for item in page.find(id="menu-mega-menu").find_all("a")]
# Have a deeper look into list comprehension if the above sounds not intuitive. It is simple than it looks!

In [64]:
links

[{'link': 'https://www.soumari.com/shop/', 'name': ' TENDANCES'},
 {'link': 'https://www.soumari.com/categorie-produit/cosmetique-bio/',
  'name': ' COSMETIQUE & BIO'},
 {'link': 'https://www.soumari.com/categorie-produit/cosmetique-bio/bio/',
  'name': 'BIO'},
 {'link': 'https://www.soumari.com/categorie-produit/cosmetique-bio/cosmetique/',
  'name': 'COSMETIQUE'},
 {'link': 'https://www.soumari.com/categorie-produit/meuble-deco/',
  'name': ' MEUBLE & DECO'},
 {'link': 'https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/',
  'name': ' TÉLÉPHONES'},
 {'link': 'https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/tecno/',
  'name': 'TECNO'},
 {'link': 'https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/samsung/',
  'name': 'SAMSUNG'},
 {'link': 'https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/huawei/',
  'name': 'HUAWEI'},
 {'link': 'https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/iphone/',
  'name

We can perform some cleaning, removing invalid links and the general "shop" one with all products.  
It is an opinionated choice, you could also just use the "shop" page and get each individual product.

In [65]:
clean_links = [
    {"link": item.get("href"), "category": item.get_text()}
    for item in page.find(id="menu-mega-menu").find_all("a")
    if (
        item.get("href").startswith("https://") and  # this is specific for this case, you may also find relative links
        not item.get("href").endswith("/shop/")  # also this one is specific for this case
        )
    ]

In [66]:
# Order the links putting the deepest categories first. This is an opinionated choice
clean_links.sort(key=lambda x: x.get("link", "").count('/'), reverse=True)

In [67]:
clean_links

[{'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/split-syinix/',
  'category': 'Syinix'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/astech/',
  'category': 'Astech'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/beko/',
  'category': 'Beko'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/continental/',
  'category': 'Continental'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/daikin/',
  'category': 'Daikin'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/general-max/',
  'category': 'Général Max'},
 {'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/gree/',
  'category': 'Gre

Check how the information is presented in some catalogue page...unfortunately it seems not consistent (and it changed during last week!)

In [68]:
electric_insect = s.get("https://www.soumari.com/categorie-produit/electronique/electric-insect/")

In [69]:
electric_insect = BeautifulSoup(electric_insect.text, 'html.parser')

In [21]:
# I had to change category, as the website changed
category_list = [item.get_text() for item in electric_insect.find("ul", {"class": "breadcrumbs"}).find_all("span", {"itemprop": "name"})]

In [70]:
category_list  # need to drop the first two items

['accueil  ', 'Shop', 'ÉLECTRONIQUE', 'ELECTRIC INSECT']

In [71]:
tecno = s.get("https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/tecno/")
tecno = BeautifulSoup(tecno.text, 'html.parser')

In [72]:
not_working = [item.get_text() for item in tecno.find("ul", {"class": "breadcrumbs"}).find_all("span", {"itemprop": "name"})]


AttributeError: 'NoneType' object has no attribute 'find_all'

Probably we are better off using only one category, the deepest one we can find. Let's see how to implement this.

### Pagination

In [75]:
# This is the second page...uncomment to run and show that there is no "Next" there
tecno = requests.get("https://www.soumari.com/categorie-produit/smartphones-haut-de-gamme/tecno/page/2/")
tecno = BeautifulSoup(tecno.text, 'html.parser')

In [76]:
next_page = tecno.find("nav", {"class": "woocommerce-pagination"}).find("a", {"class": "next"})

# If there is no next page, the result from above is None

if next_page is not None:
    next_page = next_page.get("href")

In [77]:
next_page

### Catalog and product pages

In [78]:
tv_catalogue_link = "https://www.soumari.com/categorie-produit/television/"

tv_catalogue = s.get(tv_catalogue_link)
tv_catalogue = BeautifulSoup(tv_catalogue.text, 'html.parser')

Let's find the data in the catalogue page

In [80]:
tv_catalogue.find("ul", {"class": "products"}).find_all("div", {"class": "mf-product-details"})[1].find("h2").get_text()

'Televiseur SAMSUNG 85 Pouces QA85QN800ATXZT 8K'

In [82]:
tv_catalogue.find("ul", {"class": "products"}).find_all("div", {"class": "mf-product-details"})[1].find("bdi").get_text()

'3.465.000\xa0CFA'

In [83]:
products = [
    {"name": item.find("h2").get_text(), "price": item.find("bdi").get_text()}
    for item in
    tv_catalogue.find("ul", {"class": "products"}).find_all("div", {"class": "mf-product-details"})
    ]

In [86]:
products[3]

{'name': 'Televiseur ELACTRON 65 Pouces QLED Smart Android TS-6561AS',
 'price': '363.800\xa0CFA'}

Need to filter the price for products

In [88]:
"".join(filter(str.isdigit, products[1].get("price")))

'3465000'

Let's make it a function

In [31]:
# In this case is easier because it seems there is no decimal in our prices
# It may be very different in other countries
# We may want to add some chek in production to avoid errors
# (for instance, if there are empty strings)

def filter_digits(rawdata:str) -> int:
    """Function to only extract digits from a string
    Parameters:
        rawdata (str): String from which extract digits
    
    Returns:
        String with only digits
    """
    return int("".join(filter(str.isdigit, rawdata)))

This product is discounted, we may want to get the regular price too

In [89]:
products[15]

{'name': 'TELEVISEUR Astech 85 SMART ANDROID QLED TV 85ZX500-QD',
 'price': '985.000\xa0CFA'}

It seems when a product is discounted there are two more tags: `<ins>` and `<del>`

![Discounted price](./images/discounted.png "Discounted price")

Let's make also this parsing a function too, including also the product link

In [90]:
def parse_catalog_items(page: str) -> list:
    """Function to parse an item from the catalogue page

    Parameters:
        page (str): HTML catalogue page with product information

    Returns:
        parsed_products (list): list of dicts with parsed product information
    """
    parsed_products = []
    products = page.find("ul", {"class": "products"}).find_all("div", {"class": "mf-product-details"})
    for product in products:
        parsed_product = {}
        parsed_product["name"] = product.find("h2").get_text()
        parsed_product["link"] = product.find("h2").find("a").get("href")
        if product.find("ins") is not None:
            parsed_product["price"] = filter_digits(product.find("ins").find("bdi").get_text())
            parsed_product["regular_price"] = filter_digits(product.find("del").find("bdi").get_text())
        else:
            parsed_product["price"] = filter_digits(product.find("bdi").get_text())
        parsed_products.append(parsed_product)

    return parsed_products

In [91]:
clean_products = parse_catalog_items(tv_catalogue)

In [92]:
clean_products

[{'name': 'Televiseur READY Led 32 Pouces',
  'link': 'https://www.soumari.com/produit/televiseur-ready-led-32-pouces/',
  'price': 66000},
 {'name': 'Televiseur SAMSUNG 85 Pouces QA85QN800ATXZT 8K',
  'link': 'https://www.soumari.com/produit/televiseur-samsung-85-pouces-qa85qn800atxzt-8k/',
  'price': 3465000},
 {'name': 'Televiseur WAIGAA 85 Pouces LE-WG85 Smart TV 4K UHD',
  'link': 'https://www.soumari.com/produit/televiseur-waigaa-85-pouces-le-wg85-smart-tv-4k-uhd/',
  'price': 742000},
 {'name': 'Televiseur ELACTRON 65 Pouces QLED Smart Android TS-6561AS',
  'link': 'https://www.soumari.com/produit/televiseur-elactron-65-pouces-qled-smart-android-ts-6561as/',
  'price': 363800},
 {'name': 'Televiseur ELACTRON 55 Pouces QLED Smart Android TS-5561AS',
  'link': 'https://www.soumari.com/produit/televiseur-elactron-55-pouces-qled-smart-android-ts-5561as/',
  'price': 248400},
 {'name': 'Televiseur ELACTRON 50 Pouces QLED Smart Android TS-5061AS',
  'link': 'https://www.soumari.com/pr

Now we define a function that brings together parsing and follow the pagination, using recursion.

In [93]:
import time


def scrape_category(link: str, category: str, Item: BaseModel, s: requests.session(), delay: float = 1) -> list:
    """Function to scrape a category following pagination.
    Parameters:
        link (str): starting link for a category
        category (str): category name
        Item (BaseModel): class of the data object for the specific source
        s (requests.Session()): Requests session with User-Agent properly set
        delay (float): delay in seconds between calls to prevent overloading the source

    Returns:
        list of product with all information
    """
    time.sleep(delay)
    page = s.get(link)
    page = BeautifulSoup(page.text, 'html.parser')
    results = parse_catalog_items(page)
    results = [Item(**res, category=category) for res in results]
    
    # Follow pagination if exists
    try:
        next_page = page.find("nav", {"class": "woocommerce-pagination"}).find("a", {"class": "next"})
        if next_page is not None:
            next_page = next_page.get("href")
            next_results = scrape_category(link=next_page, category=category, Item=Item, s=s, delay=delay)
            results.extend(next_results)
    except AttributeError:
        pass
            
    return results

Modified function that accept a parameter to only scrape the first page of a category

In [None]:
def scrape_category_dev(
        link: str,
        category: str,
        Item: BaseModel,
        s: requests.session(),
        delay: float = 1,
        dev: bool = False,
        pages_scraped: int = 0) -> list:
    """Function to scrape a category following pagination.
    Parameters:
        link (str): starting link for a category
        category (str): category name
        Item (BaseModel): class of the data object for the specific source
        s (requests.Session()): Requests session with User-Agent properly set
        delay (float): delay in seconds between calls to prevent overloading the source
        dev (bool): if dev == True, the function will only parse the initial two pages
        pages_scraped (int): Number of pages already scraped

    Returns:
        list of product with all information
    """
    time.sleep(delay)
    page = s.get(link)
    page = BeautifulSoup(page.text, 'html.parser')
    results = parse_catalog_items(page)
    results = [Item(**res, category=category) for res in results]
    pages_scraped += 1

    if pages_scraped >= 2 and dev:
        return results
    
    # Follow pagination if exists
    try:
        next_page = page.find("nav", {"class": "woocommerce-pagination"}).find("a", {"class": "next"})
        if next_page is not None:
            next_page = next_page.get("href")
            next_results = scrape_category_dev(
                link=next_page,
                category=category,
                Item=Item,
                s=s,
                delay=delay,
                dev=dev,
                pages_scraped=pages_scraped)
            results.extend(next_results)
    except AttributeError:
        pass
            
    return results

A litte example of how scraping from catalogue may look like, just for a couple of categories

In [94]:
import random  # used to get a random sample of categories

data_list = []

for cat in random.sample(clean_links, 3):
    print(cat)
    data_list.extend(scrape_category(cat["link"], cat["category"], Soumari, s, 1))

{'link': 'https://www.soumari.com/categorie-produit/electromenager/climatiseur-ventiallateur/climatiseur/lg-climatiseur/', 'category': 'LG'}
{'link': 'https://www.soumari.com/categorie-produit/tablette/', 'category': ' TABLETTES'}
{'link': 'https://www.soumari.com/categorie-produit/energie-solaire/batterie/', 'category': 'Batterie Solaire'}


In [40]:
# Double check the workings for a category with pagination
results = scrape_category("https://www.soumari.com/categorie-produit/cosmetique-bio/cosmetique/", "Cosmetique", Soumari, s, 1)

In [41]:
len(results)

73

Issues specific for Soumari website (you will always find specific issues...):
1. Categorization with breadcrumbs is not available in all categories
2. The categories parsed from the menu are not unique (Samsung in smartphones and air conditioning)
3. Categories are not exclusive, there is a lot ov overlapping and it's not easy to classify the level of each category
4. Some information (like the presence in stock) is not available at catalogue level


Parsing each individual page may provide better data quality. Then maybe it's better to use the large "Shop" directory that we excluded when scraping by category.

Let's try to get the information from each productg page:

In [95]:
example_page = s.get("https://www.soumari.com/produit/televiseur-samsung-85-pouces-qa85qn800atxzt-8k/")
example_page = BeautifulSoup(example_page.text, 'html.parser')

In [96]:
example_page.find("h1", {"class": "product_title"}).get_text()

'Televiseur SAMSUNG 85 Pouces QA85QN800ATXZT 8K'

In [97]:
example_page.find("p", {"class": "price"}).find("bdi").get_text()

'3.465.000\xa0CFA'

In [98]:
filter_digits(example_page.find("p", {"class": "price"}).find("bdi").get_text())

3465000

In [99]:
example_page.find("ul", {"class": "entry-meta"}).find("li", {"class": "meta-brand"}).find("a").get_text()

'Samsung'

In [101]:
example_page.find("p", {"class": "stock"}).get_text().split(":")[-1].strip() 

'En stock'

In [102]:
example_page.find("span", {"class": "posted_in"}).find_all("a")

[<a href="https://www.soumari.com/categorie-produit/television/samsung-television/" rel="tag">Samsung</a>,
 <a href="https://www.soumari.com/categorie-produit/television/" rel="tag">TÉLÉVISEURS</a>]

We put all those information together in a function

In [103]:
def scrape_individual_pages(link: str, Item: BaseModel, s: requests.session(), delay: float = 1) -> list:
    """Function to scrape each individual page in a directory following pagination.
    Parameters:
        link (str): starting link for the overall directory catalogue
        Item (BaseModel): class of the data object for the specific source
        s (requests.Session()): Requests session with User-Agent properly set
        delay (float): delay in seconds between calls to prevent overloading the source

    Returns:
        list of product with all information
    """
    time.sleep(delay)
    page = s.get(link)
    page = BeautifulSoup(page.text, 'html.parser')
    links = [item.get("link") for item in parse_catalog_items(page)]  # We can reuse the previous function, but we only keep the link
    results = []
    for l in links[:2]:  # for testing purposes, only get the first two products in each page. Remove the list selection in production
        time.sleep(delay)
        product = s.get(l)
        product = BeautifulSoup(product.text, 'html.parser')
        parsed_product = {}
        parsed_product["name"] = product.find("h1", {"class": "product_title"}).get_text()
        parsed_product["link"] = l
        if product.find("p", {"class": "price"}).find("ins") is not None:
            parsed_product["price"] = filter_digits(product.find("p", {"class": "price"}).find("ins").find("bdi").get_text())
            parsed_product["regular_price"] = filter_digits(product.find("p", {"class": "price"}).find("del").find("bdi").get_text())
        else:
            parsed_product["price"] = filter_digits(product.find("p", {"class": "price"}).find("bdi").get_text())
        if product.find("ul", {"class": "entry-meta"}).find("li", {"class": "meta-brand"}) is not None:
            parsed_product["brand"] = product.find("ul", {"class": "entry-meta"}).find("li", {"class": "meta-brand"}).find("a").get_text()
        if product.find("p", {"class": "stock"}) is not None:
            parsed_product["in_stock"] = product.find("p", {"class": "stock"}).get_text().split(":")[-1].strip()  # This may be quite fragile...
        if product.find("span", {"class": "posted_in"}) is not None:
            category_tags = product.find("span", {"class": "posted_in"}).find_all("a")
            parsed_product["category"] = category_tags.pop().get_text()
            # The number of category tag may be variable...this function account up to 3 tags
            if len(category_tags) > 0:
                parsed_product["subcategory"] = category_tags.pop().get_text()
            if len(category_tags) > 0:
                parsed_product["subsubcategory"] = category_tags.pop().get_text()
        results.append(Item(**parsed_product))

    # Follow pagination if exists
    try:
        next_page = page.find("nav", {"class": "woocommerce-pagination"}).find("a", {"class": "next"})
        if next_page is not None:
            next_page = next_page.get("href")
            next_results = scrape_individual_pages(link=next_page, Item=Item, s=s, delay=delay)
            results.extend(next_results)
    except AttributeError:
        pass
            
    return results

In [104]:
individual_results = scrape_individual_pages("https://www.soumari.com/shop/", Soumari, s, 1)

### Save data as CSV

It is a good practice to automatically set today's date in the file name, to avoid overwriting past data by mistake

In [105]:
import pandas as pd

soumari_catalog_df = pd.DataFrame([prod.dict(exclude_none=True) for prod in data_list])
soumari_catalog_df.to_csv(f"soumari_catalog_{date.today().strftime('%Y-%m-%d')}.csv", index=False)

soumari_individual_df = pd.DataFrame([prod.dict(exclude_none=True) for prod in individual_results])
soumari_individual_df.to_csv("soumari_individual_{}.csv".format(date.today().strftime("%Y-%m-%d")), index=False)

# F-strings and .format() can be used for creating dynamic names, it's a matter of personal preferences.

In [107]:
soumari_individual_df.head()

Unnamed: 0,link,source,category,subcategory,name,brand,price,currency,in_stock,date,regular_price
0,https://www.soumari.com/produit/tecno-spark-10...,Soumari,TÉLÉPHONES,TECNO,Tecno SPARK 10 Pro – Mémoire 256 Go – RAM 8 Go...,Tecno,103000.0,CFA,En stock,2023-03-13,
1,https://www.soumari.com/produit/refrigerateur-...,Soumari,Réfrigérateurs et Congélateurs,ELECTROMÉNAGER,Refrigerateur ICONA Bar 1 Porte Silver ILRF-101AA,Icona,84000.0,CFA,En stock,2023-03-13,
2,https://www.soumari.com/produit/ensemble-therm...,Soumari,ELECTROMÉNAGER,ACCESSOIRES DE CUISINE,Ensemble Thermos 2 Pieces,,31250.0,CFA,En stock,2023-03-13,
3,https://www.soumari.com/produit/lave-gobelet-a...,Soumari,ELECTROMÉNAGER,ACCESSOIRES DE CUISINE,Lave Gobelet Automatique Accessoire de Barre p...,,4500.0,CFA,En stock,2023-03-13,
4,https://www.soumari.com/produit/cartouche-hp-9...,Soumari,INFORMATIQUE,CARTOUCHES,Cartouche HP 912 XL Noir,HP,30000.0,CFA,En stock,2023-03-13,


## Carrefour

### Robots.txt

Reusing session and libraries from previous example

In [108]:
carrefour_root = "https://www.carrefour.tn/"
robots_carrefour = s.get(carrefour_root + "robots.txt")
print(robots_carrefour.text)

User-agent: *
Disallow: /index.php/
Disallow: /checkout/
Disallow: /app/
Disallow: /*?p=*&
Disallow: /lib/
Disallow: /*.php$
Disallow: /pkginfo/
Disallow: /report/
Disallow: /var/
Disallow: /catalog/
Disallow: /customer/
Disallow: /sendfriend/
Disallow: /review/
Disallow: /*SID=
# Disable checkout & customer account
Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/
Disallow: /customer/account/login/referer/*

# Disable Search pages
Disallow: /catalogsearch/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/

# Disable common folders
Disallow: /app/
Disallow: /bin/
Disallow: /dev/
Disallow: /lib/
Disallow: /phpserver/
Disallow: /pub/

# Disable Tag & Review (Avoid duplicate content)

Disallow: /tag/
Disallow: /review/

# Common files
Disallow: /composer.json
Disallow: /composer.lock
Disallow: /CONTRIBUTING.md
Disallow: /CONTRIBUTOR_LICENSE_AGREEMENT.htm

Much more information...let's use a tool to get some actionable information from the robots.txt file: [https://docs.python.org/3/library/urllib.robotparser.html](https://docs.python.org/3/library/urllib.robotparser.html)

In [109]:
from urllib.robotparser import RobotFileParser

robots_carrefour = RobotFileParser(carrefour_root + "robots.txt")
robots_carrefour.read()

In [110]:
print(robots_carrefour.request_rate(s.headers.get("User-Agent")))
print(robots_carrefour.crawl_delay(s.headers.get("User-Agent")))
# No information on the request rate or crawl delay

None
None


In [111]:
s.headers.get("User-Agent")

'Webscraping Capacity Building 1.0'

In [112]:
robots_carrefour.can_fetch(s.headers.get("User-Agent"), "https://www.carrefour.tn/default/chocolat-au-yaourt-et-aux-framboises-null-4000417237002.html")

True

In [44]:
type(robots_carrefour)

urllib.robotparser.RobotFileParser

In [45]:
type(s)

requests.sessions.Session

In [113]:
test = s.get("https://www.carrefour.tn/default/chocolat-au-yaourt-et-aux-framboises-null-4000417237002.html")
type(test)

requests.models.Response

We define a "safe get" function that respects the robots.txt 

In [116]:
def safe_get(s: requests.sessions.Session, robots: RobotFileParser, link: str) -> requests.models.Response:
    """Wrapper for a request session get call that respects the robots.txt file
    Parameters:
        s (requests.sessions.Session): Requests session with User-Agent properly set
        robots (RobotFileParser): Initialized robots.txt parsed object for the specific website
        link (str): link that you want to retrive respecting the robots.txt file
    
    Returns:
        Response from session get call or None if the link is forbidden by robots.txt
    """
    if robots.can_fetch(s.headers.get("User-Agent"), link):
        response = s.get(link)
    else:
        response = None
    return response

# This is only one of the possible ways to implement this. We could have created a new subclass of requests.session.Session 
# and add a new method to implement the "safe_get". It is a matter of style (functional vs. object oriented) and personal preferences

Now we leverage the sitemap to get more information

In [123]:
robots_carrefour.site_maps()

['https://www.carrefour.tn/sitemap.xml']

In [124]:
sitemap = safe_get(s, robots_carrefour, robots_carrefour.site_maps()[0])

Parsing XML should be done in a safe way, as there are multiple security risks: [https://docs.python.org/3/library/xml.html](https://docs.python.org/3/library/xml.html)  
Safe parsing is a quite convoluted operation. There are multiple libraries that offers much simpler parsing, but they come with security vulnerabilites. Use them at your own risk!  
The first operation is to just print the sitemap, so we can get more info on its namespacing and structure.

In [125]:
print(sitemap.text)

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://www.carrefour.tn/pub/sitemaps/categories_1.xml</loc><lastmod>2023-03-13T05:11:06+00:00</lastmod></sitemap>
<sitemap><loc>https://www.carrefour.tn/pub/sitemaps/cms_1.xml</loc><lastmod>2023-03-13T05:11:06+00:00</lastmod></sitemap>
<sitemap><loc>https://www.carrefour.tn/pub/sitemaps/stores.xml</loc><lastmod>2023-03-13T05:11:06+00:00</lastmod></sitemap>
<sitemap><loc>https://www.carrefour.tn/pub/sitemaps/category_products_1.xml</loc><lastmod>2023-03-13T05:11:06+00:00</lastmod></sitemap>
<sitemap><loc>https://www.carrefour.tn/pub/sitemaps/category_images_1.xml</loc><lastmod>2023-03-13T05:11:06+00:00</lastmod></sitemap>
</sitemapindex>


In [126]:
from defusedxml.ElementTree import fromstring

sitemap = fromstring(sitemap.text)
sitemap.tag

'{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex'

In [127]:
for elem in sitemap.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap"):
    print(elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text, elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod").text)

https://www.carrefour.tn/pub/sitemaps/categories_1.xml 2023-03-13T05:11:06+00:00
https://www.carrefour.tn/pub/sitemaps/cms_1.xml 2023-03-13T05:11:06+00:00
https://www.carrefour.tn/pub/sitemaps/stores.xml 2023-03-13T05:11:06+00:00
https://www.carrefour.tn/pub/sitemaps/category_products_1.xml 2023-03-13T05:11:06+00:00
https://www.carrefour.tn/pub/sitemaps/category_images_1.xml 2023-03-13T05:11:06+00:00


We can explore the sitemaps by clicking the links above. The most promising seems the fourth link, which has another sitemap with different categories. The first one, instead, may be useful if we want to scrape from catalogue pages.

In [129]:
sitemap = safe_get(s, robots_carrefour, "https://www.carrefour.tn/pub/sitemaps/category_products_1.xml")
sitemap = fromstring(sitemap.text)
print(sitemap.tag)

{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex


Again, it lists more sitemaps. It seems that there is a sitemap for each product category. We save them into a list.  
For future data acquisition, we may save the date of our last data harvesting and selectively get only the changed ones...but first we need to validate that the "lastmod" attribute is actually reliable, we have no guarantees on this.

In [130]:
category_map_links = []
for elem in sitemap.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap"):
    new_cat = {}
    new_cat["link"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text
    new_cat["lastmod"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod").text
    new_cat["category"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.split("/")[-1].split("_")[0]
    category_map_links.append(new_cat)
    # This last one is quite fragile...will need improvement for production

In [131]:
category_map_links

[{'link': 'https://www.carrefour.tn/pub/sitemaps/Alimentaire_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'Alimentaire'},
 {'link': 'https://www.carrefour.tn/pub/sitemaps/Hygieneadulteetbebe_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'Hygieneadulteetbebe'},
 {'link': 'https://www.carrefour.tn/pub/sitemaps/EntretienMaison_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'EntretienMaison'},
 {'link': 'https://www.carrefour.tn/pub/sitemaps/EquipezVotreCuisine_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'EquipezVotreCuisine'},
 {'link': 'https://www.carrefour.tn/pub/sitemaps/ElectromenageretHighTech_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'ElectromenageretHighTech'},
 {'link': 'https://www.carrefour.tn/pub/sitemaps/LingedeMaison_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'LingedeMaison'},
 {'link': 'https://www.carrefo

We select a random sample of categories to get product links for scraping

In [132]:
import copy

category_maps = []

for cat in random.sample(category_map_links, 3):
    time.sleep(1)
    sitemap = safe_get(s, robots_carrefour, cat.get("link"))
    sitemap = fromstring(sitemap.text)
    new_cat = copy.deepcopy(cat)  # we make a copy of the dictionary for our modification
    new_cat["products"] = []
    for elem in sitemap.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
        new_cat["products"].append(elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text)
    category_maps.append(new_cat)
    # We may track the last modification date also for each individual product.
    # Not implemented in the context of this course, but could be a good practice point.

In [152]:
category_maps

[{'link': 'https://www.carrefour.tn/pub/sitemaps/Sprays_products_1.xml',
  'lastmod': '2023-03-13T05:11:05+00:00',
  'category': 'Sprays',
  'products': ['https://www.carrefour.tn/default/deodorant-parfumant-instinct-barbare-200ml-6192013103510.html',
   'https://www.carrefour.tn/default/deodorant-parfumant-le-gladiateur-200ml-6192013103527.html',
   'https://www.carrefour.tn/default/deodorant-classique-golf-seduction-200ml-6192013103503.html',
   'https://www.carrefour.tn/default/deodorant-classique-golf-extreme-200ml-6192013103497.html',
   'https://www.carrefour.tn/default/deodorant-blue-man-200ml-6192010056468.html',
   'https://www.carrefour.tn/default/deodorant-green-man-200ml-6192010056475.html',
   'https://www.carrefour.tn/default/deodorant-red-man-200ml-6192010056482.html',
   'https://www.carrefour.tn/default/deodorant-white-man-200ml-6192010056499.html',
   'https://www.carrefour.tn/default/deodorant-invisible-fresh-null-4005900371997.html',
   'https://www.carrefour.tn/def

### Data class

We reuse the basic Product class, but we customize it for Carrefour instead

In [134]:
class Carrefour(Product):
    source: str = "Carrefour"
    currency: str = "TND"

### Scraping individual pages

In [135]:
product = safe_get(s, robots_carrefour, category_maps[0]["products"][0])
product = BeautifulSoup(product.text, 'html.parser')

In [136]:
category_maps[0]["products"][0]

'https://www.carrefour.tn/default/deodorant-parfumant-instinct-barbare-200ml-6192013103510.html'

In [139]:
# Name
product.find("h1", {"class": "page-title"}).get_text().strip()

'Déodorant parfumant Instinct Barbare'

In [140]:
# Brand
product.find("a", {"class": "cr-brand-name"}).get_text()

"C'COOL"

In [141]:
# price
product.find("meta", {"itemprop": "price"}).get("content")  # Our data model will make it a float for us

'6.29'

Regex101 website [https://regex101.com/](https://regex101.com/)

In [142]:
# parsing the GTIN code
import re

re.findall("\d{13}", category_maps[0]["products"][0])

['6192013103510']

In [143]:
re.findall("\d{13}", "https://www.carrefour.tn/default/----------------------------------------.html")
# Not all products have the GTIN in the link

[]

Another example with promotion price

In [144]:
product = safe_get(s, robots_carrefour, "https://www.carrefour.tn/default/ananas-2056560001443.html")
product = BeautifulSoup(product.text, 'html.parser')

In [145]:
# Name
product.find("h1", {"class": "page-title"}).get_text().strip()

'Ananas'

In [146]:
# Brand
product.find("a", {"class": "cr-brand-name"})  # Some product does not have brand

In [147]:
# price
product.find("span", {"data-price-type": "oldPrice"}).get("data-price-amount")  # Our data model will make it a float for us

'13.8'

In [151]:
product.find("div", {"class": "product-info-stock-sku"}).get_text().strip()

'En stock'

We organize all the parsing in a function:

In [153]:
def scrape_individual_carrefour(category: dict, Item: BaseModel, s: requests.session(), robots: RobotFileParser, delay: float = 1) -> list:
    """Function to scrape a category following carrefour sitemap.
    Parameters:
        category (dict): information on category, including category name and product links
        Item (BaseModel): class of the data object for the specific source
        s (requests.Session()): Requests session with User-Agent properly set
        robots (RobotFileParser): Initialized robots.txt parsed object for the specific website
        delay (float): delay in seconds between calls to prevent overloading the source

    Returns:
        list of products with all information
    """
    results = []
    for link in category.get("products", []):
        time.sleep(delay)
        product = safe_get(s, robots, link)
        if product is None:
            continue
        product = BeautifulSoup(product.text, 'html.parser')
        parsed_product = {}
        parsed_product["name"] = product.find("h1", {"class": "page-title"}).get_text().strip()
        parsed_product["link"] = link
        parsed_product["category"] = category.get("category")
        if product.find("a", {"class": "cr-brand-name"}) is not None:
            parsed_product["brand"] = product.find("a", {"class": "cr-brand-name"}).get_text()
        try:
            parsed_product["price"] = product.find("meta", {"itemprop": "price"}).get("content")
        except AttributeError as e:
            # printing during debug is fine...
            # but those lines need to go away in production
            print(link)
            print(e) 
            continue
        if product.find("span", {"data-price-type": "oldPrice"}) is not None:
            parsed_product["regular_price"] = product.find("span", {"data-price-type": "oldPrice"}).get("data-price-amount")
        if re.findall("\d{13}", link) not in ([], None):
            parsed_product["uid"] = re.findall("\d{13}", link)[0]
        if product.find("div", {"class": "product-info-stock-sku"}) is not None:
            parsed_product["in_stock"] = product.find("div", {"class": "product-info-stock-sku"}).get_text().strip()
        results.append(Item(**parsed_product))
    
    return results

In [154]:
carrefour_data = []

for cat in category_maps:
    carrefour_data.extend(scrape_individual_carrefour(cat, Carrefour, s, robots_carrefour, 1))
# There are several dead links coming from the sitemap...

carrefour_catalog_df = pd.DataFrame([prod.dict(exclude_none=True) for prod in carrefour_data])
carrefour_catalog_df.to_csv(f"carrefour_catalog_{date.today().strftime('%Y-%m-%d')}.csv", index=False)

### Scrape from catalogue pages

In [155]:
sitemap = safe_get(s, robots_carrefour, "https://www.carrefour.tn/pub/sitemaps/categories_1.xml")
print(sitemap.text)

sitemap = fromstring(sitemap.text)
print(sitemap.tag)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://www.carrefour.tn/default/entretien-de-la-maison.html</loc><lastmod>2023-03-03T08:08:58+00:00</lastmod><changefreq>daily</changefreq><priority>0.5</priority></url>
<url><loc>https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers.html</loc><lastmod>2022-02-22T09:21:12+00:00</lastmod><changefreq>daily</changefreq><priority>0.5</priority></url>
<url><loc>https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers/savons-multi-usages.html</loc><lastmod>2022-12-16T15:01:28+00:00</lastmod><changefreq>daily</changefreq><priority>0.5</priority></url>
<url><loc>https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers/vitres-et-meubles.html</loc><lastmod>2022-02-22T09:21:14+00:00</lastmod><changefreq>daily</changefreq><priority>0.5</priority></url>
<url><loc>https://www.carrefour.tn/default/entretien-de-la-maison/insecticides.html<

In [158]:
category_map_links = []
for elem in sitemap.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
    new_cat = {}
    new_cat["link"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text
    new_cat["lastmod"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod").text
    new_cat["category"] = elem.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.split("/")[-1].split(".")[0]
    category_map_links.append(new_cat)
    # This last one is quite fragile...will need improvement for production

In [159]:
category_map_links

[{'link': 'https://www.carrefour.tn/default/entretien-de-la-maison.html',
  'lastmod': '2023-03-03T08:08:58+00:00',
  'category': 'entretien-de-la-maison'},
 {'link': 'https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers.html',
  'lastmod': '2022-02-22T09:21:12+00:00',
  'category': 'produits-menagers'},
 {'link': 'https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers/savons-multi-usages.html',
  'lastmod': '2022-12-16T15:01:28+00:00',
  'category': 'savons-multi-usages'},
 {'link': 'https://www.carrefour.tn/default/entretien-de-la-maison/produits-menagers/vitres-et-meubles.html',
  'lastmod': '2022-02-22T09:21:14+00:00',
  'category': 'vitres-et-meubles'},
 {'link': 'https://www.carrefour.tn/default/entretien-de-la-maison/insecticides.html',
  'lastmod': '2021-08-26T13:12:58+00:00',
  'category': 'insecticides'},
 {'link': 'https://www.carrefour.tn/default/entretien-de-la-maison/lessives-et-soin-du-linge.html',
  'lastmod': '2022-08-17T10:51:59

In [163]:
cat_page = safe_get(s, robots_carrefour, "https://www.carrefour.tn/default/entretien-de-la-maison.html")

In [164]:
cat_page = BeautifulSoup(cat_page.text, "html.parser")

In [165]:
cat_page.find("div", {"class":"ais-Hits"})

In [166]:
cat_page

<!DOCTYPE html>

<html lang="fr">
<head>
<script>
    var BASE_URL = 'https://www.carrefour.tn/default/';
    var require = {
        "baseUrl": "https://static.carrefour.tn/version1678213609/frontend/Apeiron/Carrefour/fr_FR"
    };
</script>
<meta charset="utf-8"/>
<meta content="Entretien de la Maison pas cher à prix Carrefour en Collect, Drive ou Livraison" name="title"/>
<meta content="Découvrez dans notre rayon Entretien de la Maison toutes nos offres et promotions ! Livraison rapide à domicile, en point relais ou en magasin" name="description"/>
<meta content="INDEX,FOLLOW" name="robots"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<title>Entretien de la Maison pas cher à prix Carrefour en Collect, Drive ou Livraison</title>
<link href="https://static.carrefour.tn/version1678213609/_cache/merged/24f85860049733f9faf86305e888368c.min.css" media="all" rel="stylesheet" type="text/css"/>

Catalogue page is built via JavaScript, cannot be scraped using Requests.