In [1]:

!pip install playwright nest_asyncio
!playwright install chromium
!apt-get install libatk1.0-0 libatk-bridge2.0-0 libatspi2.0-0 libxcomposite1


import asyncio, json, csv
from pathlib import Path
import nest_asyncio
nest_asyncio.apply()
from playwright.async_api import async_playwright



#  3. BASE URL
BASE_URL = "https://webscraper.io/test-sites/e-commerce/static/computers/laptops"

# 4. MAIN SCRAPING FUNCTION
async def scrape_ajax_site():

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)   # Launch browser
        ctx = await browser.new_context()                  # New browser session
        page = await ctx.new_page()                        # New tab

        rows = []                                         # Stores ALL products
        page_no = 1                                       #  Start from page 1

        # 5. PAGE NUMBER LOOP (1 ‚Üí 20)
        while True:
            url = f"{BASE_URL}?page={page_no}"             #  Build page URL
            print(f"Scraping Page {page_no} ‚Üí {url}")
            await page.goto(url, timeout=60000)            #  Open that page
            try:
                await page.wait_for_selector(".thumbnail", timeout=10000)
                #  Wait for product cards
            except:
                print(" No more pages left. Stopping.")
                break                                     #  Stop when no products found
            cards = await page.query_selector_all(".thumbnail")

            if not cards:                                 # Safety stop
                print(" Last page reached.")
                break
            #  Extract products from the CURRENT page
            for card in cards:

                title_el = await card.query_selector(".title")
                title = (await title_el.text_content()).strip() if title_el else None

                url = await title_el.get_attribute("href") if title_el else None

                price_el = await card.query_selector(".price")
                price = (await price_el.text_content()).strip() if price_el else None

                stars = await card.query_selector_all(".ratings .glyphicon-star")
                rating = len(stars) if stars else 0

                img_el = await card.query_selector("img")
                img_src = await img_el.get_attribute("src") if img_el else None

                rows.append({
                    "title": title,
                    "price": price,
                    "rating_stars": rating,
                    "product_url": url,
                    "image_url": img_src,
                    "page_no": page_no
                })

            page_no += 1                                   # Go to next page number

        await browser.close()
        return rows

#  6. RUN SCRAPER

data = asyncio.get_event_loop().run_until_complete(scrape_ajax_site())
print(f" Collected {len(data)} total products")


# 7. SAVE OUTPUT FILES

Path("ioutput").mkdir(exist_ok=True)                       #  Create output folder

csv_path = Path("ioutput/products_all_ajax.csv")
json_path = Path("ioutput/products_all_ajax.json")

# Save CSV
with open(csv_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

# Save JSON
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

print(f"Saved CSV ‚Üí {csv_path}")
print(f"Saved JSON ‚Üí {json_path}")


Collecting playwright
  Downloading playwright-1.57.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting pyee<14,>=13 (from playwright)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
Downloading playwright-1.57.0-py3-none-manylinux1_x86_64.whl (46.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m46.0/46.0 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyee-13.0.0-py3-none-any.whl (15 kB)
Installing collected packages: pyee, playwright
Successfully installed playwright-1.57.0 pyee-13.0.0
Downloading Chromium 143.0.7499.4 (playwright build v1200)[2m from https://cdn.playwright.dev/dbazure/download/playwright/builds/chromium/1200/chromium-linux.zip[22m
[1G164.7 MiB [] 0% 0.0s[0K[1G164.7 MiB [] 0% 44.5s[0K[1G164.7 MiB [] 0% 86.5s[0K[1G164.7 MiB [] 0% 26.0s[0K[1G164.7 MiB [] 0% 20.0s[0K[1G164.7 MiB [] 0% 20.4s[0K[1G164.7 MiB [] 0% 15.1s[0

üìå Observations ‚Äì Page-Based Pagination Web Scraping

üîπ 1. Pagination Handling

The scraper uses URL-based pagination by incrementing the page query parameter (?page=1, ?page=2, ‚Ä¶).

This confirms that the target website supports static server-side pagination, not infinite scroll or AJAX-only loading.

Scraping stops automatically when:

No product cards (.thumbnail) are found, or

The selector wait times out.

Observation:
This method is reliable and deterministic because each page has a fixed URL and complete HTML content.


üîπ 2. Scraping Strategy

The script opens one browser session and one page (tab).

The same page object is reused for all page numbers.

For each page:

All product cards are selected.

Product details are extracted from the listing page itself, without visiting individual product pages.

Observation:
This approach is efficient and avoids unnecessary navigation, reducing scraping time and browser overhead.

üîπ 3. Extracted Data Fields

For every product, the following information is collected:

Title ‚Äì Product name

Price ‚Äì Displayed product price

Rating (Stars) ‚Äì Derived by counting star icons

Product URL ‚Äì Link to product detail page

Image URL ‚Äì Product image source

Page Number ‚Äì Page from which the product was scraped

Observation:
The dataset is well-structured and suitable for further analysis such as price comparison, rating analysis, or visualization.

üîπ 4. Rating Extraction Logic

Product ratings are calculated by counting the number of .glyphicon-star elements.

This avoids reliance on text values and ensures consistent numeric ratings.

Observation:
Using DOM element counts for ratings is robust and less error-prone than parsing text.

üîπ 5. Loop Termination Condition

The scraper safely exits when:

The selector .thumbnail is not found, or

An empty product list is returned.

Observation:
This prevents infinite loops and ensures graceful termination when the last page is reached.

üîπ 6. Performance Considerations

Page-based navigation is faster than AJAX ‚ÄúLoad More‚Äù approaches.

No JavaScript event handling or network-idle polling is required.

Observation:
This method scales well for large datasets and is ideal for production-grade scraping.

üîπ 7. Output Generation

Data is saved in both CSV and JSON formats.

Output directory is created safely using pathlib.

Observation:
Providing both formats increases usability for data analysis, machine learning, and reporting.

üîπ 8. Reliability & Maintainability

The scraper logic is simple, readable, and modular.

Fewer failure points compared to dynamic/AJAX scraping.

Observation:
This approach is highly maintainable and suitable for academic projects, internships, and interviews.

‚úÖ Final Conclusion

This scraper successfully demonstrates page-based pagination scraping, efficiently collecting structured product data from a static e-commerce website. The approach is reliable, fast, and well-suited for large-scale data collection, making it preferable over dynamic ‚ÄúLoad More‚Äù scraping when URL-based pagination is available.