# Kenya eCitizen Services Dataset - Scraping Exploration Notebook

- This notebook acts as an exploration of the scraping process used to create the Kenya eCitizen Services Dataset. It contains tests and initial code snippets that were part of the development of the scraper.
- This file should not be considered a final or polished script, but rather a record of the exploratory work that went into building the scraper. It may contain incomplete code, tests, and notes that were used to understand the structure of the eCitizen platform and how to extract relevant data effectively.
- All scraped pages are saved as HTML files in the `pages/` subdirectory relative to this notebook, and the code snippets here are part of the process of understanding how to retrieve and parse those pages.

In [5]:
from playwright.async_api import async_playwright

# Constants
user_agent = (
	'kenya-ecitizen-scraper/1.0'
	'(public research dataset;'
	'contact: alvinnjiiri@gmail.com)'
)

## Checking the Structure of the `Agencies` Page

- The `Agencies` page on the eCitizen platform lists various government agencies with accompanying details.
- The actual agencies information is loaded in via JavaScript, which means that a simple HTTP request will not retrieve the necessary data.
- To handle this, we will use Playwright to render the page and extract the relevant information after the JavaScript has executed.

In [7]:
URL = "https://accounts.ecitizen.go.ke/en/agencies"
SENTINEL = "https://adc.go.ke/shop"
OUT = "pages/agencies_grid.html"

async def dump_grid() -> None:
	async with async_playwright() as p:
		browser = await p.chromium.launch(
			headless=True
		)
		context = await browser.new_context(
			user_agent=user_agent,
		)
		page = await context.new_page()

		await page.goto(
			URL,
			wait_until="domcontentloaded",
		)

		await page.wait_for_selector(
			f"div.grid a[href='{SENTINEL}']",
			timeout=30_000,
		)

		grid = page.locator("div.grid").first
		html = await grid.evaluate(
			"el => el.outerHTML"
		)

		with open(OUT, "w", encoding="utf-8") as f:
			f.write(html)

		await browser.close()

await dump_grid()
print(f"Wrote: {OUT}")

Wrote: pages/agencies_grid.html
