# Kenya eCitizen Services Dataset - Scraping Exploration Notebook

- This notebook acts as an exploration of the scraping process used to create the Kenya eCitizen Services Dataset. It contains tests and initial code snippets that were part of the development of the scraper.
- This file should not be considered a final or polished script, but rather a record of the exploratory work that went into building the scraper. It may contain incomplete code, tests, and notes that were used to understand the structure of the eCitizen platform and how to extract relevant data effectively.
- All scraped pages are saved as HTML files in the `pages/` subdirectory relative to this notebook, and the code snippets here are part of the process of understanding how to retrieve and parse those pages.

In [2]:
from playwright.async_api import async_playwright

# Constants
user_agent = (
	'kenya-ecitizen-scraper/1.0'
	'(public research dataset;'
	'contact: alvinnjiiri@gmail.com)'
)

## Checking the Structure of the `Agencies` Page

- The `Agencies` page on the eCitizen platform lists various government agencies with accompanying details.
- The actual agencies information is loaded in via JavaScript, which means that a simple HTTP request will not retrieve the necessary data.
- To handle this, we will use Playwright to render the page and extract the relevant information after the JavaScript has executed.
- Note that we use an arbitrary link as a sentinel to ensure that the page is fully loaded before we attempt to extract the data. This is a common technique when dealing with dynamic content.

In [None]:
URL = 'https://accounts.ecitizen.go.ke/en/agencies'
SENTINEL = 'https://adc.go.ke/shop'
OUT = 'pages/agencies_grid.html'


async def dump_grid() -> None:
	async with async_playwright() as p:
		browser = await p.chromium.launch(headless=True)
		context = await browser.new_context(
			user_agent=user_agent,
		)
		page = await context.new_page()

		await page.goto(
			URL,
			wait_until='domcontentloaded',
		)

		# Wait for the sentinel link to appear,
		# indicating that the grid has loaded
		await page.wait_for_selector(
			f"div.grid a[href='{SENTINEL}']",
			timeout=30_000,
		)

		grid = page.locator('div.grid').first
		html = await grid.evaluate('el => el.outerHTML')

		with open(OUT, 'w', encoding='utf-8') as f:
			f.write(html)

		await browser.close()


await dump_grid()
print(f'Wrote: {OUT}')

## Checking the structure of `FAQ` Page

- Similar to the `Agencies` page, the `FAQ` page also loads its content dynamically using JavaScript.
- We will use Playwright to render the page and extract the FAQ entries after the JavaScript has executed.
- In this instance, we will wait for the presence of the FAQ list items to ensure that the page has fully loaded before we attempt to extract the data.


In [6]:
URL = 'https://accounts.ecitizen.go.ke/en/help-and-support'
OUT = 'pages/faq.html'


async def dump_faq() -> None:
	async with async_playwright() as p:
		browser = await p.chromium.launch(headless=True)
		context = await browser.new_context(
			user_agent=user_agent,
		)
		page = await context.new_page()

		await page.goto(
			URL,
			wait_until='domcontentloaded',
		)

		# Wait for the FAQ entries to load by checking
		# for the presence of list items within the FAQ
		# section
		await page.wait_for_function(
			"""
            () => {
                const ul = document.querySelector(
					"div#FAQs ul"
				);
                return ul && ul.children.length > 0;
            }
            """,
			timeout=30_000,
		)

		faq = page.locator('div#FAQs').first
		html = await faq.evaluate('el => el.outerHTML')

		with open(OUT, 'w', encoding='utf-8') as f:
			f.write(html)

		await browser.close()


await dump_faq()
print(f'Wrote: {OUT}')

Wrote: pages/faq.html


## Checking `Ministries` Page Source Links from `National` Navigation Menu


### Context and Motivation

- When inspecting the `Ministries` pages, it was observed that the ministry of defence had a peculiar URL structure that differed from the other ministries. This raised the question of whether this was an isolated case or if other ministries also had unique URL structures.
- Therefore, in order to get the urls of all the ministries, we will inspect the `National` navigation menu and extract the links to all the ministries listed there. This is more stable that relying on the typical URL structure of the ministries, which may not be consistent across all entries.

### Actual Scraping

- Similarly, the page uses JavaScript to load the ministry links, so we will use Playwright to render the page and extract the links after the JavaScript has executed. We will wait for the presence of the ministry links in the navigation menu to ensure that the page has fully loaded before we attempt to extract the data.

In [11]:
URL = 'https://accounts.ecitizen.go.ke/en/home/national-ministries'
OUT = 'pages/ministries_links.html'


async def dump_ministries_links() -> None:
	async with async_playwright() as p:
		browser = await p.chromium.launch(headless=True)
		context = await browser.new_context(
			user_agent=user_agent,
		)
		page = await context.new_page()

		await page.goto(
			URL,
			wait_until='domcontentloaded',
		)

		# Wait for the ministry links to load by
		# checking for the presence of anchor tags
		# that link to ministry pages
		await page.wait_for_selector(
			"ul a[href^='/en/ministries/']",
			timeout=30_000,
		)

		ministries = page.locator('ul').first
		html = await ministries.evaluate(
			'el => el.outerHTML'
		)

		with open(OUT, 'w', encoding='utf-8') as f:
			f.write(html)

		await browser.close()


await dump_ministries_links()
print(f'Wrote: {OUT}')

Wrote: pages/ministries_links.html


## Checking individual `Ministry` Pages for Service Links and Metadata

- The main ministry page contains metadata about the ministry, namely, the total number of agencies, services, and the overview of the ministry. This information can be extracted from the ministry page itself.
- However, the agencies are linked by a dropdown lists on the left side, each agency modifies the URL to include a query parameter that specifies the agency, with this modified URL, the page then loads the relevant services under that agency. This means that to extract the services for each agency, we will need to navigate to each agency link in the dropdown menu and then extract the services after the JavaScript has executed.
- Thankfully, the initial page load loads all the agency links in the DOM, allowing us to extract the agency links (i.e the URLs with the query parameters) without needing to simulate clicks. We can then iterate over these links, load each one, and extract the relevant services for each agency.

In [18]:
URL = 'https://accounts.ecitizen.go.ke/en/ministries/the-state-law-office'
AGENCY_SENTINEL = 'https://accounts.ecitizen.go.ke/en/ministries/the-state-law-office?department=registrar-generals-department&agency=registrar-of-marriages'
MINISTRY_OUT = 'pages/ministry/ministry_overview.html'
AGENCIES_OUT = 'pages/ministry/ministry_agencies.html'
SERVICES_OUT = 'pages/ministry/ministry_services.html'


async def dump_ministry_page() -> None:
	async with async_playwright() as p:
		browser = await p.chromium.launch(headless=True)
		context = await browser.new_context(
			user_agent=user_agent,
		)
		page = await context.new_page()

		await page.goto(
			URL,
			wait_until='domcontentloaded',
		)

		# Wait for content to load by checking for
		# the presence of the "Overview" section
		await page.wait_for_selector(
			"h2:has-text('Overview')",
			timeout=30_000,
		)

		# --- Ministry Overview ---

		# Select first div with class lg:grid, this
		# contains the ministry overview
		overview = page.locator('div.lg\\:grid').first
		overview_html = await overview.evaluate(
			'el => el.outerHTML'
		)
		with open(MINISTRY_OUT, 'w', encoding='utf-8') as f:
			f.write(overview_html)
			print(f'Wrote: {MINISTRY_OUT}')

		# --- Ministry Agencies ---

		# Select the unordered list with role listbox, this
		# contains the agency links
		agencies_lists = page.locator(
			"ul[role='listbox']"
		)
		agencies_html = await agencies_lists.evaluate_all(
			'els => els.map(el => el.outerHTML).join("\\n")'
		)
		with open(AGENCIES_OUT, 'w', encoding='utf-8') as f:
			f.write(agencies_html)
			print(f'Wrote: {AGENCIES_OUT}')

		# --- Ministry Services ---

		# Go to new page with agency sentinel query
		# parameter to load the services for a specific
		# agency
		await page.goto(
			AGENCY_SENTINEL,
			wait_until='domcontentloaded',
		)

		# Wait for the services to load by checking for
		# links in the div with class "space-y-3"
		await page.wait_for_selector(
			"div.space-y-3 a",
			timeout=30_000,
		)

		# Select the div with class "space-y-3", this
		# contains the service links
		services = page.locator('div.space-y-3').first
		services_html = await services.evaluate(
			'el => el.outerHTML'
		)
		with open(SERVICES_OUT, 'w', encoding='utf-8') as f:
			f.write(services_html)
			print(f'Wrote: {SERVICES_OUT}')

		await browser.close()


await dump_ministry_page()

Wrote: pages/ministry/ministry_overview.html
Wrote: pages/ministry/ministry_agencies.html
Wrote: pages/ministry/ministry_services.html
