# Kenyan eCitizen Services Dataset - Processing Exploration Notebook

- This notebook serves as an exploration of how to process the scraped data from the Kenyan eCitizen services platform. The goal is to understand how to extract relevant information from the raw HTML files that were saved during the scraping phase, and to identify any challenges or nuances in the data structure that we need to account for in our processing pipeline.
- This is a continuation of the [scraping exploration notebook](scraping_exploration.ipynb), where we identified the structure of the pages and how to extract the relevant HTML content. In this notebook, we will focus on how to process that HTML content to extract structured data that can be used readily.
- All the extracted data is in the `pages/` directory, and all structured data outputs will be saved in the `data/` directory. We will maintain a consistent naming convention for the output files to ensure that they are easily identifiable and organized.

**A note on processing URLs**:

- Some links provide relative URLs, which we will need to convert to absolute URLs by prefixing them with the base URL of the eCitizen platform. This is important to ensure that we can access the correct pages when we need to navigate through the links for further data extraction.
- This can simply be done by checking if the URL starts with a slash (`/`) and then concatenating it with the base URL (e.g., `https://www.ecitizen.go.ke`) to form the complete URL.

In [3]:
import json
import re

import pandas as pd
from bs4 import BeautifulSoup

## Processing `FAQ` Page 

- Here we process the `FAQ` page, which contains the frequently asked questions and their answers. 
- To process this page, we will use BeautifulSoup to parse the HTML content and extract the questions and answers.
- Each question and answer pair is within a `li` item with an id in the form of `faq_<number>`, the question is within a `button` tag, and the answer is within the neighboring `div`. To extract this information, we will loop through all the `li` items, extract the question form the `button` tag, and the answer from the `div` tag.

In [7]:
FAQ_PATH = 'pages/faq.html'
FAQ_DATA_PATH_CSV = 'data/faq/faq.csv'
FAQ_DATA_PATH_JSON = 'data/faq/faq.json'

# Load the FAQ page
with open(FAQ_PATH) as f:
	faq_html = f.read()

soup = BeautifulSoup(faq_html, 'lxml')

faq_items = soup.find_all('li', id=re.compile(r'^faq_'))

faqs = []

for item in faq_items:
	btn = item.find('button')
	if not btn:
		continue

	question = btn.get_text(' ', strip=True)

	# Answer is usually the next sibling div
	answer_div = btn.find_next_sibling('div')

	# If the DOM is slightly different, fall back to
	# searching inside the li for the first div after button
	if not answer_div:
		divs = item.find_all('div')
		answer_div = divs[0] if divs else None

	answer = ''
	if answer_div:
		answer = answer_div.get_text(' ', strip=True)

	# Optional: skip empty answers
	if question and answer:
		faqs.append(
			{
				'id': item.get('id'),
				'question': question,
				'answer': answer,
			}
		)

# Save to CSV
df = pd.DataFrame(faqs)
df.to_csv(FAQ_DATA_PATH_CSV, index=False)

# Save to JSON
with open(FAQ_DATA_PATH_JSON, 'w') as f:
	json.dump(faqs, f, indent=4)

## Processing `Agencies` Page

- Here we process the `Agencies` page, which contains a list of government agencies and metadata about them.
- Each agency is represented as a `a` tag with the link to that agency's page as the `href` attribute of that tag.
- To extract the logo url for each agency, we get the `src` attribute of the `img` tag within the `a` tag.
- To extract the name of the agency, we get the text content of the `h4` tag within the `a` tag, and to get the description of the agency, we get the text content of the `p` tag within the `a` tag, which is a neighboring tag to the `h4` tag.
- We will loop through all the `a` tags and extract this information for each agency.

In [9]:
AGENCIES_PATH = 'pages/agencies_grid.html'
AGENCIES_DATA_PATH_CSV = 'data/agencies/agencies.csv'
AGENCIES_DATA_PATH_JSON = 'data/agencies/agencies.json'

# Load the Agencies page
with open(AGENCIES_PATH) as f:
	agencies_html = f.read()

soup = BeautifulSoup(agencies_html, 'lxml')

agency_links = soup.find_all('a')

agencies = []

for link in agency_links:
	# Extract the URL for the agency page
	url = link.get('href', '')

	# Extract the logo URL from the img tag within the link
	img_tag = link.find('img')
	logo_url = img_tag.get('src', '') if img_tag else ''

	# Extract the agency name from the h4 tag
	# within the link
	h4_tag = link.find('h4')
	name = (
		h4_tag.get_text(' ', strip=True) if h4_tag else ''
	)

	# Extract the description from the p tag within the link
	p_tag = link.find('p')
	description = (
		p_tag.get_text(' ', strip=True) if p_tag else ''
	)

	if name:
		agencies.append(
			{
				'name': name,
				'description': description,
				'logo_url': logo_url,
				'url': url,
			}
		)

# Save to CSV
df = pd.DataFrame(agencies)
df.to_csv(AGENCIES_DATA_PATH_CSV, index=False)

# Save to JSON
with open(AGENCIES_DATA_PATH_JSON, 'w') as f:
	json.dump(agencies, f, indent=4)

## Processing `Ministries` Pages

- Extracting information about the ministries is a bit more involved, we split this into multiple steps to ensure that we can handle the complexity of navigating through the pages and extracting the relevant information.

### 1. Initial Navigation

- To extract the names of ministries and the corresponding URLS to their source pages on the eCitizen platform, we can get them from the list of `a` tags in the `National` navigation menu, the URL for each ministry's page is in the `href` attribute and the name of the ministry is in the text content of the `a` tag. 
- This step is straightforward, we just need to loop through all the `a` tags in the `National` navigation menu and extract this information for each ministry.

> Note we do not execute this step in the processing exploration notebook, but we will implement this in the processing pipeline.

### 2. Ministry Overview

- The ministry overview is processed separately from service information for simplicity
- The data is quite structured, a `dl` tag contains relevant metadata about the ministry, the `dt` tag with the text content `Total Agencies` indicates the number of agencies under the ministry, and the corresponding `dd` tag contains the actual number. The `dt` tag with the text content `Total Services` indicates the number of services under the ministry, and the corresponding `dd` tag contains the actual number. This pattern of agencies number followed by services number is consistent across all ministry overview pages, so we can rely on this structure to extract the relevant information.
- The description of the ministry is in the sole `article` tag on the page, we can get the text content of that tag to extract the description.

In [7]:
MINISTRY_OVERVIEW_PATH = (
	'pages/ministry/ministry_overview.html'
)
MINISTRY_OVERVIEW_DATA_PATH_CSV = (
	'data/ministry/ministry_overview.csv'
)
MINISTRY_OVERVIEW_DATA_PATH_JSON = (
	'data/ministry/ministry_overview.json'
)

# Load the Ministry Overview page
with open(MINISTRY_OVERVIEW_PATH) as f:
	ministry_overview_html = f.read()

soup = BeautifulSoup(ministry_overview_html, 'lxml')
ministry_overview = []

dd_tags = soup.find_all('dd')

# Values are in the next sibling divs
total_agencies = (
	dd_tags[0].get_text(strip=True) if dd_tags else None
)
total_services = (
	dd_tags[1].get_text(strip=True)
	if len(dd_tags) > 1
	else None
)

# Ministry description is in article tag
description_tag = soup.find('article')
description = (
	description_tag.get_text(' ', strip=True)
	if description_tag
	else None
)

ministry_overview.append(
	{
		'ministry': 'the-state-law-office',
		'total_agencies': total_agencies,
		'total_services': total_services,
		'description': description,
	}
)

# Save to CSV
df = pd.DataFrame(ministry_overview)
df.to_csv(MINISTRY_OVERVIEW_DATA_PATH_CSV, index=False)

# Save to JSON
with open(MINISTRY_OVERVIEW_DATA_PATH_JSON, 'w') as f:
	json.dump(ministry_overview, f, indent=4)

### 3. Ministry Agencies

- To get the services under each agency within the ministry, we need to navigate to the ministry page with the appropriate query parameters to get the list of agencies under that ministry, and then extract the relevant information for each agency.
- Parsing this information is also quite straightforward, each agency is represented as a `a` tag with the link to that agency's page as the `href` attribute of that tag, and the name of the agency is in the text content of the `a` tag. 
- This allows us to get the page with the relevant list of services under each agency, and we can then extract the relevant information for each service.
- For each URL, we can extract the name of the agency as well as the department it belongs to within the ministry, we extract these from the query parameters in the URL.

> This step is also not executed in the processing exploration notebook, but we will implement this in the processing pipeline.

### 4. Ministry Services

- For each service under the agency within the ministry, we need only extract the `a` tag with the link to the service page as the `href` attribute of that tag, and the name of the service is in the text content of the `a` tag.
- Fro

In [8]:
MINISTRY_SERVICES_PATH = (
    'pages/ministry/ministry_services.html'
)
MINISTRY_SERVICES_DATA_PATH_CSV = (
    'data/ministry/ministry_services.csv'
)
MINISTRY_SERVICES_DATA_PATH_JSON = (
    'data/ministry/ministry_services.json'
)

# Load the Ministry Services page
with open(MINISTRY_SERVICES_PATH) as f:
    ministry_services_html = f.read()

soup = BeautifulSoup(ministry_services_html, 'lxml')

service_links = soup.find_all('a')
services = []

for link in service_links:
    url = link.get('href', '')
    name = link.get_text(' ', strip=True)

    if name:
        services.append(
            {
                'ministry': 'the-state-law-office',
                'department': 'registers-generals-department',
                'agency': 'registrar-of-marriages',
                'name': name,
                'url': url,
            }
        )

# Save to CSV
df = pd.DataFrame(services)
df.to_csv(MINISTRY_SERVICES_DATA_PATH_CSV, index=False)

# Save to JSON
with open(MINISTRY_SERVICES_DATA_PATH_JSON, 'w') as f:
    json.dump(services, f, indent=4)