# Tanzania Primary Education Results (NECTA PSLE)

### 2a. Data Sourcing - NECTA
1. Beautiful Soup webscrape of NECTA data

#### Outputs:
* nation_necta_raw.csv (17935, 10)

#### Functions:
* `scrape_psle_page`

In [None]:
#Libraries: pre-installed in Anaconda
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd
#User-defined functions.py
import functions as fn

### 1. Beautiful Soup webscrape of NECTA data
**ELI5 Summary:**
*Get primary school examination results from public web page for each school*

**Steps:**
1. Get region links from national page of [PSLE results](https://onlinesys.necta.go.tz/results/2022/psle/psle.htm)
2. Get council (districts or municipalities) links from regional page
3. Get individual school links from council page
4. Call function <code>scrape_psle_page</code> to get **dict of school data**
5. Add region and council name to dict from looping code
6. Append THIS school's data to **list of dicts**
7. After three-level loop: create **Pandas DataFrame** from list of dicts and save to CSV

**DATA observations:**
* Example: [Jitegmee Primary School](https://onlinesys.necta.go.tz/results/2022/psle/results/shl_ps1104063.htm)    
* School level data is in text below headers and first table (genders/total x grades)
    * Grade (DARAJA) is based on school average out of 300 points (6 subjects x 50 points)
    * Per-student, per-subject data not used (for now)
* @nation: **17,935 school pages** scraped

**Corner cases:** (DEBUG CODE)
* 'SEMINARY' in school name (3) => `scrape_psle_page` regex
* typo ';' before NECTA exam id (1) => `scrape_psle_page` regex

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 Collect all webscarped data into a **list of dicts** first then write to a DataFrame once afterwards
    - ⚠️ Avoid iterative Pandas concat
- 📚 Scaling up to full (national) dataset (n=17k) forced me to think differently. Finding specific corner casess needs to be scalable/automated (code) vs. specific/manual (eye-ball).
- 😎 HTML code is very structured and BeautifulSoup can easily "find" tags on a page (didn't need to "navigate" through HTML)
- 😎 Iterate down through **multi-level page hierarchy**: get page "soup", get links > get next-level page "soup"
- 😎 **Regular expressions** are a joy to use with this very helpful website to test: [regex101](https://regex101.com/)
    - ⚠️ Needed full dataset to catch all string matching cases, e.g., 'SEMINARY' instead of 'PRIMARY SCHOOL'
- ⚠️ Numerous requests (HTTP GET) to same server caused **"Max retries exceeded with url"**
    - 🧑🏻‍💻 File I/O for DEBUG: <code>with open</code> construct using over('w')rite and ('a')ppend modes
    - 🧑🏻‍💻 Exception handling for DEBUG: <code>try: ... except: ...</code> block
    - 🧑🏻‍💻 **SOLUTION: timeout and retries** in <code>scrape_psle_page</code>

In [None]:
%%time
#Webscrape down through page levels: region > council > school

#Base information
exam = 'psle'
year = '2022'
base_URL = f"https://onlinesys.necta.go.tz/results/{year}/{exam}/"
top_URL = f"{exam}.htm"
URL = base_URL + top_URL

#Needed for full dataset webscraping
timeout = 10 #requests.get timeout
retries = 10 #Retry class

#DEBUG CODE
with open('textout/http_responses.txt', 'w') as f:
    f.write(f"HTTP Responses\n")
with open('textout/regex_mismatches.txt', 'w') as f:
    f.write(f"Regex Mismatches\n")

#National level scrape
response = requests.get(URL)
soup = BeautifulSoup(response.content, "html.parser")

#Initialize list of schools data (dictionaries)
y = list()

regions = soup.find_all('a') #just find all hyperlink tags!
for r in regions:
    region_name = r.get_text(strip=True).capitalize() #match TAMISEMI
    region_URL = r['href']
        
    #Regional level scrape
    response = requests.get(base_URL + region_URL)
    soup = BeautifulSoup(response.content, "html.parser")
    councils = soup.find_all('a')

    for c in councils:    
        council_name = c.get_text(strip=True).capitalize() #match TAMISEMI
        m = re.search('(.*)\s+(\w{2})', council_name) #e.g., handle 'MOROGORO MC' > 'Morogoro MC'
        if m != None: council_name = f"{m[1]} {m[2].upper()}" 

        council_URL = "results/" + c['href']

        #Council (District/Municipality) level scrape
        response = requests.get(base_URL + council_URL)
        soup = BeautifulSoup(response.content, "html.parser")
        schools = soup.find_all('a')

        for s in schools:
            school_name = s.get_text(strip=True)
            school_URL = s['href']

            #School level scrape in function > returns dict of school data
            x = fn.scrape_psle_page(school_URL, timeout, retries)
            #Add supplemental info from higher pages
            x['region_name'] = region_name
            x['council_name'] = council_name
            #Add to list of schools
            y.append(x)

        #Leoson idea: add delay vs. server traffic limiting
        #sleep(60*1) #unit in seconds

#One time creation of DataFrame-NECTA (df_n) from list of dicts
df_n = pd.DataFrame.from_records(y)

In [None]:
#DEBUG CODE
#Patching from regex_mismatches.txt => scrape_psle_page: regex update to:
# - include 'SEMINARY' after school name (3)
# - allow typos before NECTA exam id (1)

#check MISSING schools
df_n[df_n['school_name'].isna()]

#df_n[df_n['school_name'].isna()]
i = df_n[df_n['school_name'].isna()].index #2928, 3638, 11890, 15076
regex_mismatches_list = [(i[0], 'shl_ps0502150.htm'),\
                         (i[1], 'shl_ps0505121.htm'),\
                         (i[2], 'shl_ps1701079.htm'),\
                         (i[3], 'shl_ps2101141.htm')]
for schools in regex_mismatches_list:
    x = scrape_psle_page(schools[1], 10, 10)
    df_n.at[schools[0], 'school_name'] = x['school_name']
    df_n.at[schools[0], 'school_id'] = x['school_id']

In [None]:
#SPOT-CHECK CODE - handy, keep around!
df_n.shape
#df_n.head()

#Save to CSV #@nation: 102ms
#df_n.to_csv('dataout/2a/nation_necta_raw.csv')