# NECTA PSLE Dashboard

## 01-necta-webscrape
### Tasks
1. Beautiful Soup webscrape of NECTA PSLE data
2. Light data cleaning of webscraped data
3. Light feature extraction of webscraped data

#### Inputs:
- [PSLE results](https://onlinesys.necta.go.tz/results/2022/psle/psle.htm), example: [Jitegmee](https://onlinesys.necta.go.tz/results/2022/psle/results/shl_ps1104063.htm)

#### Outputs:
- 01-necta-webscrape_raw.csv (17935, 11)
- 01-necta-webscrape_features.csv (17900, 16)

In [None]:
#Libraries
#Data handling
import numpy as np
import pandas as pd

#Custom modules
from config import base_URL, top_URL, necta_raw_csv_path, necta_missing_csv_path, necta_features_csv_path
from webscraping import nation_scrape
from data_cleaning import convert_dtypes, count_duplicates, count_missing_rows, write_missing_rows, drop_missing_rows, drop_columns
from data_cleaning import convert_string_to_list, compare_list_total
from data_cleaning_special import capitalize_salaam, compare_grade
from feature_extraction import from_list_extract_total_multiple, extract_rate_multiple

### 1. Beautiful Soup webscrape of NECTA data
*Webscrape primary school examination results from each school's web page*

**Steps:**
1. Hierarchically scrape through four levels of links: nation, regions, councils, schools (actual data)
2. Turn resulting "list of dicts" into a Pandas DataFrame, and save to CSV

**Observations:**
- **17,935** school pages scraped (Wall time **~8.5 hours**)
- Corner cases solved in regex: 'SEMINARY' in school name (3), typo ';' before NECTA PS# (1) 

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- ⚠️ Numerous requests (HTTP GET) to same server caused **"Max retries exceeded with url"**
    - 🧑🏻‍💻 **SOLUTION: "Session" timeout and retries**

In [None]:
%%time
#Wall time: 8h 23min 57s (real-world)
#Main code
URL = base_URL + top_URL
data = nation_scrape(URL)
df_necta = pd.DataFrame.from_records(data)

In [None]:
#Check shape and save to CSV
df_necta.shape
#df_necta.to_csv(necta_raw_csv_path)

### 2. Light data cleaning of webscraped data
*Check school examination data for obvious issues*

**Steps:**
1. Light data cleaning steps: (i) data types, (ii) values, (iii) duplicates, (iv) missing
2. Data integrity checks of totals and grades

**Observations:**
- **DROP 35 schools** with missing school-level results data, save "missing" to CSV

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- ⚠️ Avoid storing lists in one CSV column as they do not persist when writing out to CSV (read back in as string)
    - 😎 `ast.literal_eval` saved me!

In [None]:
#Read from CSV
df_n = pd.read_csv(necta_raw_csv_path, index_col=0)

#(i) Convert to dtypes that support pd.NA (num_sitters int with NA)
df_n = convert_dtypes(df_n)

#(ii) Convert data values from CSV read
df_n = convert_string_to_list(df_n, ['WASICHANA', 'WAVULANA', 'JUMLA'])

#(ii) String matching issue found during 02-tamisemi-merge
df_n = capitalize_salaam(df_n)

#(iii) Count duplicated for expected "unique" columns
count_duplicates(df_n, 'school_id') #returns 0
count_duplicates(df_n, 'results_url') #returns 0

#(iv) Check rows missing data, write to CSV, then drop from DF
count_missing_rows(df_n) #returns 35
#write_missing_rows(df_n, necta_missing_csv_path)
df_n = drop_missing_rows(df_n)

#Assert if mismatching
compare_list_total(df_n, 'JUMLA', 'num_sitters')
compare_grade(df_n, 'grade', 'average_300')

df_n.shape #(17900, 11)

### 3. Light feature extraction from NECTA data
*Extract interesting features from NECTA raw data*

**Steps:**
1. Extract total sitters and passing from lists (all, by gender), and passing percentages (A-C)
2. Light data cleaning on newly extracted features
3. Drop unneeded raw columns, then save to CSV

**Observations:**
- Corner cases, keep and note: **28 single-gender schools** have `pct_passed_*` = 0/0 = NaN

In [None]:
#Extract totals from lists: all sitters (A-E) and passing (A-C) students by gender
source_dest_pairs = [('WASICHANA', 'num_sitters_girls'), ('WAVULANA', 'num_sitters_boys')]
df_n = from_list_extract_total_multiple(df_n, source_dest_pairs, 0, 5)
source_dest_pairs = [('JUMLA', 'num_passed'), ('WASICHANA', 'num_passed_girls'), ('WAVULANA', 'num_passed_boys')]
df_n = from_list_extract_total_multiple(df_n, source_dest_pairs, 0, 3)

#Extract passing percentages (A-C)
rate_tuples = [('num_passed', 'num_sitters', 'pct_passed'),
               ('num_passed_girls', 'num_sitters_girls', 'pct_passed_girls'),
               ('num_passed_boys', 'num_sitters_boys', 'pct_passed_boys')]
df_n = extract_rate_multiple(df_n, rate_tuples)

#(iv) Check missing data
count_missing_rows(df_n) #returns 28: single-gender schools so pct_passed_* = 0/0 = NaN

#(v) Drop unneeded columns, save to CSV
df_n = drop_columns(df_n, ['WASICHANA', 'WAVULANA', 'JUMLA'])
#df_n.to_csv(necta_features_csv_path)

#Check
df_n.shape #(17900, 16)