# NECTA PSLE Dashboard

## 03-feature-extraction
### Tasks
NOTE: Features currently fixed/extracted for (16361/17900) **government schools only** (except context)
1. Manual fixes to latitude and longitude (Google Maps, mWater), "fe1" Query population data (mWater)
2. "fe2" Extract distance to closest other school
3. "fe3" Extract distance to council headquarters
4. "fe4" Extract categorical variables (urban/rural context, PSLE results quantiles)

#### Inputs:
- 02-tamisemi-merge.csv (17900, 35)
- 03-mwater_latlon_fixes_population.csv (16339, 3)
- 03-mwater_council_hq_coords.csv (184, 2)

#### Outputs:
- 03-feature-extraction.csv (17900, 44)

In [None]:
#Data handling
import numpy as np
import pandas as pd

#Custom modules
from config import *
from data_cleaning import count_duplicates, set_index, is_diff_nans_equal, drop_columns, fillna_not_fixed
from data_cleaning_special import parse_mwater_gps_to_latlon
from feature_extraction import calc_d_km, find_closest_d_km, extract_context

### 1. Manual fixes to latitude and longitude, "fe1" Query population data
*Fix schools' latitude and longitude coordinates from public map comparisons vs. region boundaries and other errors (in-water, in-game-reserve), then use fixed coordinates to query Meta population data in mWater.*

**Technical Background:**
- Population data from Meta's [High Resolution Population Density Maps](https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps)
- [mWater](https://portal.mwater.co/#/) provides population data [query calculation](https://portal.mwater.co/#/resource_center/population_queries): from **GPS coordinates**, within **x metres**

**Steps:**
1. Read in output from mWater which includes:
    - **MANUAL DATA CLEANING:** coordinate fixing in mWater Portal Sites
    - "fe1" population data query within three kilometres (3km)
2. Combine manual data with main dataset, and flag `'LATLON is_fixed'`

**Observations:**
- **531 manual corrections** of coordinates (June 2023) > +decimal place changes (Aug 2023)
    - Often number transcription error (wrong or missing)
    - Google Maps not always correct vs. WARD

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 😎 Really benefitting from **Pandas indexing** for: `pd.concat` and `fillna` (from another column) based on NECTA ID

In [None]:
#Main code
#1. Read in mWater output
df_mw = pd.read_csv(fe1_mwater_path)
df_mw.shape #(16340, 4)

In [None]:
#Light Data cleaning
#(ii) Parse 'GPS Location' for "fix" latitude, longitude
df_mw = parse_mwater_gps_to_latlon(df_mw, lat_col_fix, lon_col_fix)

#(ii) Rename new Xd to 'pop_3km'
df_mw = df_mw.rename(columns={'Population within 3000 meters of GPS Location': 'pop_3km'})

#(iii) Check duplicates
count_duplicates(df_mw, 'Description') #returns 0

#Set index to NECTA ID for pd.concat, equalize index name so not lost during concat
df_mw = set_index(df_mw, 'Description')
df_mw.index.name = 'school_id'

In [None]:
#2. Combine data
#Read in ts-merged data
dfe = pd.read_csv(necta_ts_merged_path, index_col='school_id')

#CONCATENATE column-wise
dfe1 = pd.concat([dfe, df_mw[fe1_mwater_cols_concat]], axis=1)
dfe1.shape

#Fill NonGov/NA with original coordinates
dfe1 = fillna_not_fixed(dfe1, [lat_col, lon_col])

#Flag fixed cases
dfe1 = is_diff_nans_equal(dfe1, lat_col, lat_col_fix, lat_col_is_fixed)
dfe1 = is_diff_nans_equal(dfe1, lon_col, lon_col_fix, lon_col_is_fixed)
dfe1['LATLON is_fixed'] = dfe1[lat_col_is_fixed] | dfe1[lon_col_is_fixed]
dfe1['LATLON is_fixed'].value_counts() #True 693

#Light data cleaning
#(v) Drop unneeded columns
dfe1 = drop_columns(dfe1, ['Name', lat_col_is_fixed, lon_col_is_fixed])

#Save fe1
#dfe1.to_csv(fe1_csv_path)

#Check
dfe1.shape #(17900, 39)

### 2. "fe2" Extract distance to closest other school
*Calculate distance in kilometres to closet other school as a measure or remoteness.*

**Steps:**
1. Calculate distance in closest other (government) school in dataset

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 Tricky to compare one element against all other elements but exhaustive Series (rows of DataFrame) vs. Series (perpendicularly) worked!
    - ⚠️ But O\*\*2 complexity: for n=16,361 => 4-5min is OK

In [None]:
%%time
#Wall time: 4min 32s
#1. Calculate nearest school distance for each school
#Government schools only
dfe2 = dfe1.copy()
dfe2g = dfe1[dfe1['SCHOOL OWNERSHIP'] == 'Government'] #Gov = 16361

#Setup coordinates data structures
s_coord = dfe2g.apply(lambda p: (p[lat_col_fix], p[lon_col_fix]), axis=1)
df_coord = s_coord.to_frame('lat_lon_tuple')

#Series itself is an argument per-row of same DataFrame/Series
s_closest = df_coord.apply(find_closest_d_km, col='lat_lon_tuple', s_p=s_coord, axis=1)

In [None]:
#Save result back to main dataset
dfe2['d_closest'] = s_closest.apply(lambda x: x[1])
#dfe2['check_closest_id'] = s_closest.apply(lambda x: x[0]) #index to closest school

#Save fe2
#dfe2.to_csv(fe2_csv_path)

#Check
dfe2.shape #(17900, 40)

### 3. "fe3" Extract distance to council headquarters
*Calculate distance in kilometres to council headquarters as a measure or remoteness.*

**Steps:**
1. Read in **MANUAL DATA COLLECTION** list of council HQ coordinates from Google Maps, mWater
2. Combine council HQ coordinates with main dataset
3. Calculate distance of each school to its council HQ, light data clean, save to CSV

**Observations:**
- Tried to webscrape coordinates from TAMISEMI's region > council websites but challenges:
    - Coordinates were not direclty accessible in 63/184 cases
    - Regular and irregular council name differences between dataset and websites

In [None]:
#Main code
#1. Read in mWater output
df_mw_hq = pd.read_csv(fe3_mwater_path)
df_mw_hq.shape #(184, 2)

In [None]:
#2. Combine with main dataset

#Light Data cleaning
#(ii) Parse 'GPS Location' for "fixed" latitude, longitude
df_mw_hq = parse_mwater_gps_to_latlon(df_mw_hq, 'latitude_hq', 'longitude_hq')

#(iii) Check duplicates
count_duplicates(df_mw_hq, 'Name') #returns 0

#Set index to council name to match main dataset
df_mw_hq = set_index(df_mw_hq, 'Name')
df_mw_hq.index.name = 'council_name'

#Create per-school columns with per-council values
dfe3 = dfe2.copy()
dfe3['council_hq_lat'] = dfe3[dfe3['SCHOOL OWNERSHIP'] == 'Government'].apply(lambda x: df_mw_hq.at[x['council_name'], 'latitude_hq'], axis=1)
dfe3['council_hq_lon'] = dfe3[dfe3['SCHOOL OWNERSHIP'] == 'Government'].apply(lambda x: df_mw_hq.at[x['council_name'], 'longitude_hq'], axis=1)

In [None]:
#3. Calculate distance to council HQ
dfe3['d_council_hq'] = dfe3[dfe3['SCHOOL OWNERSHIP'] == 'Government'].apply(lambda x: calc_d_km((x[lat_col_fix], x[lon_col_fix]), (x['council_hq_lat'], x['council_hq_lon'])), axis=1)

#(v) Drop unneeded columns
dfe3 = drop_columns(dfe3, ['council_hq_lat', 'council_hq_lon'])

#Save fe3
#dfe3.to_csv(fe3_csv_path)

#Check
dfe3.shape #(17900, 41)

### 4. "fe4" Extract categorical variables
*Extract categorical features and results (potential ML class labels) for analysis*

**Steps:**
1. Extract: `'context'` from `'council_name'` between urban (TC, MC, CC) and rural (all others)
2. Extract results categorizations by PSLE quantiles (y-cat)

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 Pandas `qcut` is the star method here! 

In [None]:
#1. Extract context
dfe4 = dfe3.copy()
dfe4['context'] = dfe4['council_name'].apply(extract_context)
dfe4['context'].value_counts()

In [None]:
#2. Extract y-cat (Gov-only)

#Separate Gov
dfe4g = dfe4[dfe4['SCHOOL OWNERSHIP'] == 'Government']

#Prepare labels
labels_2tile = ['lower', 'upper']
labels_5tile = ['lowest','second', 'middle', 'fourth', 'highest']

#All government schools
dfe4['average_2tile'] = pd.qcut(dfe4g['average_300'], 2, labels=labels_2tile)
dfe4['average_5tile'] = pd.qcut(dfe4g['average_300'], 5, labels=labels_5tile)

#Check distribution of values
#MANUAL CHECK: Excel Data-Filter=Gov > Data-Sort=average_300 > check average_*_tile
dfe4['average_2tile'].value_counts() #even ~16361/2
dfe4['average_5tile'].value_counts() #even ~16361/5

#Save fe4
#dfe4.to_csv(fe_csv_path)

#Check
dfe4.shape #(17900, 44)