# Tanzania Primary Education Results (NECTA PSLE)

### 2a. Data Sourcing - NECTA
2. Light data cleaning of NECTA data
3. Light feature extraction from NECTA data

#### Inputs:
* nation_necta_raw.csv (17935, 10)

#### Outputs:
* nation_necta_features.csv (17900, 12)
* nation_necta_missing.csv (35, 10) 

#### Functions:
* `assign_grade`
* `calc_approx_marks_SD`

In [1]:
#Libraries: pre-installed in Anaconda
import numpy as np
import pandas as pd
from ast import literal_eval
#User-defined functions.py
import functions as fn

In [2]:
#Read from CSV
df_n = pd.read_csv('dataout/2a/nation_necta_raw.csv', index_col=0)
df_n.shape

(17935, 10)

### 2. Light data cleaning of NECTA data
**ELI5 Summary:** *Check school examination data for obvious issues and corner cases*

**Steps:**
1. Check [Pandas data types](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes) with `info()`, and convert with `convert_dtypes()`, `astype()`, `literal_eval`
2. Check data values format/correctness with `describe()` // Excel Data-Filter
3. Check for unique `school_id` vs. duplicates
4. Check MISSING data values with `isna()`, drop using `dropna()`
5. Check data integrity: `num_sitters`, `grade`, `num_sitters_girls`, `num_sitters_boys`

**DATA observations:**
1. Pandas type conversions (object to category) results in 1.5+ MB > 1.2+ MB
2. Found: MISSING data below, repeated `school_name` ✅, `region_name` nunique=26 vs. [Regions_of_Tanzania](https://en.wikipedia.org/wiki/Regions_of_Tanzania) ✅ 
3. Unique `school_id` ✅
4. MISSING data: **DROP 35 cases** of missing average and grades data
5. Data integrity ✅

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 `info()` is a nice Pandas method to see column names, row counts (non-null vs. MISSING), and dtypes
- 🧑🏻‍💻⚠️ Learned how to deal with [SettingWithCopyWarning](https://www.dataquest.io/blog/settingwithcopywarning/): create `.copy()` when needing to change/add to a slice of a DataFrame
- 😎 Using **Excel Data-Filter** in parallel with Pandas is a quick way to sanity check data
- ⚠️ Avoid storing lists in one CSV column as they do not persist when writing out to CSV (read back in as string)
    - 😎 `ast.literal_eval` saved me!

In [3]:
#1. Data types
#df_n2.info()

#1a. best possible
df_n2 = df_n.convert_dtypes() #returns a copy

#1b. object > category
categorical_list = ['grade', 'region_name', 'council_name']
df_n2[categorical_list] = df_n2[categorical_list].astype('category')

#1c. object: string > list
list_list = ['WASICHANA', 'WAVULANA', 'JUMLA']
for one_list in list_list:
    df_n2[one_list] = df_n2[one_list].apply(literal_eval)

df_n2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17935 entries, 0 to 17934
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   school_name   17935 non-null  string  
 1   school_id     17935 non-null  string  
 2   num_students  17900 non-null  Int64   
 3   average_300   17900 non-null  Float64 
 4   grade         17900 non-null  category
 5   WASICHANA     17935 non-null  object  
 6   WAVULANA      17935 non-null  object  
 7   JUMLA         17935 non-null  object  
 8   region_name   17935 non-null  category
 9   council_name  17935 non-null  category
dtypes: Float64(1), Int64(1), category(3), object(3), string(2)
memory usage: 1.2+ MB


In [4]:
#2. Data values

#Rename column
df_n2.rename(columns={'num_students': 'num_sitters'}, inplace=True)
#Check Count, unique, min, max
df_n2.describe(include='all') #slow because of lists

Unnamed: 0,school_name,school_id,num_sitters,average_300,grade,WASICHANA,WAVULANA,JUMLA,region_name,council_name
count,17935,17935,17900.0,17900.0,17900,17935,17935,17935,17935,17935
unique,14673,17935,,,4,11493,11939,16214,26,184
top,MUUNGANO,PS0101114,,,C,"[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]",Tanga,Moshi
freq,67,1,,,13308,41,39,25,1049,252
mean,,,75.311229,157.340311,,,,,,
std,,,58.105393,34.422022,,,,,,
min,,,2.0,67.924,,,,,,
25%,,,39.0,135.843575,,,,,,
50%,,,62.0,151.2976,,,,,,
75%,,,94.0,169.883625,,,,,,


In [5]:
#3. Duplicates - school_id
df_n2['school_id'].count() == df_n2['school_id'].nunique()

True

In [6]:
#4. Missing data

#Dropping MISSING 'average_300' etc. data before calculating NECTA features
df_n2_na = df_n2[df_n2.isna().any(axis=1)] #returns a view but that's OK
#df_n2_na.to_csv('dataout/2a/nation_necta_missing.csv')

df_n3 = df_n2.dropna(axis=0, how='any').copy() #defaults: drop row if ANY NaN value
df_n3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17900 entries, 0 to 17934
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   school_name   17900 non-null  string  
 1   school_id     17900 non-null  string  
 2   num_sitters   17900 non-null  Int64   
 3   average_300   17900 non-null  Float64 
 4   grade         17900 non-null  category
 5   WASICHANA     17900 non-null  object  
 6   WAVULANA      17900 non-null  object  
 7   JUMLA         17900 non-null  object  
 8   region_name   17900 non-null  category
 9   council_name  17900 non-null  category
dtypes: Float64(1), Int64(1), category(3), object(3), string(2)
memory usage: 1.2+ MB


In [7]:
#5. Data integrity checks
#5a. num_sitters = JUMLA
df_n3['num_sitters_from_table'] = df_n3['JUMLA'].apply(np.sum) #apply to Series from DF column
assert (df_n3['num_sitters'] == df_n3['num_sitters_from_table']).all(), 'Found mismatch in number of exam sitters!'

#5b. grade = assign_grade(average_300)
df_n3['grade_from_average'] = df_n3['average_300'].apply(fn.assign_grade)
assert (df_n3['grade'] == df_n3['grade_from_average']).all(), 'Found mismatch in grade assignment!'

#5c. girls + boys = total
df_n3['num_sitters_girls'] = df_n3['WASICHANA'].apply(np.sum)
df_n3['num_sitters_boys'] = df_n3['WAVULANA'].apply(np.sum)
assert (df_n3['num_sitters_girls'] + df_n3['num_sitters_boys'] == df_n3['num_sitters']).all(), 'Found mismatch in number of exam sitters by gender!'

### 3. Light feature extraction from NECTA data
**ELI5 Summary:**
*Calculate interesting numbers from NECTA raw data that may be useful later*

**Steps:** (Extraction calculations)
1. gender parity ratio (sitters)
2. %passing rate overall (grades A-C, not provided on webpage)
3. approx. marks SD (/300)
4. Light data cleaning on newly extracted columns

**Corner cases:**
* `pct_passed`: 8 schools with 0% pass rate (`num_sitters` = 3-47)
* `ratio_sitters_girls_boys`: 28 one-gender schools (0, inf) > leave out in EDA
* `approx_marks_SD_300`: 197 schools with 0 SD meaning all same grade (`num_sitters` = 2-27)

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 Note difference between Pandas per-row "vectorized" operations
    - in-line operations when using column data with simple mathematical operators
    - apply() when using external or user-defined functions such as np.sum()

In [8]:
#1. gender parity ratio
df_n3['ratio_sitters_girls_boys'] = df_n3['num_sitters_girls'] / df_n3['num_sitters_boys']

#2. %passing rate overall
#df['num_students_calc'] = df.apply(lambda x : np.sum(x['JUMLA']), axis=1) #apply to entire DF
df_n3['num_passed'] = df_n3['JUMLA'].apply(lambda x : np.sum(x[0:3])) #A-C
df_n3['pct_passed'] = df_n3['num_passed'] / df_n3['num_sitters']

#3. marks SD
df_n3['approx_marks_SD_300'] = df_n3['JUMLA'].apply(fn.calc_approx_marks_SD)

In [9]:
#4. Light data cleaning on newly extracted columns

extracted = ['num_sitters_girls', 'num_sitters_boys', 'ratio_sitters_girls_boys', 'num_passed', 'pct_passed', 'approx_marks_SD_300']
df_n3e = df_n3[extracted]

#(1) Data types
df_n3e.info()

#(2) Data values
df_n3e.describe() #with Excel Data-Filter
#df_n4e['approx_marks_SD_300'].value_counts()

#(5) Missing data
df_n3[df_n3e.isna().any(axis=1)]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17900 entries, 0 to 17934
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   num_sitters_girls         17900 non-null  int64  
 1   num_sitters_boys          17900 non-null  int64  
 2   ratio_sitters_girls_boys  17900 non-null  float64
 3   num_passed                17900 non-null  int64  
 4   pct_passed                17900 non-null  Float64
 5   approx_marks_SD_300       17900 non-null  float64
dtypes: Float64(1), float64(2), int64(3)
memory usage: 996.4 KB


Unnamed: 0,school_name,school_id,num_sitters,average_300,grade,WASICHANA,WAVULANA,JUMLA,region_name,council_name,num_sitters_from_table,grade_from_average,num_sitters_girls,num_sitters_boys,ratio_sitters_girls_boys,num_passed,pct_passed,approx_marks_SD_300


In [10]:
#Drop unneeded columns
df_n4 = df_n3.drop(['WASICHANA', 'WAVULANA', 'JUMLA', 'num_sitters_from_table', 'grade_from_average', 'num_passed'], axis=1)

#Save to CSV
#df_n4.to_csv('dataout/2a/nation_necta_features.csv')

In [11]:
#SPOT-CHECK CODE - handy, keep around!
#df_n4.info()
df_n4.shape
#df_n4.describe(include='all')
#df_n4[df_n4['school_id'] == 'PS1104063'] #JITEGEMEE @Morogoro MC
#df_n4.head()
#df_n4._is_copy

(17900, 12)

In [12]:
#SAVED CODE - keep, may re-use later in project development

#Feature Extraction - %passing rate by gender
df_n3['num_girls_passed'] = df_n3['WASICHANA'].apply(lambda x : np.sum(x[0:3])) #A-C
df_n3['pct_girls_passed'] = df_n3['num_girls_passed'] / df_n3['num_sitters_girls']

df_n3['num_boys_passed'] = df_n3['WAVULANA'].apply(lambda x : np.sum(x[0:3])) #A-C
df_n3['pct_boys_passed'] = df_n3['num_boys_passed'] / df_n3['num_sitters_boys']

df_n3['ratio_pct_girls_boys_passed'] = df_n3['pct_girls_passed'] / df_n3['pct_boys_passed']