# Tanzania Primary Education Results (NECTA PSLE)

### 2b. Data Cleaning

**ELI5 Summary:**
*Perform checks of sourced data together to ensure it is usable and correct for modeling, and to understand missing data*

**Steps:**
1. Data types
2. Data values
3. Duplicate rows/indices
4. Unneeded columns
5. MISSING data

#### Inputs:
* **nation_merged.csv (17900, 28)**

#### Outputs:
* **nation_cleaned.csv (17900, 28)**
* nation_cleaned_missing.csv (1652, 28)

In [1]:
#Libraries: pre-installed in Anaconda
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
#User-defined functions.py
import functions as fn

### 1. Data types

**ELI5 Summary:**
*Are all column data types the correct ones to represent the data? Can unexpected data types uncover hidden issues?*

**Steps:**
1. Convert to best possible dtypes for Pandas with `convert_dtypes()` < read in CSV
2. Check `info()` and implement additional conversions

**DATA observations:**
- Two types of additional conversions: string > category, Float64 > Int64 (CG TZS)
    - `info()`: string > category reduces memory usage from 4.3+ MB to 3.9+ MB
    - `PQTR` not converted to Int64 because of `inf` handling (see below)

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 Pandas dtype [Int64](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html) allows for counted `pd.NA` in an otherwise integer column (vs. needing to use float64)
- ⚠️ Unlike `pd.NA`, `inf` can not be handled as Int64 => becomes -9223372036854775808!

In [2]:
#Read in output file from [2a] Data Sourcing
df_c1 = pd.read_csv('dataout/2a/nation_merged.csv', index_col='school_id')
df_c1.shape

(17900, 28)

In [3]:
#(1) Data types
#df_c1.info()

#(1.1) Best possible conversion
df_c11 = df_c1.convert_dtypes()
#df_c11.info()

#(1.2) Additional conversions
#string > category
categorical_list = ['grade', 'region_name', 'council_name', 'SCHOOL OWNERSHIP']
df_c11[categorical_list] = df_c11[categorical_list].astype('category') #reduces from 4.3+ MB to 3.9+ MB

#Simplify Capitation Grant amounts: round to integers as units are in TZS (smallest circulating coin is 50 TZS)
float_round_list = ['CG', 'CG_per_student']
df_c11[float_round_list] = df_c11[float_round_list].round().astype('Int64')
#df_c11['CG_per_student'].isna().sum()

#check vs. data dictionary
df_c11.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17900 entries, PS0101114 to PS2001122
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   school_name               17900 non-null  string  
 1   num_sitters               17900 non-null  Int64   
 2   average_300               17900 non-null  Float64 
 3   grade                     17900 non-null  category
 4   region_name               17900 non-null  category
 5   council_name              17900 non-null  category
 6   num_sitters_girls         17900 non-null  Int64   
 7   num_sitters_boys          17900 non-null  Int64   
 8   ratio_sitters_girls_boys  17900 non-null  Float64 
 9   pct_passed                17900 non-null  Float64 
 10  approx_marks_SD_300       17900 non-null  Float64 
 11  WARD                      17874 non-null  string  
 12  SCHOOL OWNERSHIP          17874 non-null  category
 13  SCHOOL REG. NUMBER        17874 non-nul

In [4]:
#Experiment: could PQTR with np.inf be converted to Int64? NO
df_c11[np.isinf(df_c11['PQTR'])] #2 rows: PS1702070, PS1802012
df_c11a = df_c11.copy()
df_c11a['PQTR'] = df_c11a['PQTR'].astype('Int64')
df_c11a.loc[['PS1702070', 'PS1802012']] #'PQTR' = -9223372036854775808
df_c11a['PQTR'].describe()

count    1.787400e+04
mean     6.197717e+01
std      9.756223e+16
min     -9.223372e+18
25%      3.900000e+01
50%      5.700000e+01
75%      8.000000e+01
max      8.680000e+02
Name: PQTR, dtype: float64

### 2. Data values

**ELI5 Summary:**
*Are the data values themselves in a correct format and in a reasonably expected range?*

**Steps:**
1. Pandas `describe()` to check numerical values // Excel Data-Filter
2. Pandas `describe(include=['string', 'category'])` to check string/categorical values // Excel Data-Filter
3. Make obvious fixes

**DATA observations:**
- Note ~MISSING:
    - 0/inf: `'ratio_sitters_girls_boys'` (28), `'RATIO GIRLS-BOYS'` (6)
    - inf: `'PQTR'` (2)
- Note potential OUTLIERS: `'num|TOTAL*'`, `'ratio|RATIO*'`, `'P(Q)TR'`, `'PBR_average'`, `'CG*'`
- FIX most obviously incorrect coordinates based on [Tanzania](https://www.openstreetmap.org/) extreme land borders
    - `'LATITUDE'` South < -11.76: **2 schools**
    - `'LONGITUDE'` East < 29.60: **6 schools**
    - `'LONGITUDE'` West > 40.42: **4 schools**
    - Others from detailed/local knowledge of Morogoro Region

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 📚 Note ~MISSING 0/inf/NaN or potential OUTLIER values but don't process now (focus on format, value correctness!)
- 😎 Excel Data-Filter is a fast "no code" way for quick data sanity checks

In [5]:
#(2) Data values
df_c2 = df_c11.copy()

#Excel Data-Filter
#df_c2.to_csv('dataout/2b/nation_cleaning2.csv')

#(2.1) Check numerical values, ranges
df_c2.describe() #numerical

#(2.2) Check string, categorical values, unique
df_c2.describe(include=['string', 'category'])
#df_c2['grade'].value_counts() #categorical

Unnamed: 0,school_name,grade,region_name,council_name,WARD,SCHOOL OWNERSHIP,SCHOOL REG. NUMBER
count,17900,17900,17900,17900,17874,17874,17874
unique,14651,4,26,184,3629,2,17874
top,MUUNGANO,C,Tanga,Moshi,Majengo,Government,EM.13504
freq,66,13308,1049,252,49,16339,1


In [6]:
#(2.3) Make obvious fixes
df_c23 = df_c2.copy()

#National level CORRECTIONS (Tanzania land border extremes)
df_c23.at['PS1601101', 'LATITUDE'] = -11.103584 #MIMBUA
df_c23.at['PS1601101', 'LONGITUDE'] = 34.840564 #MIMBUA
df_c23.at['PS1606078', 'LATITUDE'] = -10.768047 #NDINGINE
df_c23.at['PS1606078', 'LONGITUDE'] = 34.727641 #NDINGINE
df_c23.at['PS0604052', 'LONGITUDE'] = 29.6452256 #SAUD AL AUJAN
df_c23.at['PS0604041', 'LATITUDE'] = -4.910021 #BENJAMINI MKAPA
df_c23.at['PS0604041', 'LONGITUDE'] = 29.605573 #BENJAMINI MKAPA
df_c23.at['PS2401087', 'LONGITUDE'] = 31.550985 #GENGENI
df_c23.at['PS0505236', 'LATITUDE'] = -2.624282 #ABDULAHAMAN BABU
df_c23.at['PS0505236', 'LONGITUDE'] = 34.044296 #ABDULAHAMAN BABU
df_c23.at['PS0604054', 'LATITUDE'] = -4.8841702 #CAMBRIDGESHIRE
df_c23.at['PS0604054', 'LONGITUDE'] = 29.670387 #CAMBRIDGESHIRE
df_c23.at['PS2011043', 'LATITUDE'] = -4.863825 #KWENDOGHOI
df_c23.at['PS2011043', 'LONGITUDE'] = 38.464221 #KWENDOGHOI
df_c23.at['PS1207011', 'LATITUDE'] = -10.696297 #MASASI
df_c23.at['PS1207011', 'LONGITUDE'] = 38.829797 #MASASI
df_c23.at['PS0304008', 'LATITUDE'] = -6.346100 #CHAZUNGWA
df_c23.at['PS0304008', 'LONGITUDE'] = 36.485880 #CHAZUNGWA
df_c23.at['PS0802044', 'LATITUDE'] = -10.259789 #MPEMBE
df_c23.at['PS0802044', 'LONGITUDE'] = 39.589261 #MPEMBE

#Morogoro Region CORRECTIONS
#GREEN CITY @Morogoro MC #LATITUDE/LONGITUDE in Kenya!
df_c23.at['PS1104096', 'LATITUDE'] = -6.6999783
df_c23.at['PS1104096', 'LONGITUDE'] = 37.6427091
#MBUYUNI @Ulanga
df_c23.at['PS1105041', 'LATITUDE'] = -8.2603824
#MDOKONYOLE @Kisaki, Morogoro
df_c23.at['PS1103141', 'LATITUDE'] = -7.45963
#MSASANI @Dakawa, Mvomero
df_c23.at['PS1106146', 'LATITUDE'] = -6.4307524
df_c23.at['PS1106146', 'LONGITUDE'] = 37.4738596
#IPERA @Njiwa, Malinyi
df_c23.at['PS1108004', 'LATITUDE'] = -8.6699422
#TAMBUU @Lundi, Morogoro
df_c23.at['PS1103106', 'LATITUDE'] = -7.0698951
df_c23.at['PS1103106', 'LONGITUDE'] = 37.7400534

df_c23.shape
#df_c23.to_csv('dataout/2b/nation_cleaning23.csv')

(17900, 28)

### 3. Duplicate rows/indices

**ELI5 Summary:**
*Are there any duplicate records to remove?*

**Steps:**
1. Check duplicated NECTA exam ID (now the index)
2. Check duplicated school registration ID

**DATA observations:**
- ALL issues fixed in [2a] Data Sourcing! 😃

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 📚 **Big lesson here:** *data clean as immediately as you can* to avoid more complicated issues such as duplicate rows/indices later (after merges and other data transformations)!

In [7]:
#(3) Duplicate rows/indices
df_c3 = df_c23.copy()

#Check duplicated 'school_id' index column
df_c3.index.duplicated(keep=False).sum() #result: 0

#Check duplicated 'SCHOOL REG. NUMBER' other than NA
df_c3[df_c3['SCHOOL REG. NUMBER'].notna()].duplicated(subset='SCHOOL REG. NUMBER', keep=False).sum() #result: 0

0

### 4. Unneeded columns

**ELI5 Summary:**
*Are there any duplicate records to remove?*

**Steps:**
1. Review current columns with `info()`

**DATA observations:**
- All 28 current columns have some potential use

In [8]:
#(4) Unneeded columns
df_c3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17900 entries, PS0101114 to PS2001122
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   school_name               17900 non-null  string  
 1   num_sitters               17900 non-null  Int64   
 2   average_300               17900 non-null  Float64 
 3   grade                     17900 non-null  category
 4   region_name               17900 non-null  category
 5   council_name              17900 non-null  category
 6   num_sitters_girls         17900 non-null  Int64   
 7   num_sitters_boys          17900 non-null  Int64   
 8   ratio_sitters_girls_boys  17900 non-null  Float64 
 9   pct_passed                17900 non-null  Float64 
 10  approx_marks_SD_300       17900 non-null  Float64 
 11  WARD                      17874 non-null  string  
 12  SCHOOL OWNERSHIP          17874 non-null  category
 13  SCHOOL REG. NUMBER        17874 non-nul

### 5. MISSING data

**ELI5 Summary:**
*Where are the missing data cases, why did they occur, and how do we handle them?*

**Steps:**
1. Check normal NA missing data
2. Check 0/inf "missing" data, i.e. not meaningful in below cases and should be converted to NA
3. Check for increased NA count and CLEAN describe() stats > save CSV (and "missing" CSV)

**DATA observations:**
1. MISSING data cases
    1. All TAMISEMI data not matched (26)
    2. No books recorded for `'PBR_average'` (+3)
    3. No grants data for `'CG*'` (1645)
2. ~MISSING data cases 0/inf => replace with `pd.NA`
    1. No qualified teachers for `'PQTR'` (+2) => convert dtype to `Int64`
    2. All-girls (0) / all-boys (inf) for 
        - `'ratio_sitters_girls_boys'` (28)
        - `'RATIO GIRLS-BOYS'` (+6)
- Total of **1652 schools/rows** have missing data

**Learnings:** (🧑🏻‍💻📚😎⚠️)
- 🧑🏻‍💻 NumPy `np.isinf` is useful to find infinite values in a DataFrame column
- 📚 Sometimes `0` is also not meaningful and should be converted to NA, e.g., `'ratio|RATIO*'` here
- ⚠️ Pandas option `use_inf_as_na` didn't work as expected with NA methods, be careful with it!

In [9]:
#(5) MISSING data
df_c5 = df_c3.copy()

#(5.1) Check normal NA missing data
df_c5.isna().sum()

school_name                    0
num_sitters                    0
average_300                    0
grade                          0
region_name                    0
council_name                   0
num_sitters_girls              0
num_sitters_boys               0
ratio_sitters_girls_boys       0
pct_passed                     0
approx_marks_SD_300            0
WARD                          26
SCHOOL OWNERSHIP              26
SCHOOL REG. NUMBER            26
LATITUDE                      26
LONGITUDE                     26
TOTAL BOYS                    26
TOTAL GIRLS                   26
PTR                           26
PQTR                          26
TOTAL STUDENTS                26
RATIO GIRLS-BOYS              26
PBR_average                   29
CG                          1645
CG_per_student              1645
approx_ages_SD                26
approx_ages_mean              26
ages_median                   26
dtype: int64

In [10]:
#(5.2) Check 0/inf "missing" data, i.e. not meaningful in below cases and should be converted to NA

#pd.options.mode.use_inf_as_na = True #didn't work for this dataset!?

#np.isinf(df_c5['PQTR']).sum() #result: 2
#(np.isinf(df_c5['ratio_sitters_girls_boys']) | (df_c5['ratio_sitters_girls_boys'] == 0)).sum() #result: 28
#(np.isinf(df_c5['RATIO GIRLS-BOYS']) | (df_c5['RATIO GIRLS-BOYS'] == 0)).sum() #result: 6

df_c5['PQTR'] = df_c5['PQTR'].replace(np.inf, pd.NA).astype('Int64')
ratio_cols = ['ratio_sitters_girls_boys', 'RATIO GIRLS-BOYS']
df_c5[ratio_cols] = df_c5[ratio_cols].replace([0, np.inf], pd.NA)

#(5.3) Check for increased NA count and CLEAN describe() stats
df_c5.isna().sum()
df_c5.describe()

Unnamed: 0,num_sitters,average_300,num_sitters_girls,num_sitters_boys,ratio_sitters_girls_boys,pct_passed,approx_marks_SD_300,LATITUDE,LONGITUDE,TOTAL BOYS,TOTAL GIRLS,PTR,PQTR,TOTAL STUDENTS,RATIO GIRLS-BOYS,PBR_average,CG,CG_per_student,approx_ages_SD,approx_ages_mean,ages_median
count,17900.0,17900.0,17900.0,17900.0,17872.0,17900.0,17900.0,17874.0,17874.0,17874.0,17874.0,17874.0,17872.0,17874.0,17868.0,17871.0,16255.0,16255.0,17874.0,17874.0,17874.0
mean,75.311229,157.340311,39.56257,35.748659,1.174122,0.790012,34.181011,-5.617327,35.115083,305.58739,313.892581,60.674499,61.984109,619.479971,1.024973,4.037403,4066465.0,6409.354783,2.245848,10.040489,10.020057
std,58.105393,34.422022,31.176419,28.010165,0.520386,0.196365,7.539484,2.81659,2.705905,232.503023,241.665813,30.41365,35.744863,472.659832,0.141556,4.178673,2884566.0,1716.296793,0.222028,0.578691,0.796209
min,2.0,67.924,0.0,0.0,0.028571,0.0,0.0,-11.952582,29.226433,0.0,0.0,1.0,2.0,13.0,0.24,0.241935,40717.0,552.0,0.469333,6.691083,6.0
25%,39.0,135.843575,20.0,18.0,0.886364,0.674353,30.554686,-7.89917,32.987519,157.0,158.0,39.0,39.0,317.0,0.94303,2.113054,2215315.0,6022.0,2.102325,9.708944,10.0
50%,62.0,151.2976,32.0,29.0,1.090909,0.840909,34.362236,-5.026778,34.913331,253.0,259.0,57.0,57.0,513.0,1.019704,2.904063,3418074.0,6368.0,2.252514,10.065048,10.0
75%,94.0,169.883625,50.0,45.0,1.350162,0.953739,38.139917,-3.321528,37.47827,380.0,394.0,79.0,80.0,773.0,1.10117,4.47614,5090749.0,6707.0,2.392925,10.409149,10.0
max,920.0,296.4286,488.0,432.0,15.0,1.0,82.680687,-1.000001,40.974404,3606.0,3689.0,608.0,868.0,7295.0,4.346667,124.076565,48065620.0,202333.0,3.546118,12.972477,14.0


In [11]:
#Save to CSV
df_c5.shape
#df_c5[df_c5.isna().any(axis=1)].to_csv('dataout/2b/nation_cleaned_missing.csv')
#df_c5.to_csv('dataout/2b/nation_cleaned.csv')

(17900, 28)

In [12]:
#SPOT-CHECK CODE - handy, keep around!
df_c5.info()
#df_c5.info()
#df_c5.describe()
#df_x24[df_x24['school_id'] == 'PS1104063'] #JITEGEMEE @Morogoro MC
#df_x24.loc['PS1104063'] #JITEGEMEE @Morogoro MC
#df_c2.head(10)
#df_x13._is_copy

<class 'pandas.core.frame.DataFrame'>
Index: 17900 entries, PS0101114 to PS2001122
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   school_name               17900 non-null  string  
 1   num_sitters               17900 non-null  Int64   
 2   average_300               17900 non-null  Float64 
 3   grade                     17900 non-null  category
 4   region_name               17900 non-null  category
 5   council_name              17900 non-null  category
 6   num_sitters_girls         17900 non-null  Int64   
 7   num_sitters_boys          17900 non-null  Int64   
 8   ratio_sitters_girls_boys  17872 non-null  Float64 
 9   pct_passed                17900 non-null  Float64 
 10  approx_marks_SD_300       17900 non-null  Float64 
 11  WARD                      17874 non-null  string  
 12  SCHOOL OWNERSHIP          17874 non-null  category
 13  SCHOOL REG. NUMBER        17874 non-nul