## Data Validation: Country-Level Entity Filtering

### Validation Approaches

**Approach 1 - Manual Validation:**  
Cross-reference country entities against World Bank official classifications to identify regional aggregates, income groups, and non-sovereign entities.

**Approach 2 - ISO Standard Validation:**  
Leverage ISO 3166-1 alpha-3 standard via the pycountry library for programmatic validation against internationally recognized country codes.

### Filtering Methodology

1. **Generate exclusion list** of non-country entities using selected validation approach
2. **Apply primary filter** to EdStatsCountry dataset
3. **Propagate filter** across all related datasets (EdStatsData, EdStatsCountry-Series, EdStatsFootNote, EdStatsSeries) to maintain referential integrity


### Technical Implementation
Identify country identifier field in each dataset, apply exclusion filter using boolean indexing with `.isin()` method, and export sanitized results maintaining consistent file structure.

**Outcome:** Datasets contain exclusively sovereign nation-level observations, excluding supranational aggregates and non-country groupings.

---

In [1]:
# Standard library imports
from pathlib import Path

# Third-party imports - Jupyter/IPython
from IPython.display import display, HTML, Markdown

# Local application imports
from project2.data.Filesloader import Reader
from project2.data.Saver import Saver
from project2.utils.Cleaner import Cleaner
from project2.utils.Config import Config
from project2.utils.DataInspector import BasicInfo, VisualInspector

# Jupyter magic commands
%load_ext autoreload
%autoreload 2

In [2]:
# Initialize data pipeline components
data_loader = Reader()
data_inspector = BasicInfo(data_loader)
display_config = Config()
file_saver = Saver()
conf = Config()
cleaner = Cleaner()

### Loading the Semi Processed files

In [3]:
# Load raw data files
data_pattern = '../data/interim/*.csv'
file_names, data_dict = data_loader.load_raw_files_csv(data_pattern)

Unnamed: 0,Filesnames
0,EdStatsCountry
1,EdStatsCountry-Series
2,EdStatsData
3,EdStatsFootNote
4,EdStatsSeries


### Loading the EdStatsCountry data and Filter the Non countries 

In [4]:
# EdStatsCountry
name = 'EdStatsCountry'
df = data_dict[name].convert_dtypes()
conf.pdconfig(nrows=None, cols_width=None, precision=None)
df.head()

Pandas config display options set.

Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,National accounts base year,SNA price valuation,Lending category,System of National Accounts,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,SNA data for 2000-2011 are updated from official government statistics; 1994-1999 from UN databases. Base year has changed from 1995 to 2000.,Latin America & Caribbean,High income: nonOECD,AW,2000,Value added at basic prices (VAB),,Country uses the 1993 System of National Accounts methodology.,,"IMF Balance of Payments Manual, 6th edition.",,Special trade system,,,2010,,,Yes,,,2012.0,
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period for national accounts data: FY (from 2013 are CY). National accounts data are sourced from the IMF and differ from the Central Statistics Organization numbers due to exclusion of the opium economy.,South Asia,Low income,AF,2002/03,Value added at basic prices (VAB),IDA,Country uses the 1993 System of National Accounts methodology.,,,Actual,General trade system,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2012.0,2000.0
2,AGO,Angola,Angola,People's Republic of Angola,AO,Angolan kwanza,"April 2013 database update: Based on IMF data, national accounts data were revised for 2000 onward; the base year changed to 2002.",Sub-Saharan Africa,Upper middle income,AO,2002,Value added at producer prices (VAP),IBRD,Country uses the 1993 System of National Accounts methodology.,2005,"IMF Balance of Payments Manual, 6th edition.",Actual,Special trade system,Budgetary central government,General Data Dissemination System (GDDS),1970,"Malaria Indicator Survey (MIS), 2011","Integrated household survey (IHS), 2008",,2015,,,2005.0
3,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,Original chained constant price data are rescaled.,Value added at basic prices (VAB),IBRD,Country uses the 1993 System of National Accounts methodology.,Rolling,"IMF Balance of Payments Manual, 6th edition.",Actual,General trade system,Budgetary central government,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09","Living Standards Measurement Study Survey (LSMS), 2012",Yes,2012,2010.0,2012.0,2006.0
4,AND,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income: nonOECD,AD,1990,,,Country uses the 1968 System of National Accounts methodology.,,,,Special trade system,,,2011. Population figures compiled from administrative registers.,,,Yes,,,2006.0,


### Identifying fake countries and removing them 

In [5]:
# Select features of interest for identification 
cleaner.fk_country(name,df, return_code=False, disp=True)

# Eliminate the fake countries 
cleaner.removefkcountry(name,df)

For EdStatsCountry - 27 invalid country codes identified


Unnamed: 0,Country Code,Short Name
0,ARB,Arab World
1,CHI,Channel Islands
2,EAP,East Asia & Pacific (developing only)
3,EAS,East Asia & Pacific (all income levels)
4,ECA,Europe & Central Asia (developing only)
5,ECS,Europe & Central Asia (all income levels)
6,EMU,Euro area
7,EUU,European Union
8,HIC,High income
9,HPC,Heavily indebted poor countries (HIPC)


For EdStatsCountry - 27 invalid country codes identified
âœ“ File saved: /Users/hopedonglo/Documents/Projects/OpenClassRooms/OpenClassroom_Projects/project_2/data/processed/EdStatsCountry.csv


<project2.utils.Cleaner.Cleaner at 0x11a9bf770>