# Literacy Indicator Analysis and Exclusion Decision
## Bangladesh, India, and Pakistan Long-term Development Study

### Background
This notebook examines the availability and quality of literacy indicators for Bangladesh, India, and Pakistan in the context of our long-term development analysis. While I consider literacy rates to be  crucial for assessing the impact of microcredit, we need to evaluate their suitability for robust statistical analysis.

### Objectives
1. Assess the availability of literacy indicators in our dataset
2. Evaluate data quality and completeness
3. Compare with other available indicators
4. Provide justification for inclusion/exclusion decisions
5. Suggest alternative approaches for future research


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting preferences
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 11

print("Libraries loaded successfully")


Libraries loaded successfully


In [5]:
!ls work


data		    Dockerfile	models	   README.md  src
docker-compose.yml  lib		notebooks  reports


In [6]:
# Load the filtered dataset
df = pd.read_csv('/home/jovyan/work/data/processed/filtered_data.csv')

print(f"Dataset shape: {df.shape}")
print(f"Countries: {df['Country Name'].unique()}")
print(f"Total indicators: {df['Indicator Name'].nunique()}")

# Identify year columns
year_columns = [col for col in df.columns if col.isdigit()]
year_columns = sorted([int(year) for year in year_columns])
print(f"Year range in dataset: {min(year_columns)} - {max(year_columns)}")
print(f"Total years available: {len(year_columns)} years")


Dataset shape: (4797, 64)
Countries: ['Bangladesh' 'India' 'Pakistan']
Total indicators: 1599
Year range in dataset: 1960 - 2018
Total years available: 59 years


## 1. Literacy Indicators Identification

First, let's identify all literacy-related indicators available in our dataset.


In [7]:
# Search for literacy indicators
literacy_indicators = df[df['Indicator Name'].str.contains('literacy', case=False, na=False)]

print(f"Found {literacy_indicators['Indicator Name'].nunique()} literacy indicators:")
print("\n=== Available Literacy Indicators ===")

unique_literacy = literacy_indicators[['Indicator Name', 'Indicator Code']].drop_duplicates()
for idx, row in unique_literacy.iterrows():
    print(f"• {row['Indicator Name']}")
    print(f"  Code: {row['Indicator Code']}")
    print()


Found 7 literacy indicators:

=== Available Literacy Indicators ===
• Literacy rate, adult female (% of females ages 15 and above)
  Code: SE.ADT.LITR.FE.ZS

• Literacy rate, adult male (% of males ages 15 and above)
  Code: SE.ADT.LITR.MA.ZS

• Literacy rate, adult total (% of people ages 15 and above)
  Code: SE.ADT.LITR.ZS

• Literacy rate, youth (ages 15-24), gender parity index (GPI)
  Code: SE.ADT.1524.LT.FM.ZS

• Literacy rate, youth female (% of females ages 15-24)
  Code: SE.ADT.1524.LT.FE.ZS

• Literacy rate, youth male (% of males ages 15-24)
  Code: SE.ADT.1524.LT.MA.ZS

• Literacy rate, youth total (% of people ages 15-24)
  Code: SE.ADT.1524.LT.ZS



In [8]:
def analyze_indicator_coverage(df, indicator_subset=None):
    """
    Analyze data coverage for literacy indicators
    """
    if indicator_subset is None:
        indicator_subset = df
    
    results = []
    year_columns_str = [str(year) for year in year_columns]
    
    for indicator_name in indicator_subset['Indicator Name'].unique():
        indicator_data = indicator_subset[indicator_subset['Indicator Name'] == indicator_name]
        indicator_code = indicator_data['Indicator Code'].iloc[0]
        
        # Transform to long format
        indicator_long = indicator_data.melt(
            id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'],
            value_vars=year_columns_str,
            var_name='Year',
            value_name='Value'
        )
        
        # Clean the data
        indicator_long['Year'] = indicator_long['Year'].astype(int)
        indicator_long['Value'] = pd.to_numeric(indicator_long['Value'], errors='coerce')
        indicator_clean = indicator_long.dropna(subset=['Value'])
        
        if len(indicator_clean) > 0:
            year_range = indicator_clean['Year'].max() - indicator_clean['Year'].min() + 1
            countries_with_data = indicator_clean['Country Name'].nunique()
            total_data_points = len(indicator_clean)
            
            # Check if all three countries have data
            countries_in_data = set(indicator_clean['Country Name'].unique())
            all_countries = {'Bangladesh', 'India', 'Pakistan'}
            has_all_countries = all_countries.issubset(countries_in_data)
            
            # Calculate data density
            possible_points = countries_with_data * year_range
            data_density = (total_data_points / possible_points) * 100 if possible_points > 0 else 0
            
            results.append({
                'Indicator_Name': indicator_name,
                'Indicator_Code': indicator_code,
                'Year_Range': year_range,
                'Min_Year': indicator_clean['Year'].min(),
                'Max_Year': indicator_clean['Year'].max(),
                'Countries_with_Data': countries_with_data,
                'Has_All_Countries': has_all_countries,
                'Total_Data_Points': total_data_points,
                'Data_Density': round(data_density, 1),
                'Possible_Points': possible_points
            })
    
    return pd.DataFrame(results)

# Analyze literacy indicators
literacy_analysis = analyze_indicator_coverage(df, literacy_indicators)

print("=== Literacy Indicators Data Coverage Analysis ===\n")
for idx, row in literacy_analysis.iterrows():
    print(f"📊 {row['Indicator_Name']}")
    print(f"   Code: {row['Indicator_Code']}")
    print(f"   Period: {row['Min_Year']}-{row['Max_Year']} ({row['Year_Range']} years)")
    print(f"   Data points: {row['Total_Data_Points']}/{row['Possible_Points']} ({row['Data_Density']}%)")
    print(f"   All countries covered: {'✅ Yes' if row['Has_All_Countries'] else '❌ No'}")
    print(f"   Missing data: {100 - row['Data_Density']:.1f}%")
    print()


=== Literacy Indicators Data Coverage Analysis ===

📊 Literacy rate, adult female (% of females ages 15 and above)
   Code: SE.ADT.LITR.FE.ZS
   Period: 1981-2017 (37 years)
   Data points: 27/111 (24.3%)
   All countries covered: ✅ Yes
   Missing data: 75.7%

📊 Literacy rate, adult male (% of males ages 15 and above)
   Code: SE.ADT.LITR.MA.ZS
   Period: 1981-2017 (37 years)
   Data points: 27/111 (24.3%)
   All countries covered: ✅ Yes
   Missing data: 75.7%

📊 Literacy rate, adult total (% of people ages 15 and above)
   Code: SE.ADT.LITR.ZS
   Period: 1981-2017 (37 years)
   Data points: 27/111 (24.3%)
   All countries covered: ✅ Yes
   Missing data: 75.7%

📊 Literacy rate, youth (ages 15-24), gender parity index (GPI)
   Code: SE.ADT.1524.LT.FM.ZS
   Period: 1981-2017 (37 years)
   Data points: 27/111 (24.3%)
   All countries covered: ✅ Yes
   Missing data: 75.7%

📊 Literacy rate, youth female (% of females ages 15-24)
   Code: SE.ADT.1524.LT.FE.ZS
   Period: 1981-2017 (37 years)


In [9]:
# Detailed analysis of data gaps
print("=== CRITICAL DATA QUALITY ISSUES ===\n")

# Calculate key statistics
avg_data_density = literacy_analysis['Data_Density'].mean()
avg_year_range = literacy_analysis['Year_Range'].mean()
min_year = literacy_analysis['Min_Year'].min()
max_year = literacy_analysis['Max_Year'].max()

print(f"📈 TEMPORAL COVERAGE:")
print(f"   • Earliest data: {min_year}")
print(f"   • Latest data: {max_year}")
print(f"   • Average span: {avg_year_range:.1f} years")
print(f"   • Required for long-term analysis: 40+ years")
print(f"   • Gap: {40 - avg_year_range:.1f} years SHORT")

print(f"\n📊 DATA COMPLETENESS:")
print(f"   • Average data density: {avg_data_density:.1f}%")
print(f"   • Average missing data: {100 - avg_data_density:.1f}%")
print(f"   • Minimum acceptable density: 80%")
print(f"   • Status: {'✅ ACCEPTABLE' if avg_data_density >= 80 else '❌ INSUFFICIENT'}")

print(f"\n🌍 COUNTRY COVERAGE:")
all_countries_covered = literacy_analysis['Has_All_Countries'].all()
print(f"   • All three countries (BD, IN, PK): {'✅ Yes' if all_countries_covered else '❌ No'}")

print(f"\n🔍 DETAILED BREAKDOWN:")
total_possible = 3 * avg_year_range  # 3 countries × years
total_actual = literacy_analysis['Total_Data_Points'].iloc[0]  # All indicators have same coverage
missing_count = total_possible - total_actual

print(f"   • Total possible data points: {total_possible:.0f}")
print(f"   • Actual data points: {total_actual}")
print(f"   • Missing data points: {missing_count:.0f}")
print(f"   • Years of data per country: ~{total_actual/3:.1f} out of {avg_year_range:.0f}")


=== CRITICAL DATA QUALITY ISSUES ===

📈 TEMPORAL COVERAGE:
   • Earliest data: 1981
   • Latest data: 2017
   • Average span: 37.0 years
   • Required for long-term analysis: 40+ years
   • Gap: 3.0 years SHORT

📊 DATA COMPLETENESS:
   • Average data density: 24.3%
   • Average missing data: 75.7%
   • Minimum acceptable density: 80%
   • Status: ❌ INSUFFICIENT

🌍 COUNTRY COVERAGE:
   • All three countries (BD, IN, PK): ✅ Yes

🔍 DETAILED BREAKDOWN:
   • Total possible data points: 111
   • Actual data points: 27
   • Missing data points: 84
   • Years of data per country: ~9.0 out of 37


In [10]:
# Analyze the selected indicator for comparison
selected_indicator_name = "Adolescent fertility rate (births per 1,000 women ages 15-19)"
selected_indicator = df[df['Indicator Name'] == selected_indicator_name]

if not selected_indicator.empty:
    selected_analysis = analyze_indicator_coverage(df, selected_indicator)
    
    print("=== COMPARISON: Literacy vs Selected Indicator ===\n")
    
    # Create comparison table
    comparison_data = {
        'Metric': [
            'Data Span (Years)',
            'Meets 40+ Year Requirement',
            'Data Completeness (%)',
            'Missing Data (%)',
            'All Countries Covered',
            'Suitable for Time Series Analysis'
        ],
        'Literacy Indicators': [
            f"{avg_year_range:.0f}",
            "❌ No (3 years short)",
            f"{avg_data_density:.1f}%",
            f"{100 - avg_data_density:.1f}%",
            "✅ Yes",
            "❌ No (too many gaps)"
        ],
        'Adolescent Fertility Rate': [
            f"{selected_analysis['Year_Range'].iloc[0]}",
            "✅ Yes (18 years beyond requirement)",
            f"{selected_analysis['Data_Density'].iloc[0]}%",
            f"{100 - selected_analysis['Data_Density'].iloc[0]:.1f}%",
            "✅ Yes",
            "✅ Yes (complete data)"
        ]
    }
    
    comparison_df = pd.DataFrame(comparison_data)
    print(comparison_df.to_string(index=False))
    
    print(f"\n🎯 KEY DIFFERENCES:")
    year_diff = selected_analysis['Year_Range'].iloc[0] - avg_year_range
    density_diff = selected_analysis['Data_Density'].iloc[0] - avg_data_density
    
    print(f"   • Time span difference: +{year_diff:.0f} years in favor of selected indicator")
    print(f"   • Data completeness difference: +{density_diff:.1f}% in favor of selected indicator")
    print(f"   • Reliability gap: Selected indicator has {density_diff/avg_data_density*100:.0f}% better data quality")

else:
    print("Selected indicator not found in dataset.")


=== COMPARISON: Literacy vs Selected Indicator ===

                           Metric  Literacy Indicators           Adolescent Fertility Rate
                Data Span (Years)                   37                                  58
       Meets 40+ Year Requirement ❌ No (3 years short) ✅ Yes (18 years beyond requirement)
            Data Completeness (%)                24.3%                              100.0%
                 Missing Data (%)                75.7%                                0.0%
            All Countries Covered                ✅ Yes                               ✅ Yes
Suitable for Time Series Analysis ❌ No (too many gaps)               ✅ Yes (complete data)

🎯 KEY DIFFERENCES:
   • Time span difference: +21 years in favor of selected indicator
   • Data completeness difference: +75.7% in favor of selected indicator
   • Reliability gap: Selected indicator has 312% better data quality


In [12]:
print("=== STATISTICAL IMPLICATIONS OF MISSING DATA ===\n")

print("🔍 IMPACT ON ANALYSIS QUALITY:")
print("\n1. TIME SERIES ANALYSIS:")
print("   ❌ Irregular gaps prevent smooth trend detection")
print("   ❌ Missing data reduces statistical power")
print("   ❌ Interpolation may introduce bias")
print("   ❌ Seasonal/cyclical patterns cannot be identified")

print("\n2. COMPARATIVE ANALYSIS:")
print("   ❌ Different countries have data for different years")
print("   ❌ Difficult to compare trends across countries")
print("   ❌ Policy impact assessment becomes unreliable")

print("\n3. STATISTICAL TESTING:")
print("   ❌ Reduced degrees of freedom")
print("   ❌ Lower confidence in significance tests")
print("   ❌ Potential for Type II errors (false negatives)")


=== STATISTICAL IMPLICATIONS OF MISSING DATA ===

🔍 IMPACT ON ANALYSIS QUALITY:

1. TIME SERIES ANALYSIS:
   ❌ Irregular gaps prevent smooth trend detection
   ❌ Missing data reduces statistical power
   ❌ Interpolation may introduce bias
   ❌ Seasonal/cyclical patterns cannot be identified

2. COMPARATIVE ANALYSIS:
   ❌ Different countries have data for different years
   ❌ Difficult to compare trends across countries
   ❌ Policy impact assessment becomes unreliable

3. STATISTICAL TESTING:
   ❌ Reduced degrees of freedom
   ❌ Lower confidence in significance tests
   ❌ Potential for Type II errors (false negatives)


In [13]:

print("\n=== ALTERNATIVE APPROACHES FOR FUTURE RESEARCH ===\n")

print("📈 RECOMMENDED SOLUTIONS:")
print("\n1. DATA TRIANGULATION:")
print("   • Combine World Bank, UNESCO, and national statistics")
print("   • Use multiple sources to fill gaps")
print("   • Cross-validate between sources")

print("\n2. PROXY INDICATORS:")
print("   • Primary school enrollment rates")
print("   • Secondary school completion rates")
print("   • Educational expenditure as % of GDP")

print("\n3. ALTERNATIVE TIME FRAMES:")
print("   • Focus on 1990-2017 period (better data availability)")
print("   • Analyze specific time points with complete data")
print("   • Use cross-sectional analysis for recent years")

print("\n4. ADVANCED STATISTICAL METHODS:")
print("   • Multiple imputation techniques")
print("   • Bayesian estimation with informative priors")
print("   • Machine learning-based prediction models")



=== ALTERNATIVE APPROACHES FOR FUTURE RESEARCH ===

📈 RECOMMENDED SOLUTIONS:

1. DATA TRIANGULATION:
   • Combine World Bank, UNESCO, and national statistics
   • Use multiple sources to fill gaps
   • Cross-validate between sources

2. PROXY INDICATORS:
   • Primary school enrollment rates
   • Secondary school completion rates
   • Educational expenditure as % of GDP

3. ALTERNATIVE TIME FRAMES:
   • Focus on 1990-2017 period (better data availability)
   • Analyze specific time points with complete data
   • Use cross-sectional analysis for recent years

4. ADVANCED STATISTICAL METHODS:
   • Multiple imputation techniques
   • Bayesian estimation with informative priors
   • Machine learning-based prediction models


## 5. Summary and Conclusions

### Key Findings

1. **Data Availability Issues**: Literacy indicators span only 37 years (1981-2017), falling 3 years short of our 40-year requirement for long-term analysis.

2. **Severe Data Incompleteness**: With only 24.3% data completeness, literacy indicators have 75.7% missing data, making robust statistical analysis challenging.

3. **Comparison with Selected Indicator**: The chosen Adolescent Fertility Rate has 100% data completeness over 58 years, providing a clear advantage for time series analysis.

4. **Statistical Implications**: Missing data would compromise trend detection, comparative analysis, and forecasting capabilities.

### Decision Rationale

The exclusion of literacy indicators from our main analysis is justified by:
- Insufficient temporal coverage (37 vs 40+ years required)
- Poor data quality (24.3% vs 80% minimum threshold)
- Risk of unreliable statistical conclusions
- Availability of superior alternatives

### Research Implications

While literacy is undoubtedly crucial for development analysis, the current dataset limitations necessitate this exclusion. Future research should:
- Seek enhanced datasets with better literacy coverage
- Consider alternative education indicators
- Acknowledge this limitation in current findings

This analysis demonstrates the importance of data quality assessment in indicator selection and provides a transparent rationale for research decisions.
