# Economic Resilience Predictor 

## Capstone Project Phase 1b: World Bank Data Collection
This notebook handles the data collection phase for the economic shock resilience project. The Collection Phase is divided in two steps: Madisson Focused Collection (1a)
and World Bank Collection implemented in the present notebook


**Objectives:**
- Collect 25+ World Bank indicators for our proven country set
- Focus on the same 38 countries that worked perfectly with Maddison
- Target period: 1990-2023
- Enhanced economic, financial and institutional indicators
- Maintain high data quality standards
- Integrate datasets for comprehensive analysis
- Calculate initial resilience metrics
- Prepare data for EDA phase

In [19]:

# World Bank Data Collection - Phase 1B
# ====================================================
# Building on Maddison Success, we will collect more data to complete information for our 38 countries, World Bank Indicator will enrich our analysis providing more sources 
# to calculate and measure economic resilience
 
# Environment Setup
import sys
import warnings
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import json
from datetime import datetime
from typing import Dict, List, Optional

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 15)

# Load configuration
sys.path.append("config")
from data_collection_config import *

print("WORLD BANK DATA COLLECTION - Phase 1B")
print("=" * 55)
print(f"Target countries: {len(FOCUS_COUNTRIES)}")
print(f"Target indicators: {len(WORLD_BANK_INDICATORS)}")
print(f"Time period: {START_YEAR}-{END_YEAR}")

WORLD BANK DATA COLLECTION - Phase 1B
Target countries: 38
Target indicators: 26
Time period: 1990-2023


In [20]:
# Load our Maddison data and country recommendations
# ===========================================================

print("LOADING MADDISON SUCCESS RESULTS")
print("=" * 40)

try:
    # Load the successful Maddison data
    maddison_data = pd.read_csv('data/maddison_focused_test.csv')
    print(f"Maddison data loaded: {maddison_data.shape}")
    
    # Load country recommendations
    with open('data/country_recommendations.json', 'r') as f:
        recommendations = json.load(f)
    
    high_quality_countries = recommendations['high_quality_countries']
    print(f"High-quality countries: {len(high_quality_countries)}")
    print(f"Countries: {', '.join(sorted(high_quality_countries))}")
    
    # Verify we have all expected countries
    print(f"\n COVERAGE VERIFICATION:")
    print(f"   Expected: {len(FOCUS_COUNTRIES)} countries")
    print(f"   Found: {len(high_quality_countries)} high-quality")
    print(f"   Success rate: {len(high_quality_countries)/len(FOCUS_COUNTRIES):.1%}")
    
    success_base = True
    
except Exception as e:
    print(f"  Error loading previous results: {e}")
    print("   Please ensure test_focused_collection.ipynb completed successfully")
    success_base = False

if success_base:
    print(f"Ready to proceed with World Bank collection!")

LOADING MADDISON SUCCESS RESULTS
Maddison data loaded: (1254, 6)
High-quality countries: 38
Countries: ARG, AUS, AUT, BEL, BRA, CAN, CHE, CHL, CHN, COL, CZE, DEU, DNK, ESP, FIN, FRA, GBR, HUN, IDN, IND, IRL, ITA, JPN, KOR, MEX, MYS, NLD, NOR, NZL, PHL, POL, PRT, RUS, SWE, THA, TUR, USA, ZAF

 COVERAGE VERIFICATION:
   Expected: 38 countries
   Found: 38 high-quality
   Success rate: 100.0%
Ready to proceed with World Bank collection!


In [21]:
# Define functions for World Bank data collection 
# ===========================================================

print("SETTING UP WORLD BANK COLLECTION FUNCTIONS")
print("=" * 50)

def collect_wb_indicator_with_progress(indicator_code: str, indicator_name: str, 
                                     target_countries: List[str]) -> Optional[pd.DataFrame]:
    """
    Collect a single World Bank indicator with detailed progress reporting.
    
    Parameters:
    -----------
    indicator_code : str
        World Bank API indicator code
    indicator_name : str  
        Human-readable indicator name
    target_countries : List[str]
        List of target country codes
        
    Returns:
    --------
    Optional[pd.DataFrame] : Collected data or None if failed
    """
    
    max_retries = 3
    retry_delay = 2
    
    print(f"Collecting: {indicator_name}")
    
    for retry in range(max_retries):
        try:
            # Construct API request
            url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator_code}"
            params = {
                'format': 'json',
                'date': f'{START_YEAR}:{END_YEAR}',
                'per_page': 15000,  # Generous page size
                'page': 1
            }
            
            # Make request with timeout
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            # Parse response
            data = response.json()
            
            if not isinstance(data, list) or len(data) < 2:
                print(f"      Invalid response structure")
                if retry < max_retries - 1:
                    time.sleep(retry_delay)
                continue
            
            # Extract records
            records_data = data[1]
            if not records_data:
                print(f"      No data returned")
                if retry < max_retries - 1:
                    time.sleep(retry_delay)
                continue
            
            # Process records
            valid_records = []
            
            for item in records_data:
                if not isinstance(item, dict):
                    continue
                
                # Extract data
                country_info = item.get('country', {})
                country_code = country_info.get('id')
                country_name = country_info.get('value')
                year_str = item.get('date')
                value = item.get('value')
                
                # Validate record
                if (country_code and year_str and value is not None and 
                    country_code in target_countries):
                    
                    try:
                        year = int(year_str)
                        value_float = float(value)
                        
                        valid_records.append({
                            'country_code': country_code,
                            'country_name': country_name,
                            'year': year,
                            'indicator_code': indicator_code,
                            'indicator_name': indicator_name,
                            'value': value_float
                        })
                    except (ValueError, TypeError):
                        continue
            
            if valid_records:
                df = pd.DataFrame(valid_records)
                countries_found = df['country_code'].nunique()
                year_range = f"{df['year'].min()}-{df['year'].max()}"
                
                print(f"      Success: {len(valid_records)} records, {countries_found} countries, {year_range}")
                return df
            else:
                print(f"      ⚠️ No valid records found")
                if retry < max_retries - 1:
                    time.sleep(retry_delay)
                
        except requests.exceptions.RequestException as e:
            print(f"      Request error (attempt {retry + 1}): {str(e)[:100]}")
            if retry < max_retries - 1:
                time.sleep(retry_delay * (retry + 1))
        except Exception as e:
            print(f"      Processing error (attempt {retry + 1}): {str(e)[:100]}")
            if retry < max_retries - 1:
                time.sleep(retry_delay)
    
    print(f"Failed after {max_retries} attempts")
    return None


def assess_indicator_quality(df: pd.DataFrame, target_countries: List[str]) -> Dict:
    """Assess the quality of collected indicator data."""
    
    if df is None or df.empty:
        return {'quality_score': 0, 'countries_covered': 0, 'completeness': 0}
    
    countries_covered = df['country_code'].nunique()
    country_coverage = countries_covered / len(target_countries)
    
    # Calculate completeness by country
    country_completeness = df.groupby('country_code')['value'].apply(
        lambda x: x.notna().mean()
    ).mean()
    
    # Overall quality score
    quality_score = 0.6 * country_coverage + 0.4 * country_completeness
    
    return {
        'quality_score': quality_score,
        'countries_covered': countries_covered,
        'country_coverage': country_coverage,
        'completeness': country_completeness,
        'total_records': len(df)
    }

print("Collection functions ready")

SETTING UP WORLD BANK COLLECTION FUNCTIONS
Collection functions ready


In [22]:
# DEBUG: Investigate World Bank API Response

print("DEBUGGING WORLD BANK API COLLECTION")
print("=" * 50)

def debug_wb_api_call(indicator_code: str, sample_countries: List[str] = None):
    """Debug the World Bank API call to understand the issue."""
    
    if sample_countries is None:
        sample_countries = ['USA', 'GBR', 'DEU']  # Simple test countries
    
    print(f"Testing indicator: {indicator_code}")
    print(f"Sample countries: {sample_countries}")
    
    try:
        # Construct URL
        url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator_code}"
        params = {
            'format': 'json',
            'date': f'{START_YEAR}:{END_YEAR}',
            'per_page': 1000,  # Smaller for debugging
            'page': 1
        }
        
        print(f"API URL: {url}")
        print(f"Parameters: {params}")
        
        # Make request
        response = requests.get(url, params=params, timeout=30)
        print(f"Response status: {response.status_code}")
        
        if response.status_code != 200:
            print(f"HTTP Error: {response.status_code}")
            print(f"Response text: {response.text[:500]}")
            return None
        
        # Parse JSON
        data = response.json()
        print(f"Response type: {type(data)}")
        print(f"Response length: {len(data) if isinstance(data, list) else 'N/A'}")
        
        if isinstance(data, list) and len(data) >= 2:
            pagination_info = data[0]
            records_data = data[1]
            
            print(f"\nPAGINATION INFO:")
            print(f"   Total records: {pagination_info.get('total', 'N/A')}")
            print(f"   Pages: {pagination_info.get('pages', 'N/A')}")
            print(f"   Per page: {pagination_info.get('per_page', 'N/A')}")
            
            print(f"\nRECORDS DATA:")
            print(f"   Records type: {type(records_data)}")
            print(f"   Records count: {len(records_data) if records_data else 0}")
            
            if records_data and len(records_data) > 0:
                # Show first few records structure
                print(f"\nSAMPLE RECORD STRUCTURE:")
                sample_record = records_data[0]
                print(f"   Record type: {type(sample_record)}")
                if isinstance(sample_record, dict):
                    print(f"   Keys: {list(sample_record.keys())}")
                    
                    # Show country structure
                    country_info = sample_record.get('country', {})
                    print(f"   Country info: {country_info}")
                    
                    # Show other fields
                    print(f"   Date: {sample_record.get('date')}")
                    print(f"   Value: {sample_record.get('value')}")
                
                # Check how many records match our criteria
                matching_records = 0
                countries_found = set()
                
                for item in records_data[:100]:  # Check first 100
                    if isinstance(item, dict):
                        country_info = item.get('country', {})
                        country_code = country_info.get('id')
                        value = item.get('value')
                        
                        if country_code:
                            countries_found.add(country_code)
                            
                        if country_code in sample_countries and value is not None:
                            matching_records += 1
                
                print(f"\nMATCHING ANALYSIS (first 100 records):")
                print(f"   Records matching target countries: {matching_records}")
                print(f"   Unique countries found: {len(countries_found)}")
                print(f"   Sample countries found: {countries_found & set(sample_countries)}")
                print(f"   All countries sample: {sorted(list(countries_found))[:10]}")
                
                return records_data[:10]  # Return sample for further inspection
            else:
                print(f"   No records in response")
                return None
        else:
            print(f"Unexpected response structure")
            print(f"Raw response: {str(data)[:500]}")
            return None
            
    except Exception as e:
        print(f"Error during debugging: {e}")
        import traceback
        traceback.print_exc()
        return None

# Run the debug
if success_base:
    print("Starting API debug with known good countries...")
    sample_data = debug_wb_api_call('NY.GDP.PCAP.KD', ['USA', 'GBR', 'DEU', 'FRA', 'JPN'])
    
    if sample_data:
        print(f"\nDebug successful - got sample data")
        print("Sample records:")
        for i, record in enumerate(sample_data[:3]):
            print(f"Record {i+1}: {record}")
    else:
        print(f"\nDebug failed - no data returned")

DEBUGGING WORLD BANK API COLLECTION
Starting API debug with known good countries...
Testing indicator: NY.GDP.PCAP.KD
Sample countries: ['USA', 'GBR', 'DEU', 'FRA', 'JPN']
API URL: https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.KD
Parameters: {'format': 'json', 'date': '1990:2023', 'per_page': 1000, 'page': 1}
Response status: 200
Response type: <class 'list'>
Response length: 2

PAGINATION INFO:
   Total records: 9044
   Pages: 10
   Per page: 1000

RECORDS DATA:
   Records type: <class 'list'>
   Records count: 1000

SAMPLE RECORD STRUCTURE:
   Record type: <class 'dict'>
   Keys: ['indicator', 'country', 'countryiso3code', 'date', 'value', 'unit', 'obs_status', 'decimal']
   Country info: {'id': 'ZH', 'value': 'Africa Eastern and Southern'}
   Date: 2023
   Value: 1418.3637366532

MATCHING ANALYSIS (first 100 records):
   Records matching target countries: 0
   Unique countries found: 3
   Sample countries found: set()
   All countries sample: ['1A', 'ZH', 'ZI']

Deb

In [23]:
# Fixed World Bank Data Collection with pagination
# ==========================================================

print(" FIXED COLLECTION FUNCTION WITH PAGINATION")
print("=" * 50)

def collect_wb_indicator_fixed(indicator_code: str, indicator_name: str, 
                             target_countries: List[str]) -> Optional[pd.DataFrame]:
    """
    Fixed World Bank collection that handles pagination properly.
    """
    
    print(f"   Collecting: {indicator_name}")
    
    all_records = []
    max_pages = 15  # Safety limit
    
    try:
        # First, get pagination info
        url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator_code}"
        params = {
            'format': 'json',
            'date': f'{START_YEAR}:{END_YEAR}',
            'per_page': 1000,
            'page': 1
        }
        
        response = requests.get(url, params=params, timeout=30)
        response.raise_for_status()
        data = response.json()
        
        if not isinstance(data, list) or len(data) < 2:
            print(f"      ⚠️ Invalid response structure")
            return None
        
        pagination_info = data[0]
        total_pages = pagination_info.get('pages', 1)
        total_records = pagination_info.get('total', 0)
        
        print(f"Total records: {total_records}, Pages: {total_pages}")
        
        # Collect from all pages (up to max_pages for safety)
        pages_to_fetch = min(total_pages, max_pages)
        
        for page in range(1, pages_to_fetch + 1):
            print(f"Fetching page {page}/{pages_to_fetch}", end="")
            
            params['page'] = page
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            page_data = response.json()
            if isinstance(page_data, list) and len(page_data) >= 2:
                records_data = page_data[1]
                
                if records_data:
                    # Process records from this page
                    page_valid = 0
                    for item in records_data:
                        if not isinstance(item, dict):
                            continue
                        
                        # Use countryiso3code field (more reliable)
                        country_code = item.get('countryiso3code')
                        country_info = item.get('country', {})
                        country_name = country_info.get('value')
                        year_str = item.get('date')
                        value = item.get('value')
                        
                        # Check if this is one of our target countries
                        if (country_code and country_code in target_countries and 
                            year_str and value is not None):
                            
                            try:
                                year = int(year_str)
                                value_float = float(value)
                                
                                all_records.append({
                                    'country_code': country_code,
                                    'country_name': country_name,
                                    'year': year,
                                    'indicator_code': indicator_code,
                                    'indicator_name': indicator_name,
                                    'value': value_float
                                })
                                page_valid += 1
                                
                            except (ValueError, TypeError):
                                continue
                    
                    print(f" ({page_valid} valid records)")
                else:
                    print(f" (no records)")
            
            # Small delay between pages
            time.sleep(0.5)
        
        if all_records:
            df = pd.DataFrame(all_records)
            countries_found = df['country_code'].nunique()
            year_range = f"{df['year'].min()}-{df['year'].max()}"
            
            print(f"      Success: {len(all_records)} records, {countries_found} countries, {year_range}")
            return df
        else:
            print(f"      No valid records found across all pages")
            return None
            
    except Exception as e:
        print(f" Error: {str(e)[:100]}")
        return None

# Test the fixed function
if success_base:
    print("\n Testing fixed collection function...")
    fixed_test = collect_wb_indicator_fixed('NY.GDP.PCAP.KD', 'gdp_per_capita_constant', high_quality_countries)
    
    if fixed_test is not None:
        print(f"\n FIXED APPROACH WORKS!")
        print(f"   Total records: {len(fixed_test)}")
        print(f"   Countries found: {fixed_test['country_code'].nunique()}")
        print(f"   Year range: {fixed_test['year'].min()}-{fixed_test['year'].max()}")
        
        # Show countries found
        countries_found = sorted(fixed_test['country_code'].unique())
        countries_missing = [c for c in high_quality_countries if c not in countries_found]
        
        print(f"\n COUNTRY COVERAGE:")
        print(f"   Found: {len(countries_found)}/{len(high_quality_countries)} ({len(countries_found)/len(high_quality_countries):.1%})")
        print(f"   Countries found: {countries_found}")
        
        if countries_missing:
            print(f"   Missing: {countries_missing}")
        
        print(f"\n SAMPLE DATA:")
        display(fixed_test.head(10))
        
        # Test data quality
        country_coverage = fixed_test.groupby('country_code').size().sort_values(ascending=False)
        print(f"\n DATA POINTS PER COUNTRY (top 10):")
        print(dict(country_coverage.head(10)))
        
        collection_working = True
        
    else:
        print(f"\n Fixed approach still failed")
        collection_working = False

 FIXED COLLECTION FUNCTION WITH PAGINATION

 Testing fixed collection function...
   Collecting: gdp_per_capita_constant
Total records: 9044, Pages: 10
Fetching page 1/10 (0 valid records)
Fetching page 2/10 (34 valid records)
Fetching page 3/10 (170 valid records)
Fetching page 4/10 (226 valid records)
Fetching page 5/10 (218 valid records)
Fetching page 6/10 (116 valid records)
Fetching page 7/10 (184 valid records)
Fetching page 8/10 (140 valid records)
Fetching page 9/10 (204 valid records)
Fetching page 10/10 (0 valid records)
      Success: 1292 records, 38 countries, 1990-2023

 FIXED APPROACH WORKS!
   Total records: 1292
   Countries found: 38
   Year range: 1990-2023

 COUNTRY COVERAGE:
   Found: 38/38 (100.0%)
   Countries found: ['ARG', 'AUS', 'AUT', 'BEL', 'BRA', 'CAN', 'CHE', 'CHL', 'CHN', 'COL', 'CZE', 'DEU', 'DNK', 'ESP', 'FIN', 'FRA', 'GBR', 'HUN', 'IDN', 'IND', 'IRL', 'ITA', 'JPN', 'KOR', 'MEX', 'MYS', 'NLD', 'NOR', 'NZL', 'PHL', 'POL', 'PRT', 'RUS', 'SWE', 'THA', 'TU

Unnamed: 0,country_code,country_name,year,indicator_code,indicator_name,value
0,ARG,Argentina,2023,NY.GDP.PCAP.KD,gdp_per_capita_constant,12933.249734
1,ARG,Argentina,2022,NY.GDP.PCAP.KD,gdp_per_capita_constant,13182.793395
2,ARG,Argentina,2021,NY.GDP.PCAP.KD,gdp_per_capita_constant,12549.28117
3,ARG,Argentina,2020,NY.GDP.PCAP.KD,gdp_per_capita_constant,11393.050596
4,ARG,Argentina,2019,NY.GDP.PCAP.KD,gdp_per_capita_constant,12706.397811
5,ARG,Argentina,2018,NY.GDP.PCAP.KD,gdp_per_capita_constant,13058.328545
6,ARG,Argentina,2017,NY.GDP.PCAP.KD,gdp_per_capita_constant,13520.112985
7,ARG,Argentina,2016,NY.GDP.PCAP.KD,gdp_per_capita_constant,13265.886064
8,ARG,Argentina,2015,NY.GDP.PCAP.KD,gdp_per_capita_constant,13679.626498
9,ARG,Argentina,2014,NY.GDP.PCAP.KD,gdp_per_capita_constant,13456.131916



 DATA POINTS PER COUNTRY (top 10):
{'ARG': 34, 'NZL': 34, 'ITA': 34, 'JPN': 34, 'KOR': 34, 'MEX': 34, 'MYS': 34, 'NLD': 34, 'NOR': 34, 'PHL': 34}


In [24]:
# ALTERNATIVE: Country- Specific Request
# ======================================================

print("ALTERNATIVE: COUNTRY-SPECIFIC COLLECTION")
print("=" * 50)

def collect_wb_by_country_batch(indicator_code: str, indicator_name: str, 
                               target_countries: List[str], batch_size: int = 10) -> Optional[pd.DataFrame]:
    """
    Collect World Bank data by requesting specific countries in batches.
    """
    
    print(f"   Collecting: {indicator_name}")
    print(f"   Processing {len(target_countries)} countries in batches of {batch_size}")
    
    all_records = []
    
    # Process countries in batches
    for i in range(0, len(target_countries), batch_size):
        batch = target_countries[i:i+batch_size]
        batch_str = ';'.join(batch)
        
        print(f"Batch {i//batch_size + 1}: {batch}", end="")
        
        try:
            url = f"https://api.worldbank.org/v2/country/{batch_str}/indicator/{indicator_code}"
            params = {
                'format': 'json',
                'date': f'{START_YEAR}:{END_YEAR}',
                'per_page': 5000
            }
            
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            
            data = response.json()
            
            if isinstance(data, list) and len(data) >= 2:
                records_data = data[1]
                
                if records_data:
                    batch_valid = 0
                    for item in records_data:
                        if not isinstance(item, dict):
                            continue
                        
                        country_code = item.get('countryiso3code')
                        country_info = item.get('country', {})
                        country_name = country_info.get('value')
                        year_str = item.get('date')
                        value = item.get('value')
                        
                        if country_code and year_str and value is not None:
                            try:
                                year = int(year_str)
                                value_float = float(value)
                                
                                all_records.append({
                                    'country_code': country_code,
                                    'country_name': country_name,
                                    'year': year,
                                    'indicator_code': indicator_code,
                                    'indicator_name': indicator_name,
                                    'value': value_float
                                })
                                batch_valid += 1
                                
                            except (ValueError, TypeError):
                                continue
                    
                    print(f" → {batch_valid} records")
                else:
                    print(f" → no data")
            else:
                print(f" → invalid response")
                
        except Exception as e:
            print(f" → error: {str(e)[:50]}")
        
        # Small delay between batches
        time.sleep(1)
    
    if all_records:
        df = pd.DataFrame(all_records)
        countries_found = df['country_code'].nunique()
        print(f"      Total: {len(all_records)} records, {countries_found} countries")
        return df
    else:
        print(f"      No records collected")
        return None

# Test the batch approach
if success_base:
    print("\n Testing batch collection approach...")
    batch_test = collect_wb_by_country_batch('NY.GDP.PCAP.KD', 'gdp_per_capita_constant', 
                                           high_quality_countries[:10], batch_size=5)  # Test with first 10 countries
    
    if batch_test is not None:
        print(f"\n BATCH APPROACH WORKS!")
        print(f"   Records: {len(batch_test)}")
        print(f"   Countries: {batch_test['country_code'].nunique()}")
        print(f"   Sample data:")
        display(batch_test.head())
        
        batch_working = True
    else:
        print(f"\n Batch approach failed")
        batch_working = False

ALTERNATIVE: COUNTRY-SPECIFIC COLLECTION

 Testing batch collection approach...
   Collecting: gdp_per_capita_constant
   Processing 10 countries in batches of 5
Batch 1: ['ARG', 'NZL', 'ITA', 'JPN', 'KOR'] → 170 records
Batch 2: ['MEX', 'MYS', 'NLD', 'NOR', 'PHL'] → 170 records
      Total: 340 records, 10 countries

 BATCH APPROACH WORKS!
   Records: 340
   Countries: 10
   Sample data:


Unnamed: 0,country_code,country_name,year,indicator_code,indicator_name,value
0,ARG,Argentina,2023,NY.GDP.PCAP.KD,gdp_per_capita_constant,12933.249734
1,ARG,Argentina,2022,NY.GDP.PCAP.KD,gdp_per_capita_constant,13182.793395
2,ARG,Argentina,2021,NY.GDP.PCAP.KD,gdp_per_capita_constant,12549.28117
3,ARG,Argentina,2020,NY.GDP.PCAP.KD,gdp_per_capita_constant,11393.050596
4,ARG,Argentina,2019,NY.GDP.PCAP.KD,gdp_per_capita_constant,12706.397811


In [25]:

# FULL WORLD BANK COLLECTION - USING BATCH APPROACH
#====================================================

print("FULL WORLD BANK DATA COLLECTION")
print("=" * 45)

def collect_all_wb_indicators_batch(target_countries: List[str], 
                                  indicators_dict: Dict[str, str],
                                  batch_size: int = 8) -> Dict:
    """
    Collect all World Bank indicators using efficient batch approach.
    
    Parameters:
    -----------
    target_countries : List[str]
        List of country codes to collect
    indicators_dict : Dict[str, str] 
        Dictionary of indicator_code: indicator_name
    batch_size : int
        Number of countries per batch
        
    Returns:
    --------
    Dict : Collection results and summary
    """
    
    # Organize indicators by category for better tracking
    indicator_categories = {
        'Core Economic': {
            'NY.GDP.PCAP.KD': 'gdp_per_capita_constant',
            'NY.GDP.MKTP.KD.ZG': 'gdp_growth_annual', 
            'NY.GDP.PCAP.KD.ZG': 'gdp_per_capita_growth',
            'NY.GDP.MKTP.PP.KD': 'gdp_ppp_constant'
        },
        'Investment & Savings': {
            'NE.GDI.TOTL.ZS': 'gross_investment_gdp',
            'NY.GNS.ICTR.ZS': 'gross_savings_gdp',
            'NE.GDI.FPRV.ZS': 'private_investment_gdp',
            'BX.KLT.DINV.WD.GD.ZS': 'fdi_net_inflows_gdp'
        },
        'Trade & Openness': {
            'NE.TRD.GNFS.ZS': 'trade_gdp',
            'NE.EXP.GNFS.ZS': 'exports_gdp',
            'NE.IMP.GNFS.ZS': 'imports_gdp',
            'TM.TAX.MRCH.WM.AR.ZS': 'tariff_rate_applied_weighted'
        },
        'Financial Development': {
            'FS.AST.DOMS.GD.ZS': 'domestic_credit_private_gdp',
            'FB.BNK.CAPA.ZS': 'bank_capital_assets_ratio',
            'CM.MKT.LCAP.GD.ZS': 'market_cap_gdp',
            'FR.INR.RINR': 'real_interest_rate'
        },
        'Government & Fiscal': {
            'GC.DOD.TOTL.GD.ZS': 'government_debt_gdp',
            'GC.TAX.TOTL.GD.ZS': 'tax_revenue_gdp',
            'GC.XPN.TOTL.GD.ZS': 'government_expenditure_gdp'
        },
        'Labor & Social': {
            'SL.UEM.TOTL.ZS': 'unemployment_total',
            'SE.TER.ENRR': 'tertiary_education_enrollment',
            'SP.POP.GROW': 'population_growth',
            'SP.URB.TOTL.IN.ZS': 'urban_population_pct'
        },
        'Innovation & Tech': {
            'GB.XPD.RSDV.GD.ZS': 'research_development_gdp',
            'IP.PAT.RESD': 'patent_applications_residents',
            'IT.NET.USER.ZS': 'internet_users_pct'
        }
    }
    
    # Collection tracking
    all_data = []
    collection_summary = []
    category_results = {}
    
    total_indicators = sum(len(cat_indicators) for cat_indicators in indicator_categories.values())
    collected_count = 0
    
    print(f"Collecting {total_indicators} indicators across {len(indicator_categories)} categories")
    print(f"Target: {len(target_countries)} countries, {START_YEAR}-{END_YEAR}")
    print(f"Using batch size: {batch_size} countries per request")
    
    collection_start = time.time()
    
    # Collect by category
    for category_name, category_indicators in indicator_categories.items():
        print(f"\nCATEGORY: {category_name}")
        print("-" * 50)
        
        category_data = []
        category_success = 0
        
        for indicator_code, indicator_name in category_indicators.items():
            collected_count += 1
            print(f"   [{collected_count:2d}/{total_indicators}] {indicator_name}")
            
            # Collect this indicator
            indicator_df = collect_wb_by_country_batch(
                indicator_code, indicator_name, target_countries, batch_size
            )
            
            if indicator_df is not None and len(indicator_df) > 0:
                # Quality assessment
                countries_covered = indicator_df['country_code'].nunique()
                coverage_rate = countries_covered / len(target_countries)
                avg_data_points = len(indicator_df) / countries_covered if countries_covered > 0 else 0
                
                # Quality score based on coverage and data density
                quality_score = coverage_rate * 0.7 + min(avg_data_points / 30, 1.0) * 0.3
                
                if quality_score > 0.4:  # Accept if reasonable quality
                    all_data.append(indicator_df)
                    category_data.append(indicator_df)
                    category_success += 1
                    
                    print(f"Success: {len(indicator_df)} records, {countries_covered} countries, "
                          f"quality: {quality_score:.3f}")
                    
                    collection_summary.append({
                        'category': category_name,
                        'indicator_code': indicator_code,
                        'indicator_name': indicator_name,
                        'status': 'success',
                        'records': len(indicator_df),
                        'countries': countries_covered,
                        'coverage_rate': coverage_rate,
                        'quality_score': quality_score
                    })
                else:
                    print(f"      ⚠️ Low quality: {quality_score:.3f}, skipping")
                    collection_summary.append({
                        'category': category_name,
                        'indicator_code': indicator_code,
                        'indicator_name': indicator_name,
                        'status': 'low_quality',
                        'records': len(indicator_df) if indicator_df is not None else 0,
                        'countries': countries_covered,
                        'coverage_rate': coverage_rate,
                        'quality_score': quality_score
                    })
            else:
                print(f"Failed: No data collected")
                collection_summary.append({
                    'category': category_name,
                    'indicator_code': indicator_code,
                    'indicator_name': indicator_name,
                    'status': 'failed',
                    'records': 0,
                    'countries': 0,
                    'coverage_rate': 0,
                    'quality_score': 0
                })
        
        # Category summary
        category_results[category_name] = {
            'attempted': len(category_indicators),
            'successful': category_success,
            'success_rate': category_success / len(category_indicators)
        }
        
        print(f"Category result: {category_success}/{len(category_indicators)} "
              f"({category_success/len(category_indicators):.1%} success)")
    
    collection_time = time.time() - collection_start
    
    # Overall summary
    successful_collections = [item for item in collection_summary if item['status'] == 'success']
    
    print(f"\n COLLECTION COMPLETE!")
    print("=" * 30)
    print(f"Total time: {collection_time/60:.1f} minutes")
    print(f"Successful indicators: {len(successful_collections)}/{total_indicators}")
    print(f"Overall success rate: {len(successful_collections)/total_indicators:.1%}")
    print(f"Total records collected: {sum(len(df) for df in all_data):,}")
    
    return {
        'data_frames': all_data,
        'summary': collection_summary,
        'category_results': category_results,
        'collection_time': collection_time,
        'successful_count': len(successful_collections)
    }

# Run the full collection
if success_base:
    print("Starting full World Bank collection...")
    wb_results = collect_all_wb_indicators_batch(
        high_quality_countries, 
        WORLD_BANK_INDICATORS,
        batch_size=8  # Optimal batch size
    )
    
    if wb_results['successful_count'] > 0:
        print(f"\nCollection successful!")
        wb_collection_success = True
    else:
        print(f"\nCollection failed!")
        wb_collection_success = False

FULL WORLD BANK DATA COLLECTION
Starting full World Bank collection...
Collecting 26 indicators across 7 categories
Target: 38 countries, 1990-2023
Using batch size: 8 countries per request

CATEGORY: Core Economic
--------------------------------------------------
   [ 1/26] gdp_per_capita_constant
   Collecting: gdp_per_capita_constant
   Processing 38 countries in batches of 8
Batch 1: ['ARG', 'NZL', 'ITA', 'JPN', 'KOR', 'MEX', 'MYS', 'NLD'] → 272 records
Batch 2: ['NOR', 'PHL', 'AUS', 'POL', 'PRT', 'RUS', 'SWE', 'THA'] → 272 records
Batch 3: ['TUR', 'USA', 'IRL', 'IND', 'IDN', 'HUN', 'AUT', 'BEL'] → 272 records
Batch 4: ['BRA', 'CAN', 'CHE', 'CHL', 'CHN', 'COL', 'CZE', 'DEU'] → 272 records
Batch 5: ['DNK', 'ESP', 'FIN', 'FRA', 'GBR', 'ZAF'] → 204 records
      Total: 1292 records, 38 countries
Success: 1292 records, 38 countries, quality: 1.000
   [ 2/26] gdp_growth_annual
   Collecting: gdp_growth_annual
   Processing 38 countries in batches of 8
Batch 1: ['ARG', 'NZL', 'ITA', 'JP

In [28]:

# Process and Save World Bank Results 
# ======================================================

if 'wb_results' in locals() and wb_collection_success:
    print(" PROCESSING WORLD BANK RESULTS")
    print("=" * 40)
    
    # Combine all successful data frames
    print("   Combining all indicator datasets...")
    wb_combined = pd.concat(wb_results['data_frames'], ignore_index=True)
    
    print(f"   Combined dataset: {wb_combined.shape}")
    print(f"   Indicators: {wb_combined['indicator_name'].nunique()}")
    print(f"   Countries: {wb_combined['country_code'].nunique()}")
    print(f"   Years: {wb_combined['year'].min()}-{wb_combined['year'].max()}")
    
    # Create wide format
    print("   Creating wide format...")
    wb_wide = wb_combined.pivot_table(
        index=['country_code', 'country_name', 'year'],
        columns='indicator_name',
        values='value',
        aggfunc='first'
    ).reset_index()
    wb_wide.columns.name = None
    
    print(f"Wide format: {wb_wide.shape}")
    
    # Integration with Maddison
    print("   Integrating with Maddison data...")
    maddison_data = pd.read_csv('data/maddison_focused_test.csv')
    
    integrated_final = maddison_data.merge(
        wb_wide,
        on=['country_code', 'year'],
        how='outer',
        suffixes=('_maddison', '_wb')
    )
    
    # Handle duplicate country names
    if 'country_name_maddison' in integrated_final.columns:
        integrated_final['country_name'] = integrated_final['country_name_maddison'].fillna(
            integrated_final['country_name_wb']
        )
        integrated_final = integrated_final.drop(['country_name_maddison', 'country_name_wb'], axis=1)
    
    print(f"Final integrated dataset: {integrated_final.shape}")
    
    # Save all results
    print(f"\n SAVING RESULTS:")
    wb_wide.to_csv('data/worldbank_batch_collection.csv', index=False)
    integrated_final.to_csv('data/final_integrated_dataset.csv', index=False) #This is our main dataset for analysis with WB Data
    
    # Save summary
    summary_df = pd.DataFrame(wb_results['summary'])
    summary_df.to_csv('data/worldbank_collection_summary.csv', index=False)
    
    print(f"   World Bank data: data/worldbank_batch_collection.csv")
    print(f"   Final dataset: data/final_integrated_dataset.csv") 
    print(f"   Collection summary: data/worldbank_collection_summary.csv")
    
    # Quality analysis
    successful_indicators = summary_df[summary_df['status'] == 'success']
    
    print(f"\nFINAL RESULTS SUMMARY:")
    print("=" * 30)
    print(f"Successful indicators: {len(successful_indicators)}")
    print(f"Countries in final dataset: {integrated_final['country_code'].nunique()}")
    print(f"Final dataset shape: {integrated_final.shape}")
    print(f"Time period: {integrated_final['year'].min()}-{integrated_final['year'].max()}")
    
    # Category performance
    print(f"\n SUCCESS BY CATEGORY:")
    for category, results in wb_results['category_results'].items():
        print(f"   {category}: {results['successful']}/{results['attempted']} "
              f"({results['success_rate']:.1%})")
    
    # Coverage analysis
    numeric_cols = integrated_final.select_dtypes(include=[np.number]).columns
    overall_coverage = integrated_final[numeric_cols].notna().mean().mean()
    
    print(f"\n DATA QUALITY:")
    print(f"   Overall data coverage: {overall_coverage:.1%}")
    print(f"   Maddison + WB indicators: {len(numeric_cols)}")
    
    print(f"\n READY FOR PHASE 2: FEATURE ENGINEERING!")
    
    final_success = True
else:
    print(" Cannot process results - collection failed")
    final_success = False

 PROCESSING WORLD BANK RESULTS
   Combining all indicator datasets...
   Combined dataset: (27988, 6)
   Indicators: 26
   Countries: 38
   Years: 1990-2023
   Creating wide format...
Wide format: (1292, 29)
   Integrating with Maddison data...
Final integrated dataset: (1292, 32)

 SAVING RESULTS:
   World Bank data: data/worldbank_batch_collection.csv
   Final dataset: data/final_integrated_dataset.csv
   Collection summary: data/worldbank_collection_summary.csv

FINAL RESULTS SUMMARY:
Successful indicators: 26
Countries in final dataset: 38
Final dataset shape: (1292, 32)
Time period: 1990-2023

 SUCCESS BY CATEGORY:
   Core Economic: 4/4 (100.0%)
   Investment & Savings: 4/4 (100.0%)
   Trade & Openness: 4/4 (100.0%)
   Financial Development: 4/4 (100.0%)
   Government & Fiscal: 3/3 (100.0%)
   Labor & Social: 4/4 (100.0%)
   Innovation & Tech: 3/3 (100.0%)

 DATA QUALITY:
   Overall data coverage: 84.8%
   Maddison + WB indicators: 29

 READY FOR PHASE 2: FEATURE ENGINEERING!
