# BA900 Exploratory Data Analysis

This notebook demonstrates how to load and explore the South African Reserve Bank (SARB) BA900 return data.  We focus on calculating non‑performing loan (NPL) ratios for banks, retrieving macroeconomic indicators and visualising their relationships.  The code relies on the reusable functions defined in the `ba900` package contained in this repository.

> **Note:** Running this notebook requires that you have previously scraped and cached BA900 data using the `ba900.scraper` module.  You should also ensure that the necessary Python dependencies are installed via `requirements.txt`.

In [10]:
# Standard imports
import pandas as pd
import matplotlib.pyplot as plt

# Add project root to Python path so we can import ba900 modules
import os
import sys
project_root = os.path.dirname(os.getcwd())  # Go up one level from notebooks to project root
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Import project modules
from ba900.scraper import load_cached_data
from ba900.macro_fetcher import get_world_bank_indicators
from ba900.modeling import aggregate_bank_data, prepare_regression_dataset, train_simple_model
from ba900.visualization import plot_npl_over_time, plot_npl_vs_macro

## 1. Load BA900 data

The BA900 returns are stored locally after scraping.  Use the `load_cached_data` function to read them into a DataFrame.  Each record corresponds to a single institution and reporting period.  Adjust the `periods` list to match the months you downloaded.

In [None]:
# Replace the periods below with the months you scraped, e.g. ['2025-01-01', '2025-02-01']
periods = []  # e.g. ['2024-12-01', '2025-01-01']

# Load the cached BA900 data for the selected periods
if periods:
    bank_frames = load_cached_data(periods)
    # 'bank_frames' is a dictionary mapping (period, institutionId) to DataFrames
    print(f"Loaded {len(bank_frames)} institution returns")
else:
    print("Please specify periods before running this cell.")

## 2. Compute NPL ratios

To compute the non‑performing loan ratio, we need to identify the columns corresponding to non‑performing loans and gross loans in the BA900 return.  These column names vary by institution; consult the XML schema for specifics.  For the purposes of this example, we assume the columns are named `non_performing_loans` and `gross_loans`.

We then combine all institutions into a single panel DataFrame using `aggregate_bank_data`.

In [None]:
# Aggregate returns into a panel and compute NPL ratio
from itertools import islice

if periods:
    # Combine all institution DataFrames into one iterable
    all_records = []
    for ((period, inst_id), df) in bank_frames.items():
        df['date'] = pd.to_datetime(period)
        all_records.append(df)
    panel = aggregate_bank_data(all_records, npl_field='non_performing_loans', loans_field='gross_loans')
    panel.head()
else:
    print("No data loaded.")

## 3. Fetch macroeconomic indicators

Next we fetch macro variables from the World Bank API.  The function `get_world_bank_indicators` accepts a dictionary mapping friendly names to World Bank indicator codes.  Below we request GDP growth and inflation for South Africa (`ZAF`) starting in 2000.  The result is a DataFrame indexed by year.

In [None]:
# Download macro indicators (GDP growth and inflation)
indicators = {
    'gdp_growth': 'NY.GDP.MKTP.KD.ZG',
    'inflation': 'FP.CPI.TOTL.ZG'
}

macro_df = get_world_bank_indicators(indicators, start_year=2000, end_year=2025)
macro_df.head()

## 4. Merge and explore

Finally we merge the bank panel and macro data.  Because macro data are annual, we resample the bank panel to year‑end values.  The merged dataset can then be used for modelling or plotting.

In [None]:
if periods:
    merged = prepare_regression_dataset(panel, macro_df, date_freq='A')
    merged.head()
else:
    print("No data loaded.")

### Visualisation

We can now visualise how NPL ratios evolve over time or relate to macro variables.  Use the plotting functions from the `ba900.visualization` module.  For example, plot the NPL ratio over time for a couple of institutions (replace `inst1` and `inst2` with real identifiers), and examine the relationship between NPL ratio and GDP growth.

In [None]:
# Example plots -- replace 'inst1' and 'inst2' with actual institution codes
if periods:
    fig1 = plot_npl_over_time(panel, institutions=['inst1', 'inst2'])
    fig1.show()
    fig2 = plot_npl_vs_macro(merged, macro_var='gdp_growth', hue='institution')
    fig2.show()
else:
    print("No data loaded.")