# Fixed Income Funds Recommendation System

This notebook demonstrates the data ingestion, feature engineering, and scoring pipeline for a fixed-income fund recommendation system.

What you'll see here:

- Define data sources and feature/score configuration using a YAML manifest.
- Fetch and prepare datasets (partitioned by period).
- Compute fund-month features and derive scores using configurable YAML registries.
- Produce ranked fund profiles based on profile weights.

**Prerequisites:** Python packages: `pandas`, `requests`, `pyyaml`. Optional: `pyarrow` or `fastparquet` for Parquet I/O.

> Note: For quick demos this notebook may use local CSV fallbacks; substitute `fetch_manifest(...)` to run the full end-to-end pipeline against remote sources.

In [36]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Quick check

Run a quick sanity check to ensure the package is importable and that basic helpers (like `hello()`) work as expected. This is useful to confirm the development environment is set up correctly before running heavier pipeline steps.

In [37]:
from fif_recsys import hello

# This will print to the notebook output
hello()

## Configuration manifest (YAML)

The configuration dictionary (`config_d`) defines how data is fetched and how features and scores are computed.

- `fetch`: datasets to download. Each dataset includes `base_url`, `periods`, and `filename_template`.
- `feature`: registry of features to compute, including aggregation method and optional adjustments.
- `score`: scoring definitions (type, feature source, and adjustments like `invert`).
- `profile`: named profile weightings used to aggregate scores into a single ranking for each investor profile.

Edit these values to match your data sources and scoring preferences.

In [45]:
import yaml

config_d = yaml.safe_load("""
fetch:
    cda:
        base_url: "https://dados.cvm.gov.br/dados/FI/DOC/CDA/DADOS/"
        periods:
            - "202501"
            - "202502"
            - "202503"
            - "202504"
            - "202505"
            - "202506"
            - "202507"
            - "202508"
            - "202509"
            - "202510"
            - "202511"
            - "202512"
        filename_template: "cda_fi_{period}.zip"

    cotas:
        base_url: "https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/"
        periods:
            - "202301"
            - "202302"
            - "202303"
            - "202304"
            - "202305"
            - "202306"
            - "202307"
            - "202308"
            - "202309"
            - "202310"
            - "202311"
            - "202312"
                          
            - "202401"
            - "202402"
            - "202403"
            - "202404"
            - "202405"
            - "202406"
            - "202407"
            - "202408"
            - "202409"
            - "202410"
            - "202411"
            - "202412"
                          
            - "202501"
            - "202502"
            - "202503"
            - "202504"
            - "202505"
            - "202506"
            - "202507"
            - "202508"
            - "202509"
            - "202510"
            - "202511"
            - "202512"
        filename_template: "inf_diario_fi_{period}.zip"
feature:
    group_keys:
        - CNPJ_FUNDO_CLASSE
        - DENOM_SOCIAL
        - reference_date
    feature_registry:
        cda:
            patrimonio_liq:
                description: "Maximum reported net asset value per fund-month."
                method: max
                args:
                    - VL_PATRIM_LIQ
                            
            log_aum:
                description: "Maximum reported net asset value per fund-month."
                method: max
                args:
                    - VL_PATRIM_LIQ
                adjustment:
                    - log

            total_posicao:
                description: "Sum of final market value of all positions in the period."
                method: sum
                args:
                    - VL_MERC_POS_FINAL

            n_ativos:
                description: "Number of unique assets in the fund portfolio."
                method: nunique
                args:
                    - CD_ATIVO

            n_emissores:
                description: "Number of unique issuers in the fund portfolio."
                method: nunique
                args:
                    - CPF_CNPJ_EMISSOR

            credito_share:
                description: "Weighted share of credit-linked assets in the portfolio."
                method: credito_share_feature_fn
                args:
                    - ["Debêntures", "Cédula de Crédito", "CRI", "CRA", "Notas Promissórias"]
                adjustment:
                    - clip

            related_party_share:
                description: "Weighted share of related-party issuers."
                method: related_party_share_feature_fn
                adjustment:
                    - clip

            issuer_hhi:
                description: "Herfindahl-Hirschman index based on issuer weights."
                method: hhi_feature_fn
                adjustment:
                    - clip
                    - coalesce
        cotas:
            
score:
    size_score:
        type: zscore
        description: >
            Measures the relative size of the fund based on its assets under
            management. Larger funds typically exhibit greater operational
            stability, better liquidity access, and lower idiosyncratic risk.
            Computed using the z-score of the log-transformed AUM (log_aum).
        args:
            feature: log_aum

    diversification_score:
        type: zscore
        description: >
            Evaluates how diversified the fund's portfolio is in terms of
            the number of unique assets held. Higher values indicate broader
            asset diversification, reducing exposure to security-specific risks.
        args:
            feature: n_ativos

    issuer_diversification_score:
        type: zscore
        description: >
            Measures diversification across issuers by counting how many distinct
            counterparties the fund is exposed to. Funds with exposures distributed
            across more issuers typically have lower concentration and reduced
            issuer-specific credit risk.
        args:
            feature: n_emissores

    credit_risk_score:
        type: zscore
        description: >
            Quantifies the fund's exposure to credit-linked instruments such as
            debentures, CRIs/CRAs, and promissory notes. A higher credit share
            typically increases sensitivity to credit events. The score is inverted
            so that higher credit exposure corresponds to a lower (worse) score.
        args:
            feature: credito_share
        adjustment:
            - invert

    governance_risk_score:
        type: zscore
        description: >
            Captures exposure to related-party transactions, which may increase
            governance risk due to potential conflicts of interest and reduced
            market discipline. The score is inverted, so funds with higher
            related-party share receive a lower (worse) score.
        args:
            feature: related_party_share
        adjustment:
            - invert

    concentration_risk_score:
        type: zscore
        description: >
            Measures portfolio concentration using the Herfindahl-Hirschman Index
            (HHI) computed over issuer exposure weights. Higher HHI values indicate
            more concentrated portfolios and elevated idiosyncratic and liquidity
            risks. Score is inverted so higher concentration yields a lower score.
        args:
            feature: issuer_hhi
        adjustment:
            - invert
profile:
  conservative:
    description: >
      Designed for risk-averse investors prioritizing capital preservation and stability.
      Emphasizes fund size, diversification, and issuer spread to minimize volatility,
      while keeping exposure to credit and governance risks tightly controlled.
    size_score: 0.25
    diversification_score: 0.20
    issuer_diversification_score: 0.20
    credit_risk_score: 0.15
    governance_risk_score: 0.10
    concentration_risk_score: 0.10

  balanced:
    description: >
      Suitable for investors seeking a middle ground between safety and return.
      Balances diversification and issuer exposure with moderate tolerance for credit
      and concentration risks, aiming for a stable but growth-oriented allocation.
    size_score: 0.20
    diversification_score: 0.15
    issuer_diversification_score: 0.15
    credit_risk_score: 0.20
    governance_risk_score: 0.15
    concentration_risk_score: 0.15

  institutional:
    description: >
      Targeted at large professional allocators who value scale and diversification
      but can tolerate more concentrated or complex positions. Prioritizes fund size
      and issuer spread while placing relatively lower weight on credit and governance constraints.
    size_score: 0.30
    diversification_score: 0.20
    issuer_diversification_score: 0.20
    credit_risk_score: 0.10
    governance_risk_score: 0.10
    concentration_risk_score: 0.10

""")

## Fetch datasets

Use `fetch_manifest` to download and assemble datasets defined in the manifest. The function returns a `dict` mapping dataset names to `pandas.DataFrame` objects and writes partitioned files to `output_dir/<dataset>/period=<period>/data.parquet` when a Parquet engine is available (a CSV fallback is used otherwise).

Example usage (below) demonstrates both the programmatic fetch and a temporary offline fallback for quick demos.

In [46]:
from pathlib import Path

from fif_recsys.commands.data import fetch_manifest


data_sources_d = fetch_manifest(config_d['fetch'], output_dir=Path("/tmp"))



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(

  df = pd.read_csv(



  df = pd.read_csv(


## Compute features

Call `compute_all_features` (or `compute_all_features(...)` via the `FEATURE_ENGINE`) to aggregate fund-month features according to your `feature_registry`. The result is a DataFrame with one row per fund-month and computed features ready for scoring.

In [48]:
from fif_recsys.commands.feature import compute_all_features, FEATURE_ENGINE


feature_df = compute_all_features(data_sources_d, config_d, FEATURE_ENGINE)

feature_df.head()

  def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
  def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
  result = getattr(ufunc, method)(*inputs, **kwargs)
  def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
  def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):
  def build_feature_engine(feature_engine: Dict, group_keys: List[str], registry: Any):


[33mSkipping dataset [0m[33m'cotas'[0m[33m: no features defined in registry.[0m


  


Unnamed: 0,CNPJ_FUNDO_CLASSE,DENOM_SOCIAL,reference_date,patrimonio_liq,log_aum,total_posicao,n_ativos,n_emissores,credito_share,related_party_share,issuer_hhi
0,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,2026-01-14,986347900.0,20.70952,5825532000.0,55,1,0.0,0.125814,1.0
1,09.260.031/0001-56,FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE IN...,2026-01-14,82364500.0,18.226665,503980600.0,0,8,0.0,0.479135,0.298536
2,10.292.322/0001-05,KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS D...,2026-01-14,528258100.0,20.085096,4007817000.0,0,4,0.0,0.999686,0.612586
3,10.406.511/0001-61,ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABI...,2026-01-14,14990920000.0,23.43071,102854400000.0,103,9,0.0,0.013466,0.364377
4,10.406.600/0001-08,ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDIC...,2026-01-14,2112755000.0,21.471258,18136060000.0,131,11,0.0,0.035685,0.856891


## Compute scores

Convert features into normalized scores using `compute_scores_from_yaml`. The `score` section in the configuration defines score types (e.g., `zscore`) and optional adjustments (e.g., `invert`). The resulting DataFrame will contain the base features and the derived score columns.

In [50]:
from fif_recsys.commands.model import compute_scores_from_yaml

score_df = compute_scores_from_yaml(feature_df, config_d)

score_df.head()


Unnamed: 0,CNPJ_FUNDO_CLASSE,DENOM_SOCIAL,reference_date,patrimonio_liq,log_aum,total_posicao,n_ativos,n_emissores,credito_share,related_party_share,issuer_hhi,size_score,diversification_score,issuer_diversification_score,credit_risk_score,governance_risk_score,concentration_risk_score
0,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,2026-01-14,986347900.0,20.70952,5825532000.0,55,1,0.0,0.125814,1.0,1.27038,2.366598,-0.537225,0.064385,0.500056,-1.078161
1,09.260.031/0001-56,FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE IN...,2026-01-14,82364500.0,18.226665,503980600.0,0,8,0.0,0.479135,0.298536,0.205049,-0.22705,-0.023439,0.064385,-0.37679,0.839722
2,10.292.322/0001-05,KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS D...,2026-01-14,528258100.0,20.085096,4007817000.0,0,4,0.0,0.999686,0.612586,1.002455,-0.22705,-0.317031,0.064385,-1.668657,-0.018927
3,10.406.511/0001-61,ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABI...,2026-01-14,14990920000.0,23.43071,102854400000.0,103,9,0.0,0.013466,0.364377,2.437974,4.630145,0.049959,0.064385,0.778873,0.659707
4,10.406.600/0001-08,ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDIC...,2026-01-14,2112755000.0,21.471258,18136060000.0,131,11,0.0,0.035685,0.856891,1.597222,5.950548,0.196754,0.064385,0.723733,-0.686886


## Compute profile rankings

Use `compute_profile_scores_from_yaml` (from `fif_recsys.commands.policy`) to aggregate weighted scores into a single profile score and ranking for each fund. Profiles are defined in the `profile` section of the configuration (e.g., `conservative`, `balanced`, `institutional`).

In [51]:
from fif_recsys.commands.policy import compute_profile_scores_from_yaml

ranking_df = compute_profile_scores_from_yaml(score_df.fillna(0), config_d)

ranking_df.head()

Unnamed: 0,CNPJ_FUNDO_CLASSE,DENOM_SOCIAL,reference_date,patrimonio_liq,log_aum,total_posicao,n_ativos,n_emissores,credito_share,related_party_share,...,issuer_diversification_score,credit_risk_score,governance_risk_score,concentration_risk_score,score_conservative,rank_conservative,score_balanced,rank_balanced,score_institutional,rank_institutional
0,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,2026-01-14,986347900.0,20.70952,5825532000.0,55,1,0.0,0.125814,...,-0.537225,0.064385,0.500056,-1.078161,0.635317,84,0.454643,125,0.695616,74
1,09.260.031/0001-56,FUNDO DE INVESTIMENTO EM QUOTAS DE FUNDO DE IN...,2026-01-14,82364500.0,18.226665,503980600.0,0,8,0.0,0.479135,...,-0.023439,0.064385,-0.37679,0.839722,0.057116,363,0.085753,367,0.064149,367
2,10.292.322/0001-05,KONDOR KOBOLD FUNDO DE INVESTIMENTO EM COTAS D...,2026-01-14,528258100.0,20.085096,4007817000.0,0,4,0.0,0.999686,...,-0.317031,0.064385,-1.668657,-0.018927,-0.017303,442,-0.121382,598,0.0296,408
3,10.406.511/0001-61,ISHARES IBOVESPA CLASSE DE ÍNDICE - RESPONSABI...,2026-01-14,14990920000.0,23.43071,102854400000.0,103,9,0.0,0.013466,...,0.049959,0.064385,0.778873,0.659707,1.69903,7,1.418274,7,1.817709,6
4,10.406.600/0001-08,ISHARES BM&FBOVESPA SMALL CAP CLASSE DE ÍNDIC...,2026-01-14,2112755000.0,21.471258,18136060000.0,131,11,0.0,0.035685,...,0.196754,0.064385,0.723733,-0.686886,1.642109,8,1.259944,9,1.71875,8


## Next steps & CLI

- Run the full pipeline from the command line using the Typer-based CLI:
  - `fif-recsys data fetch` to download and prepare datasets
  - `fif-recsys feature build` to compute and write feature tables
  - `fif-recsys model score` to compute scores

- Tips:
  - Install `pyarrow` for faster Parquet I/O when running on large datasets.
  - For reproducible fetches, consider passing a deterministic `reference_date` to `fetch_manifest`.

Feel free to update this notebook with real data paths and run the pipeline end-to-end.

## Inspecting pipeline outputs

If you ran the Docker pipeline and mounted an output directory (e.g., `/tmp/fif_data` on the host → `/data` in the container), the pipeline writes the final profile-scored table to `features_profile_scored.parquet` or `features_profile_scored.csv` in that directory. Use the cell below to load and preview the output; update the `output_path` if you used a different directory.

In [52]:
ranking_df.sort_values(by='rank_conservative', ascending=True)[:5]

Unnamed: 0,CNPJ_FUNDO_CLASSE,DENOM_SOCIAL,reference_date,patrimonio_liq,log_aum,total_posicao,n_ativos,n_emissores,credito_share,related_party_share,...,issuer_diversification_score,credit_risk_score,governance_risk_score,concentration_risk_score,score_conservative,rank_conservative,score_balanced,rank_balanced,score_institutional,rank_institutional
219,40.155.573/0001-09,TREND ETF IBOVESPA CLASSE DE ÍNDICE - RESPONS...,2026-01-14,1001094000.0,20.724359,9626491000.0,130,130,0.0,0.001832,...,8.931106,0.064385,0.807746,1.533615,3.52988,1,2.844605,1,3.590498,1
133,32.203.211/0001-18,FUNDO DE INVESTIMENTO DE ÍNDICE - CLASSE DE IN...,2026-01-14,1910818000.0,21.370797,11003930000.0,93,79,0.0,0.059015,...,5.187812,0.064385,0.665834,1.495788,2.483626,2,2.049902,2,2.558113,2
419,48.643.130/0001-79,FUNDO DE INVESTIMENTO DE ÍNDICE - CI B-INDEX M...,2026-01-14,77181100.0,18.161665,227935500.0,97,70,0.0,0.061258,...,4.527231,0.064385,0.660266,1.587268,2.053588,3,1.716604,3,2.059226,3
143,34.606.480/0001-50,BB ETF IBOVESPA FUNDO DE ÍNDICE RESPONSABILIDA...,2026-01-14,2199202000.0,21.51136,12528590000.0,95,45,0.0,0.036515,...,2.692283,0.064385,0.721672,0.820588,1.956525,4,1.608878,5,2.034027,4
725,57.848.980/0001-02,BB ETF ÍNDICE BOVESPA B3 BR+ FUNDO DE ÍNDICE R...,2026-01-14,26080170.0,17.076686,258004400.0,106,67,0.0,0.033134,...,4.307037,0.064385,0.730064,1.245778,1.950878,5,1.613376,4,1.93324,5


In [53]:
# Load and preview the profile-scored table
from pathlib import Path
import pandas as pd

# Update this path to the directory you mounted into the container (host path: /tmp/fif_data)
output_dir = Path("/tmp/fif_data")

pj = output_dir / "features_profile_scored.parquet"
pcsv = output_dir / "features_profile_scored.csv"

if pj.exists():
    df = pd.read_parquet(pj)
elif pcsv.exists():
    df = pd.read_csv(pcsv)
else:
    raise FileNotFoundError(f"No profile-scored output found at {pj} or {pcsv}. Make sure you mounted the output dir and ran the pipeline.")

# Quick preview
print("Path:", pj if pj.exists() else pcsv)
print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head()


Path: /tmp/fif_data/features_profile_scored.csv
Rows: 5476
Columns: ['CNPJ_FUNDO_CLASSE', 'DENOM_SOCIAL', 'competencia', 'patrimonio_liq', 'log_aum', 'total_posicao', 'n_ativos', 'n_emissores', 'credito_share', 'related_party_share', 'issuer_hhi', 'size_score', 'diversification_score', 'issuer_diversification_score', 'credit_risk_score', 'governance_risk_score', 'concentration_risk_score', 'score_conservative', 'rank_conservative', 'score_balanced', 'rank_balanced', 'score_institutional', 'rank_institutional']


Unnamed: 0,CNPJ_FUNDO_CLASSE,DENOM_SOCIAL,competencia,patrimonio_liq,log_aum,total_posicao,n_ativos,n_emissores,credito_share,related_party_share,...,issuer_diversification_score,credit_risk_score,governance_risk_score,concentration_risk_score,score_conservative,rank_conservative,score_balanced,rank_balanced,score_institutional,rank_institutional
0,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,202506,963097100.0,20.685665,965935000.0,47,0,0.0,0.127924,...,-0.606022,0.081555,0.486351,,,0,,0,,0
1,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,202507,920648300.0,20.640589,922510000.0,46,1,0.0,0.127067,...,-0.538943,0.066368,0.469103,-0.97675,0.547754,533,0.394923,765,0.60993,485
2,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,202508,933380200.0,20.654323,1015659000.0,50,0,0.0,0.118529,...,-0.619399,0.066605,0.49858,,,0,,0,,0
3,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,202509,950239800.0,20.672225,964980300.0,49,0,0.0,0.127355,...,-0.600803,0.063425,0.398001,,,0,,0,,0
4,06.323.688/0001-27,IT NOW PIBB IBRX-50 FUNDO DE ÍNDICE RESPONSABI...,202510,965022200.0,20.687662,967214200.0,49,0,0.0,0.127783,...,-0.603854,0.073029,0.511019,,,0,,0,,0
