# 🔍 Article Selection Analysis for TTD Newsletter

This notebook is an interactive replicate of the article selection flow.

Find, test, and evaluate it with dummy data.

**You have to execute all steps in order to get proper result.**

**You should inspect the flow state at each step** to understand the flow.

In [1]:
from metaflow import FlowSpec, step, Parameter
from pydantic import BaseModel, Field
from datetime import datetime

from ttd.utils.print import safe_pretty_print

class FlowParametersSchema(BaseModel):
    """Schema for flow parameters."""
    articles_table: str = Field(
        'articles', description="Table to load articles from"
    )
    articles_limit: int = Field(
        2, description="Maximum number of articles to select"
    )
    date_threshold: datetime = Field(
        'Thu, 03 Apr 2025 18:00:00 +0000',
        description="Select articles published after this date"
    )
    cluster_date_threshold: datetime = Field(
        'Thu, 03 Apr 2025 18:00:00 +0000',
        description="articles published after this date are used to compute clusters scores"
    )
    clean_tables: bool = Field(
        False, description="Clean tables (selections, selected_articles)"
    )
    class Config:
        extra = 'allow'
    pass

flow = FlowParametersSchema(
    articles_table='replicated_articles',
    articles_limit=20,
    clean_tables=True
)
safe_pretty_print(flow.model_dump())

  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **1. Start step**
Init, config, parameters, tracks versioning...

In [2]:
from ttd.flows.article_selection.steps.start import execute as start_step

start_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:31:16,265 - ttd.flows.article_selection.steps.start - INFO - ✅ Database first connection established.
2025-05-09 17:31:16,384 - ttd.flows.article_selection.steps.start - INFO - ✅ Database cleaned.


Thu, 03 Apr 2025 18:00:00 +0000


  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **2. Load articles step**

In [3]:
from ttd.flows.article_selection.steps.load_articles \
    import execute as load_articles_step

load_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:31:29,252 - ttd.flows.article_selection.steps.load_articles - INFO - Loading articles...


2025-05-09 17:31:29,296 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Loaded 100 articles from 'replicated_articles': len(articles)=100, date_threshold='2025-04-03 18:00:00+00:00'
2025-05-09 17:31:29,296 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Filtering out already replicated articles...
2025-05-09 17:31:29,327 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ There are 0 already selected articles
2025-05-09 17:31:29,327 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Filtered out 0 already selected articles from 'selected_articles'
2025-05-09 17:31:29,327 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Loaded 100 articles...
2025-05-09 17:31:29,327 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Step load_articles done in 0.07s
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  Expected `datetime` but got `str` with value `'T

### **3. Article Selection step**

Propose multiple selection based on various unsupervised strategies.

In [4]:
from ttd.flows.article_selection.steps.select_articles \
    import execute as select_articles_step

select_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:32:53,924 - ttd.flows.article_selection.steps.select_articles - INFO - Selecting articles...
2025-05-09 17:32:53,939 - ttd.flows.article_selection.steps.select_articles - INFO - ✅ Scoring clusters...
2025-05-09 17:32:53,939 - ttd.flows.article_selection.steps.select_articles - INFO - ✅ Scoring articles...
2025-05-09 17:32:55,338 - ttd.flows.article_selection.steps.select_articles - INFO - ✅ Step select_articles done in 1.41s
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **4. Prepare Report**

Generate a detailed report from replicated articles.

To retrieve the report use the command :

```bash # python -m ttd.flows.article_selection.flow card get prepare_report article_selection_report_example.html```

Here is an example : [article_selection_report_example.html](./images/article_selection_report_example.html)

## Sandbox to try stuff

In [12]:
flow.config.get("db_path")

'/Users/mathieucrilout/Repos/train_tune_deploy/data/ttd_tinydb.json'

In [6]:
from ttd.storage.ttd_storage import TTDStorage

storage = TTDStorage(flow.config.get("db_path"))
storage.get_all('selected_articles')

[{'original_table_name': 'replicated_articles',
  'original_doc_id': '23',
  'table_name': 'selected_articles',
  'created_at': '2025-05-09T15:32:53.942969',
  'doc_id': '1'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '18',
  'table_name': 'selected_articles',
  'created_at': '2025-05-09T15:32:54.055135',
  'doc_id': '2'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '5',
  'table_name': 'selected_articles',
  'created_at': '2025-05-09T15:32:54.109952',
  'doc_id': '3'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '30',
  'table_name': 'selected_articles',
  'created_at': '2025-05-09T15:32:54.167469',
  'doc_id': '4'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '2',
  'table_name': 'selected_articles',
  'created_at': '2025-05-09T15:32:54.236514',
  'doc_id': '5'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '20',
  'table_name': 'selected_articles',
  'created