# 🔍 Article Selection Analysis for TTD Newsletter

This notebook is an interactive replica of the article selection flow using [Metaflow](https://metaflow.org/).

You can run the complete flow using the following command:

```bash
python -m ttd.flows.article_selection.flow
```

To learn more about how Metaflow works, see the [official documentation](https://docs.metaflow.org/).

It consists of five main steps :

<img src="images/article_selection_flow.png" width="400">


Use this notebook to **find, test, and evaluate** the flow with dummy data.

> ⚠️ **Make sure to execute all steps in order** to obtain valid results.  
> 🔎 **Inspect the flow state at each step** to better understand the process.

In [1]:
from metaflow import FlowSpec, step, Parameter
from pydantic import BaseModel, Field
from datetime import datetime

from ttd.utils.print import safe_pretty_print
from ttd.utils.date import parse_date

class FlowParametersSchema(BaseModel):
    """Schema for flow parameters."""
    articles_table: str = Field(
        'articles', description="Table to load articles from"
    )
    articles_limit: int = Field(
        2, description="Maximum number of articles to select"
    )
    date_threshold: datetime = Field(
        'Thu, 03 Apr 2025 18:00:00 +0000',
        description="Select articles published after this date"
    )
    cluster_date_threshold: datetime = Field(
        'Thu, 03 Apr 2025 18:00:00 +0000',
        description="articles published after this date are used to compute clusters scores"
    )
    clean_tables: bool = Field(
        False, description="Clean tables (selections, selected_articles)"
    )
    selected_articles_table: str = Field(
        'selected_articles_dummy_table',
        description='Table to register selected articles'
    )
    class Config:
        extra = 'allow'
    pass

flow = FlowParametersSchema(
    articles_table='replicated_articles',
    date_threshold=parse_date('Thu, 05 Jun 2025 12:00:00 +0000'),
    cluster_date_threshold=parse_date('Thu, 05 Jun 2025 12:00:00 +0000'),
    articles_limit=2,
    clean_tables=True
)
safe_pretty_print(flow.model_dump())

### **1. Start step**
Init, config, parameters, tracks versioning...

In [2]:
from ttd.flows.article_selection.steps.start import execute as start_step

start_step(flow)
safe_pretty_print(flow.model_dump())

2025-06-05 23:12:05,649 - ttd.flows.article_selection.steps.start - INFO - ✅ Database first connection established.
2025-06-05 23:12:05,787 - ttd.flows.article_selection.steps.start - INFO - ✅ Database cleaned.


### **2. Load articles step**

In [3]:
from ttd.flows.article_selection.steps.load_articles \
    import execute as load_articles_step

load_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-06-05 23:12:05,794 - ttd.flows.article_selection.steps.load_articles - INFO - Loading articles...
2025-06-05 23:12:05,966 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Loaded 5 articles from 'replicated_articles': len(articles)=5, date_threshold='2025-06-05 12:00:00+00:00'
2025-06-05 23:12:05,967 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Filtering out already replicated articles...
2025-06-05 23:12:06,021 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ There are 0 already selected articles
2025-06-05 23:12:06,021 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Filtered out 0 already selected articles from 'selected_articles_dummy_table'
2025-06-05 23:12:06,021 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Loaded 5 articles...
2025-06-05 23:12:06,021 - ttd.flows.article_selection.steps.load_articles - INFO - ✅ Step load_articles done in 0.23s


### **3. Article Selection step**

Propose multiple selection based on various unsupervised strategies.

In [4]:
from ttd.flows.article_selection.steps.select_articles \
    import execute as select_articles_step

select_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-06-05 23:12:06,051 - ttd.flows.article_selection.steps.select_articles - INFO - Selecting articles...
2025-06-05 23:12:06,052 - ttd.flows.article_selection.steps.select_articles - INFO - ✅ Scoring clusters...
2025-06-05 23:12:06,574 - ttd.flows.article_selection.steps.select_articles - INFO - ✅ Step select_articles done in 0.52s


### **4. Prepare Report**

Generate a detailed report from the selected articles.

Once the flow has been executed, you can retrieve the report using the following command :

```bash # python -m ttd.flows.article_selection.flow card get prepare_report article_enrichment_report_example.html```

Here is an example output :

<img src="images/article_selection_report_example.png" width="800">

## Quick Analysis

In [5]:
db_path = flow.config.get("db_path"):while
db_path = '/Users/mathieucrilout/Repos/train_tune_deploy/data/ttd_tinydb.json'

In [6]:
from ttd.flows.analysis import get_articles_with_no_error, get_oldest_and_latest_dates
from ttd.storage.ttd_storage import TTDStorage

# Get articles from first ingestion
storage = TTDStorage(db_path)
articles = storage.get_obj_in_range("replicated_articles", first_id=2000)
articles_with_no_error = get_articles_with_no_error(articles)
oldest_date, latest_date = get_oldest_and_latest_dates(articles_with_no_error)
if oldest_date and latest_date:
    print(f"Oldest date: {oldest_date}")
    print(f"Latest date: {latest_date}")

2025-06-05 23:12:06,847 - numexpr.utils - INFO - NumExpr defaulting to 10 threads.


Oldest date: Tue, 13 May 2025 13:31:57 +0000
Latest date: Thu, 05 Jun 2025 12:05:00 +0000


In [7]:
from ttd.flows.analysis import filter_articles_by_clusters
articles = filter_articles_by_clusters(articles_with_no_error, ["artificial intelligence", "large language models", "India"])

### Look at linear vs sub exponential vs sup exponential orders

In [8]:
from ttd.flows.article_selection.steps.select_articles import compute_cluster_scores
from ttd.flows.article_selection.steps.select_articles import linear_order_metric
from ttd.flows.article_selection.steps.select_articles import exponential_order_metric
linear_cluster_scores = compute_cluster_scores(
    articles,
    order_metric=linear_order_metric
)
sub_exponential_cluster_scores = compute_cluster_scores(
    articles,
    order_metric=exponential_order_metric(0.1)
)
sup_exponential_cluster_scores = compute_cluster_scores(
    articles,
    order_metric=exponential_order_metric(0.9)
)
linear_ordered_by_value = dict(sorted(linear_cluster_scores.items(), key=lambda x: x[1], reverse=True))
sub_exponential_ordered_by_value = dict(sorted(sub_exponential_cluster_scores.items(), key=lambda x: x[1], reverse=True))
sup_exponential_ordered_by_value = dict(sorted(sup_exponential_cluster_scores.items(), key=lambda x: x[1], reverse=True))
print("Linear cluster scores, Sub-exp cluster scores, Sup-exp cluster scores")
for key, value, key_2, value_2, key_3, value_3 in zip(linear_ordered_by_value.keys(), linear_ordered_by_value.values(), sub_exponential_ordered_by_value.keys(), sub_exponential_ordered_by_value.values(), sup_exponential_ordered_by_value.keys(), sup_exponential_ordered_by_value.values()):
    print(f"{key}: {value:0.2f}, {key_2}: {value_2:0.2f}, {key_3}: {value_3:0.2f}")

Linear cluster scores, Sub-exp cluster scores, Sup-exp cluster scores
NVIDIA: 59.32, NVIDIA: 36.66, data science: 94.90
data science: 58.03, software development: 32.72, NVIDIA: 88.39
automation: 52.13, data science: 32.47, automation: 80.93
software development: 50.43, Cloud computing: 30.50, software development: 73.62
Cloud computing: 44.23, automation: 29.57, data privacy: 66.20
data privacy: 40.82, data privacy: 22.32, Cloud computing: 63.23
OpenAI: 27.32, OpenAI: 20.25, open source software: 40.19
AI training: 24.83, digital transformation: 18.81, search engines: 36.67
search engines: 24.60, AI training: 18.14, OpenAI: 35.89
digital transformation: 23.03, search engines: 16.00, AI training: 33.06
open source software: 20.07, Android: 15.61, data governance: 30.47
Android: 18.33, healthcare: 12.52, digital transformation: 28.12
data governance: 17.52, reinforcement learning: 11.56, data management: 28.12
data management: 17.48, drug discovery: 11.54, reliability: 26.84
reliability

### Look at linear selection using linear vs sub-exp vs sup-exp order

In [9]:
from ttd.flows.article_selection.steps.select_articles import compute_article_cluster_scores
from ttd.flows.article_selection.steps.select_articles import get_top_n_articles
from typing import Callable

linearly_scored_articles = compute_article_cluster_scores(
        articles, linear_cluster_scores, order_metric=linear_order_metric
)
top_n_linearly_scored_articles = get_top_n_articles(linearly_scored_articles, n=20)
sub_scored_articles = compute_article_cluster_scores(
        articles, sub_exponential_cluster_scores, order_metric=linear_order_metric
)
top_n_sub_scored_articles = get_top_n_articles(sub_scored_articles, n=20)
sup_scored_articles = compute_article_cluster_scores(
        articles, sup_exponential_cluster_scores, order_metric=linear_order_metric
)
top_n_sup_scored_articles = get_top_n_articles(sup_scored_articles, n=20)
for article, article_2, article_3 in zip(top_n_linearly_scored_articles, top_n_sub_scored_articles, top_n_sup_scored_articles):
        print(f"{article['clusters_score']:0.2f} {article['title']}")
        print(article['clusters_names_in_order_added'])
        print(f"{article_2['clusters_score']:0.2f} {article_2['title']}")
        print(article_2['clusters_names_in_order_added'])
        print(f"{article_3['clusters_score']:0.2f} {article_3['title']}")
        print(article_3['clusters_names_in_order_added'])
        print("-"*100)

88.77 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
52.58 NVIDIA and Microsoft Advance Development on RTX AI PCs
['NVIDIA', 'Microsoft', 'software development', 'microservices', 'RTX Remix']
141.23 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
----------------------------------------------------------------------------------------------------
88.17 Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper
['data science', 'NVIDIA', 'Grace Hopper', 'CPU offloading', 'unified memory', 'automatic mixed precision']
52.45 NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
['NVIDIA', 'automation', 'interoperability']
140.46 Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper
['data science', 'NVIDIA', 'Grace Hopper', 

### Look at sub-exponential selection using linear vs sub-exp vs sup-exp order

In [10]:
from ttd.flows.article_selection.steps.select_articles import compute_article_cluster_scores
from ttd.flows.article_selection.steps.select_articles import get_top_n_articles
from typing import Callable

linearly_scored_articles = compute_article_cluster_scores(
        articles, linear_cluster_scores, order_metric=exponential_order_metric(0.1)
)
top_n_linearly_scored_articles = get_top_n_articles(linearly_scored_articles, n=20)
sub_scored_articles = compute_article_cluster_scores(
        articles, sub_exponential_cluster_scores, order_metric=exponential_order_metric(0.1)
)
top_n_sub_scored_articles = get_top_n_articles(sub_scored_articles, n=20)
sup_scored_articles = compute_article_cluster_scores(
        articles, sup_exponential_cluster_scores, order_metric=exponential_order_metric(0.1)
)
top_n_sup_scored_articles = get_top_n_articles(sup_scored_articles, n=20)
for article, article_2, article_3 in zip(top_n_linearly_scored_articles, top_n_sub_scored_articles, top_n_sup_scored_articles):
        print(f"{article['clusters_score']:0.2f} {article['title']}")
        print(article['clusters_names_in_order_added'])
        print(f"{article_2['clusters_score']:0.2f} {article_2['title']}")
        print(article_2['clusters_names_in_order_added'])
        print(f"{article_3['clusters_score']:0.2f} {article_3['title']}")
        print(article_3['clusters_names_in_order_added'])
        print("-"*100)

64.57 NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
['NVIDIA', 'automation', 'interoperability']
39.65 NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
['NVIDIA', 'automation', 'interoperability']
103.76 Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper
['data science', 'NVIDIA', 'Grace Hopper', 'CPU offloading', 'unified memory', 'automatic mixed precision']
----------------------------------------------------------------------------------------------------
64.53 Semiconductor Industry Accelerates Design Manufacturing With NVIDIA Blackwell and CUDA-X
['NVIDIA', 'automation', 'computational lithography', 'electronic design automation', 'process control']
39.62 Semiconductor Industry Accelerates Design Manufacturing With NVIDIA Blackwell and CUDA-X
['NVIDIA', 'automation', 'computational lithography', 'electronic design automation', 'process control']
103.76 Advanced Optimization Strategies f

### Look at sup-exponential selection using linear vs sub-exp vs sup-exp order

In [11]:
from ttd.flows.article_selection.steps.select_articles import compute_article_cluster_scores
from ttd.flows.article_selection.steps.select_articles import get_top_n_articles
from typing import Callable

linearly_scored_articles = compute_article_cluster_scores(
        articles, linear_cluster_scores, order_metric=exponential_order_metric(0.9)
)
top_n_linearly_scored_articles = get_top_n_articles(linearly_scored_articles, n=20)
sub_scored_articles = compute_article_cluster_scores(
        articles, sub_exponential_cluster_scores, order_metric=exponential_order_metric(0.9)
)
top_n_sub_scored_articles = get_top_n_articles(sub_scored_articles, n=20)
sup_scored_articles = compute_article_cluster_scores(
        articles, sup_exponential_cluster_scores, order_metric=exponential_order_metric(0.9)
)
top_n_sup_scored_articles = get_top_n_articles(sup_scored_articles, n=20)
for article, article_2, article_3 in zip(top_n_linearly_scored_articles, top_n_sub_scored_articles, top_n_sup_scored_articles):
        print(f"{article['clusters_score']:0.2f} {article['title']}")
        print(article['clusters_names_in_order_added'])
        print(f"{article_2['clusters_score']:0.2f} {article_2['title']}")
        print(article_2['clusters_names_in_order_added'])
        print(f"{article_3['clusters_score']:0.2f} {article_3['title']}")
        print(article_3['clusters_names_in_order_added'])
        print("-"*100)

138.38 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
81.27 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
215.83 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
----------------------------------------------------------------------------------------------------
125.46 How open systems drive AI performance
['Cloud computing', 'open source software', 'search engines', 'NVIDIA']
75.30 How open systems drive AI performance
['Cloud computing', 'open source software', 'search engines', 'NVIDIA']
193.55 How open systems drive AI performance
['Cloud computing', 'open source software', 'search engines', 'NVIDIA']
----------------------------------------------------------------------------

### Look at selection with diversity using linear vs sub-exp vs sup-exp order

In [12]:
from ttd.flows.article_selection.steps.select_articles import select_top_articles_with_diversity

top_n_linearly_scored_articles_with_diversity = select_top_articles_with_diversity(
articles, articles,
order_metric=linear_order_metric, n=20
)
top_n_sub_scored_articles_with_diversity = select_top_articles_with_diversity(
articles, articles,
order_metric=exponential_order_metric(0.1), n=20
)
top_n_sup_scored_articles_with_diversity = select_top_articles_with_diversity(
articles, articles,
order_metric=exponential_order_metric(0.9), n=20
)
for article, article_2, article_3 in zip(top_n_linearly_scored_articles_with_diversity, top_n_sub_scored_articles_with_diversity, top_n_sup_scored_articles_with_diversity)     :
    print(f"{article['clusters_score']:0.2f} {article['title']}")
    print(article['clusters_names_in_order_added'])
    print(f"{article_2['clusters_score']:0.2f} {article_2['title']}")
    print(article_2['clusters_names_in_order_added'])
    print(f"{article_3['clusters_score']:0.2f} {article_3['title']}")
    print(article_3['clusters_names_in_order_added'])
    print("-"*100)

88.77 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
39.65 NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
['NVIDIA', 'automation', 'interoperability']
215.83 Asana shares drop as earnings results top estimates but revenue growth slows
['data science', 'Business growth', 'automation', 'software development']
----------------------------------------------------------------------------------------------------
86.67 NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
['NVIDIA', 'automation', 'interoperability']
35.83 AWS open-sources Strands Agents SDK to ease AI agent development
['software development', 'Cloud computing', 'open source software']
193.55 How open systems drive AI performance
['Cloud computing', 'open source software', 'search engines', 'NVIDIA']
----------------------------------------------------------

## Inspect result

In [13]:
flow.config.get("db_path")

'/Users/mathieucrilout/Repos/train_tune_deploy/data/ttd_tinydb.json'

In [14]:
from ttd.storage.ttd_storage import TTDStorage

storage = TTDStorage(flow.config.get("db_path"))
storage.get_all('selected_articles')

[{'original_table_name': 'replicated_articles',
  'original_doc_id': '810',
  'table_name': 'selected_articles',
  'created_at': '2025-05-13T15:57:46.846697',
  'doc_id': '1'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '333',
  'table_name': 'selected_articles',
  'created_at': '2025-05-13T15:57:46.903167',
  'doc_id': '2'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '473',
  'table_name': 'selected_articles',
  'created_at': '2025-05-13T15:57:46.939717',
  'doc_id': '3'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '577',
  'table_name': 'selected_articles',
  'created_at': '2025-05-13T15:57:46.975671',
  'doc_id': '4'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '590',
  'table_name': 'selected_articles',
  'created_at': '2025-05-13T15:57:47.019303',
  'doc_id': '5'},
 {'original_table_name': 'replicated_articles',
  'original_doc_id': '742',
  'table_name': 'selected_articles',
  