# 🔍 Article Enrichment Analysis for TTD Newsletter

This notebook is an interactive replicate of the article enrichment main flow.

Find, test, and evaluate it with dummy data.

**You have to execute all steps in order to get proper result.**

**You should inspect the flow state at each step** to understand the flow.

In [1]:
from metaflow import FlowSpec, step, Parameter
from pydantic import BaseModel, Field
from datetime import datetime

from ttd.utils.print import safe_pretty_print

class FlowParametersSchema(BaseModel):
    """Schema for flow parameters."""
    articles_table: str = Field(
        'articles', description="Table to load articles from"
    )
    articles_limit: int = Field(
        2, description="Maximum number of articles to process"
    )
    date_threshold: datetime = Field(
        'Thu, 03 Apr 2025 18:00:00 +0000', description="Process articles published after this date"
    )
    replicates_table: str = Field(   
        'replicated_articles', description="Replicate articles to this table"
    )
    clean_tables: bool = Field(
        False, description="Clean tables (tags, tag_clusters, replicated_articles)"
    )
    class Config:
        extra = 'allow'
    pass

flow = FlowParametersSchema(
    articles_table='replicated_articles',
    replicates_table='dummy_table',
    articles_limit=2,
    clean_tables=True
)
safe_pretty_print(flow.model_dump())

  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **1. Start step**
Init, config, parameters, tracks versioning...

In [2]:
from ttd.flows.article_enrichment.steps.start import execute as start_step

start_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:25,852 - ttd.flows.article_enrichment.steps.start - INFO - ✅ Database first connection established.
2025-05-09 17:10:25,971 - ttd.flows.article_enrichment.steps.start - INFO - ✅ Database cleaned.
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **2. Load articles step**

In [3]:
from ttd.flows.article_enrichment.steps.load_articles \
    import execute as load_articles_step

load_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:25,978 - ttd.flows.article_enrichment.steps.load_articles - INFO - Loading articles...
2025-05-09 17:10:26,007 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ Loaded 2 articles from 'replicated_articles': len(articles)=2, date_threshold='2025-04-03 18:00:00+00:00'
2025-05-09 17:10:26,008 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ Filtering out already replicated articles...
2025-05-09 17:10:26,032 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ There are 0 already replicated articles
2025-05-09 17:10:26,033 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ Filtered out 0 already replicated articles from 'dummy_table'
2025-05-09 17:10:26,033 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ Loaded 2 articles...
2025-05-09 17:10:26,033 - ttd.flows.article_enrichment.steps.load_articles - INFO - ✅ Step load_articles done in 0.05s


### **3. Is AI articles step**

Checks if articles talk about AI or not

In [4]:
from ttd.flows.article_enrichment.steps.is_ai_articles \
    import execute as is_ai_articles_step

is_ai_articles_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:26,635 - numexpr.utils - INFO - NumExpr defaulting to 10 threads.
2025-05-09 17:10:26,799 - ttd.flows.article_enrichment.steps.is_ai_articles - INFO - Classifying articles as AI-related...
2025-05-09 17:10:26,804 - ttd.flows.utils - INFO - ✅ Loading model spec: article_is_ai_classifier_spec


2025-05-09 17:10:26,807 - ttd.flows.utils - INFO - None
2025-05-09 17:10:26,807 - ttd.flows.utils - INFO - ✅ Provider 'openai'  Model 'meta-llama/llama-4-maverick:free' 
2025-05-09 17:10:26,808 - ttd.flows.utils - INFO - ✅ Predict 1/2 
2025-05-09 17:10:26,808 - ttd.flows.utils - INFO - ✅ Inputs:


2025-05-09 17:10:26,809 - ttd.flows.utils - INFO - None
2025-05-09 17:10:26,951 - httpx - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 17:10:28,769 - ttd.flows.utils - INFO - ✅ Outputs:


2025-05-09 17:10:28,773 - ttd.flows.utils - INFO - None
2025-05-09 17:10:28,776 - ttd.flows.utils - INFO - ✅ Provider 'openai'  Model 'meta-llama/llama-4-maverick:free' 
2025-05-09 17:10:28,777 - ttd.flows.utils - INFO - ✅ Predict 2/2 
2025-05-09 17:10:28,777 - ttd.flows.utils - INFO - ✅ Inputs:


2025-05-09 17:10:28,779 - ttd.flows.utils - INFO - None
2025-05-09 17:10:28,870 - httpx - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 17:10:30,019 - ttd.flows.utils - INFO - ✅ Outputs:


2025-05-09 17:10:30,023 - ttd.flows.utils - INFO - None
2025-05-09 17:10:30,024 - ttd.flows.article_enrichment.steps.is_ai_articles - INFO - ✅ Step is_ai_articles done in 3.22s
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **4. Dense summarizer step**

Generates a dense summary from articles

In [5]:
from ttd.flows.article_enrichment.steps.dense_summarizer \
    import execute as dense_summarizer_step

dense_summarizer_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:30,053 - ttd.flows.article_enrichment.steps.dense_summarizer - INFO - Generating dense summaries for AI-related articles...
2025-05-09 17:10:30,061 - ttd.flows.utils - INFO - ✅ Loading model spec: dense_summarizer_spec


2025-05-09 17:10:30,065 - ttd.flows.utils - INFO - None
2025-05-09 17:10:30,065 - ttd.flows.utils - INFO - ✅ Provider 'openai'  Model 'meta-llama/llama-4-maverick:free' 
2025-05-09 17:10:30,066 - ttd.flows.utils - INFO - ✅ Predict 1/1 
2025-05-09 17:10:30,066 - ttd.flows.utils - INFO - ✅ Inputs:


2025-05-09 17:10:30,067 - ttd.flows.utils - INFO - None
2025-05-09 17:10:30,208 - httpx - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 17:10:39,706 - ttd.flows.utils - INFO - ✅ Outputs:


2025-05-09 17:10:39,711 - ttd.flows.utils - INFO - None
2025-05-09 17:10:39,712 - ttd.flows.article_enrichment.steps.dense_summarizer - INFO - ✅ Step dense_summarizer done in 9.66s


### **5. Core line summarizer step**

Generates a core line summary from a dense summary.

In [6]:
from ttd.flows.article_enrichment.steps.core_line_summarizer \
    import execute as core_line_summarizer_step

core_line_summarizer_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:39,742 - ttd.flows.article_enrichment.steps.core_line_summarizer - INFO - Generating core line summaries for AI-related articles...
2025-05-09 17:10:39,750 - ttd.flows.utils - INFO - ✅ Loading model spec: core_line_summarizer_spec


2025-05-09 17:10:39,754 - ttd.flows.utils - INFO - None
2025-05-09 17:10:39,754 - ttd.flows.utils - INFO - ✅ Provider 'openai'  Model 'meta-llama/llama-4-maverick:free' 
2025-05-09 17:10:39,755 - ttd.flows.utils - INFO - ✅ Predict 1/1 
2025-05-09 17:10:39,755 - ttd.flows.utils - INFO - ✅ Inputs:


2025-05-09 17:10:39,756 - ttd.flows.utils - INFO - None
2025-05-09 17:10:39,915 - httpx - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 17:10:41,434 - ttd.flows.utils - INFO - ✅ Outputs:


2025-05-09 17:10:41,439 - ttd.flows.utils - INFO - None
2025-05-09 17:10:41,440 - ttd.flows.article_enrichment.steps.core_line_summarizer - INFO - ✅ Step core_line_summarizer done in 1.70s


### **6. Tagger step**

Predict main tags over articles using dense summmaries.

In [7]:
from ttd.flows.article_enrichment.steps.tagger import execute as tagger_step

tagger_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:41,475 - ttd.flows.article_enrichment.steps.tagger - INFO - Extracting tags from dense summaries...
2025-05-09 17:10:41,495 - ttd.flows.utils - INFO - ✅ Loading model spec: tagger_spec


2025-05-09 17:10:41,501 - ttd.flows.utils - INFO - None
2025-05-09 17:10:41,502 - ttd.flows.utils - INFO - ✅ Provider 'openai'  Model 'meta-llama/llama-4-maverick:free' 
2025-05-09 17:10:41,502 - ttd.flows.utils - INFO - ✅ Predict 1/1 
2025-05-09 17:10:41,503 - ttd.flows.utils - INFO - ✅ Inputs:


2025-05-09 17:10:41,504 - ttd.flows.utils - INFO - None
2025-05-09 17:10:41,630 - httpx - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 17:10:43,100 - ttd.flows.utils - INFO - ✅ Outputs:


2025-05-09 17:10:43,106 - ttd.flows.utils - INFO - None
2025-05-09 17:10:43,107 - ttd.flows.article_enrichment.steps.tagger - INFO - ✅ Step tagger done in 1.63s


### **7. Merge same tags step**

Aggregates common tags in articles -> Will be used to update tags representations at the next step.

In [8]:
from ttd.flows.article_enrichment.steps.merge_same_tags \
    import execute as merge_same_tags_step

merge_same_tags_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:43,140 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - Merging extracted tags...
2025-05-09 17:10:43,141 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merge 2/2 
2025-05-09 17:10:43,141 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,143 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,143 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ multimodal large language models 2/6 - 1 Items
2025-05-09 17:10:43,143 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 1 - 1 Items


2025-05-09 17:10:43,145 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,145 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ computer vision 2/6 - 2 Items
2025-05-09 17:10:43,145 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 2 - 2 Items


2025-05-09 17:10:43,147 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,147 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ natural language processing 2/6 - 3 Items
2025-05-09 17:10:43,147 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 3 - 3 Items


2025-05-09 17:10:43,150 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,150 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ deep learning 2/6 - 4 Items
2025-05-09 17:10:43,150 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 4 - 4 Items


2025-05-09 17:10:43,153 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,154 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ AI training methods 2/6 - 5 Items
2025-05-09 17:10:43,154 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 5 - 5 Items


2025-05-09 17:10:43,157 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,157 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ transfer learning 2/6 - 6 Items
2025-05-09 17:10:43,158 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Merged tags 6 - 6 Items


2025-05-09 17:10:43,161 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - None
2025-05-09 17:10:43,161 - ttd.flows.article_enrichment.steps.merge_same_tags - INFO - ✅ Step merge_same_tags done in 0.02s


### **8. Update tags step**

Update tags histories (will be used to decide best representant for tags clusters).

In [9]:
from ttd.flows.article_enrichment.steps.update_tags import execute as update_tags_step

update_tags_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:43,184 - ttd.flows.article_enrichment.steps.update_tags - INFO - Saving merged tags to database...
2025-05-09 17:10:43,269 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 1/6 
2025-05-09 17:10:43,270 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,271 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,352 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: multimodal large language models
2025-05-09 17:10:43,485 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 2/6 
2025-05-09 17:10:43,485 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,486 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,542 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: computer vision
2025-05-09 17:10:43,567 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 3/6 
2025-05-09 17:10:43,568 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,569 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,623 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: natural language processing
2025-05-09 17:10:43,676 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 4/6 
2025-05-09 17:10:43,677 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,678 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,735 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: deep learning
2025-05-09 17:10:43,760 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 5/6 
2025-05-09 17:10:43,760 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,762 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,818 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: AI training methods
2025-05-09 17:10:43,843 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Update 6/6 
2025-05-09 17:10:43,843 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Inputs:


2025-05-09 17:10:43,844 - ttd.flows.article_enrichment.steps.update_tags - INFO - None
2025-05-09 17:10:43,930 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Creating new tag: transfer learning
2025-05-09 17:10:43,930 - ttd.flows.article_enrichment.steps.update_tags - INFO - ✅ Step update_tags done in 0.75s


### **9. Update clusters step**

Aggregates tags into clusters using embedding cosine similarity.

Pick the most frequent (recently) tag as the representative.

In [10]:
from ttd.flows.article_enrichment.steps.update_clusters \
    import execute as update_clusters_step

update_clusters_step(flow)
safe_pretty_print(flow.model_dump())

2025-05-09 17:10:43,956 - ttd.flows.article_enrichment.steps.update_clusters - INFO - Clustering tags...
2025-05-09 17:10:43,966 - ttd.flows.article_enrichment.steps.update_clusters - INFO - ✅ Update 1/6 
2025-05-09 17:10:44,265 - ttd.flows.article_enrichment.steps.update_clusters - INFO - ✅ Update 2/6 
2025-05-09 17:10:45,284 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-05-09 17:10:45,720 - ttd.flows.article_enrichment.steps.update_clusters - INFO - ✅ Update 3/6 
2025-05-09 17:10:46,261 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-05-09 17:10:46,631 - ttd.flows.article_enrichment.steps.update_clusters - INFO - ✅ Update 4/6 
2025-05-09 17:10:48,223 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-05-09 17:10:48,657 - ttd.flows.article_enrichment.steps.update_clusters - INFO - ✅ Update 5/6 
2025-05-09 17:10:49,339 - httpx - INFO - HTTP Requ

### **10. Replicate articles step**

Replicate the original articles with the new computed features.

Compute and add evaluations to the replicated articles.

In [11]:
from ttd.flows.article_enrichment.steps.replicate_articles \
    import execute as replicate_articles_step

replicate_articles_step(flow)
safe_pretty_print(flow.model_dump())

  from .autonotebook import tqdm as notebook_tqdm
2025-05-09 17:10:50,817 - datasets - INFO - PyTorch version 2.5.1 available.
2025-05-09 17:10:52,811 - ttd.flows.article_enrichment.steps.replicate_articles - INFO - Replicating articles with enrichment data...
2025-05-09 17:10:53,780 - ttd.flows.article_enrichment.steps.replicate_articles - INFO - ✅ Replicate 1/2 
2025-05-09 17:10:54,758 - absl - INFO - Using default tokenizer.
2025-05-09 17:10:54,835 - ttd.flows.article_enrichment.steps.replicate_articles - INFO - ✅ Replicate 2/2 
2025-05-09 17:10:54,835 - ttd.flows.article_enrichment.steps.replicate_articles - INFO - ✅ Step replicate_articles done in 2.02s
  Expected `datetime` but got `str` with value `'Thu, 03 Apr 2025 18:00:00 +0000'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### **11. Prepare Report**

Generate a detailed report from replicated articles.

To retrieve the report use the command :

```bash # python -m ttd.flows.article_enrichment.flow card get prepare_report report_card_10.html```

Here is an example : [article_enrichment_report_example.html](./images/article_enrichment_report_example.html)

## Sandbox to try stuff

In [12]:
flow.config.get("db_path")

'/Users/mathieucrilout/Repos/train_tune_deploy/data/ttd_tinydb.json'

In [13]:
from ttd.storage.ttd_storage import TTDStorage

storage = TTDStorage(flow.config.get("db_path"))
storage.get_all('tagged_articles')

[{'original_table_name': 'replicated_articles',
  'original_doc_id': '2',
  'table_name': 'tagged_articles',
  'created_at': '2025-05-09T15:10:43.353151',
  'doc_id': '1'}]

In [14]:
tags = storage.get_all('tags')
tags,len(tags)

([{'table_name': 'tags',
   'name': 'multimodal large language models',
   'history': ['Fri, 04 Apr 2025 05:15:52 +0000'],
   'created_at': '2025-05-09T15:10:43.271524',
   'tag_cluster_id': '1',
   'last_updated': '2025-05-09T15:10:44.072607',
   'doc_id': '1'},
  {'table_name': 'tags',
   'name': 'computer vision',
   'history': ['Fri, 04 Apr 2025 05:15:52 +0000'],
   'created_at': '2025-05-09T15:10:43.486897',
   'tag_cluster_id': '2',
   'last_updated': '2025-05-09T15:10:45.531140',
   'doc_id': '2'},
  {'table_name': 'tags',
   'name': 'natural language processing',
   'history': ['Fri, 04 Apr 2025 05:15:52 +0000'],
   'created_at': '2025-05-09T15:10:43.569269',
   'tag_cluster_id': '3',
   'last_updated': '2025-05-09T15:10:46.439885',
   'doc_id': '3'},
  {'table_name': 'tags',
   'name': 'deep learning',
   'history': ['Fri, 04 Apr 2025 05:15:52 +0000'],
   'created_at': '2025-05-09T15:10:43.678390',
   'tag_cluster_id': '4',
   'last_updated': '2025-05-09T15:10:48.464522',
   '

In [15]:
tags = storage.get_all('tag_clusters')
tags,len(tags)

([{'table_name': 'tag_clusters',
   'name': 'multimodal large language models',
   'tag_synonyms': {'1': 'multimodal large language models'},
   'created_at': '2025-05-09T15:10:43.991883',
   'last_updated': '2025-05-09T15:10:44.185749',
   'doc_id': '1'},
  {'table_name': 'tag_clusters',
   'name': 'computer vision',
   'tag_synonyms': {'2': 'computer vision'},
   'created_at': '2025-05-09T15:10:45.443999',
   'last_updated': '2025-05-09T15:10:45.610390',
   'doc_id': '2'},
  {'table_name': 'tag_clusters',
   'name': 'natural language processing',
   'tag_synonyms': {'3': 'natural language processing'},
   'created_at': '2025-05-09T15:10:46.369008',
   'last_updated': '2025-05-09T15:10:46.521225',
   'doc_id': '3'},
  {'table_name': 'tag_clusters',
   'name': 'deep learning',
   'tag_synonyms': {'4': 'deep learning'},
   'created_at': '2025-05-09T15:10:48.383505',
   'last_updated': '2025-05-09T15:10:48.578411',
   'doc_id': '4'},
  {'table_name': 'tag_clusters',
   'name': 'AI traini

In [16]:
from ttd.models.loader import load_model_spec

model_spec_name = "tag_embedding_spec"
tag_embedding_spec = load_model_spec(model_spec_name)

In [17]:
safe_pretty_print(tag_embedding_spec)