# Presentazione del progetto

## Titolo
Estrazione e interrogazione strutturata di informazioni da menu ristorante

## Obiettivo
Costruire una pipeline end-to-end per:
- estrarre testi da documenti di menu,
- riconoscere e normalizzare entità strutturate (piatti, ingredienti, tecniche),
- creare mapping utili per interrogazioni semantiche,
- valutare le risposte fornite da un agente basato su LLM rispetto a un ground truth.

## Dati
- Cartella principale: Dataset
    - Knowledge_base/menu: documenti dei menu (pagine da parsare)
    - ground_truth: file di riferimento per mapping e valutazione (es. dish_mapping.json, ground_truth_mapped.csv)
- Artifacts prodotti nella cartella `artifacts` del progetto (parsed_menus.json, extracted_menu_info.json, *_to_dishes.json, ecc.)

## Architettura / Pipeline
1. Parsing e aggregazione
     - parse_documents_in_directory(): legge documenti dal dataset
     - group_and_concatenate_documents(): raggruppa e concatena le pagine per documento
     - output: `parsed_menus.json`

2. Estrazione strutturata
     - extract_structured_info_from_menus(..., model_name="gpt-4.1-nano")
     - usa LLM per identificare campi come `restaurant_name`, `dishes`, `ingredients`, `techniques`
     - output: `extracted_menu_info.json`

3. Creazione di mapping
     - create_mappings(extracted_info, dish_mapping)
     - costruisce mappature ingredient->piatti e tecnica->piatti
     - output: `ingredient_to_dishes.json`, `technique_to_dishes.json`

4. Agente conversazionale / interrogazioni
     - get_agent(model_name="gpt-4.1")
     - query_dish_ids(question, agent): permette query in linguaggio naturale sui dati estratti

5. Valutazione
     - evaluate_easy_questions(agent, question_path, ground_truth_path)
     - confronta risposte dell'agente con il ground truth e genera `eval_df`
     - output: `easy_questions_evaluation_results.csv`

## Risultati prodotti
- File JSON con testi parsati e informazioni estratte
- Mappature normalizzate tra ingredienti/tecniche e piatti
- Meccanismo di interrogazione basato su LLM per rispondere a domande sui menu
- DataFrame di valutazione con metriche di accuratezza sulle domande facili
- Accuratezza sulle domande easy: 87%

## Come usare il notebook
- Eseguire le celle nell'ordine per popolare variabili globali (dataset_file_path, artifacts_file_path, documents, extracted_info, agent, ecc.)
- Controllare i file sotto `artifacts/` per i risultati intermedi
- Lanciare `evaluate_easy_questions(...)` per eseguire la valutazione automatica

## Limiti e sviluppi futuri
- Sensibilità del modello LLM alla qualità del prompt e dei dati: migliorare prompt engineering e post-processing
- Normalizzazione entità: estendere regole e fuzzy-matching per un mapping più robusto
- Scalabilità: parallelizzare parsing ed estrazione per dataset più grandi
- Valutazione: aggiungere metriche più granulari e casi di test complessi

## Conclusione
La pipeline integra preprocessing, LLM-driven extraction, mappatura e valutazione per trasformare menu non strutturati in una knowledge base interrogabile, fornendo una base riutilizzabile per applicazioni di ricerca, raccomandazione e analytics sui menu.

# Setup

In [None]:
from pathlib import Path
import sys


	
cwd = Path.cwd().resolve()
project_dir = cwd.parent.parent

if str(project_dir) not in sys.path:
	sys.path.insert(0, str(project_dir))
	

dataset_file_path = project_dir / "Dataset"
artifacts_file_path = cwd / "artifacts" / "simple_rag"

# Preprocessing

## Parsing e aggregazione

In [10]:
from src.preprocessing.menu_ingestion import group_and_concatenate_documents, parse_documents_in_directory
from src.utils import write_json


menus_path = dataset_file_path / "Knowledge_base" / "menu"
documents_pages = parse_documents_in_directory(document_path=menus_path)
documents = group_and_concatenate_documents(documents=documents_pages)
write_json(documents, artifacts_file_path / "parsed_menus.json")

for doc_name, doc_text in list(documents.items())[:5]:
    print(f"Document: {doc_name}\nContent Preview: {doc_text[:100]}...\n")




Ricomposti 30 documenti.
Document: Anima Cosmica.pdf
Content Preview: Ristorante "Anima Cosmica"
Chef Aurora Stellaris
Nel cuore pulsante di Pandora, dove le foreste biol...

Document: Armonia Universale.pdf
Content Preview: Ristorante "Armonia Universale"
Chef: Maestro Alessandro Stellanova
Su Pandora, tra le bioluminescen...

Document: Cosmica Essenza.pdf
Content Preview: Cosmica Essenza
Alla guida dello Chef Aurelio "Starweaver" Celestini su Tatooine
Nel vasto deserto d...

Document: Datapizza.pdf
Content Preview: Ristorante "L'Infinito Sapore"
Viaggio nel Tempo e nel Gusto su Pandora
Chef Alessandro-Pierpaolo-Ja...

Document: Eco di Pandora.pdf
Content Preview: Ristorante "L'Eco di Pandora"
Chef Alessandra Novastella
Nel cuore pulsante di Pandora, dove la natu...



## Estrazione strutturata

In [None]:
from src.preprocessing.menu_extraction import extract_structured_info_from_menus

extracted_info = extract_structured_info_from_menus(documents=documents, model_name="gpt-4.1-nano" )
write_json(extracted_info, artifacts_file_path / "extracted_menu_info.json")

for info in extracted_info[:5]:
    restaurant_name = info.get("restaurant_name", "Unknown")
    print(f"Document: {restaurant_name}\nExtracted Info: {info}\n")

2025-12-03 22:29:08,518 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Anima Cosmica.pdf has been extracted.


2025-12-03 22:29:19,994 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Armonia Universale.pdf has been extracted.


2025-12-03 22:29:31,051 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Cosmica Essenza.pdf has been extracted.


2025-12-03 22:29:42,622 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Datapizza.pdf has been extracted.


2025-12-03 22:29:54,807 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Eco di Pandora.pdf has been extracted.


2025-12-03 22:30:09,169 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Eredita Galattica.pdf has been extracted.


2025-12-03 22:30:26,551 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Essenza dell Infinito.pdf has been extracted.


2025-12-03 22:30:42,264 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Il Firmamento.pdf has been extracted.


2025-12-03 22:30:59,933 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Architetto dell Universo.pdf has been extracted.


2025-12-03 22:31:19,800 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Eco dei Sapori.pdf has been extracted.


2025-12-03 22:31:31,778 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Equilibrio Quantico.pdf has been extracted.


2025-12-03 22:31:45,401 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Essenza Cosmica.pdf has been extracted.


2025-12-03 22:32:08,536 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Essenza del Multiverso su Pandora.pdf has been extracted.


2025-12-03 22:32:23,795 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Essenza delle Dune.pdf has been extracted.


2025-12-03 22:32:39,557 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Essenza di Asgard.pdf has been extracted.


2025-12-03 22:32:51,342 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Etere del Gusto.pdf has been extracted.


2025-12-03 22:33:06,498 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L infinito in un Boccone.pdf has been extracted.


2025-12-03 22:33:18,482 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Oasi delle Dune Stellari.pdf has been extracted.


2025-12-03 22:33:34,149 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: L Universo in Cucina.pdf has been extracted.


2025-12-03 22:33:49,601 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Le Dimensioni del Gusto.pdf has been extracted.


2025-12-03 22:34:02,099 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Le Stelle che Ballano.pdf has been extracted.


2025-12-03 22:34:15,314 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Le Stelle Danzanti.pdf has been extracted.


2025-12-03 22:34:31,832 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Ristorante delle Dune Stellari.pdf has been extracted.


2025-12-03 22:34:44,281 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Ristorante Quantico.pdf has been extracted.


2025-12-03 22:34:55,349 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Sala del Valhalla.pdf has been extracted.


2025-12-03 22:35:05,848 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Sapore del Dune.pdf has been extracted.


2025-12-03 22:35:20,948 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Stelle Astrofisiche.pdf has been extracted.


2025-12-03 22:35:32,513 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Stelle dell Infinito Celestiale.pdf has been extracted.


2025-12-03 22:35:42,451 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Tutti a TARSvola.pdf has been extracted.


2025-12-03 22:35:59,447 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Info from document: Universo Gastronomico di Namecc.pdf has been extracted.


AttributeError: 'list' object has no attribute 'items'

## Creazione di mapping

In [23]:
from src.preprocessing.menu_mapping import create_mappings
from src.utils import read_json

dish_mapping = read_json(dataset_file_path / "ground_truth" / "dish_mapping.json")
ingredient_to_dishes, technique_to_dishes = create_mappings(extracted_info=extracted_info, dish_mapping=dish_mapping)

write_json(technique_to_dishes, artifacts_file_path / 'technique_to_dishes.json')
write_json(ingredient_to_dishes, artifacts_file_path / 'ingredient_to_dishes.json')

Created mappings:
- Techniques: 284 unique techniques
- Ingredients: 171 unique ingredients


# Engine

## Agente conversazionale / interrogazioni

In [3]:
from src.ai.agents.engine import get_agent, query_dish_ids

agent = get_agent(model_name="gpt-4.1")
response = query_dish_ids(question="Quali sono i piatti che includono le Chocobo Wings come ingrediente?", agent=agent)
print(response)

MAPPINGS_DIR is set to: C:\Users\g.liturri\Desktop\data-pizza\test-tecnico-ai-engineer\src\experiments\artifacts


{78}


## Valutazione

In [None]:
from src.evaluation.easy_questions_evaluation import evaluate_easy_questions


question_path = dataset_file_path / "domande.csv"
ground_truth_path = dataset_file_path / "ground_truth" / "ground_truth_mapped.csv"
eval_df = evaluate_easy_questions(agent=agent, question_path=question_path, ground_truth_path=ground_truth_path)
eval_df.to_csv(artifacts_file_path / "easy_questions_evaluation_results.csv", index=False)

  atteso:    [78]
  predetto:  [78]
[001] 1.00 - Quali sono i piatti che includono le Chocobo Wings come ingrediente?
--------------------



  atteso:    [225]
  predetto:  [225]
[002] 1.00 - Quali piatti dovrei scegliere per un banchetto a tema magico che includa le celebri Cioccorane?
--------------------



  atteso:    [156]
  predetto:  [156]
[003] 1.00 - Quali sono i piatti della galassia che contengono Latte+?
--------------------



  atteso:    [215]
  predetto:  [249]
[004] 0.00 - Quali piatti contengono i Ravioli al Vaporeon?
--------------------



  atteso:    [94]
  predetto:  [36, 43, 46, 55, 78, 90, 94, 127, 169, 180, 181, 201, 216, 262]
[005] 0.07 - Quali sono i piatti che includono i Sashimi di Magikarp?
--------------------



  atteso:    [179]
  predetto:  [179]
[006] 1.00 - Quali piatti sono accompagnati dai misteriosi Frutti del Diavolo, che donano poteri speciali a chi li consuma?
--------------------



  atteso:    [171, 189, 267]
  predetto:  [171, 189, 267]
[007] 1.00 - Quali piatti preparati con la tecnica Grigliatura a Energia Stellare DiV?
--------------------



  atteso:    [6, 13, 15, 51, 130, 209]
  predetto:  [130, 187]
[008] 0.14 - Quali piatti sono preparati utilizzando la tecnica della Sferificazione a Gravità Psionica Variabile?
--------------------



  atteso:    [76, 207]
  predetto:  [76]
[009] 0.50 - Quali piatti sono preparati sia con la Marinatura Temporale Sincronizzata che con il Congelamento Bio-Luminiscente Sincronico?
--------------------



KeyboardInterrupt: 

In [None]:
print("Evaluation Results:")
print(eval_df)
print("Accuracy:", eval_df['correct'].mean())

Unnamed: 0,row_id,expected,predicted,score
0,1,{78},{78},1.0
1,2,{225},{225},1.0
2,3,{156},{156},1.0
3,4,{215},{215},1.0
4,5,{94},{94},1.0
5,6,{179},{179},1.0
6,7,"{267, 171, 189}","{267, 171, 189}",1.0
7,8,"{130, 6, 13, 15, 209, 51}","{130, 6, 13, 15, 209, 51}",1.0
8,9,"{76, 207}","{76, 207}",1.0
9,10,"{184, 266, 115}","{184, 266, 115}",1.0
