# Modelisation du Matching Produits - Facteurs d'Emission ADEME

## Objectif
Associer chaque produit au facteur d'emission ADEME le plus pertinent.

**Deux cas traites avec le meme pipeline :**
- **Cas 1** : Matching Produits et Facteurs d'Emission Physiques
- **Cas 2** : Matching Produits et Ratios Monetaires

## Modeles implementes
1. **Modele 1** : Jaccard Similarity (baseline lexical)
2. **Modele 2** : TF-IDF + Cosine Similarity
3. **Modele 3** : Sentence Embeddings (SBERT)
4. **Modele 4** : Approche Hybride (TF-IDF + SBERT)

## Quantification de l'incertitude
- Incertitude du facteur d'emission (fournie par ADEME)
- Incertitude du matching (basee sur le score de similarite)

---
## 1. Configuration et Chargement

In [55]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Pour SBERT (installer si necessaire : pip install sentence-transformers)
try:
    from sentence_transformers import SentenceTransformer
    SBERT_AVAILABLE = True
    print("sentence-transformers disponible")
except ImportError:
    SBERT_AVAILABLE = False
    print("sentence-transformers non disponible - installer avec : pip install sentence-transformers")

from collections import Counter
import time

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 80)

sentence-transformers disponible


In [56]:
# CONFIGURATION - A MODIFIER SELON VOTRE ENVIRONNEMENT
DATA_PREPROCESSED_PATH = Path("./data/preprocessed")
DATA_OUTPUT_PATH = Path("./data/output")
DATA_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

# Fichiers preprocesses
FILE_PRODUITS = DATA_PREPROCESSED_PATH / "produits_preprocessed.csv"
FILE_FE_PHYSIQUE = DATA_PREPROCESSED_PATH / "fe_physique_preprocessed.csv"
FILE_RATIO_MONETAIRE = DATA_PREPROCESSED_PATH / "ratio_monetaire_preprocessed.csv"

In [57]:
# Chargement des donnees preprocessees
df_produits = pd.read_csv(FILE_PRODUITS)
df_fe_physique = pd.read_csv(FILE_FE_PHYSIQUE)
df_ratio_monetaire = pd.read_csv(FILE_RATIO_MONETAIRE)

print("Donnees chargees :")
print(f"  - Produits : {len(df_produits)} lignes")
print(f"  - FE Physiques : {len(df_fe_physique)} lignes")
print(f"  - Ratios Monetaires : {len(df_ratio_monetaire)} lignes")

Donnees chargees :
  - Produits : 47187 lignes
  - FE Physiques : 14502 lignes
  - Ratios Monetaires : 56 lignes


In [58]:
# Verification des colonnes
print("\nColonnes Produits :", list(df_produits.columns))
print("\nColonnes FE Physiques :", list(df_fe_physique.columns))
print("\nColonnes Ratios Monetaires :", list(df_ratio_monetaire.columns))


Colonnes Produits : ['PRODUIT.ID', 'DB.LIB', 'COMPTE.LIB', 'QTE.2022', 'QTE.2023', 'QTE.2024', 'QTE.Total', 'description_clean', 'description_normalized', 'keywords']

Colonnes FE Physiques : ['FE.ADEME.ID', 'FE.VAL', 'FE.UNITE', 'FE.Incertitude', 'FE.LIB1', 'FE.LIB2', 'FE.LIB3', 'FE.CAT1', 'FE.CAT2', 'description_raw', 'description_normalized', 'keywords']

Colonnes Ratios Monetaires : ['FE.RM.ID', 'FE.VAL', 'FE.UNITE', 'FE.Incertitude', 'FE.LIB2', 'FE.CAT1', 'description_raw', 'description_normalized', 'keywords']


In [59]:
# Remplacement des NaN par des chaines vides pour les colonnes texte
df_produits['keywords'] = df_produits['keywords'].fillna('')
df_produits['description_normalized'] = df_produits['description_normalized'].fillna('')

df_fe_physique['keywords'] = df_fe_physique['keywords'].fillna('')
df_fe_physique['description_normalized'] = df_fe_physique['description_normalized'].fillna('')

df_ratio_monetaire['keywords'] = df_ratio_monetaire['keywords'].fillna('')
df_ratio_monetaire['description_normalized'] = df_ratio_monetaire['description_normalized'].fillna('')

---
## 2. Definition des Modeles de Matching

In [60]:
def jaccard_similarity(text1, text2):
    """
    Calcul de la similarite de Jaccard entre deux textes.
    Jaccard = |A inter B| / |A union B|
    """
    if not text1 or not text2:
        return 0.0
    
    set1 = set(text1.split())
    set2 = set(text2.split())
    
    if not set1 or not set2:
        return 0.0
    
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    
    return intersection / union if union > 0 else 0.0

In [61]:
class MatchingModel:
    """
    Classe generique pour les modeles de matching.
    Permet d'appliquer le meme pipeline aux deux cas (physique et monetaire).
    """
    
    def __init__(self, name):
        self.name = name
        self.is_fitted = False
    
    def fit(self, reference_texts):
        """Entrainement sur les textes de reference (FE ADEME)."""
        raise NotImplementedError
    
    def predict(self, query_texts, top_k=3):
        """Prediction des meilleurs matches pour chaque requete."""
        raise NotImplementedError
    
    def get_scores(self, query_text):
        """Retourne les scores de similarite pour une requete."""
        raise NotImplementedError

In [62]:
class JaccardModel(MatchingModel):
    """
    Modele 1 : Matching base sur la similarite de Jaccard.
    Approche lexicale simple (baseline).
    """
    
    def __init__(self):
        super().__init__("Jaccard")
        self.reference_texts = None
    
    def fit(self, reference_texts):
        self.reference_texts = list(reference_texts)
        self.is_fitted = True
        return self
    
    def get_scores(self, query_text):
        if not self.is_fitted:
            raise ValueError("Modele non entraine")
        scores = [jaccard_similarity(query_text, ref) for ref in self.reference_texts]
        return np.array(scores)
    
    def predict(self, query_texts, top_k=3):
        results = []
        for query in query_texts:
            scores = self.get_scores(query)
            top_indices = np.argsort(scores)[::-1][:top_k]
            top_scores = scores[top_indices]
            results.append({
                'indices': top_indices,
                'scores': top_scores
            })
        return results

In [63]:
class TfidfModel(MatchingModel):
    """
    Modele 2 : TF-IDF + Cosine Similarity.
    Prend en compte l'importance relative des termes.
    """
    
    def __init__(self):
        super().__init__("TF-IDF")
        self.vectorizer = TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),  # Unigrams et bigrams
            min_df=1,
            max_df=0.95,
            sublinear_tf=True
        )
        self.reference_vectors = None
    
    def fit(self, reference_texts):
        reference_texts = list(reference_texts)
        # Entrainement sur les textes de reference
        self.reference_vectors = self.vectorizer.fit_transform(reference_texts)
        self.is_fitted = True
        return self
    
    def get_scores(self, query_text):
        if not self.is_fitted:
            raise ValueError("Modele non entraine")
        query_vector = self.vectorizer.transform([query_text])
        scores = cosine_similarity(query_vector, self.reference_vectors).flatten()
        return scores
    
    def predict(self, query_texts, top_k=3):
        results = []
        query_vectors = self.vectorizer.transform(query_texts)
        all_scores = cosine_similarity(query_vectors, self.reference_vectors)
        
        for i, scores in enumerate(all_scores):
            top_indices = np.argsort(scores)[::-1][:top_k]
            top_scores = scores[top_indices]
            results.append({
                'indices': top_indices,
                'scores': top_scores
            })
        return results

In [64]:
class SBERTModel(MatchingModel):
    """
    Modele 3 : Sentence Embeddings avec SBERT.
    Capture la semantique des phrases.
    Modele utilise : paraphrase-multilingual-MiniLM-L12-v2 (leger, multilingue)
    """
    
    def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        super().__init__("SBERT")
        self.model_name = model_name
        self.model = None
        self.reference_embeddings = None
    
    def fit(self, reference_texts):
        if not SBERT_AVAILABLE:
            raise ImportError("sentence-transformers non disponible")
        
        print(f"Chargement du modele SBERT : {self.model_name}")
        self.model = SentenceTransformer(self.model_name)
        
        print("Calcul des embeddings de reference...")
        reference_texts = list(reference_texts)
        self.reference_embeddings = self.model.encode(
            reference_texts, 
            show_progress_bar=True,
            convert_to_numpy=True
        )
        self.is_fitted = True
        print(f"Embeddings calcules : shape = {self.reference_embeddings.shape}")
        return self
    
    def get_scores(self, query_text):
        if not self.is_fitted:
            raise ValueError("Modele non entraine")
        query_embedding = self.model.encode([query_text], convert_to_numpy=True)
        scores = cosine_similarity(query_embedding, self.reference_embeddings).flatten()
        return scores
    
    def predict(self, query_texts, top_k=3):
        if not self.is_fitted:
            raise ValueError("Modele non entraine")
        
        print("Calcul des embeddings des requetes...")
        query_embeddings = self.model.encode(
            list(query_texts), 
            show_progress_bar=True,
            convert_to_numpy=True
        )
        
        all_scores = cosine_similarity(query_embeddings, self.reference_embeddings)
        
        results = []
        for scores in all_scores:
            top_indices = np.argsort(scores)[::-1][:top_k]
            top_scores = scores[top_indices]
            results.append({
                'indices': top_indices,
                'scores': top_scores
            })
        return results

In [65]:
class HybridModel(MatchingModel):
    """
    Modele 4 : Approche Hybride combinant TF-IDF et SBERT.
    Score = alpha * TF-IDF + (1-alpha) * SBERT
    """
    
    def __init__(self, alpha=0.5, sbert_model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        super().__init__("Hybride")
        self.alpha = alpha
        self.tfidf_model = TfidfModel()
        self.sbert_model = SBERTModel(sbert_model_name)
    
    def fit(self, reference_texts):
        reference_texts = list(reference_texts)
        print("Entrainement TF-IDF...")
        self.tfidf_model.fit(reference_texts)
        print("Entrainement SBERT...")
        self.sbert_model.fit(reference_texts)
        self.is_fitted = True
        return self
    
    def get_scores(self, query_text):
        tfidf_scores = self.tfidf_model.get_scores(query_text)
        sbert_scores = self.sbert_model.get_scores(query_text)
        combined_scores = self.alpha * tfidf_scores + (1 - self.alpha) * sbert_scores
        return combined_scores
    
    def predict(self, query_texts, top_k=3):
        query_texts = list(query_texts)
        
        # Scores TF-IDF
        tfidf_results = self.tfidf_model.predict(query_texts, top_k=len(self.tfidf_model.reference_vectors.toarray()))
        
        # Scores SBERT
        sbert_results = self.sbert_model.predict(query_texts, top_k=len(self.sbert_model.reference_embeddings))
        
        # Combinaison
        results = []
        for i in range(len(query_texts)):
            # Reconstruction des scores complets
            tfidf_scores = np.zeros(len(self.tfidf_model.reference_vectors.toarray()))
            sbert_scores = np.zeros(len(self.sbert_model.reference_embeddings))
            
            # TF-IDF scores
            query_vector = self.tfidf_model.vectorizer.transform([query_texts[i]])
            tfidf_scores = cosine_similarity(query_vector, self.tfidf_model.reference_vectors).flatten()
            
            # SBERT scores
            query_emb = self.sbert_model.model.encode([query_texts[i]], convert_to_numpy=True)
            sbert_scores = cosine_similarity(query_emb, self.sbert_model.reference_embeddings).flatten()
            
            # Combinaison
            combined = self.alpha * tfidf_scores + (1 - self.alpha) * sbert_scores
            
            top_indices = np.argsort(combined)[::-1][:top_k]
            top_scores = combined[top_indices]
            
            results.append({
                'indices': top_indices,
                'scores': top_scores,
                'tfidf_scores': tfidf_scores[top_indices],
                'sbert_scores': sbert_scores[top_indices]
            })
        
        return results

---
## 3. Fonction de Matching Generique

In [66]:
def run_matching(df_products, df_references, model, 
                 product_text_col='keywords', 
                 reference_text_col='keywords',
                 reference_id_col='FE.ADEME.ID',
                 top_k=3):
    """
    Execute le matching entre produits et references.
    
    Parameters:
        df_products: DataFrame des produits
        df_references: DataFrame des references (FE ou RM)
        model: Instance du modele de matching
        product_text_col: Colonne texte des produits
        reference_text_col: Colonne texte des references
        reference_id_col: Colonne ID des references
        top_k: Nombre de meilleurs matches a retourner
    
    Returns:
        DataFrame avec les resultats du matching
    """
    print(f"\n{'='*60}")
    print(f"Matching avec le modele : {model.name}")
    print(f"{'='*60}")
    
    # Preparation des textes
    product_texts = df_products[product_text_col].fillna('').tolist()
    reference_texts = df_references[reference_text_col].fillna('').tolist()
    
    # Entrainement
    start_time = time.time()
    print(f"Entrainement sur {len(reference_texts)} references...")
    model.fit(reference_texts)
    fit_time = time.time() - start_time
    print(f"Temps d'entrainement : {fit_time:.2f}s")
    
    # Prediction
    start_time = time.time()
    print(f"Prediction pour {len(product_texts)} produits...")
    predictions = model.predict(product_texts, top_k=top_k)
    predict_time = time.time() - start_time
    print(f"Temps de prediction : {predict_time:.2f}s")
    
    # Construction du DataFrame de resultats
    results = []
    for i, pred in enumerate(predictions):
        best_idx = pred['indices'][0]
        best_score = pred['scores'][0]
        
        result = {
            'PRODUIT.ID': df_products.iloc[i]['PRODUIT.ID'],
            'PRODUIT.LIB': df_products.iloc[i]['DB.LIB'],
            'PRODUIT.KEYWORDS': product_texts[i],
            f'{reference_id_col}_MATCH': df_references.iloc[best_idx][reference_id_col],
            'MATCH_DESCRIPTION': df_references.iloc[best_idx].get('description_raw', df_references.iloc[best_idx].get('FE.LIB2', '')),
            'MATCH_SCORE': best_score,
            'MATCH_RANK': 1
        }
        
        # Ajout des top-k matches
        for k in range(min(top_k, len(pred['indices']))):
            idx = pred['indices'][k]
            result[f'TOP{k+1}_ID'] = df_references.iloc[idx][reference_id_col]
            result[f'TOP{k+1}_SCORE'] = pred['scores'][k]
        
        results.append(result)
    
    df_results = pd.DataFrame(results)
    
    # Statistiques
    print(f"\nStatistiques du matching :")
    print(f"  - Score moyen : {df_results['MATCH_SCORE'].mean():.4f}")
    print(f"  - Score median : {df_results['MATCH_SCORE'].median():.4f}")
    print(f"  - Score min : {df_results['MATCH_SCORE'].min():.4f}")
    print(f"  - Score max : {df_results['MATCH_SCORE'].max():.4f}")
    print(f"  - Matches avec score > 0.8 : {(df_results['MATCH_SCORE'] > 0.8).sum()} ({(df_results['MATCH_SCORE'] > 0.8).mean()*100:.1f}%)")
    print(f"  - Matches avec score > 0.7 : {(df_results['MATCH_SCORE'] > 0.7).sum()} ({(df_results['MATCH_SCORE'] > 0.7).mean()*100:.1f}%)")
    print(f"  - Matches avec score > 0.6 : {(df_results['MATCH_SCORE'] > 0.6).sum()} ({(df_results['MATCH_SCORE'] > 0.6).mean()*100:.1f}%)")
    print(f"  - Matches avec score > 0.5 : {(df_results['MATCH_SCORE'] > 0.5).sum()} ({(df_results['MATCH_SCORE'] > 0.5).mean()*100:.1f}%)")
    print(f"  - Matches avec score > 0.3 : {(df_results['MATCH_SCORE'] > 0.3).sum()} ({(df_results['MATCH_SCORE'] > 0.3).mean()*100:.1f}%)")
    
    return df_results, {'fit_time': fit_time, 'predict_time': predict_time}

---
## 4. CAS 1 : Matching Produits vs FE Physiques

In [67]:
print("="*80)
print("CAS 1 : MATCHING PRODUITS vs FACTEURS D'EMISSION PHYSIQUES")
print("="*80)
print(f"\nProduits : {len(df_produits)}")
print(f"FE Physiques : {len(df_fe_physique)}")

CAS 1 : MATCHING PRODUITS vs FACTEURS D'EMISSION PHYSIQUES

Produits : 47187
FE Physiques : 14502


### 4.1 Modele 1 : Jaccard

In [None]:
model_jaccard = JaccardModel()
results_jaccard_phys, times_jaccard_phys = run_matching(
    df_produits, df_fe_physique, model_jaccard,
    reference_id_col='FE.ADEME.ID'
)

In [None]:
print("\nExemples de matching Jaccard (CAS 1) :")
display(results_jaccard_phys[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))

### 4.2 Modele 2 : TF-IDF

In [68]:
model_tfidf = TfidfModel()
results_tfidf_phys, times_tfidf_phys = run_matching(
    df_produits, df_fe_physique, model_tfidf,
    reference_id_col='FE.ADEME.ID'
)


Matching avec le modele : TF-IDF
Entrainement sur 14502 references...
Temps d'entrainement : 0.24s
Prediction pour 47187 produits...
Temps de prediction : 26.68s

Statistiques du matching :
  - Score moyen : 0.2392
  - Score median : 0.2733
  - Score min : 0.0000
  - Score max : 1.0000
  - Matches avec score > 0.8 : 633 (1.3%)
  - Matches avec score > 0.7 : 965 (2.0%)
  - Matches avec score > 0.6 : 1796 (3.8%)
  - Matches avec score > 0.5 : 4235 (9.0%)
  - Matches avec score > 0.3 : 20255 (42.9%)


In [69]:
print("\nExemples de matching TF-IDF (CAS 1) :")
display(results_tfidf_phys[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))


Exemples de matching TF-IDF (CAS 1) :


Unnamed: 0,PRODUIT.LIB,MATCH_DESCRIPTION,MATCH_SCORE
0,RAI DEPISTAGE (B50) REF RAID2,"Métaux,zinc Zinc",0.0
1,SUPPLEMENT DIMANCHE (B20) REF 9004,"Métaux,zinc Zinc",0.0
2,RAI DEPISTAGE TIA,"Métaux,zinc Zinc",0.0
3,PHENOTYPE RHESUS KELL (B40) REF RHK,"Métaux,zinc Zinc",0.0
4,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,"équipement électrique,groupe électrogène Groupe électrogène",0.462241
5,DEVIS N°Q2387840 - PHANTOM-TOS,"Métaux,zinc Zinc",0.0
6,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL","sport,escrime Escrime, un mètre de fil conducteur diam 1,1mm",0.265689
7,Modèle de palpation mammaire + valise de transport,"Laine,Agribalyse,Ovin,Agriculture Laine biologique (modèle type), après fève...",0.217521
8,"PASSE BROCHES 0,7 A 1.8 MM","Métaux,zinc Zinc",0.0
9,POWER SUPPLY 24V,"Métaux,zinc Zinc",0.0


### 4.3 Modele 3 : SBERT

In [70]:
if SBERT_AVAILABLE:
    model_sbert = SBERTModel()
    results_sbert_phys, times_sbert_phys = run_matching(
        df_produits, df_fe_physique, model_sbert,
        product_text_col='description_normalized',
        reference_text_col='description_normalized',
        reference_id_col='FE.ADEME.ID'
    )
else:
    print("SBERT non disponible - modele ignore")
    results_sbert_phys = None
    times_sbert_phys = None


Matching avec le modele : SBERT
Entrainement sur 14502 references...
Chargement du modele SBERT : paraphrase-multilingual-MiniLM-L12-v2
Calcul des embeddings de reference...


Batches: 100%|██████████| 454/454 [01:44<00:00,  4.36it/s]


Embeddings calcules : shape = (14502, 384)
Temps d'entrainement : 107.16s
Prediction pour 47187 produits...
Calcul des embeddings des requetes...


Batches: 100%|██████████| 1475/1475 [05:05<00:00,  4.83it/s]


Temps de prediction : 326.86s

Statistiques du matching :
  - Score moyen : 0.6039
  - Score median : 0.5990
  - Score min : 0.2860
  - Score max : 1.0000
  - Matches avec score > 0.8 : 1224 (2.6%)
  - Matches avec score > 0.7 : 6962 (14.8%)
  - Matches avec score > 0.6 : 23406 (49.6%)
  - Matches avec score > 0.5 : 41282 (87.5%)
  - Matches avec score > 0.3 : 47186 (100.0%)


In [71]:
if results_sbert_phys is not None:
    print("\nExemples de matching SBERT (CAS 1) :")
    display(results_sbert_phys[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))


Exemples de matching SBERT (CAS 1) :


Unnamed: 0,PRODUIT.LIB,MATCH_DESCRIPTION,MATCH_SCORE
0,RAI DEPISTAGE (B50) REF RAID2,HFC-32,0.60787
1,SUPPLEMENT DIMANCHE (B20) REF 9004,PFC-31-10,0.570792
2,RAI DEPISTAGE TIA,Raie crue,0.746705
3,PHENOTYPE RHESUS KELL (B40) REF RHK,R401a,0.679245
4,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,R422a,0.666308
5,DEVIS N°Q2387840 - PHANTOM-TOS,Chou de Bruxelles cuit,0.605891
6,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL","sport,escrime Escrime, un mètre de fil conducteur diam 1,1mm",0.511386
7,Modèle de palpation mammaire + valise de transport,"transport de personne ferré,RER RER",0.591252
8,"PASSE BROCHES 0,7 A 1.8 MM",1-bromopropane,0.545142
9,POWER SUPPLY 24V,Électricité fournisseur : Elsam,0.773117


### 4.4 Modele 4 : Hybride

In [None]:
if SBERT_AVAILABLE:
    model_hybrid = HybridModel(alpha=0.4)  # 40% TF-IDF, 60% SBERT
    results_hybrid_phys, times_hybrid_phys = run_matching(
        df_produits, df_fe_physique, model_hybrid,
        product_text_col='description_normalized',
        reference_text_col='description_normalized',
        reference_id_col='FE.ADEME.ID'
    )
else:
    print("Modele Hybride non disponible (necessite SBERT)")
    results_hybrid_phys = None
    times_hybrid_phys = None

In [None]:
if results_hybrid_phys is not None:
    print("\nExemples de matching Hybride (CAS 1) :")
    display(results_hybrid_phys[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))

---
## 5. CAS 2 : Matching Produits vs Ratios Monetaires

In [72]:
print("="*80)
print("CAS 2 : MATCHING PRODUITS vs RATIOS MONETAIRES")
print("="*80)
print(f"\nProduits : {len(df_produits)}")
print(f"Ratios Monetaires : {len(df_ratio_monetaire)}")

CAS 2 : MATCHING PRODUITS vs RATIOS MONETAIRES

Produits : 47187
Ratios Monetaires : 56


### 5.1 Modele 1 : Jaccard

In [None]:
model_jaccard_rm = JaccardModel()
results_jaccard_rm, times_jaccard_rm = run_matching(
    df_produits, df_ratio_monetaire, model_jaccard_rm,
    reference_id_col='FE.RM.ID'
)

In [29]:
print("\nExemples de matching Jaccard (CAS 2) :")
display(results_jaccard_rm[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))


Exemples de matching Jaccard (CAS 2) :


Unnamed: 0,PRODUIT.LIB,MATCH_DESCRIPTION,MATCH_SCORE
0,RAI DEPISTAGE (B50) REF RAID2,Travaux d'impression et de reproduction,0.0
1,SUPPLEMENT DIMANCHE (B20) REF 9004,Travaux d'impression et de reproduction,0.0
2,RAI DEPISTAGE TIA,Travaux d'impression et de reproduction,0.0
3,PHENOTYPE RHESUS KELL (B40) REF RHK,Travaux d'impression et de reproduction,0.0
4,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,Travaux d'impression et de reproduction,0.0
5,DEVIS N°Q2387840 - PHANTOM-TOS,Travaux d'impression et de reproduction,0.0
6,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL",Travaux d'impression et de reproduction,0.0
7,Modèle de palpation mammaire + valise de transport,Transport par eau,0.166667
8,"PASSE BROCHES 0,7 A 1.8 MM",Travaux d'impression et de reproduction,0.0
9,POWER SUPPLY 24V,Travaux d'impression et de reproduction,0.0


### 5.2 Modele 2 : TF-IDF

In [74]:
model_tfidf_rm = TfidfModel()
results_tfidf_rm, times_tfidf_rm = run_matching(
    df_produits, df_ratio_monetaire, model_tfidf_rm,
    reference_id_col='FE.RM.ID'
)


Matching avec le modele : TF-IDF
Entrainement sur 56 references...
Temps d'entrainement : 0.00s
Prediction pour 47187 produits...
Temps de prediction : 0.63s

Statistiques du matching :
  - Score moyen : 0.0491
  - Score median : 0.0000
  - Score min : 0.0000
  - Score max : 1.0000
  - Matches avec score > 0.8 : 21 (0.0%)
  - Matches avec score > 0.7 : 252 (0.5%)
  - Matches avec score > 0.6 : 289 (0.6%)
  - Matches avec score > 0.5 : 1562 (3.3%)
  - Matches avec score > 0.3 : 3938 (8.3%)


In [75]:
print("\nExemples de matching TF-IDF (CAS 2) :")
display(results_tfidf_rm[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))


Exemples de matching TF-IDF (CAS 2) :


Unnamed: 0,PRODUIT.LIB,MATCH_DESCRIPTION,MATCH_SCORE
0,RAI DEPISTAGE (B50) REF RAID2,Travaux d'impression et de reproduction,0.0
1,SUPPLEMENT DIMANCHE (B20) REF 9004,Travaux d'impression et de reproduction,0.0
2,RAI DEPISTAGE TIA,Travaux d'impression et de reproduction,0.0
3,PHENOTYPE RHESUS KELL (B40) REF RHK,Travaux d'impression et de reproduction,0.0
4,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,Travaux d'impression et de reproduction,0.0
5,DEVIS N°Q2387840 - PHANTOM-TOS,Travaux d'impression et de reproduction,0.0
6,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL",Travaux d'impression et de reproduction,0.0
7,Modèle de palpation mammaire + valise de transport,Transport par eau,0.539768
8,"PASSE BROCHES 0,7 A 1.8 MM",Travaux d'impression et de reproduction,0.0
9,POWER SUPPLY 24V,Travaux d'impression et de reproduction,0.0


### 5.3 Modele 3 : SBERT

In [76]:
if SBERT_AVAILABLE:
    model_sbert_rm = SBERTModel()
    results_sbert_rm, times_sbert_rm = run_matching(
        df_produits, df_ratio_monetaire, model_sbert_rm,
        product_text_col='description_normalized',
        reference_text_col='description_normalized',
        reference_id_col='FE.RM.ID'
    )
else:
    results_sbert_rm = None
    times_sbert_rm = None


Matching avec le modele : SBERT
Entrainement sur 56 references...
Chargement du modele SBERT : paraphrase-multilingual-MiniLM-L12-v2
Calcul des embeddings de reference...


Batches: 100%|██████████| 2/2 [00:00<00:00,  5.69it/s]


Embeddings calcules : shape = (56, 384)
Temps d'entrainement : 3.70s
Prediction pour 47187 produits...
Calcul des embeddings des requetes...


Batches: 100%|██████████| 1475/1475 [04:51<00:00,  5.05it/s]


Temps de prediction : 292.95s

Statistiques du matching :
  - Score moyen : 0.3921
  - Score median : 0.3877
  - Score min : 0.0933
  - Score max : 0.9332
  - Matches avec score > 0.8 : 17 (0.0%)
  - Matches avec score > 0.7 : 221 (0.5%)
  - Matches avec score > 0.6 : 1082 (2.3%)
  - Matches avec score > 0.5 : 5756 (12.2%)
  - Matches avec score > 0.3 : 39821 (84.4%)


In [77]:
if results_sbert_rm is not None:
    print("\nExemples de matching SBERT (CAS 2) :")
    display(results_sbert_rm[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))


Exemples de matching SBERT (CAS 2) :


Unnamed: 0,PRODUIT.LIB,MATCH_DESCRIPTION,MATCH_SCORE
0,RAI DEPISTAGE (B50) REF RAID2,Services de réparation et installation de machines et d'équipements,0.325406
1,SUPPLEMENT DIMANCHE (B20) REF 9004,Edition,0.377166
2,RAI DEPISTAGE TIA,Edition,0.43359
3,PHENOTYPE RHESUS KELL (B40) REF RHK,Produits chimiques,0.231094
4,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,Produits chimiques,0.225973
5,DEVIS N°Q2387840 - PHANTOM-TOS,Transports aériens,0.381995
6,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL",Produits métallurgiques,0.433475
7,Modèle de palpation mammaire + valise de transport,Autres matériels de transport,0.587784
8,"PASSE BROCHES 0,7 A 1.8 MM",Transports aériens,0.277956
9,POWER SUPPLY 24V,Services de télécommunications,0.437628


### 5.4 Modele 4 : Hybride

In [None]:
if SBERT_AVAILABLE:
    model_hybrid_rm = HybridModel(alpha=0.4)
    results_hybrid_rm, times_hybrid_rm = run_matching(
        df_produits, df_ratio_monetaire, model_hybrid_rm,
        product_text_col='description_normalized',
        reference_text_col='description_normalized',
        reference_id_col='FE.RM.ID'
    )
else:
    results_hybrid_rm = None
    times_hybrid_rm = None

In [None]:
if results_hybrid_rm is not None:
    print("\nExemples de matching Hybride (CAS 2) :")
    display(results_hybrid_rm[['PRODUIT.LIB', 'MATCH_DESCRIPTION', 'MATCH_SCORE']].head(10))

---
## 6. Comparaison des Modeles

In [79]:
def compare_models(results_dict, case_name):
    """
    Compare les performances des differents modeles.
    """
    print(f"\n{'='*60}")
    print(f"COMPARAISON DES MODELES - {case_name}")
    print(f"{'='*60}")
    
    comparison = []
    for model_name, results in results_dict.items():
        if results is not None:
            comparison.append({
                'Modele': model_name,
                'Score Moyen': results['MATCH_SCORE'].mean(),
                'Score Median': results['MATCH_SCORE'].median(),
                'Score Min': results['MATCH_SCORE'].min(),
                'Score Max': results['MATCH_SCORE'].max(),
                'Matches > 0.5': f"{(results['MATCH_SCORE'] > 0.5).sum()} ({(results['MATCH_SCORE'] > 0.5).mean()*100:.1f}%)",
                'Matches > 0.3': f"{(results['MATCH_SCORE'] > 0.3).sum()} ({(results['MATCH_SCORE'] > 0.3).mean()*100:.1f}%)",
            })
    
    df_comparison = pd.DataFrame(comparison)
    display(df_comparison)
    return df_comparison

In [80]:
# Comparaison CAS 1
results_phys = {
    #'Jaccard': results_jaccard_phys,
    'TF-IDF': results_tfidf_phys,
    'SBERT': results_sbert_phys,
    #'Hybride': results_hybrid_phys
}
comparison_phys = compare_models(results_phys, "CAS 1 - FE Physiques")


COMPARAISON DES MODELES - CAS 1 - FE Physiques


Unnamed: 0,Modele,Score Moyen,Score Median,Score Min,Score Max,Matches > 0.5,Matches > 0.3
0,TF-IDF,0.23924,0.273303,0.0,1.0,4235 (9.0%),20255 (42.9%)
1,SBERT,0.603854,0.599028,0.285993,1.0,41282 (87.5%),47186 (100.0%)


In [81]:
# Comparaison CAS 2
results_rm = {
    #'Jaccard': results_jaccard_rm,
    'TF-IDF': results_tfidf_rm,
    'SBERT': results_sbert_rm,
    #'Hybride': results_hybrid_rm
}
comparison_rm = compare_models(results_rm, "CAS 2 - Ratios Monetaires")


COMPARAISON DES MODELES - CAS 2 - Ratios Monetaires


Unnamed: 0,Modele,Score Moyen,Score Median,Score Min,Score Max,Matches > 0.5,Matches > 0.3
0,TF-IDF,0.049116,0.0,0.0,1.0,1562 (3.3%),3938 (8.3%)
1,SBERT,0.392123,0.387674,0.093345,0.933203,5756 (12.2%),39821 (84.4%)


---
## 7. Calcul de l'Incertitude

In [82]:
def calculate_uncertainty(df_results, df_references, reference_id_col, match_id_col):
    """
    Calcul de l'incertitude totale.
    
    Incertitude totale = sqrt(U_FE^2 + U_Match^2)
    
    Ou:
    - U_FE : Incertitude du facteur d'emission (fournie par ADEME)
    - U_Match : Incertitude du matching = 1 - score_similarite
    """
    df = df_results.copy()
    
    # Jointure pour recuperer l'incertitude ADEME
    ref_uncertainty = df_references[[reference_id_col, 'FE.Incertitude', 'FE.VAL', 'FE.UNITE']].copy()
    ref_uncertainty = ref_uncertainty.rename(columns={reference_id_col: match_id_col})
    
    df = df.merge(ref_uncertainty, on=match_id_col, how='left')
    
    # Incertitude du matching (basee sur le score de similarite)
    df['U_MATCH'] = 1 - df['MATCH_SCORE']
    
    # Incertitude du facteur d'emission
    #df['U_FE'] = 1- df['FE.Incertitude']
    df['U_FE'] = 0
    
    # Incertitude totale (propagation quadratique)
    df['U_TOTAL'] = np.sqrt(df['U_FE']**2 + df['U_MATCH']**2)
    
    # Classification de la qualite du match
    def classify_match(row):
        if row['MATCH_SCORE'] >= 0.7:
            return 'Excellent'
        elif row['MATCH_SCORE'] >= 0.5:
            return 'Bon'
        elif row['MATCH_SCORE'] >= 0.3:
            return 'Moyen'
        else:
            return 'Faible'
    
    df['MATCH_QUALITY'] = df.apply(classify_match, axis=1)
    
    return df

In [83]:
# Selection du meilleur modele pour chaque cas
# (a ajuster selon les resultats de comparaison)

# CAS 1 : Selection du meilleur modele
best_model_phys = 'SBERT'  
if results_sbert_phys is not None and comparison_phys[comparison_phys['Modele'] == 'SBERT']['Score Moyen'].values[0] > comparison_phys[comparison_phys['Modele'] == 'TF-IDF']['Score Moyen'].values[0]:
    best_model_phys = 'SBERT'
    results_best_phys = results_sbert_phys
else:
    results_best_phys = results_tfidf_phys

print(f"Meilleur modele selectionne pour CAS 1 : {best_model_phys}")

Meilleur modele selectionne pour CAS 1 : SBERT


In [84]:
# CAS 2 : Selection du meilleur modele
best_model_rm = 'SBERT'  
if results_sbert_rm is not None:
    sbert_mean = comparison_rm[comparison_rm['Modele'] == 'SBERT']['Score Moyen'].values
    tfidf_mean = comparison_rm[comparison_rm['Modele'] == 'TF-IDF']['Score Moyen'].values
    if len(sbert_mean) > 0 and len(tfidf_mean) > 0 and sbert_mean[0] > tfidf_mean[0]:
        best_model_rm = 'SBERT'
        results_best_rm = results_sbert_rm
    else:
        results_best_rm = results_tfidf_rm
else:
    results_best_rm = results_tfidf_rm

print(f"Meilleur modele selectionne pour CAS 2 : {best_model_rm}")

Meilleur modele selectionne pour CAS 2 : SBERT


In [85]:
# Calcul des incertitudes
final_results_phys = calculate_uncertainty(
    results_best_phys, df_fe_physique, 
    'FE.ADEME.ID', 'FE.ADEME.ID_MATCH'
)

print("\nStatistiques d'incertitude - CAS 1 :")
print(f"  - U_Match moyenne : {final_results_phys['U_MATCH'].mean():.3f}")
print(f"  - U_FE moyenne : {final_results_phys['U_FE'].mean():.3f}")
print(f"  - U_Total moyenne : {final_results_phys['U_TOTAL'].mean():.3f}")
print(f"\nDistribution qualite du match :")
print(final_results_phys['MATCH_QUALITY'].value_counts())


Statistiques d'incertitude - CAS 1 :
  - U_Match moyenne : 0.396
  - U_FE moyenne : 0.000
  - U_Total moyenne : 0.396

Distribution qualite du match :
MATCH_QUALITY
Bon          34320
Excellent     6962
Moyen         5904
Faible           1
Name: count, dtype: int64


In [86]:
final_results_rm = calculate_uncertainty(
    results_best_rm, df_ratio_monetaire, 
    'FE.RM.ID', 'FE.RM.ID_MATCH'
)

print("\nStatistiques d'incertitude - CAS 2 :")
print(f"  - U_Match moyenne : {final_results_rm['U_MATCH'].mean():.3f}")
print(f"  - U_FE moyenne : {final_results_rm['U_FE'].mean():.3f}")
print(f"  - U_Total moyenne : {final_results_rm['U_TOTAL'].mean():.3f}")
print(f"\nDistribution qualite du match :")
print(final_results_rm['MATCH_QUALITY'].value_counts())


Statistiques d'incertitude - CAS 2 :
  - U_Match moyenne : 0.608
  - U_FE moyenne : 0.000
  - U_Total moyenne : 0.608

Distribution qualite du match :
MATCH_QUALITY
Moyen        34065
Faible        7366
Bon           5535
Excellent      221
Name: count, dtype: int64


---
## 8. Export des Resultats Finaux

In [87]:
# Preparation des fichiers finaux

# CAS 1 : Matching FE Physiques
cols_export_phys = [
    'PRODUIT.ID', 'PRODUIT.LIB',
    'FE.ADEME.ID_MATCH', 'MATCH_DESCRIPTION',
    'FE.VAL', 'FE.UNITE',
    'MATCH_SCORE', 'MATCH_QUALITY',
    'U_MATCH', 'U_FE', 'U_TOTAL',
    'TOP1_ID', 'TOP1_SCORE', 'TOP2_ID', 'TOP2_SCORE', 'TOP3_ID', 'TOP3_SCORE'
]
cols_export_phys = [c for c in cols_export_phys if c in final_results_phys.columns]
df_export_phys = final_results_phys[cols_export_phys].copy()

# CAS 2 : Matching Ratios Monetaires
cols_export_rm = [
    'PRODUIT.ID', 'PRODUIT.LIB',
    'FE.RM.ID_MATCH', 'MATCH_DESCRIPTION',
    'FE.VAL', 'FE.UNITE',
    'MATCH_SCORE', 'MATCH_QUALITY',
    'U_MATCH', 'U_FE', 'U_TOTAL',
    'TOP1_ID', 'TOP1_SCORE', 'TOP2_ID', 'TOP2_SCORE', 'TOP3_ID', 'TOP3_SCORE'
]
cols_export_rm = [c for c in cols_export_rm if c in final_results_rm.columns]
df_export_rm = final_results_rm[cols_export_rm].copy()

In [88]:
print("\nApercu des resultats finaux - CAS 1 (FE Physiques) :")
display(df_export_phys.head(10))


Apercu des resultats finaux - CAS 1 (FE Physiques) :


Unnamed: 0,PRODUIT.ID,PRODUIT.LIB,FE.ADEME.ID_MATCH,MATCH_DESCRIPTION,FE.VAL,FE.UNITE,MATCH_SCORE,MATCH_QUALITY,U_MATCH,U_FE,U_TOTAL,TOP1_ID,TOP1_SCORE,TOP2_ID,TOP2_SCORE,TOP3_ID,TOP3_SCORE
0,601540,RAI DEPISTAGE (B50) REF RAID2,43152,HFC-32,771.0,kgCO2e/kg,0.60787,Bon,0.39213,0,0.39213,43152,0.60787,43167,0.603897,43638,0.599626
1,601615,SUPPLEMENT DIMANCHE (B20) REF 9004,43238,PFC-31-10,10000.0,kgCO2e/kg,0.570792,Bon,0.429208,0,0.429208,43238,0.570792,43154,0.553622,43156,0.533646
2,246363,RAI DEPISTAGE TIA,46180,Raie crue,12.0,kgCO2e/kg de poids net,0.746705,Excellent,0.253295,0,0.253295,46180,0.746705,45924,0.725298,46694,0.69263
3,601541,PHENOTYPE RHESUS KELL (B40) REF RHK,43640,R401a,1263.0,kgCO2e/kg,0.679245,Bon,0.320755,0,0.320755,43640,0.679245,43634,0.643985,43628,0.639469
4,601583,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,43634,R422a,3355.0,kgCO2e/kg,0.666308,Bon,0.333692,0,0.333692,43634,0.666308,43636,0.66167,43089,0.649498
5,61651,DEVIS N°Q2387840 - PHANTOM-TOS,44794,Chou de Bruxelles cuit,1.37,kgCO2e/kg de poids net,0.605891,Bon,0.394109,0,0.394109,44794,0.605891,45088,0.603211,46410,0.59855
6,61651,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL",20786,"sport,escrime Escrime, un mètre de fil conducteur diam 1,1mm",0.02,kgCO2e/appareil,0.511386,Bon,0.488614,0,0.488614,20786,0.511386,47456,0.500286,20904,0.497256
7,61651,Modèle de palpation mammaire + valise de transport,21715,"transport de personne ferré,RER RER",0.0057,kgCO2e/passager.km,0.591252,Bon,0.408748,0,0.408748,21715,0.591252,21055,0.520743,21072,0.520743
8,61651,"PASSE BROCHES 0,7 A 1.8 MM",43601,1-bromopropane,0.052,kgCO2e/kg,0.545142,Bon,0.454858,0,0.454858,43601,0.545142,43040,0.545142,20786,0.507874
9,61651,POWER SUPPLY 24V,15515,Électricité fournisseur : Elsam,0.436,kgCO2e/kWh,0.773117,Excellent,0.226883,0,0.226883,15515,0.773117,15537,0.76968,15540,0.751961


In [89]:
print("\nApercu des resultats finaux - CAS 2 (Ratios Monetaires) :")
display(df_export_rm.head(10))


Apercu des resultats finaux - CAS 2 (Ratios Monetaires) :


Unnamed: 0,PRODUIT.ID,PRODUIT.LIB,FE.RM.ID_MATCH,MATCH_DESCRIPTION,FE.VAL,FE.UNITE,MATCH_SCORE,MATCH_QUALITY,U_MATCH,U_FE,U_TOTAL,TOP1_ID,TOP1_SCORE,TOP2_ID,TOP2_SCORE,TOP3_ID,TOP3_SCORE
0,601540,RAI DEPISTAGE (B50) REF RAID2,43450,Services de réparation et installation de machines et d'équipements,196,kgCO2e/keuro (2023) HT,0.325406,Moyen,0.674594,0,0.674594,43450,0.325406,43420,0.324773,43430,0.303234
1,601615,SUPPLEMENT DIMANCHE (B20) REF 9004,43345,Edition,96,kgCO2e/keuro (2023) HT,0.377166,Moyen,0.622834,0,0.622834,43345,0.377166,43390,0.294658,43440,0.293937
2,246363,RAI DEPISTAGE TIA,43345,Edition,96,kgCO2e/keuro (2023) HT,0.43359,Moyen,0.56641,0,0.56641,43345,0.43359,43460,0.387954,43360,0.375196
3,601541,PHENOTYPE RHESUS KELL (B40) REF RHK,43380,Produits chimiques,603,kgCO2e/keuro (2023) HT,0.231094,Faible,0.768906,0,0.768906,43380,0.231094,43495,0.208188,43315,0.182984
4,601583,GROUPE SANGUIN ABO RH(D) (B35) REF ABOD,43380,Produits chimiques,603,kgCO2e/keuro (2023) HT,0.225973,Faible,0.774027,0,0.774027,43380,0.225973,43430,0.16112,43435,0.154131
5,61651,DEVIS N°Q2387840 - PHANTOM-TOS,43565,Transports aériens,914,kgCO2e/keuro (2023) HT,0.381995,Moyen,0.618005,0,0.618005,43565,0.381995,43360,0.36586,43325,0.305232
6,61651,"MANDRIN DE JACOB DIAM. 6,35 MM DRILL",43430,Produits métallurgiques,990,kgCO2e/keuro (2023) HT,0.433475,Moyen,0.566525,0,0.566525,43430,0.433475,43425,0.385284,43325,0.349249
7,61651,Modèle de palpation mammaire + valise de transport,43300,Autres matériels de transport,239,kgCO2e/keuro (2023) HT,0.587784,Bon,0.412216,0,0.412216,43300,0.587784,43570,0.498544,43320,0.487494
8,61651,"PASSE BROCHES 0,7 A 1.8 MM",43565,Transports aériens,914,kgCO2e/keuro (2023) HT,0.277956,Faible,0.722044,0,0.722044,43565,0.277956,43325,0.212387,43560,0.206309
9,61651,POWER SUPPLY 24V,43510,Services de télécommunications,136,kgCO2e/keuro (2023) HT,0.437628,Moyen,0.562372,0,0.562372,43510,0.437628,43355,0.354122,43420,0.296488


In [90]:
# Export CSV
df_export_phys.to_csv(DATA_OUTPUT_PATH / 'matching_fe_physique_final.csv', index=False, encoding='utf-8')
df_export_rm.to_csv(DATA_OUTPUT_PATH / 'matching_ratio_monetaire_final.csv', index=False, encoding='utf-8')

print("\nFichiers exportes :")
print(f"  - {DATA_OUTPUT_PATH / 'matching_fe_physique_final.csv'}")
print(f"  - {DATA_OUTPUT_PATH / 'matching_ratio_monetaire_final.csv'}")


Fichiers exportes :
  - data/output/matching_fe_physique_final.csv
  - data/output/matching_ratio_monetaire_final.csv


In [91]:
# Export des comparaisons de modeles
comparison_phys.to_csv(DATA_OUTPUT_PATH / 'comparaison_modeles_fe_physique.csv', index=False)
comparison_rm.to_csv(DATA_OUTPUT_PATH / 'comparaison_modeles_ratio_monetaire.csv', index=False)

print(f"  - {DATA_OUTPUT_PATH / 'comparaison_modeles_fe_physique.csv'}")
print(f"  - {DATA_OUTPUT_PATH / 'comparaison_modeles_ratio_monetaire.csv'}")

  - data/output/comparaison_modeles_fe_physique.csv
  - data/output/comparaison_modeles_ratio_monetaire.csv


---
## 9. Resume et Conclusions

In [92]:
print("="*80)
print("RESUME DE LA MODELISATION")
print("="*80)

print(f"""
MODELES TESTES :
----------------
1. Jaccard Similarity (baseline lexical)
2. TF-IDF + Cosine Similarity
3. SBERT (Sentence Embeddings) {'- DISPONIBLE' if SBERT_AVAILABLE else '- NON DISPONIBLE'}
4. Hybride (TF-IDF + SBERT) {'- DISPONIBLE' if SBERT_AVAILABLE else '- NON DISPONIBLE'}

CAS 1 - FACTEURS D'EMISSION PHYSIQUES :
---------------------------------------
- Meilleur modele : {best_model_phys}
- Score moyen : {final_results_phys['MATCH_SCORE'].mean():.4f}
- Incertitude totale moyenne : {final_results_phys['U_TOTAL'].mean():.4f}

CAS 2 - RATIOS MONETAIRES :
---------------------------
- Meilleur modele : {best_model_rm}
- Score moyen : {final_results_rm['MATCH_SCORE'].mean():.4f}
- Incertitude totale moyenne : {final_results_rm['U_TOTAL'].mean():.4f}

FICHIERS GENERES :
------------------
- matching_fe_physique_final.csv : Associations produits -> FE physiques
- matching_ratio_monetaire_final.csv : Associations produits -> Ratios monetaires
- comparaison_modeles_fe_physique.csv : Comparaison des modeles (CAS 1)
- comparaison_modeles_ratio_monetaire.csv : Comparaison des modeles (CAS 2)

INTERPRETATION DE L'INCERTITUDE :
---------------------------------
- U_FE : Incertitude intrinseque du facteur d'emission (donnee ADEME)
- U_Match : Incertitude due au matching (1 - score de similarite)
- U_Total : Incertitude totale propagee (racine de la somme des carres)

QUALITE DU MATCH :
------------------
- Excellent : score >= 0.7
- Bon : 0.5 <= score < 0.7
- Moyen : 0.3 <= score < 0.5
- Faible : score < 0.3 (verification manuelle recommandee)
""")

RESUME DE LA MODELISATION

MODELES TESTES :
----------------
1. Jaccard Similarity (baseline lexical)
2. TF-IDF + Cosine Similarity
3. SBERT (Sentence Embeddings) - DISPONIBLE
4. Hybride (TF-IDF + SBERT) - DISPONIBLE

CAS 1 - FACTEURS D'EMISSION PHYSIQUES :
---------------------------------------
- Meilleur modele : SBERT
- Score moyen : 0.6039
- Incertitude totale moyenne : 0.3961

CAS 2 - RATIOS MONETAIRES :
---------------------------
- Meilleur modele : SBERT
- Score moyen : 0.3921
- Incertitude totale moyenne : 0.6079

FICHIERS GENERES :
------------------
- matching_fe_physique_final.csv : Associations produits -> FE physiques
- matching_ratio_monetaire_final.csv : Associations produits -> Ratios monetaires
- comparaison_modeles_fe_physique.csv : Comparaison des modeles (CAS 1)
- comparaison_modeles_ratio_monetaire.csv : Comparaison des modeles (CAS 2)

INTERPRETATION DE L'INCERTITUDE :
---------------------------------
- U_FE : Incertitude intrinseque du facteur d'emission (donn