## 0. Environment Configuration & Dependency Installation
**Objective:** Prepare the runtime environment by installing necessary external tools and linking storage.

**Context:**
This project requires specific libraries that are not pre-installed in Google Colab:
1.  **SpaCy (`fr_core_news_sm`):** A natural language processing model required for tokenizing French text and removing stop words during the indexing phase.
2.  **Percollate:** A Node.js command-line tool used to transform HTML web pages into clean, readable Markdown files.
3.  **Google Drive:** Mounted to persist our data (`meta_data.json`, `index_inverse.json`) and document corpus, ensuring work is saved between sessions.

In [11]:
# --- 1. SETUP ENVIRONMENT ---
# Install the necessary NLP model for French
!pip install -q spacy
!python -m spacy download fr_core_news_sm

# Install 'percollate' (Node.js tool) used in your retrieval script
# This is required to convert web pages to Markdown
!npm install -g percollate

# Mount Google Drive to access your project files
from google.colab import drive
import os

drive.mount('/content/drive')

# --- CONFIGURATION ---
# UPDATE THIS PATH to your actual project folder in Drive
PROJECT_PATH = "/content/drive/MyDrive/SRI-Medical-Search"

# Navigate to project folder
os.chdir(PROJECT_PATH)
print(f"Working Directory set to: {os.getcwd()}")

Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0

## 1. Preprocessing: ID Assignment
**Objective:** Assign a unique ID to each medication in our `meta_data.json` source file.

**Context:**
We start with a manually collected list of Name/URL pairs. This step adds a unique `"id": X` field to each entry, which acts as the **primary key** for our search engine.

**Why this is necessary:**
1.  **Linkage:** Establishes a robust link between the metadata (JSON) and the downloaded document files (e.g., `1.md`, `2.md`).
2.  **Consistency:** Avoids encoding issues associated with using complex file names (e.g., special characters in medication names).
3.  **Frontend Integration:** Facilitates efficient metadata retrieval for the User Interface. The frontend can instantly fetch details (like the official name or source URL) using this ID without parsing the full document text.

In [12]:
import json
import os

# --- CONFIGURATION ---
JSON_FILE = "meta_data.json"

def assign_unique_ids(file_path):
    """
    Reads the metadata JSON file, assigns a unique sequential ID to each entry
    if it does not already exist, and overwrites the file in place.
    """
    if not os.path.exists(file_path):
        print(f"[ERROR] File not found: {file_path}")
        return

    try:
        # Load existing data
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        print(f"[INFO] Loaded {len(data)} entries.")

        updated = False

        # Iterate and assign IDs
        for index, entry in enumerate(data, start=1):
            if 'id' not in entry:
                # Assign simple integer ID based on current list position
                # Using 'insert' ensures 'id' is the first key in the visual JSON
                original_data = entry.copy()
                entry.clear()
                entry['id'] = index
                entry.update(original_data)
                updated = True

        # Write back to file only if changes were made
        if updated:
            with open(file_path, 'w', encoding='utf-8') as f:
                json.dump(data, f, indent=4, ensure_ascii=False)
            print(f"[SUCCESS] IDs added and file updated: {file_path}")
        else:
            print("[INFO] All entries already have IDs. No changes needed.")

    except json.JSONDecodeError:
        print(f"[ERROR] Failed to decode JSON from {file_path}.")

# Execute the function
assign_unique_ids(JSON_FILE)

[INFO] Loaded 50 entries.
[SUCCESS] IDs added and file updated: meta_data.json


## 2. Data Acquisition: Automated Retrieval
**Objective:** Automated fetching of medical notices and conversion to a standardized Markdown format.

**Context:**
This script iterates through the normalized `meta_data.json` file. For each entry, it extracts the unique `id` and source `url`. It then triggers the `percollate` tool to scrape the web page content and save it as a structured Markdown file (e.g., `med_md/1.md`).

**Why this is necessary:**
* **Standardization:** Converting HTML to Markdown removes web clutter (navbars, ads), leaving only the relevant text for indexing.
* **Data Integrity:** Using the `id` for filenames ensures a strict 1-to-1 mapping between our metadata and our document corpus.
* **Efficiency:** Automating this process allows us to scale the corpus easily without manual copy-pasting.

In [13]:
import os
import subprocess
import time
import json

# --- CONFIGURATION ---
JSON_SOURCE_FILE = "meta_data.json"
OUTPUT_DIR = "med_md"  # Folder for markdown files

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

def generate_markdown(doc_id, url):
    """
    Generates a Markdown file from a URL using the 'percollate' tool.
    The filename is based on the document ID.
    """
    filename = f"{doc_id}.md"
    output_path = os.path.join(OUTPUT_DIR, filename)

    # Command to run percollate (outputting as markdown)
    # Note: 'percollate' must be installed in the environment
    cmd = ["percollate", "md", "--output", output_path, url]

    try:
        print(f"[PROCESSING] ID: {doc_id} -> {filename}...")

        # Execute the command
        result = subprocess.run(cmd, capture_output=True, text=True)

        if result.returncode == 0:
            print(f"    [SUCCESS] Saved to {output_path}")
        else:
            print(f"    [ERROR] Percollate failed: {result.stderr.strip()}")

        # Short pause to be polite to the server
        time.sleep(1)

    except FileNotFoundError:
        print("    [CRITICAL] 'percollate' tool not found. Please run '!npm install -g percollate' first.")

def run_retrieval_process(json_file):
    """
    Main function to iterate through the JSON and fetch data.
    """
    if not os.path.exists(json_file):
        print(f"[ERROR] JSON file not found: {json_file}")
        return

    with open(json_file, 'r', encoding='utf-8') as f:
        medicaments_list = json.load(f)

    print(f"[INFO] Starting retrieval for {len(medicaments_list)} items.\n")

    for item in medicaments_list:
        # Extract ID and URL directly from the JSON object
        doc_id = item.get("id")
        url = item.get("url")

        # Only proceed if both ID and URL are valid
        if doc_id and url:
            generate_markdown(doc_id, url)
        else:
            print(f"[WARNING] Skipped entry (Missing ID or URL): {item}")

    print(f"\n[DONE] Retrieval complete. Check the '{OUTPUT_DIR}' folder.")

# Execute retrieval
run_retrieval_process(JSON_SOURCE_FILE)

[INFO] Starting retrieval for 50 items.

[PROCESSING] ID: 1 -> 1.md...
    [SUCCESS] Saved to med_md/1.md
[PROCESSING] ID: 2 -> 2.md...
    [SUCCESS] Saved to med_md/2.md
[PROCESSING] ID: 3 -> 3.md...
    [SUCCESS] Saved to med_md/3.md
[PROCESSING] ID: 4 -> 4.md...
    [SUCCESS] Saved to med_md/4.md
[PROCESSING] ID: 5 -> 5.md...
    [SUCCESS] Saved to med_md/5.md
[PROCESSING] ID: 6 -> 6.md...
    [SUCCESS] Saved to med_md/6.md
[PROCESSING] ID: 7 -> 7.md...
    [SUCCESS] Saved to med_md/7.md
[PROCESSING] ID: 8 -> 8.md...
    [SUCCESS] Saved to med_md/8.md
[PROCESSING] ID: 9 -> 9.md...
    [SUCCESS] Saved to med_md/9.md
[PROCESSING] ID: 10 -> 10.md...
    [SUCCESS] Saved to med_md/10.md
[PROCESSING] ID: 11 -> 11.md...
    [SUCCESS] Saved to med_md/11.md
[PROCESSING] ID: 12 -> 12.md...
    [SUCCESS] Saved to med_md/12.md
[PROCESSING] ID: 13 -> 13.md...
    [SUCCESS] Saved to med_md/13.md
[PROCESSING] ID: 14 -> 14.md...
    [SUCCESS] Saved to med_md/14.md
[PROCESSING] ID: 15 -> 15.md...
  

## 3. Core Engine: Parsing & Inverted Index Construction
**Objective:** Parse the raw Markdown documents to extract structured data and build the Inverted Index for the search engine.

**Context:**
This phase transforms our collection of unstructured text files into a searchable data structure. We employ a **hybrid extraction strategy**:
1.  **Regex Extraction:** We use Regular Expressions to extract structural fields where precision is critical (e.g., *Name*, *Active Substance*).
2.  **NLP Pipeline:** We use SpaCy to process the descriptive text (*Indications*, *Posology*). This involves tokenization, stop-word removal, and lemmatization to improve search recall.

**Output:**
The script generates `index_inverse.json`, a dictionary mapping every unique word to the list of documents containing it (including term frequency and positions). This is the fundamental data structure used by the search algorithm.

In [16]:
import os
import json
import re
import spacy

# --- CONFIGURATION ---
DOCS_FOLDER = "med_md"
INDEX_JSON_PATH = "index_inverse.json"
METADATA_JSON_PATH = "meta_data.json"

# --- LOAD NLP MODEL ---
try:
    nlp = spacy.load("fr_core_news_sm")
    # Define custom stop words specific to medical notices
    CUSTOM_STOP_WORDS = {"mg", "ml", "comprimé", "sachet", "gélule", "boîte", "notice", "médicament", "voir", "rubrique", "comprimer", "fois", "cas", "base", "substance", "active"}
    STOP_WORDS = nlp.Defaults.stop_words
    ALL_STOP_WORDS = STOP_WORDS.union(CUSTOM_STOP_WORDS)
    print("[INFO] SpaCy model loaded successfully.")
except Exception as e:
    print(f"[ERROR] SpaCy model loading failed: {e}")

def clean_markdown_syntax(text):
    """
    Cleans up Markdown syntax to simplify regex extraction.
    Removes links and bold markers.
    """
    # Remove links: [Text](link) -> Text
    text = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', text)
    # Remove anchors: [](#top) -> ""
    text = re.sub(r'\[\]\(#.*?\)', '', text)
    # Remove bold/italic: **Text** -> Text
    text = re.sub(r'[*_]{2,}(.*?)[*_]{2,}', r'\1', text)
    # Normalize newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    return text

def parse_file_for_indexing(content):
    """
    Parses a Markdown file and extracts key sections.
    Returns: Name, Molecule, Tech_Block, Desc_Block
    """
    # 1. Clean Markdown
    content = clean_markdown_syntax(content)

    # 2. Extract Name (Title)
    # Handles "1. DENOMINATION..." or simple titles
    regex_nom = r'(?:1\.\s*)?Dénomination du médicament.*?\n+(.+)'
    match_nom = re.search(regex_nom, content, re.IGNORECASE)

    if match_nom:
        nom = match_nom.group(1).strip()
    else:
        # Fallback: Look for uppercase title with dosage
        regex_fallback = r'\n([A-Z\s\-\(\)]+\d+[.,]?\d*\s?(?:mg|g|ml|%)[^\n]+)'
        match_nom = re.search(regex_fallback, content)
        nom = match_nom.group(1).strip() if match_nom else "Inconnu"

    nom = re.sub(r'[\[\]]', '', nom).strip()

    # 3. Extract Molecule
    molecule = ""
    regex_molecule = r'(?:La substance active est\s*:|Substance active\s*:)\s*\n+(.+?)(?=\n|Pour un)'
    match_molecule = re.search(regex_molecule, content, re.IGNORECASE)

    if match_molecule:
        raw_molecule = match_molecule.group(1).strip()
        molecule = re.sub(r'\.{2,}', ' ', raw_molecule).strip()

    # 4. Extract Description Sections
    regex_indic = r'(?:DANS QUELS CAS EST-IL UTILISE|Indications thérapeutiques)(.+?)(?=\n2\.|QUELLES SONT)'
    match_indic = re.search(regex_indic, content, re.IGNORECASE | re.DOTALL)
    indic = match_indic.group(1).strip() if match_indic else ""

    regex_poso = r'(?:Posologie|Mode d\'administration)(.+?)(?=\n(?:4\.|QUELS SONT|Si vous avez pris))'
    match_poso = re.search(regex_poso, content, re.IGNORECASE | re.DOTALL)
    poso = match_poso.group(1).strip() if match_poso else ""

    regex_effets = r'(?:QUELS SONT LES EFFETS|Effets indésirables)(.+?)(?=\n(?:5\.|COMMENT CONSERVER|Déclaration))'
    match_effets = re.search(regex_effets, content, re.IGNORECASE | re.DOTALL)
    effets = match_effets.group(1).strip() if match_effets else ""

    # Prepare blocks
    bloc_technique = f"{nom} {molecule}"
    bloc_descriptif = f"{indic} {poso} {effets}"
    bloc_descriptif = re.sub(r'\s+', ' ', bloc_descriptif).strip()

    return nom, molecule, bloc_technique, bloc_descriptif


def tokenize_and_process(text, use_lemmatization=True):
    """
    Tokenizes text, removes stop words, and optionally lemmatizes.
    """
    doc = nlp(text)
    tokens = []
    for token in doc:
        # Lemma if requested, else exact text
        term = token.lemma_.lower() if use_lemmatization else token.text.lower()
        if token.is_alpha and term not in ALL_STOP_WORDS:
            tokens.append(term)
    return tokens

def build_index():
    print(f"[START] Starting indexing process...")

    if not os.path.exists(METADATA_JSON_PATH):
        print(f"[ERROR] Metadata file missing: {METADATA_JSON_PATH}")
        return

    with open(METADATA_JSON_PATH, 'r', encoding='utf-8') as f:
        metadata_list = json.load(f)

    inverted_index = {}
    count_indexed = 0
    total_docs = len(metadata_list)

    for i, entry in enumerate(metadata_list, start=1):
        doc_id = entry.get('id')
        file_name = f"{doc_id}.md"
        file_path = os.path.join(DOCS_FOLDER, file_name)

        print(f"[{i}/{total_docs}] Processing ID: {doc_id} ...", end=" ")

        if not os.path.exists(file_path):
            print(f"[MISSING] ID {doc_id} (File {file_name} not found)")
            continue

        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            nom, molecule, tech_block, desc_block = parse_file_for_indexing(content)

            # Update Metadata with extracted info
            entry['nom'] = nom
            entry['snippet'] = desc_block[:200] + "..." if desc_block else ""

            # 1. Process Technical terms (No Lemmatization usually better for proper nouns, but user choice)
            tokens_tech = tokenize_and_process(tech_block, use_lemmatization=False)
            # 2. Process Descriptive terms (Lemmatization active)
            tokens_desc = tokenize_and_process(desc_block, use_lemmatization=True)

            all_tokens = tokens_tech + tokens_desc

            # Build Inverted Index
            for pos, token in enumerate(all_tokens):
                if token not in inverted_index:
                    inverted_index[token] = []

                # Check if doc exists in list
                doc_entry = next((item for item in inverted_index[token] if item['doc_id'] == doc_id), None)

                if doc_entry is None:
                    inverted_index[token].append({'doc_id': doc_id, 'tf': 1, 'positions': [pos]})
                else:
                    doc_entry['tf'] += 1
                    doc_entry['positions'].append(pos)

            count_indexed += 1
            print(f"-> [OK] Indexed '{nom}' ({len(all_tokens)} tokens)")

        except Exception as e:
            print(f"[ERROR] Failed to index ID {doc_id}: {e}")

    # Save Results
    print(f"\n[SAVING] Writing index to disk...")
    with open(INDEX_JSON_PATH, 'w', encoding='utf-8') as f:
        json.dump(inverted_index, f, indent=2, ensure_ascii=False)

    with open(METADATA_JSON_PATH, 'w', encoding='utf-8') as f:
        json.dump(metadata_list, f, indent=2, ensure_ascii=False)

    print(f"\n[SUCCESS] Indexed {count_indexed} documents.")
    print(f"[INFO] Index saved to {INDEX_JSON_PATH}")

# Run Indexing
build_index()

[INFO] SpaCy model loaded successfully.
[START] Starting indexing process...
[1/50] Processing ID: 1 ... -> [OK] Indexed 'AMOXICILLINE ACIDE CLAVULANIQUE ALMUS 100 mg/12,5 mg par mL ENFANTS, poudre pour suspension buvable (rapport amoxicilline/acide clavulanique : 8/1)' (555 tokens)
[2/50] Processing ID: 2 ... -> [OK] Indexed 'AUGMENTIN 1 g/125 mg, poudre pour suspension buvable en sachet-dose (rapport amoxicilline/acide clavulanique : 8/1)' (555 tokens)
[3/50] Processing ID: 3 ... -> [OK] Indexed 'CLAMOXYL 1 g, comprimé dispersible' (1347 tokens)
[4/50] Processing ID: 4 ... -> [OK] Indexed 'ORELOX 100 mg, comprimé pelliculé' (300 tokens)
[5/50] Processing ID: 5 ... -> [OK] Indexed 'PYOSTACINE 250 mg, comprimé pelliculé' (909 tokens)
[6/50] Processing ID: 6 ... -> [OK] Indexed 'ZITHROMAX 250 mg, comprimé pelliculé' (1001 tokens)
[7/50] Processing ID: 7 ... -> [OK] Indexed 'MONURIL 3 g, granulés pour solution buvable en sachet' (600 tokens)
[8/50] Processing ID: 8 ... -> [OK] Indexed 'F