# Reviewer Assistant Agent (RAA)

**Overview:**

This Python script implements a "Reviewer Assistant Agent" (RAA) designed to automate key parts of the scientific manuscript peer-review process. Utilizing the LangChain framework and an OpenAI language model (e.g., gpt-4.1-mini), the agent analyzes a submitted manuscript (in PDF format), extracts specific quantitative findings, searches for similar articles online, extracts comparable data from those articles, and presents a structured comparison. The entire workflow is orchestrated by a LangChain AgentExecutor managing a suite of custom tools. This script provides a powerful, automated starting point for reviewers looking to quickly understand a manuscript's key contributions and compare them against relevant online literature.

**Goal:**

The primary goal is to assist scientific reviewers by:

Providing a concise summary of a manuscript's core findings.

Automatically extracting predefined quantitative metrics (Corrosion Inhibition Efficiency %, Impedance ohm.cm², Adhesion Strength MPa) from the manuscript.

Identifying potentially relevant existing literature via web search based on the manuscript's title.

Attempting to extract the same quantitative metrics from found online articles (PDFs or HTML) for quick comparison against the submitted manuscript's results.

**Core Technologies:**

Orchestration: LangChain AgentExecutor with create_openai_functions_agent.

LLM: OpenAI ChatOpenAI (model configurable, e.g., gpt-4.1-mini).

Custom Tools: LangChain BaseTool subclasses for specific tasks (loading, summarizing, extracting, searching, processing URLs).

Tool Schemas: Pydantic V1 (BaseModel, Field) for defining tool input arguments.

Web Search: duckduckgo-search library.

PDF Handling: PyPDFLoader (from langchain_community).

Web Interaction: requests library for downloading, BeautifulSoup4 for HTML parsing.

Text Processing: LangChain RecursiveCharacterTextSplitter, load_summarize_chain.

Data Handling: pandas for displaying results.

Execution: Standard Python 3, argparse for command-line arguments, logging.

This script is mainly created by **Google AI Studio**

In [1]:
!pip install --upgrade --no-cache-dir duckduckgo-search --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25h

In [2]:
!pip install --upgrade --no-cache-dir langchain langchain_openai openai pypdf pandas google-search-results beautifulsoup4 kaggle googlesearch-python langchain_community requests lxml --quiet

print("Libraries installed.")

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.7/61.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.8/644.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.3/187.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.6/433.6 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for go

In [7]:
# --- START OF IMPORT SECTION ---

import os
import re
import sys
import time
import json
import argparse
import logging
import traceback
from tempfile import NamedTemporaryFile, TemporaryDirectory
from urllib.parse import urljoin, urlparse

# --- Dependency Imports with Error Handling ---

log = logging.getLogger(__name__) # Assuming logger is set up later

try:
    # Core Third-Party Libraries
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    log.info("Successfully imported: pandas, requests, bs4")
except ImportError as e:
    log.critical(f"Error: Missing core third-party library: {e}")
    log.critical("Please install requirements: pip install pandas requests beautifulsoup4 lxml")
    traceback.print_exc()
    exit(1)

try:
    # Google Search (Optional)
    from googlesearch import search
    log.info("Successfully imported: googlesearch")
except ImportError:
    log.warning("Could not import googlesearch. Search functionality will be disabled.")
    log.warning("You can install it using: pip install googlesearch-python")
    search = None # Define search as None so the script can proceed without it

try:
    # OpenAI SDK
    import openai # Although langchain_openai wraps it, direct import check is good
    from langchain_openai import ChatOpenAI
    log.info("Successfully imported: openai, langchain_openai.ChatOpenAI")
except ImportError as e:
    log.critical(f"Error: Missing OpenAI or Langchain OpenAI integration: {e}")
    log.critical("Please install requirements: pip install openai langchain_openai")
    traceback.print_exc()
    exit(1)

try:
    # LangChain Core Components
    from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder
    from langchain_core.tools import BaseTool
    from langchain_core.documents import Document # Corrected from langchain.docstore.document
    from langchain_core.messages import AIMessage, HumanMessage
    # Pydantic v1 is often needed for Langchain schemas
    from langchain.pydantic_v1 import BaseModel, Field
    log.info("Successfully imported: LangChain core prompts, tools, documents, messages, pydantic_v1")

    # Ensure BaseModel and Field are recognized immediately after import
    assert BaseModel is not None
    assert Field is not None
    log.info("Confirmed BaseModel and Field are loaded.")

except ImportError as e:
    log.critical(f"Error: Missing core LangChain library: {e}")
    log.critical("This often indicates an incomplete or corrupted installation.")
    log.critical("Try reinstalling: pip install --upgrade --force-reinstall langchain langchain-core")
    traceback.print_exc()
    exit(1)
except AssertionError:
    log.critical("Error: BaseModel or Field not recognized immediately after import. Check installation.")
    traceback.print_exc()
    exit(1)


try:
    # LangChain Community Components (Loaders, Splitters)
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter # Often in top-level langchain
    log.info("Successfully imported: LangChain community loaders, text_splitter")
except ImportError as e:
    log.critical(f"Error: Missing LangChain community or text_splitter component: {e}")
    log.critical("Please install requirements: pip install langchain_community pypdf")
    traceback.print_exc()
    exit(1)

try:
    from duckduckgo_search import DDGS
    log.info("Successfully imported: duckduckgo_search")
    ddgs_search = DDGS()
except ImportError:
    log.warning("Could not import duckduckgo_search. Install with 'pip install duckduckgo-search'")
    ddgs_search = None
    
try:
    # LangChain Chains and Agents
    from langchain.chains.summarize import load_summarize_chain
    from langchain.agents import AgentExecutor, create_openai_functions_agent
    log.info("Successfully imported: LangChain chains, agents")
except ImportError as e:
    # Check specifically for the BaseCache error if it occurs here
    if 'BaseCache' in str(e):
         log.critical(f"Error: Missing component likely related to LangChain caching/runnables: {e}")
         log.critical("This points to an issue within 'langchain-core'.")
         log.critical("Ensure 'langchain-core' is installed correctly and is compatible.")
         log.critical("Try reinstalling: pip install --upgrade --force-reinstall langchain-core langchain")
    else:
        log.critical(f"Error: Missing LangChain chains or agents component: {e}")
        log.critical("Please ensure langchain is fully installed.")
    traceback.print_exc()
    exit(1)

# --- Logging Setup ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
log = logging.getLogger(__name__)

# --- Constants and Configuration ---
# These can be overridden by command-line arguments
DEFAULT_LLM_MODEL = "gpt-4.1-mini"
DEFAULT_MAX_SEARCH_RESULTS = 10
DEFAULT_MAX_SIMILAR_ARTICLES_TO_PROCESS = 5
DEFAULT_DOWNLOAD_TIMEOUT = 25
DEFAULT_MAX_CHARS_FOR_EXTRACTION = 25000
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

# --- Helper Functions (mostly for tools) ---

def load_pdf_text_internal(pdf_path):
    """Loads text content from a PDF file. Returns (full_text, docs_list)
       Internal version for tools, raises errors on failure."""
    try:
        log.info(f"Loading PDF: {pdf_path}")
        if not os.path.exists(pdf_path):
             raise FileNotFoundError(f"PDF file not found at: {pdf_path}")
        loader = PyPDFLoader(pdf_path, extract_images=False)
        docs = loader.load()
        if not docs:
             log.warning(f"PyPDFLoader loaded 0 pages from {pdf_path}.")
             return None, [] # Return empty list for docs

        cleaned_pages = []
        for doc in docs:
             cleaned_content = doc.page_content.replace('\x00', '').strip()
             cleaned_pages.append(cleaned_content)

        full_text = "\n".join(filter(None, cleaned_pages))

        if not full_text:
            log.warning(f"PDF loaded but resulted in empty text content after cleaning: {pdf_path}")
            # Still return the loaded docs object if it exists, even if text is empty
            return None, docs

        log.info(f"Successfully loaded text from: {pdf_path} ({len(docs)} pages, {len(full_text)} chars)")
        return full_text, docs
    except Exception as e:
        log.error(f"Error loading PDF {pdf_path}: {type(e).__name__} - {e}", exc_info=False) # Log error without traceback for cleaner tool output
        raise # Re-raise the exception for the tool/agent to handle

def extract_text_from_html_internal(html_content):
    """Extracts plain text from HTML content."""
    try:
        soup = BeautifulSoup(html_content, 'lxml')
        for script_or_style in soup(["script", "style"]):
            script_or_style.decompose()
        text = soup.get_text(separator='\n', strip=True)
        text = re.sub(r'\n\s*\n', '\n\n', text)
        log.info(f"Extracted ~{len(text)} characters of text from HTML.")
        return text
    except Exception as e:
        log.error(f"Error extracting text from HTML: {e}", exc_info=False)
        return None # Return None on failure, tool should report this

def parse_llm_extraction_internal(llm_output_text):
    """Parses the LLM text output using regex for key data."""
    data = {
        'corrosion_inhibition_%': 'Not Found',
        'impedance_ohm_cm2': 'Not Found',
        'adhesion_MPa': 'Not Found'
    }
    patterns = {
        'corrosion_inhibition_%': r"Corrosion Inhibition \(%\):\s*([\d\.]+%?|Not Found|N/A|-)",
        'impedance_ohm_cm2': r"Impedance \((?:ohm\.cm2|ꭥ\.cm2)\):\s*(.*?)(?:\n|Adhesion|$)",
        'adhesion_MPa': r"Adhesion \(MPa\):\s*([\d\.]+|Not Found|N/A|-)"
    }

    for key, pattern in patterns.items():
        match = re.search(pattern, llm_output_text, re.IGNORECASE | re.MULTILINE | re.DOTALL)
        if match:
            value = match.group(1).strip()
            if value.lower() in ['not found', 'n/a', '-'] or not value:
                data[key] = 'Not Found'
            else:
                if key == 'impedance_ohm_cm2':
                     value = re.sub(r'\s+', ' ', value).strip('., ').strip()
                     if not value or value.lower() in ['not found', 'n/a', '-']:
                          data[key] = 'Not Found'
                     else:
                         num_match = re.search(r"([-+]?\d*\.?\d+(?:[eE][-+]?\d+)?)", value)
                         num_val = num_match.group(1) if num_match else 'N/A'
                         # Keep parsed num in value for clarity, but could simplify
                         data[key] = f"{value} (parsed num: {num_val})"
                else:
                    num_match = re.search(r"([-+]?\d*\.?\d+(?:[eE][-+]?\d+)?)", value)
                    if num_match:
                        data[key] = num_match.group(1).strip()
                    else:
                        data[key] = 'Not Found (parse failed)'
                        log.info(f"Regex matched for {key} but couldn't parse number from '{value}'")
        else:
             data[key] = 'Not Found (pattern mismatch)'

    log.info(f"Parsed LLM Extraction Output: {data}")
    return data

def download_pdf_internal(url, save_dir, timeout):
    """Attempts direct PDF download. Returns (pdf_path, response_object_or_None)"""
    os.makedirs(save_dir, exist_ok=True)
    pdf_path = None
    response = None
    try:
        log.info(f"Attempting direct PDF download from: {url}")
        response = requests.get(url, stream=True, timeout=timeout, headers=HEADERS, allow_redirects=True)
        response.raise_for_status()
        content_type = response.headers.get('Content-Type', '').lower()
        is_pdf_content = ('application/pdf' in content_type or
                          '.pdf' in urlparse(url).path.lower() or
                          'pdf' in response.headers.get('Content-Disposition', '').lower())

        if is_pdf_content and response.status_code == 200:
            content_length_str = response.headers.get('content-length')
            if content_length_str is not None and int(content_length_str) == 0:
                 log.warning(f"Content-Length is 0 for {url}. Skipping download.")
                 return None, response

            with NamedTemporaryFile(delete=False, suffix=".pdf", dir=save_dir) as temp_file:
                pdf_path = temp_file.name
                bytes_downloaded = 0
                for chunk in response.iter_content(chunk_size=8192):
                    temp_file.write(chunk)
                    bytes_downloaded += len(chunk)

            if bytes_downloaded < 1024: # Check if PDF is suspiciously small
                 log.warning(f"Downloaded PDF {pdf_path} is small ({bytes_downloaded} bytes). May be invalid.")

            try:
                 loader_check = PyPDFLoader(pdf_path, extract_images=False)
                 # Try loading only the first page for a quicker check
                 loader_check.load() # If it fails, raises an exception
                 log.info(f"Successfully downloaded and verified PDF ({bytes_downloaded} bytes) to: {pdf_path}")
                 return pdf_path, response
            except Exception as pdf_err:
                 log.warning(f"Downloaded file {pdf_path} failed PDF verification: {pdf_err}. Removing.")
                 if pdf_path and os.path.exists(pdf_path): os.remove(pdf_path)
                 return None, response # Return None path, but keep response obj
        else:
            log.info(f"URL does not appear to be a direct PDF or has error. Status: {response.status_code}, Content-Type: {content_type}")
            return None, response

    except requests.exceptions.Timeout:
        log.warning(f"Timeout occurred downloading {url}")
        return None, None
    except requests.exceptions.HTTPError as http_err:
        log.warning(f"HTTP Error for {url}: {http_err}")
        return None, getattr(http_err, 'response', None)
    except requests.exceptions.RequestException as e:
        log.warning(f"Failed download/access for {url}: {e}")
        return None, None
    except Exception as e:
        log.error(f"Unexpected error during direct download {url}: {type(e).__name__} {e}", exc_info=False)
        return None, response

def find_and_download_pdf_from_html_internal(page_url, page_content, save_dir, timeout):
    """Parses HTML to find PDF links and attempts download."""
    log.info(f"Scanning HTML from {page_url} for PDF links.")
    soup = BeautifulSoup(page_content, 'lxml')
    pdf_links = []
    found_links = set()

    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href'].strip()
        if not href or href.startswith('#') or href.lower().startswith('javascript:'):
            continue

        href_lower = href.lower()
        link_text = a_tag.get_text(strip=True).lower()

        is_likely_pdf = (
            href_lower.endswith('.pdf') or
            '/pdf/' in href_lower or any(kw in href_lower for kw in ['download=pdf', 'downloadpdf', 'articlepdf', 'fulltextpdf', 'viewpdf']) or
            link_text in ['pdf', 'full text pdf', 'download pdf', '[pdf]', 'article pdf', 'view pdf', 'get pdf'] or
            (a_tag.get('class') and any('pdf' in c.lower() for c in a_tag.get('class', []))) or
            (a_tag.get('id') and 'pdf' in a_tag.get('id','').lower())
        )

        if is_likely_pdf:
            try:
                pdf_url = urljoin(page_url, href)
                parsed_url = urlparse(pdf_url)
                if parsed_url.scheme in ['http', 'https'] and pdf_url != page_url and pdf_url not in found_links:
                    if '.pdf' in pdf_url.lower() or any(kw in pdf_url.lower() for kw in ['download', 'content', 'fulltext', 'article', 'view']):
                         log.info(f"Found potential PDF link: {pdf_url}")
                         pdf_links.append(pdf_url)
                         found_links.add(pdf_url)
            except Exception as parse_err:
                 log.warning(f"Skipping invalid link '{href}': {parse_err}")

    pdf_links.sort(key=lambda x: not (x.lower().endswith('.pdf') or 'download' in x.lower()), reverse=True)

    if not pdf_links:
        log.info("No likely PDF links found in HTML.")
        return None

    log.info(f"Found {len(pdf_links)} potential links. Attempting download...")
    for pdf_url in pdf_links:
        time.sleep(1.0)
        log.info(f"Attempting download from HTML link: {pdf_url}")
        downloaded_path, _ = download_pdf_internal(pdf_url, save_dir, timeout)
        if downloaded_path:
            log.info(f"Successfully downloaded PDF from HTML link: {pdf_url}")
            return downloaded_path

    log.info("Tried all potential PDF links from HTML, none worked.")
    return None

# --- Custom LangChain Tools ---

class LoadManuscriptInput(BaseModel):
    pdf_path: str = Field(description="The file path to the manuscript PDF.")

class LoadManuscriptTool(BaseTool):
    name: str = "load_manuscript"
    description: str = "Loads the text content from a PDF and extracts page data. Returns a dictionary with 'full_text' and 'docs_data' (a list of dicts, each with 'page_content' and 'metadata'). Critical first step." # <-- UPDATED description
    args_schema: type[BaseModel] = LoadManuscriptInput

    def _run(self, pdf_path: str) -> dict:
        try:
            full_text, docs = load_pdf_text_internal(pdf_path)
            # Convert Document objects to simple dicts for serialization
            docs_data = []
            if docs:
                docs_data = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]

            return {
                "full_text": full_text if full_text else "",
                "docs_data": docs_data # <-- RETURN docs_data instead of docs
                }
        except Exception as e:
            log.error(f"LoadManuscriptTool failed: {e}", exc_info=False)
            return {"error": f"Failed to load manuscript '{os.path.basename(pdf_path)}': {e}"}

class SummarizeManuscriptInput(BaseModel):
    # Change from list[Document] to list[dict]
    docs_data: list[dict] = Field(description="A list of dictionaries, each with 'page_content' and 'metadata', obtained from 'load_manuscript'.")
    
class SummarizeManuscriptTool(BaseTool):
    name: str = "summarize_manuscript"
    description: str = "Generates a concise summary of the manuscript using its extracted page data." # <-- Slightly updated description
    args_schema: type[BaseModel] = SummarizeManuscriptInput # Uses the updated input schema
    llm: ChatOpenAI

    # Change the input parameter name and type
    def _run(self, docs_data: list[dict]) -> str: # <-- CHANGED parameter
        if not self.llm:
            log.error("SummarizeManuscriptTool: LLM not initialized.")
            return "Error: LLM not initialized for summarization."
        if not docs_data:
            log.error("SummarizeManuscriptTool: No document data provided.")
            return "Error: No document data provided for summarization."

        try:
            log.info("Reconstructing Document objects for summarization...")
            # Reconstruct Document objects from the dictionaries
            reconstructed_docs = [Document(page_content=d.get('page_content',''), metadata=d.get('metadata',{})) for d in docs_data] # <-- RECONSTRUCT docs

            if not reconstructed_docs:
                 log.error("SummarizeManuscriptTool: Failed to reconstruct documents from data.")
                 return "Error: Failed to reconstruct documents from provided data."

            log.info("Generating summary...")
            summary_chain = load_summarize_chain(self.llm, chain_type="map_reduce")
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=500)
            # Use the reconstructed documents
            split_docs = text_splitter.split_documents(reconstructed_docs) # <-- Use reconstructed_docs

            if not split_docs:
                 log.error("SummarizeManuscriptTool: Document splitting resulted in no chunks.")
                 return "Error: Document splitting resulted in no chunks for summary."

            summary_result = summary_chain.invoke(split_docs)
            summary = summary_result.get("output_text", "Summary generation failed.")
            log.info("Summary generated successfully.")
            return summary
        except Exception as e:
            log.error(f"Error during summarization: {e}", exc_info=False)
            return f"Error during summarization: {e}"
            

class ExtractMetricsInput(BaseModel):
    text_content: str = Field(description="The text content from which to extract data.")
    source_type: str = Field(description="Description of the source (e.g., 'Manuscript PDF', 'Web Article PDF', 'Web Article HTML').")
    max_chars: int = Field(default=DEFAULT_MAX_CHARS_FOR_EXTRACTION, description="Max characters of text_content to analyze.")

class ExtractMetricsTool(BaseTool):
    name: str = "extract_key_metrics"
    description: str = "Extracts specific quantitative data (Corrosion Inhibition %, Impedance ohm.cm2, Adhesion MPa) from the provided text using an LLM. Returns a dictionary of extracted values."
    args_schema: type[BaseModel] = ExtractMetricsInput
    llm: ChatOpenAI = Field(default=None)

    def _run(self, text_content: str, source_type: str, max_chars: int = DEFAULT_MAX_CHARS_FOR_EXTRACTION) -> dict:
        if not self.llm:
            log.error("ExtractMetricsTool: LLM not initialized.")
            return {"error": "LLM not initialized for extraction."}
        if not text_content:
            log.warning(f"ExtractMetricsTool: No text content provided for extraction from {source_type}.")
            return {"error": f"No text content provided for extraction from {source_type}."}

        extraction_prompt_template = """
        Analyze the following text from a scientific paper ({source_type}). Extract the following specific quantitative data points if present. Look carefully throughout the text, especially in results, discussion, abstract, and conclusion sections. If a value is explicitly mentioned, report it. If not found, state 'Not Found'. Provide only the numerical value (e.g., 95.5, 1.2E6, 15.3) or 'Not Found'. Ensure units are implicitly ohm.cm2 for impedance unless otherwise stated in the value itself.

        1.  **Corrosion Inhibition Efficiency (%):** Look for 'Corrosion Inhibition Efficiency', 'Inhibition Efficiency', or 'IE%'. Report the highest or most representative value. [Value or 'Not Found']
        2.  **Impedance Magnitude (ohm.cm2 or ꭥ.cm2):** Look for impedance values, often associated with EIS results (e.g., Z_real, |Z|, R_ct, R_p) reported in Ohm.cm² (or kΩ.cm², MΩ.cm² - include units if not base). Report a key representative value (e.g., low frequency, after exposure, highest resistance). [Value (include units if not base ohm.cm2) or 'Not Found']
        3.  **Adhesion Strength (MPa):** Look for results from pull-off, scratch, or similar adhesion tests reported in MPa. [Value or 'Not Found']

        Format the output strictly as follows, one item per line, using the exact keys:
        Corrosion Inhibition (%): [Value or 'Not Found']
        Impedance (ohm.cm2): [Value or 'Not Found']
        Adhesion (MPa): [Value or 'Not Found']

        Text Snippet (first {max_chars} chars):
        "{text}"

        Extracted Data:
        """
        extraction_prompt = PromptTemplate(
            template=extraction_prompt_template,
            input_variables=["text", "source_type", "max_chars"]
        )

        try:
            log.info(f"Requesting LLM extraction from {source_type}...")
            text_snippet = text_content[:max_chars]

            extraction_query = extraction_prompt.format(text=text_snippet, source_type=source_type, max_chars=max_chars)
            extraction_response = self.llm.invoke(extraction_query)
            extracted_data_text = extraction_response.content
            log.info(f"LLM extraction response received for {source_type}.")

            extracted_values = parse_llm_extraction_internal(extracted_data_text)
            extracted_values['status'] = 'Extraction Attempted'
            if any(v not in ['Not Found', 'N/A', 'Extraction Error', 'No Text', 'LLM Error', 'Not Found (parse failed)', 'Not Found (pattern mismatch)'] and v is not None for v in extracted_values.values()):
                 extracted_values['status'] = 'Extraction Successful (Found Values)'
            return extracted_values

        except Exception as e:
            log.error(f"Error during LLM data extraction from {source_type}: {e}", exc_info=False)
            return {
                'corrosion_inhibition_%': 'LLM Error',
                'impedance_ohm_cm2': 'LLM Error',
                'adhesion_MPa': 'LLM Error',
                'status': f'LLM Error ({type(e).__name__})'
            }

class SearchSimilarArticlesInput(BaseModel):
    query: str = Field(description="The search query, typically the manuscript title.")
    num_results: int = Field(description="Maximum number of search results to return.")

class SearchSimilarArticlesTool(BaseTool):
    name: str = "search_similar_articles"
    description: str = "Performs a DuckDuckGo search for articles similar to the manuscript based on a query (usually the title). Returns a list of URLs."
    args_schema: type[BaseModel] = SearchSimilarArticlesInput

    def _run(self, query: str, num_results: int) -> dict:
        if not ddgs_search:
            log.error("DuckDuckGo search library not available.")
            return {"error": "DuckDuckGo search library not available."}

        urls = []
        effective_max = num_results if num_results > 0 else 10 # DDGS uses max_results

        log.info(f"Searching DuckDuckGo for: '{query}' (Max results: {effective_max})")
        try:
            # Use ddgs.text() which returns dictionaries with 'href'
            search_results = ddgs_search.text(
                query,
                region='wt-wt', # World-wide
                safesearch='off',
                max_results=effective_max
            )

            if search_results:
                urls = [result['href'] for result in search_results if 'href' in result]
            else:
                urls = [] # Explicitly empty if no results

            log.info(f"Found {len(urls)} potential URLs via DuckDuckGo.")
            return {"urls": urls}

        except Exception as e:
             # DDG errors might be different, keep generic for now
             error_message = f"Error during DuckDuckGo search: {type(e).__name__} - {e}"
             log.error(error_message, exc_info=False)
             return {"error": error_message, "urls": []}
            

class ProcessWebArticleInput(BaseModel):
    url: str = Field(description="The URL of the web article to process.")

class ProcessWebArticleTool(BaseTool):
    name :str = "process_web_article"
    description :str = "Processes a single web article URL. Attempts to download a PDF (directly or via HTML links), extracts text, and then extracts key metrics (Corrosion Inhibition %, Impedance ohm.cm2, Adhesion MPa). Returns a dictionary with the URL, processing status, and extracted data."
    args_schema: type[BaseModel] = ProcessWebArticleInput
    metrics_extractor: ExtractMetricsTool = Field(default=None)
    download_dir: str = Field(...) # Make required
    timeout: int = Field(...) # Make required

    def _run(self, url: str) -> dict:
        if not self.metrics_extractor:
             log.error("ProcessWebArticleTool: Metrics extractor tool not provided.")
             return {"url": url, "status": "Error: Metrics extractor tool not provided.", "data": {}}

        log.info(f"--- Processing URL: {url} ---")
        pdf_path = None
        response_obj = None
        article_text = None
        extracted_data = {}
        status = "Processing Started"
        source_type = "N/A"
        # Initialize result structure
        final_result = {
            'url': url,
            'status': status,
            'corrosion_inhibition_%': 'N/A',
            'impedance_ohm_cm2': 'N/A',
            'adhesion_MPa': 'N/A'
        }

        try:
            # Attempt 1: Direct Download PDF
            pdf_path, response_obj = download_pdf_internal(url, self.download_dir, self.timeout)

            if pdf_path:
                try:
                    article_text, _ = load_pdf_text_internal(pdf_path)
                    if article_text:
                        status = "PDF Downloaded & Text Loaded"
                        source_type = "PDF"
                    else:
                        status = "PDF Downloaded but Text Extraction Failed"
                except Exception as load_err:
                    status = f"PDF Downloaded but Text Loading Failed: {load_err}"
            else:
                # Attempt 2: Check if it was HTML
                if response_obj and response_obj.status_code == 200 and 'text/html' in response_obj.headers.get('Content-Type', '').lower():
                    html_content = response_obj.text
                    if len(html_content) > 200: # Basic check for valid content
                        log.info("Direct download failed or not PDF, content is HTML.")
                        status = "HTML Page Found"
                        # Attempt 2a: Find and download PDF from HTML
                        pdf_path_from_html = find_and_download_pdf_from_html_internal(url, html_content, self.download_dir, self.timeout)
                        if pdf_path_from_html:
                            pdf_path = pdf_path_from_html # Assign for cleanup
                            try:
                                article_text, _ = load_pdf_text_internal(pdf_path)
                                if article_text:
                                    status = "PDF Link in HTML Found, Downloaded & Text Loaded"
                                    source_type = "PDF (from HTML link)"
                                else:
                                    status = "PDF via HTML Downloaded but Text Extraction Failed"
                            except Exception as load_err:
                                status = f"PDF via HTML Downloaded but Text Loading Failed: {load_err}"
                        else:
                            # Attempt 2b: FALLBACK - Extract text directly from HTML
                            log.info("Could not download PDF from HTML links. Attempting extraction from HTML text.")
                            status = "HTML Scan Failed to Find PDF, Trying HTML Text Extraction"
                            article_text = extract_text_from_html_internal(html_content)
                            if article_text:
                                source_type = "HTML"
                            else:
                                status = "HTML Scan Failed, HTML Text Extraction Failed"
                    else:
                        status = "HTML Page Found, but Content too small"
                elif response_obj:
                     status = f"Direct Download/Access Failed (Status: {response_obj.status_code}, Type: {response_obj.headers.get('Content-Type', 'N/A')})"
                else:
                     status = "Direct Download/Access Failed (Network/Request Error or Timeout)"

            # If text obtained, extract metrics
            if article_text:
                log.info(f"Attempting metric extraction from {source_type} text...")
                extracted_data = self.metrics_extractor._run(text_content=article_text, source_type=source_type)
                extraction_status = extracted_data.pop('status', 'Unknown Extraction Status')
                status += f" - {extraction_status}"
            else:
                 log.info("No text loaded, skipping metric extraction.")
                 extracted_data = {key: 'No Text Loaded' for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}


        except Exception as process_err:
             log.error(f"Unexpected error processing URL {url}: {type(process_err).__name__} - {process_err}", exc_info=True) # Log full traceback here
             status = f"Unexpected Error During Processing: {type(process_err).__name__}"
             extracted_data = {key: 'Processing Error' for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}

        finally:
            final_result['status'] = status
            # Merge extracted data into the final result
            for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']:
                 final_result[key] = extracted_data.get(key, 'Error/Missing')

            # Clean up temporary PDF file
            if pdf_path and os.path.exists(pdf_path):
                try:
                    os.remove(pdf_path)
                    log.info(f"Removed temporary file: {pdf_path}")
                except OSError as e:
                    log.warning(f"Error removing temporary file {pdf_path}: {e}")

        return final_result

# --- Main Agent Function ---

def run_reviewer_agent(pdf_path, openai_api_key, output_dir, llm_model, max_search, max_process, timeout, max_extract_chars, agent_verbose=True):
    """
    Sets up and runs the LangChain Reviewer Assistant Agent.
    """
    log.info("--- Starting Reviewer Assistant Agent ---")

    # --- Initialize LLM ---
    try:
        llm = ChatOpenAI(
            temperature=0,
            model_name=llm_model,
            openai_api_key=openai_api_key,
            request_timeout=180 # Increased timeout for agent chains
        )
        log.info(f"LLM ({llm_model}) initialized.")
    except Exception as e:
        log.critical(f"Failed to initialize LLM: {e}. Please check API key and model name.", exc_info=True)
        return None # Indicate failure

    # --- Create Temporary Directory for Downloads ---
    # Use a context manager for automatic cleanup
    with TemporaryDirectory(prefix="raa_downloads_", dir=output_dir) as temp_download_dir:
        log.info(f"Using temporary download directory: {temp_download_dir}")

        # --- Instantiate Tools ---
        load_tool = LoadManuscriptTool()
        summarize_tool = SummarizeManuscriptTool(llm=llm)
        extract_tool = ExtractMetricsTool(llm=llm)
        search_tool = SearchSimilarArticlesTool()
        process_article_tool = ProcessWebArticleTool(
            metrics_extractor=extract_tool,
            download_dir=temp_download_dir, # Use temp dir
            timeout=timeout
        )

        tools = [load_tool, summarize_tool, extract_tool, search_tool, process_article_tool]

        # --- Define Agent Prompt ---
        prompt_template = ChatPromptTemplate.from_messages([
            ("system",
             "You are a highly efficient Reviewer Assistant Agent. Your task is to analyze a scientific manuscript about corrosion inhibition and coatings. "
             "Follow these steps precisely:\n"
             # Step 1 Change: Mention 'docs_data' output
             "1. Load the manuscript PDF using the 'load_manuscript' tool using the provided path. This tool returns 'full_text' and 'docs_data' (a list of dicts).\n"
             "2. If loading is successful, extract the manuscript's title. Infer this from the first few pages' text in the 'full_text' field, or use the filename as a fallback. State the title clearly.\n"
             # Step 3 Change: Mention using 'docs_data' as input
             "3. Summarize the manuscript's key findings using the 'summarize_manuscript' tool with the 'docs_data' field obtained from the load step.\n"
             "4. Extract the specific quantitative metrics (Corrosion Inhibition %, Impedance ohm.cm2, Adhesion MPa) from the 'full_text' of the manuscript using the 'extract_key_metrics' tool. Set source_type='Manuscript PDF' and use max_chars={max_extract_chars}.\n"
             "5. If a title was identified, search for similar articles online using 'search_similar_articles'. Use the manuscript title as the query and request {max_search_results} results.\n"
             "6. If search results (URLs) are found, process *up to* {max_process_limit} of the *most relevant-looking* URLs using the 'process_web_article' tool, one URL at a time. Prioritize URLs ending in .pdf or from known publishers (nature.com, sciencedirect.com, pubs.acs.org, mdpi.com, etc.). Skip obvious non-article links (e.g., search engine results pages).\n"
             "7. Compile ALL results into a single, final JSON object. The JSON object MUST have these top-level keys ONLY:\n"
             "   - 'manuscript_analysis': An object containing 'title', 'summary', and 'extracted_data' (the dict from step 4).\n"
             "   - 'similar_articles_analysis': A list of objects, where each object is the dictionary returned by 'process_web_article' for each processed URL (including 'url', 'status', 'corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa').\n"
             "8. If any step fails (e.g., loading, search, processing a specific URL), report the error clearly in the 'status' or relevant field and proceed with the next steps if possible. The final JSON output is mandatory.\n"
             "9. Ensure your final response is ONLY the JSON object described in step 7, enclosed in ```json ... ```."
             .format(max_search_results=max_search, max_process_limit=max_process, max_extract_chars=max_extract_chars)
            ),
            MessagesPlaceholder(variable_name="chat_history"),
            ("human", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ])

        # --- Create Agent and Executor ---
        try:
            agent = create_openai_functions_agent(llm, tools, prompt_template)
            agent_executor = AgentExecutor(
                agent=agent,
                tools=tools,
                verbose=agent_verbose,
                handle_parsing_errors=True, # Try to recover from LLM format errors
                max_iterations=25 # Increased slightly for multi-step processing
            )
            log.info("Agent setup complete.")
        except Exception as e:
            log.critical(f"Failed to create agent/executor: {e}", exc_info=True)
            return None # Indicate failure

        # --- Prepare Input and Run Agent ---
        initial_input = f"Please process the manuscript located at '{pdf_path}' according to the standard procedure."
        chat_history = []

        try:
            log.info("Invoking agent executor...")
            result = agent_executor.invoke({"input": initial_input, "chat_history": chat_history})
            log.info("Agent execution finished.")
            agent_output = result.get('output', '{}')
            return agent_output # Return the raw output string

        except Exception as e:
            log.critical(f"An error occurred during agent execution: {type(e).__name__} - {e}", exc_info=True)
            # Optionally return partial results or a specific error structure
            return json.dumps({"error": f"Agent execution failed: {e}"}) # Return error as JSON string

    # End of TemporaryDirectory context - temp_download_dir is cleaned up here

# --- Main Execution Block ---
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Reviewer Assistant Agent: Summarizes a manuscript and compares metrics with similar online articles.")
    # ...(keep all your parser.add_argument calls exactly as they were)...
    parser.add_argument("pdf_path", help="Path to the input manuscript PDF file.")
    parser.add_argument("-k", "--api-key", help="OpenAI API Key. If not provided, uses OPENAI_API_KEY environment variable.", default=None)
    parser.add_argument("-o", "--output-dir", help="Directory for temporary files (like downloads). Defaults to './raa_output'.", default="./raa_output")
    parser.add_argument("--model", help=f"OpenAI model name to use. Defaults to '{DEFAULT_LLM_MODEL}'.", default=DEFAULT_LLM_MODEL)
    parser.add_argument("--max-search", type=int, help=f"Max Google search results. Defaults to {DEFAULT_MAX_SEARCH_RESULTS}.", default=DEFAULT_MAX_SEARCH_RESULTS)
    parser.add_argument("--max-process", type=int, help=f"Max similar articles to fully process. Defaults to {DEFAULT_MAX_SIMILAR_ARTICLES_TO_PROCESS}.", default=DEFAULT_MAX_SIMILAR_ARTICLES_TO_PROCESS)
    parser.add_argument("--timeout", type=int, help=f"Download timeout in seconds. Defaults to {DEFAULT_DOWNLOAD_TIMEOUT}.", default=DEFAULT_DOWNLOAD_TIMEOUT)
    parser.add_argument("--max-extract-chars", type=int, help=f"Max characters for LLM extraction. Defaults to {DEFAULT_MAX_CHARS_FOR_EXTRACTION}.", default=DEFAULT_MAX_CHARS_FOR_EXTRACTION)
    parser.add_argument("--quiet", action="store_true", help="Run agent silently (sets verbose=False).")


    # ----- MODIFICATION FOR NOTEBOOK -----
    # Define arguments explicitly as a list of strings
    # Replace placeholders with your actual values for notebook execution
    pdf_file_to_process = "/kaggle/input/sample-manuscript/Sample-Manuscript.pdf" # <--- SET YOUR PDF PATH HERE
    output_directory = "/kaggle/working/raa_output" # <--- SET YOUR DESIRED OUTPUT DIR HERE
    # Add other arguments if needed, e.g., "--model", "gpt-4", "--max-process", "3"
    # For boolean flags like --quiet, just include the flag string: "--quiet"

    # Construct the argument list
    args_list = [
        pdf_file_to_process,
        "--output-dir", output_directory,
        "--max-search", "20",
        "--max-process", "10",
    ]

    # Parse the explicit list instead of sys.argv
    args = parser.parse_args(args_list)
    # ----- END OF MODIFICATION -----


    # --- Get API Key ---
    # (Keep this section, it can still check args.api_key or environment variable)
    openai_api_key = args.api_key or os.getenv("OPENAI_API_KEY")
    if not openai_api_key:
        # If running in Kaggle, try getting from secrets as a fallback
        try:
            from kaggle_secrets import UserSecretsClient
            user_secrets = UserSecretsClient()
            openai_api_key = user_secrets.get_secret("OpenAI_API_KEY")
            log.info("Retrieved OpenAI API Key from Kaggle Secrets.")
            os.environ["OPENAI_API_KEY"] = openai_api_key # Set env var too if needed later
        except Exception as secret_err:
            log.critical(f"Kaggle Secrets error: {secret_err}")
            log.critical("Error: OpenAI API Key not found. Provide via --api-key, env var, or Kaggle Secrets.")
            exit(1) # Exit if key is definitely not found

    if not openai_api_key: # Double check after trying secrets
         log.critical("Error: OpenAI API Key not found. Provide via --api-key, env var, or Kaggle Secrets.")
         exit(1)


    # --- Create Output Directory ---
    os.makedirs(args.output_dir, exist_ok=True)
    log.info(f"Using output directory: {args.output_dir}")

    # --- Run the Agent ---
    # (Keep the rest of the execution logic the same)
    agent_raw_output = run_reviewer_agent(
        pdf_path=args.pdf_path, # Uses the parsed path from args_list
        openai_api_key=openai_api_key,
        output_dir=args.output_dir,
        llm_model=args.model,
        max_search=args.max_search,
        max_process=args.max_process,
        timeout=args.timeout,
        max_extract_chars=args.max_extract_chars,
        agent_verbose=(not args.quiet)
    )

    
    # --- Process and Display Results ---
    if agent_raw_output:
        print("\n" + "="*30 + " Agent Raw Output " + "="*30)
        print(agent_raw_output)
        print("="*76)

        final_data = None
        try:
            # Attempt to extract JSON block
            json_match = re.search(r"```json\n(.*?)\n```", agent_raw_output, re.DOTALL)
            if json_match:
                json_string = json_match.group(1)
            else:
                 # Fallback: assume the whole output might be JSON, clean it
                 json_string = agent_raw_output.strip().strip('`')

            final_data = json.loads(json_string)

            print("\n\n" + "*"*30 + " Final Structured Results " + "*"*30)

            # Display Manuscript Info
            manuscript_info = final_data.get('manuscript_analysis', {})
            print("\n--- Manuscript Analysis ---")
            print(f"Title: {manuscript_info.get('title', 'N/A')}")
            print("Summary:")
            print(manuscript_info.get('summary', 'N/A'))
            print("\nExtracted Data:")
            ms_data = manuscript_info.get('extracted_data', {})
            if 'error' in ms_data:
                 print(f"  Error extracting manuscript data: {ms_data['error']}")
            elif ms_data:
                 # Create DataFrame for nice printing
                 ms_df = pd.DataFrame([ms_data]) # Needs list of dicts
                 print(ms_df.to_string(index=False))
            else:
                 print("  No manuscript data extracted.")

            # Display Similar Articles Info
            similar_articles = final_data.get('similar_articles_analysis', [])
            print(f"\n--- Similar Articles Analysis ({len(similar_articles)} URLs processed) ---")
            if similar_articles:
                df_similar = pd.DataFrame(similar_articles)
                column_order = ['url', 'status', 'corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']
                # Ensure columns exist and are ordered correctly
                df_similar = df_similar.reindex(columns=column_order, fill_value='N/A')

                print("\nResults Table:")
                # Use to_markdown for better terminal display
                print(df_similar.to_markdown(index=False))

                # Option: Save to CSV
                csv_output_path = os.path.join(args.output_dir, "similar_articles_data.csv")
                try:
                    df_similar.to_csv(csv_output_path, index=False)
                    print(f"\nSimilar articles data saved to: {csv_output_path}")
                except Exception as csv_e:
                    print(f"\nWarning: Failed to save results to CSV: {csv_e}")

            else:
                print("No data collected from similar articles (or none were processed/found).")

        except json.JSONDecodeError as json_err:
            log.error(f"Failed to parse the agent's final JSON output: {json_err}")
            print("\n--- Error Parsing Agent's Final JSON Output ---")
            print("The agent likely failed to format its response correctly.")
            print("Check the 'Agent Raw Output' above for details.")
        except Exception as display_err:
            log.error(f"An error occurred while displaying results: {display_err}", exc_info=True)
            print(f"\n--- Error Displaying Results ---: {display_err}")
            if final_data: print("\nRaw parsed data:", final_data)

    else:
        print("\n--- Agent execution failed. Check logs for details. ---")
    display(df_similar)

    print("\n--- Script Execution Finished ---")

# --- END OF FILE reviewer_agent_script.py ---



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `load_manuscript` with `{'pdf_path': '/kaggle/input/sample-manuscript/Sample-Manuscript.pdf'}`


[0m[36;1m[1;3m{'full_text': 'Scrutinizing corrosion inhibition in acid solution, and the effect of inhibitors on the \nperformance of polymer coatings on steel surface \nM. Mahdavian1 \nAbstract: \nThis a sample manuscript created as an instance to use as an input for an GenAI model. The \nGenAI model (gemini-1.5-pro-latest) is going to extract keywords based on corrosion \ninhibitors (%), impedance (ohm.cm2), and adhesion strength (MPa). The title is going to use \nin the top most relevant search results.  \nThe azole-based corrosion inhibitor  in hydrochloric acid solution provides corrosion  \ninhibition of 92% at concentration of 1000 ppm. The epoxy coated acid washed steel samples \nrevealed an impedance of 2×10 8 ohm.cm2 and adhesion strength of 5.2 MPa after 20 days \nsubjection to saline solution. \n \n \n \n

Unnamed: 0,url,status,corrosion_inhibition_%,impedance_ohm_cm2,adhesion_MPa
0,https://link.springer.com/article/10.1007/s116...,"HTML Scan Failed to Find PDF, Trying HTML Text...",Not Found,Not Found,Not Found
1,https://www.sciencedirect.com/science/article/...,Direct Download/Access Failed (Network/Request...,No Text Loaded,No Text Loaded,No Text Loaded
2,https://www.sciencedirect.com/science/article/...,Direct Download/Access Failed (Network/Request...,No Text Loaded,No Text Loaded,No Text Loaded
3,https://link.springer.com/article/10.1007/s119...,"HTML Scan Failed to Find PDF, Trying HTML Text...",Not Found,Not Found,Not Found
4,https://pdfs.semanticscholar.org/bc32/93e4080d...,PDF Downloaded & Text Loaded - Extraction Succ...,99.75,Not Found,Not Found
5,https://www.sciencedirect.com/science/article/...,Direct Download/Access Failed (Network/Request...,No Text Loaded,No Text Loaded,No Text Loaded
6,https://chemistry-europe.onlinelibrary.wiley.c...,Direct Download/Access Failed (Network/Request...,No Text Loaded,No Text Loaded,No Text Loaded
7,https://www.nature.com/articles/s41598-025-934...,"PDF Link in HTML Found, Downloaded & Text Load...",94.9,Not Found,Not Found
8,https://link.springer.com/article/10.1007/s108...,"HTML Scan Failed to Find PDF, Trying HTML Text...",Not Found,3 orders of magnitude higher than EP after 35 ...,Not Found
9,https://pubs.acs.org/doi/10.1021/acsomega.0c05476,Direct Download/Access Failed (Network/Request...,No Text Loaded,No Text Loaded,No Text Loaded



--- Script Execution Finished ---
