<a href="https://www.kaggle.com/code/mmahdavian/capstone-raa-project?scriptVersionId=233515412" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Reviewer Assistant Agent (RAA) NotebookThis notebook builds an agent to assist reviewers by summarizing manuscriptsand comparing their key findings (corrosion inhibition, impedance, adhesion)with similar open-access articles found via internet search.**

**This notebook implements a Reviewer Assistant Agent (RAA) using the LangChain framework and OpenAI's GPT-4-turbo model. The agent takes a manuscript in PDF format, extracts its text content, summarizes it focusing on specific quantitative results (corrosion inhibition %, impedance ohm.cm2, adhesion MPa), finds similar articles online, extracts the same metrics from those articles, and presents a comparative summary.**

**Goal: To automate parts of the peer-review process by (a) Providing a concise summary of a manuscript's key quantitative findings related to corrosion inhibition, impedance, and adhesion.b) Identifying and extracting comparable results from the top relevant web search results for similar articles, facilitating a quick comparison of the manuscript's contribution against existing literature.**

**Input: PDF manuscript file path, OpenAI API Key**

**Framework: LangChain, OpenAI**

**LLM: gpt-4-turbo, OpenAI**

**Tools: File handling and text extraction, web searching and downloading (googlesearch-python, requests),**

**Outputs: (1) Text summary of the input manuscript highlighting key metrics, (2) Pandas DataFrame containing links and extracted metrics from similar articles.**

In [1]:
# --- 1. Installation ---
# Install necessary libraries (run this cell first in Kaggle)
!pip install -q langchain langchain_openai openai pypdf pandas google-search-results beautifulsoup4 kaggle googlesearch-python langchain_community requests lxml
print("Libraries installed.")

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.4/644.4 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.3/423.3 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for google-search-results (setup.py) ... [?25l[?25hdone
Libraries installed.


In [2]:
# --- 2. Setup and Imports ---
import os
import re
import requests
import time
import pandas as pd
from tempfile import NamedTemporaryFile
from urllib.parse import urljoin, urlparse

from bs4 import BeautifulSoup
from kaggle_secrets import UserSecretsClient

# LangChain
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Google Search
try:
    from googlesearch import search
except ImportError:
    print("Could not import googlesearch. Please ensure it's installed.")
    search = None

print("Imports successful.")

# --- 3. Configuration ---
try:
    user_secrets = UserSecretsClient()
    OPENAI_API_KEY = user_secrets.get_secret("OpenAI_API_KEY")
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    print("OpenAI API Key retrieved successfully.")
except Exception as e:
    print(f"Error retrieving OpenAI API Key from Kaggle Secrets: {e}")
    OPENAI_API_KEY = None

# Constants
PDF_MANUSCRIPT_PATH = "/kaggle/input/sample-manuscript/Sample-Manuscript.pdf" # <-- Verify
LLM_MODEL = "gpt-4-turbo"
MAX_SEARCH_RESULTS = 20
DOWNLOAD_TIMEOUT = 25 # Slightly longer timeout
DOWNLOAD_DIR = "/kaggle/working/temp_pdfs"
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
# Max characters to send to LLM for extraction task (balance context vs token limits)
MAX_CHARS_FOR_EXTRACTION = 25000

# --- 4. Initialize LLM ---
if OPENAI_API_KEY:
    llm = ChatOpenAI(temperature=0, model_name=LLM_MODEL, openai_api_key=OPENAI_API_KEY, request_timeout=120) # Increased LLM timeout
    print(f"LLM ({LLM_MODEL}) initialized.")
else:
    llm = None
    print("LLM could not be initialized.")

# --- 5. Helper Functions ---

def load_pdf_text(pdf_path):
    """Loads text content from a PDF file. Returns (full_text, docs_list)"""
    try:
        print(f"Loading PDF: {pdf_path}")
        loader = PyPDFLoader(pdf_path, extract_images=False) # Ignore images
        docs = loader.load()
        if not docs:
             print(f"Warning: PyPDFLoader loaded 0 pages from {pdf_path}.")
             return None, None
        # Clean potentially problematic characters before joining
        cleaned_pages = []
        for doc in docs:
             # Replace null bytes and normalize whitespace
             cleaned_content = doc.page_content.replace('\x00', '').strip()
             cleaned_pages.append(cleaned_content)

        full_text = "\n".join(filter(None, cleaned_pages)) # Join non-empty pages

        if not full_text:
            print(f"Warning: PDF loaded but resulted in empty text content after cleaning: {pdf_path}")
            return None, docs # Return docs even if text is empty

        print(f"Successfully loaded and cleaned text from: {pdf_path} ({len(docs)} pages, {len(full_text)} chars)")
        return full_text, docs
    except Exception as e:
        print(f"Error loading PDF {pdf_path}: {type(e).__name__} - {e}")
        return None, None

def extract_text_from_html(html_content):
    """Extracts plain text from HTML content."""
    try:
        soup = BeautifulSoup(html_content, 'lxml')
        # Remove script and style elements
        for script_or_style in soup(["script", "style"]):
            script_or_style.decompose()
        # Get text, separate paragraphs, and clean up whitespace
        text = soup.get_text(separator='\n', strip=True)
        # Optional: Further cleaning (e.g., removing excessive blank lines)
        text = re.sub(r'\n\s*\n', '\n\n', text)
        print(f"Extracted ~{len(text)} characters of text from HTML.")
        return text
    except Exception as e:
        print(f"Error extracting text from HTML: {e}")
        return None

def extract_manuscript_title(pdf_docs):
    """Extracts manuscript title using LLM or filename fallback."""
    fallback_title = "Unknown Manuscript Title"
    if PDF_MANUSCRIPT_PATH and os.path.exists(PDF_MANUSCRIPT_PATH):
        base_name = os.path.basename(PDF_MANUSCRIPT_PATH)
        title, _ = os.path.splitext(base_name)
        fallback_title = title.replace('-', ' ').replace('_', ' ') # Use filename as primary fallback
        print(f"Prepared fallback title from filename: {fallback_title}")

    if not llm or not pdf_docs:
        print("LLM not available or no document content for title extraction. Using fallback.")
        return fallback_title

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
    split_docs = text_splitter.split_documents(pdf_docs[:5]) # Use first few pages

    prompt_template = """
    Based on the following text from the beginning of a scientific manuscript, what is its exact full title? Only return the title itself, nothing else.

    Text:
    "{text}"

    Title:"""
    prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

    if split_docs:
        try:
            context_text = "\n".join([doc.page_content for doc in split_docs])[:4000]
            title_query = prompt.format(text=context_text)
            response = llm.invoke(title_query)
            title = response.content.strip().strip('"')
            if 5 < len(title) < 300:
                print(f"Extracted title via LLM: {title}")
                return title
            else:
                 print(f"LLM extraction yielded unusual title. Using fallback: {fallback_title}")
                 return fallback_title
        except Exception as e:
            print(f"Error during LLM title extraction: {e}. Using fallback: {fallback_title}")
            return fallback_title
    else:
         print("No text chunks available for LLM title extraction. Using fallback.")
         return fallback_title


def summarize_and_extract_manuscript_data(docs, full_text):
    """Summarizes the manuscript and extracts key data using LLM."""
    if not llm: return "LLM not available.", {}
    if not docs and not full_text: return "No content provided.", {}

    summary = "Summary generation failed."
    extracted_values = {}

    # --- Summarization ---
    if docs:
        try:
            print("Generating summary...")
            summary_chain = load_summarize_chain(llm, chain_type="map_reduce")
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=500)
            split_docs = text_splitter.split_documents(docs)
            if split_docs:
                summary_result = summary_chain.invoke(split_docs)
                summary = summary_result.get("output_text", summary)
                print("Summary generated.")
            else:
                 summary = "Document splitting failed for summary."
        except Exception as e:
            print(f"Error during summarization: {e}")
            summary = f"Error during summarization: {e}"
    else:
        summary = "Original document list not available for summarization."


    # --- Extraction ---
    if full_text:
        try:
            print("Extracting specific data points from manuscript text...")
            # Use the dedicated extraction function
            extracted_values = extract_data_from_text(full_text, source_type="Manuscript PDF")
        except Exception as e:
            print(f"Error during manuscript data extraction: {e}")
            # Ensure default values if extraction fails completely
            extracted_values = {
                'corrosion_inhibition_%': 'Extraction Error',
                'impedance_ohm_cm2': 'Extraction Error',
                'adhesion_MPa': 'Extraction Error'
                }
    else:
        print("No full text available for manuscript data extraction.")
        extracted_values = {key: 'No Text' for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}


    return summary, extracted_values


def extract_data_from_text(text_content, source_type="Unknown Source"):
    """Extracts key data points from provided text using LLM."""
    if not llm:
        print(f"LLM not available for data extraction from {source_type}.")
        return {}
    if not text_content:
        print(f"No text content provided for extraction from {source_type}.")
        return {}

    # --- Updated Extraction Prompt ---
    extraction_prompt_template = """
    Analyze the following text from a scientific paper ({source_type}). Extract the following specific quantitative data points if present. Look carefully throughout the text, especially in results, discussion, and conclusion sections. If a value is explicitly mentioned, report it. If not found, state 'Not Found'. Provide only the numerical value (e.g., 95.5, 1.2E6, 15.3) or 'Not Found'.

    1.  **Corrosion Inhibition Efficiency (%):** Look for 'Corrosion Inhibition Efficiency', 'Inhibition Efficiency', or 'IE%'. Report the highest or most representative value. [Value or 'Not Found']
    2.  **Impedance Magnitude (ohm.cm2 or ꭥ.cm2):** Look for impedance values, often associated with EIS results (e.g., Z_real, |Z|, R_ct, R_p) reported in Ohm.cm² (or kohm.cm², Mohm.cm² - convert if possible, otherwise state units). Report a key value (e.g., low frequency, after exposure). [Value (specify units if not ohm.cm2) or 'Not Found']
    3.  **Adhesion Strength (MPa):** Look for results from pull-off, scratch, or similar adhesion tests reported in MPa. [Value or 'Not Found']

    Format the output clearly, one item per line:
    Corrosion Inhibition (%): [Value or 'Not Found']
    Impedance (ohm.cm2): [Value or 'Not Found']
    Adhesion (MPa): [Value or 'Not Found']

    Text Snippet:
    "{text}"

    Extracted Data:
    """
    extraction_prompt = PromptTemplate(
        template=extraction_prompt_template,
        input_variables=["text", "source_type"]
    )

    try:
        print(f"Requesting LLM extraction from {source_type}...")
        # Limit text sent to LLM to manage token usage/cost
        text_snippet = text_content[:MAX_CHARS_FOR_EXTRACTION]

        extraction_query = extraction_prompt.format(text=text_snippet, source_type=source_type)
        extraction_response = llm.invoke(extraction_query)
        extracted_data_text = extraction_response.content
        print(f"Extraction response received from LLM for {source_type}.")

        # Parse the LLM's response
        extracted_values = parse_llm_extraction(extracted_data_text)
        return extracted_values

    except Exception as e:
        print(f"Error during LLM data extraction from {source_type}: {e}")
        # Return dict indicating error for this source
        return {
            'corrosion_inhibition_%': f'LLM Error ({source_type})',
            'impedance_ohm_cm2': f'LLM Error ({source_type})',
            'adhesion_MPa': f'LLM Error ({source_type})'
        }

def search_similar_articles(query, num_results):
    """Performs Google search (no date filter)."""
    if not search: return []
    urls = []
    print(f"Searching Google for: '{query}' (Top {num_results})")
    try:
        search_results = search(query, num_results=num_results, lang="en") # No pause or tbs
        urls = list(search_results)
        print(f"Found {len(urls)} potential URLs.")
    except Exception as e:
        print(f"Error during Google search: {e}")
        if "429" in str(e): print("Google search blocked (429).")
        urls = []
    return urls

def download_pdf(url, save_dir):
    """Attempts direct PDF download. Returns (pdf_path, response_object_or_None)"""
    os.makedirs(save_dir, exist_ok=True)
    pdf_path = None
    response = None
    try:
        print(f"Attempting direct PDF download from: {url}")
        response = requests.get(url, stream=True, timeout=DOWNLOAD_TIMEOUT, headers=HEADERS, allow_redirects=True)
        response.raise_for_status()
        content_type = response.headers.get('Content-Type', '').lower()
        is_pdf_content = 'application/pdf' in content_type or 'pdf' in urlparse(url).path.lower()

        if is_pdf_content and response.status_code == 200:
            if int(response.headers.get('content-length', 1)) == 0:
                 print(f"Warning: Content-Length is 0 for {url}. Skipping download.")
                 return None, response

            with NamedTemporaryFile(delete=False, suffix=".pdf", dir=save_dir) as temp_file:
                bytes_downloaded = 0
                for chunk in response.iter_content(chunk_size=8192):
                    temp_file.write(chunk)
                    bytes_downloaded += len(chunk)
                pdf_path = temp_file.name

            if bytes_downloaded < 1024:
                 print(f"Warning: Downloaded PDF {pdf_path} is small ({bytes_downloaded} bytes).")

            try:
                 loader_check = PyPDFLoader(pdf_path, extract_images=False)
                 loader_check.load() # Verify it's loadable
                 print(f"Successfully downloaded and verified PDF ({bytes_downloaded} bytes) to: {pdf_path}")
                 return pdf_path, response
            except Exception as pdf_err:
                 print(f"Downloaded file {pdf_path} failed verification: {pdf_err}. Removing.")
                 if os.path.exists(pdf_path): os.remove(pdf_path)
                 return None, response # Return None path, but keep response obj
        else:
            print(f"URL does not appear to be a direct PDF. Status: {response.status_code}, Content-Type: {content_type}")
            return None, response

    except requests.exceptions.Timeout:
        print(f"Timeout occurred downloading {url}")
    except requests.exceptions.RequestException as e:
        print(f"Failed download/access for {url}: {e}")
    except Exception as e:
        print(f"Unexpected error during direct download {url}: {e}")
    # Return None path, but the response object if we got one
    return None, response


def find_and_download_pdf_from_html(page_url, page_content, save_dir):
    """Parses HTML to find PDF links and attempts download."""
    print(f"Scanning HTML from {page_url} for PDF links.")
    soup = BeautifulSoup(page_content, 'lxml')
    pdf_links = []
    found_links = set()

    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href'].strip()
        href_lower = href.lower()
        link_text = a_tag.get_text(strip=True).lower()

        is_potential_pdf_link = (
            href_lower.endswith('.pdf') or
            '/pdf' in href_lower or # More general path check
            'download=pdf' in href_lower or
            'downloadpdf' in href_lower or
            'epdf' in href_lower or # Common for ePDFs
            link_text in ['pdf', 'full text pdf', 'download pdf', '[pdf]', 'article pdf'] or
            (a_tag.get('class') and any('pdf' in c.lower() for c in a_tag.get('class')))
           )

        if is_potential_pdf_link:
            pdf_url = urljoin(page_url, href)
            parsed_url = urlparse(pdf_url)
            if parsed_url.scheme in ['http', 'https'] and not pdf_url.lower().startswith('javascript:'):
                if pdf_url != page_url and pdf_url not in found_links:
                    # Basic check to avoid linking to abstract/html pages named 'pdf'
                    if '.pdf' in pdf_url or 'download' in pdf_url.lower() or 'content' in pdf_url.lower():
                         print(f"Found potential PDF link: {pdf_url}")
                         pdf_links.append(pdf_url)
                         found_links.add(pdf_url)

    # Prioritize links ending directly in .pdf
    pdf_links.sort(key=lambda x: not x.lower().endswith('.pdf'))

    if not pdf_links:
        print("No likely PDF links found in HTML.")
        return None

    print(f"Found {len(pdf_links)} potential links. Attempting download...")
    for pdf_url in pdf_links:
        time.sleep(0.5)
        print(f"Attempting download from HTML link: {pdf_url}")
        downloaded_path, _ = download_pdf(pdf_url, save_dir)
        if downloaded_path:
            print(f"Successfully downloaded PDF from HTML link: {pdf_url}")
            return downloaded_path

    print("Tried all potential PDF links from HTML, none worked.")
    return None

def parse_llm_extraction(llm_output_text):
    """Parses the LLM text output using regex for key data."""
    data = {
        'corrosion_inhibition_%': 'Not Found',
        'impedance_ohm_cm2': 'Not Found',
        'adhesion_MPa': 'Not Found'
    }
    # Adjusted Regex to be slightly more flexible and capture units for impedance
    patterns = {
        'corrosion_inhibition_%': r"Corrosion Inhibition \(%\):\s*([\d\.]+%?|Not Found|N/A|-)",
        'impedance_ohm_cm2': r"Impedance \((?:ohm\.cm2|ꭥ\.cm2)\):\s*(.*?)(?:\n|Adhesion|$)", # Capture value potentially with units
        'adhesion_MPa': r"Adhesion \(MPa\):\s*([\d\.]+|Not Found|N/A|-)"
    }

    for key, pattern in patterns.items():
        match = re.search(pattern, llm_output_text, re.IGNORECASE | re.MULTILINE | re.DOTALL)
        if match:
            value = match.group(1).strip()
            if value.lower() in ['not found', 'n/a', '-']:
                data[key] = 'Not Found'
            else:
                # For impedance, keep the raw value as LLM might include units
                if key == 'impedance_ohm_cm2':
                     # Basic cleanup of impedance value
                     value = re.sub(r'\s+', ' ', value).strip('., ')
                     if not value: # If value becomes empty after cleanup
                          data[key] = 'Not Found'
                     else:
                          data[key] = value # Keep potentially complex value like "1.5 Mohm.cm2" or "15000"
                else: # For % and MPa, extract number
                    num_match = re.search(r"([-+]?\d*\.?\d+(?:[eE][-+]?\d+)?)", value)
                    if num_match:
                        data[key] = num_match.group(1).strip()
                    else:
                        data[key] = 'Not Found (parse failed)'
                        print(f"Info: Regex matched for {key} but couldn't parse number from '{value}'")

    print(f"Parsed LLM Output: {data}")
    return data


# --- 6. Main Execution ---

if __name__ == "__main__":
    if not llm:
        print("Execution cannot proceed: LLM not initialized.")
    else:
        # --- Part 1: Process the Manuscript ---
        print("\n--- Processing Manuscript ---")
        manuscript_full_text, manuscript_docs = load_pdf_text(PDF_MANUSCRIPT_PATH)
        manuscript_summary = "Summary N/A"
        manuscript_data = {}
        manuscript_title = "Title N/A"

        if manuscript_docs or manuscript_full_text:
            # Extract title (needs docs ideally, but pass None if only text exists)
            manuscript_title = extract_manuscript_title(manuscript_docs if manuscript_docs else None)
            # Summarize and extract data (pass both docs and full_text)
            manuscript_summary, manuscript_data = summarize_and_extract_manuscript_data(manuscript_docs, manuscript_full_text)
        else:
            print(f"Critical Failure: Could not load text or docs from manuscript: {PDF_MANUSCRIPT_PATH}")
            # Attempt title extraction from filename as last resort
            if PDF_MANUSCRIPT_PATH and os.path.exists(PDF_MANUSCRIPT_PATH):
                 base_name = os.path.basename(PDF_MANUSCRIPT_PATH)
                 title_f, _ = os.path.splitext(base_name)
                 manuscript_title = title_f.replace('-', ' ').replace('_', ' ')
                 print(f"Using filename as title due to load failure: {manuscript_title}")
            else:
                 manuscript_title = "Unknown Manuscript Title"


        print("\n--- Manuscript Summary ---")
        print(manuscript_summary)
        print("\n--- Manuscript Extracted Data ---")
        manuscript_df_data = {k: [v] for k, v in manuscript_data.items()} # Wrap values in lists for DataFrame
        print(pd.DataFrame(manuscript_df_data))

        # --- Part 2: Find and Process Similar Articles ---
        print(f"\n--- Searching for Similar Articles (Title: '{manuscript_title}') ---")
        if manuscript_title.startswith("Unknown") or manuscript_title == "Title N/A":
             print("Skipping search due to missing/invalid manuscript title.")
             similar_article_urls = []
        else:
             similar_article_urls = search_similar_articles(manuscript_title, MAX_SEARCH_RESULTS)


        similar_articles_data = []
        processed_url_count = 0
        successful_extractions = 0 # Count URLs where data extraction was attempted

        os.makedirs(DOWNLOAD_DIR, exist_ok=True)

        if similar_article_urls:
            for url in similar_article_urls:
                processed_url_count += 1
                print(f"\n--- Processing URL ({processed_url_count}/{len(similar_article_urls)}): {url} ---")

                pdf_path = None
                response_obj = None
                article_text = None
                extracted_data = {}
                status = "Processing Failed"
                source_type = "N/A" # Where did we get the text for extraction?

                try:
                    # Attempt 1: Direct Download PDF
                    pdf_path, response_obj = download_pdf(url, DOWNLOAD_DIR)

                    if pdf_path:
                        # Attempt 1a: Load text from downloaded PDF
                        article_text, _ = load_pdf_text(pdf_path)
                        if article_text:
                            status = "PDF Downloaded & Text Loaded"
                            source_type = "PDF"
                        else:
                             status = "PDF Downloaded but Text Extraction Failed"
                             # Keep pdf_path for cleanup, but article_text is None
                    else:
                        # Attempt 2: Check if it was HTML
                        if response_obj and 'text/html' in response_obj.headers.get('Content-Type', '').lower():
                            html_content = response_obj.text
                            if len(html_content) > 200: # Basic check for valid HTML page
                                print("Direct download failed, content is HTML.")
                                status = "HTML Page Found"
                                # Attempt 2a: Find and download PDF from HTML
                                pdf_path_from_html = find_and_download_pdf_from_html(url, html_content, DOWNLOAD_DIR)
                                if pdf_path_from_html:
                                    pdf_path = pdf_path_from_html # Assign to main pdf_path for cleanup
                                    article_text, _ = load_pdf_text(pdf_path)
                                    if article_text:
                                        status = "PDF Link in HTML Found, Downloaded & Text Loaded"
                                        source_type = "PDF (from HTML link)"
                                    else:
                                        status = "PDF via HTML Downloaded but Text Extraction Failed"
                                else:
                                    # Attempt 2b: FALLBACK - Extract text directly from HTML
                                    print("Could not download PDF from HTML links. Attempting extraction from HTML text.")
                                    status = "HTML Scan Failed to Find PDF, Trying HTML Text Extraction"
                                    article_text = extract_text_from_html(html_content)
                                    if article_text:
                                        source_type = "HTML"
                                    else:
                                        status = "HTML Scan Failed, HTML Text Extraction Failed"
                            else:
                                status = "HTML Page Found, but Content too small (Error Page?)"
                        elif response_obj:
                             status = f"Direct Download Failed (Status: {response_obj.status_code}, Type: {response_obj.headers.get('Content-Type', 'N/A')})"
                        else:
                             status = "Direct Download Failed (Network/Request Error)"


                    # If we managed to get text (from PDF or HTML), try extracting data
                    if article_text:
                        extracted_data = extract_data_from_text(article_text, source_type=source_type)
                        successful_extractions += 1
                        # Refine status based on extraction result
                        if any(v not in ['Not Found', 'N/A', 'Extraction Error', 'No Text', 'LLM Error'] for v in extracted_data.values()):
                             status += " - Data Extraction Successful"
                        else:
                             status += " - Data Extraction Attempted (No Values Found by LLM)"
                    else:
                         # If no text was extracted, create empty data structure
                         extracted_data = {key: 'No Text Loaded' for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}


                except Exception as process_err:
                     print(f"Unexpected error processing URL {url}: {process_err}")
                     status = "Unexpected Error During Processing"
                     extracted_data = {key: 'Processing Error' for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}

                finally:
                    # Record the outcome
                    result = {
                        'url': url,
                        'status': status,
                        # Merge extracted data, ensuring keys exist
                        **{key: extracted_data.get(key, 'N/A') for key in ['corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']}
                    }
                    similar_articles_data.append(result)

                    # Clean up downloaded temporary file (PDF) if it exists
                    if pdf_path and os.path.exists(pdf_path):
                        try:
                            os.remove(pdf_path)
                            print(f"Removed temporary file: {pdf_path}")
                        except OSError as e:
                            print(f"Error removing temporary file {pdf_path}: {e}")

                # Politeness delay
                time.sleep(2.0) # Slightly increased delay

        else:
            print("No similar article URLs found or search failed/skipped.")

        # --- Part 3: Display Results ---
        print("\n\n--- Final Results ---")
        print("\n--- Manuscript Summary & Data ---")
        print(f"Title: {manuscript_title}")
        print("Summary:")
        print(manuscript_summary)
        print("\nData:")
        print(pd.DataFrame({k: [v] for k, v in manuscript_data.items()}))

        print(f"\n--- Similar Articles Data ({successful_extractions} URLs processed for extraction out of {processed_url_count} URLs checked) ---")
        if similar_articles_data:
            df_similar_articles = pd.DataFrame(similar_articles_data)
            column_order = ['url', 'status', 'corrosion_inhibition_%', 'impedance_ohm_cm2', 'adhesion_MPa']
            df_similar_articles = df_similar_articles.reindex(columns=column_order, fill_value='N/A') # Ensure all columns exist & order

            print("Results Table (Markdown):")
            print(df_similar_articles.to_markdown(index=False))
            try:
                from IPython.display import display
                print("\nDataFrame View:")
                display(df_similar_articles)
            except ImportError: pass

            # Optional: Save to CSV
            # csv_output_path = "/kaggle/working/similar_articles_data.csv"
            # df_similar_articles.to_csv(csv_output_path, index=False)
            # print(f"\nSimilar articles data saved to {csv_output_path}")
        else:
            print("No data collected from similar articles.")

        # Optional: Cleanup Temp Dir
        # import shutil
        # if os.path.exists(DOWNLOAD_DIR): shutil.rmtree(DOWNLOAD_DIR)


print("\n--- Script Execution Finished ---")

Imports successful.
OpenAI API Key retrieved successfully.
LLM (gpt-4-turbo) initialized.

--- Processing Manuscript ---
Loading PDF: /kaggle/input/sample-manuscript/Sample-Manuscript.pdf
Successfully loaded and cleaned text from: /kaggle/input/sample-manuscript/Sample-Manuscript.pdf (1 pages, 881 chars)
Prepared fallback title from filename: Sample Manuscript
Error during LLM title extraction: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}. Using fallback: Sample Manuscript
Generating summary...
Error during summarization: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guid

Unnamed: 0,url,status,corrosion_inhibition_%,impedance_ohm_cm2,adhesion_MPa
0,/search?num=22,Direct Download Failed (Network/Request Error),No Text Loaded,No Text Loaded,No Text Loaded
1,https://ijcsrr.org/manuscript-template/,"HTML Scan Failed to Find PDF, Trying HTML Text...",LLM Error (HTML),LLM Error (HTML),LLM Error (HTML)
2,https://journals.lww.com/greenjournal/Document...,PDF Downloaded & Text Loaded - Data Extraction...,LLM Error (PDF),LLM Error (PDF),LLM Error (PDF)
3,https://www.medicinearticle.com/JMR_Manuscript...,"Direct Download Failed (Status: 200, Type: app...",No Text Loaded,No Text Loaded,No Text Loaded
4,http://web.mit.edu/kjb/mitconf/Sample_Manuscri...,PDF Downloaded & Text Loaded - Data Extraction...,LLM Error (PDF),LLM Error (PDF),LLM Error (PDF)
5,https://www.overleaf.com/latex/templates/assoc...,"PDF Link in HTML Found, Downloaded & Text Load...",LLM Error (PDF (from HTML link)),LLM Error (PDF (from HTML link)),LLM Error (PDF (from HTML link))
6,https://journal.naturalhistoryinstitute.org/wp...,Direct Download Failed (Network/Request Error),No Text Loaded,No Text Loaded,No Text Loaded
7,https://spie.org/documents/Publications/ProcSP...,PDF Downloaded & Text Loaded - Data Extraction...,LLM Error (PDF),LLM Error (PDF),LLM Error (PDF)
8,https://apastyle.apa.org/style-grammar-guideli...,"HTML Scan Failed, HTML Text Extraction Failed",No Text Loaded,No Text Loaded,No Text Loaded
9,https://www.ieee.org/conferences/publishing/te...,"PDF Link in HTML Found, Downloaded & Text Load...",LLM Error (PDF (from HTML link)),LLM Error (PDF (from HTML link)),LLM Error (PDF (from HTML link))



--- Script Execution Finished ---
