<a href="https://colab.research.google.com/github/punnoose-1620/masters-thesis-sensor-data/blob/main/LiteratureReviewHelper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Perform Relevance analysis for all papers related to an idea.

### Expected File Structure :
```
papers_folder
  |---subfolder
      |---paper1
      |---paper2
      |---notepad
```

## Imports and Installs

In [16]:
!pip install google-generativeai pdfplumber tqdm pydantic dotenv

Defaulting to user installation because normal site-packages is not writeable
Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.2.1-py3-none-any.whl (21 kB)
Installing collected packages: python-dotenv, dotenv

   ---------------------------------------- 0/2 [python-dotenv]
   ---------------------------------------- 0/2 [python-dotenv]
   ---------------------------------------- 0/2 [python-dotenv]
   ---------------------------------------- 2/2 [dotenv]

Successfully installed dotenv-0.9.9 python-dotenv-1.2.1



[notice] A new release of pip is available: 25.3 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
from pydantic import BaseModel, ConfigDict
import google.generativeai as genai
# from google.colab import userdata
from typing import Type, Optional
from dotenv import load_dotenv

from tqdm import tqdm
import pdfplumber
import json
import os

## Declare class for return types

In [7]:
class Status(BaseModel):
    status:str
    date:str

In [8]:
class ContextualSummary(BaseModel):
  summary: str
  abstract: str
  conclusion: str
  published: bool
  status_updates: list[Status]

In [9]:
class Author(BaseModel):
  name: str
  institution: str

In [None]:
class Paper(BaseModel):
  title: str
  methodology: str
  relevance: float
  relevant_pages: list
  conflict: str
  difference: str
  citation: str
  gap: str
  paperType: str
  authors: list[Author]

  model_config = ConfigDict(extra='allow')

## Declare Static Queries

In [None]:
CITATION_MAPS = """
Citations must follow one of the following patterns based on the current/latest status of the paper : 
1. Pre-Print: A. Author, "Title," arXiv, vol. abs/1234.567, 2026.
2. In-Press/Accepted : B. Author, "Title," Journal Name, to be published.
3. Submitted : C. Author, "Title," submitted for publication.
4. Published in Journal : J. K. Author, "Name of paper," Abbrev. Title of Periodical, vol. x, no. x, pp. xxx–xxx, Abbrev. Month, year, doi: xxx.
5. Published in Conference : J. K. Author, "Name of paper," in Abbrev. Name of Conf., City of Conf., Abbrev. State (if relevant), year, pp. xxx–xxx.
"""

In [None]:
CLASS_DETAILS = {
    "title": "Title of the Paper",
    "methodology": "What is done in the paper and how it is done, including relevant technical details. Use proper technical terms on what methods are used. Do not miss technical keywords or details.",
    "relevance": "Relevance score (0-1) for how relevant this paper is to our context.",
    "relevant_pages": "List of pages that have content relevant to our topic.",
    "conflict": "How this paper conflicts our idea",
    "difference": "How this paper is different from our idea",
    "gap" : "The project/literature gap found by this paper in regards to this domain.",
    "citation": "String to cite this paper",
    "paperType": "What type of paper is this (qualitative/quantitative)?",
    "authors": [
        {
            "name": "Author Name",
            "institution": "Institution of Author"
        }
    ]
}

In [12]:
SYSTEM_QUERY_SUMMARIZER = """
You are an academically profound individual well versed in the domain of the reference paper. 
Your only task is to summarise all papers given to you. 
Do not miss any technical terms that might be relevant to this domain. 
Isolate Abstract, Conclusion and Status from this paper.
"""

SYSTEM_QUERY_RELEVANCE = f"""
You are an academically profound individual well versed in the domain of both papers. 
Do not miss any technical terms that might be relevant to this domain. 
Output must be Strictly in this format :
{CLASS_DETAILS}
"""

## Declare Models Values

### Declare Model Names

In [13]:
SUMMARIZATION_MODEL = "gemini-2.5-flash-lite"
RELEVANCE_MODEL = "gemini-2.5-pro"
MODEL_API_KEY = None

### Populate Model API Key based on Colab/Local Environment

In [21]:
load_dotenv()
try:
    MODEL_API_KEY = userdata.get("GOOGLE_API_KEY")
except Exception:
    pass  # Not running in Colab

if not MODEL_API_KEY:
    MODEL_API_KEY = os.getenv("GOOGLE_API_KEY")
#print(f"Model Key : {MODEL_API_KEY}\nKey in Env : {os.getenv('GOOGLE_API_KEY')}")

## Configure API Key for LLM

In [22]:
if MODEL_API_KEY is not None:
    print('Model API Key has been initialized....')
else:
    print('Unable to initialize Model API Key')
    raise RuntimeError("GOOGLE_API_KEY not found in Colab userdata or environment variables")
genai.configure(api_key=MODEL_API_KEY)

Model API Key has been initialized....


## Declare Project Based Values

### Folder Path

In [23]:
FOLDER_PATH = "Literature_Review_Papers/"

### Project Context/Description

In [34]:
PROJECT_CONTEXT_MINIATURE = """"
This project proposes a stakeholder-facing chatbot built on a hybrid LLM-deterministic architecture. The system is designed to answer user queries by combining probabilistic language understanding with controlled, rule-based execution. The primary goal is to maintain interpretability, predictability, and low hallucination risk while still benefiting from large language models.

A user submits a natural language query, which is first converted into a structured input representation. This representation includes the raw query text, metadata such as domain, geography, and time scope, and explicit indicators of ambiguity and confidence expectations. This structured form provides semantic grounding and supports comparison with intent-based and slot-filling dialogue systems.

Intent identification is handled by a large language model such as Gemini, GPT, or Claude. The model is used only to classify intent and related attributes, not to select tools or execute actions. Its output is a structured intent label with confidence and grouping information, enabling downstream deterministic processing.

The identified intent is passed to a rule-based function selector. This component maps intents to predefined backend functions using a static function map. Each function definition specifies required and optional inputs, domain constraints, and execution preconditions. If preconditions are not satisfied, the system triggers a fallback path that requests clarification from the user. This ensures that critical decisions remain deterministic and auditable.

Once a function is selected, a code executor retrieves and processes data from structured sources such as databases or domain-specific datasets. The executor applies business logic and returns raw, structured outputs. No natural language generation occurs at this stage, and all data access is explicitly defined.

The raw execution results are then passed to a second LLM acting as a data interpreter. This model translates structured outputs into human-readable responses tailored to the original user query. The interpreter has no control over execution and operates only on provided data and context, reducing the risk of fabricated information.

The final response is delivered to the end user or integrated into external platforms such as Microsoft Teams. The architecture enforces a clear separation between understanding, decision-making, execution, and explanation, making the system comparable to task-oriented dialogue systems, hybrid neuro-symbolic approaches, and modern LLM agent frameworks while remaining controlled and explainable.
"""

In [24]:
PROJECT_CONTEXT = """
1. Overall System Goal
The proposed system is a stakeholder-facing chatbot designed to answer user queries by combining:
- LLM-based intent understanding
- Deterministic function selection
- Structured data retrieval and execution
- LLM-based result interpretation
The core design principle is to separate reasoning from execution, ensuring:
- Predictability and controllability
- Reduced hallucination risk
- Easier comparison with traditional dialog systems and agent-based LLM workflows

2. High-Level Workflow Overview
- An end user (stakeholder) submits a natural language query.
- The query is first processed by an Intent Identifier (LLM).
- Identified intent is passed to a deterministic Function Selector.
- A Code Executor retrieves and processes relevant data from structured sources.
- Raw outputs are sent to a Data Interpreter (LLM).
- The interpreted response is returned to the user or optionally integrated with external platforms (e.g., Microsoft Teams).
Purpose
- Enables explicit semantic grounding
- Allows comparison with:
- Slot-filling approaches
- Semantic frame parsing
- Ontology-driven dialog systems
- Supports ambiguity detection and confidence-aware handling

3. Intent Identification Layer (LLM)
Role
The Intent Identifier uses a Large Language Model (e.g., Gemini, GPT, Claude) to:
- Classify the user's intent
- Assign confidence levels
- Group intents into higher-level categories
Key Characteristics
- Used only for semantic understanding, not execution
- Model choice is configurable based on empirical benchmarking
- Output is structured, not free-form text
Research Comparison Angle
Comparable to:
- Neural intent classifiers
- Zero-shot / few-shot intent detection
- LLM-based semantic parsing (without tool execution)

4. Deterministic Function Selector
Role
- The Function Selector is a non-LLM, rule-based component that:
- Maps detected intents to predefined backend functions
- Validates required inputs and preconditions
- Decides whether execution is possible or clarification is required

5. Code Executor and Data Layer
Code Executor
Responsible for:
- Executing the selected function
- Fetching data from structured sources
- Applying business logic or transformations
Data Sources
- Databases
- Predefined datasets
- Domain-specific structured data
These are represented as a Map of Available Data, ensuring:
- Explicit data provenance
- No implicit model assumptions
Output
- Produces raw, structured data
- No natural language generation occurs at this stage

6. Data Interpretation Layer (LLM)
Role
The Data Interpreter (LLM) converts raw execution results into:
- Human-readable explanations
- Summaries
- Contextual insights appropriate for stakeholders
Characteristics
- No access to execution logic
- Operates only on:
   - Raw data
   - Original user query
   - Optional metadata
Research Alignment
Comparable to:
- Neural natural language generation layers
- Explainable AI interfaces
- Post-hoc summarization systems

8. Response Delivery and Integration
- The final response is returned to the end user
- Optionally routed through external integrations (e.g., Microsoft Teams)
- Maintains separation between:
   - Internal system logic
   - Presentation layer

9. Key Architectural Principles
- Separation of Concerns
   - Understanding ≠ Decision ≠ Execution ≠ Explanation
- Controlled Use of LLMs
   - LLMs are used only where probabilistic reasoning is beneficial
- Determinism and Safety
   - Critical system decisions are rule-based
- Explainability and Comparability
   - Every stage is inspectable and replaceable

10. Positioning for Paper Comparison
This system can be directly compared with:
- Task-oriented dialogue systems (TOD)
- LLM agent frameworks (ReAct, Toolformer, AutoGPT)
- Hybrid neuro-symbolic architectures
- Enterprise conversational AI platforms
Key differentiators:
- Dual-LLM design (intent + interpretation)
- Deterministic execution core
- Explicit input and function schemas
"""

## Function to get paths for files and notepads from a folder

In [25]:
def getFilesAndNotepads(folderPath:str):
    file_paths = []
    notepad_paths = []

    for root, _, files in os.walk(folderPath):
        for file in files:
            file_path = os.path.join(root, file)
            if file.lower().endswith(('.pdf')):
                file_paths.append(file_path)
            elif file.lower().endswith(('.txt')):
                notepad_paths.append(file_path)

    return file_paths, notepad_paths

## Functions to read content from Documents

In [26]:
def read_txt_file_content(file_path: str) -> str:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        return content
    except FileNotFoundError:
        return f"Error: The file at {file_path} was not found."
    except Exception as e:
        return f"An error occurred while reading the file: {e}"

In [27]:
def read_pdf_contents(pdf_path, detect_columns=True):
    """
    Read all contents from a PDF file, handling both single and multi-column layouts.

    Args:
        pdf_path: Path to the PDF file
        detect_columns: Whether to automatically detect and handle multi-column layouts

    Returns:
        Extracted text as a string

    Example:
        text = read_pdf_contents("research_paper.pdf")
        print(text)
    """
    all_text = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            if detect_columns:
                # Get page dimensions
                page_width = page.width
                page_height = page.height
                words = page.extract_words()

                # Detect multi-column layout
                is_multi_column = False
                if len(words) >= 20:
                    midpoint = page_width / 2
                    left_words = sum(1 for w in words if (w['x0'] + w['x1']) / 2 < midpoint)
                    right_words = sum(1 for w in words if (w['x0'] + w['x1']) / 2 > midpoint)
                    total = len(words)
                    is_multi_column = (left_words / total >= 0.3 and right_words / total >= 0.3)

                # Extract based on column detection
                if is_multi_column:
                    split_point = page_width * 0.5
                    left_text = page.crop((0, 0, split_point, page_height)).extract_text() or ""
                    right_text = page.crop((split_point, 0, page_width, page_height)).extract_text() or ""
                    page_text = f"{left_text}\n\n{right_text}".strip()
                else:
                    page_text = page.extract_text()
            else:
                page_text = page.extract_text()

            if page_text:
                all_text.append(f"=== Page {page_num} ===\n{page_text}")

    return "\n\n".join(all_text)

## Function to invoke LLM

In [31]:
def invoke_gemini(
    query: str,
    responseClass: Type[BaseModel],
    modelName: str,
    system_query: Optional[str] = None
):
    """
    Invokes Gemini and parses the response into responseClass.
    The user query is passed EXACTLY as-is.
    """

    model = genai.GenerativeModel(
        model_name=modelName,
        system_instruction=system_query,

    )

    response = model.generate_content(
        query,  # <-- query is untouched
        generation_config={
            "response_mime_type": "application/json",
            "response_schema": responseClass
        }
    )

    # Gemini already validates against the schema
    #return response.parsed
    text = response.text
    return responseClass.model_validate_json(text)


## Start Analysis

In [44]:
papersWithRelevance = []
errorsList = []

In [45]:
# Summarize each paper
# Analyze each summary with reference to project idea
files, notepads = getFilesAndNotepads(FOLDER_PATH)
for paper in tqdm(files, desc="Analyzing reference papers...."):
  paper_content = read_pdf_contents(paper)
  errorFlag = False

  try:
    # Get Summary and abstract of the paper
    summary = invoke_gemini(paper_content, ContextualSummary, SUMMARIZATION_MODEL, SYSTEM_QUERY_SUMMARIZER)
  except Exception as e:
    errorsList.append({paper:e})
    errorFlag = True
    continue

  # Create Query for Relevance Calculation
  relevance_query = f"""
  Here is my current project idea : 
  {PROJECT_CONTEXT_MINIATURE}

  Calculate relevance of this paper with context to my project. Here is the summary of the paper :
  {summary.summary}
  """

  if not errorFlag:
    try:
      # Calculate Relevance values for this Paper
      relevance = invoke_gemini(relevance_query, Paper, RELEVANCE_MODEL, SYSTEM_QUERY_RELEVANCE)
    except Exception as e:
      errorsList.append({paper:e})
      errorFlag = True
      continue

  # Add Paper's local path, abstract and conclusion to result
  if not errorFlag:
    relevance.paper_path = paper
    relevance.abstract = summary.abstract
    relevance.conclusion = summary.conclusion
    relevance.summary = summary.summary
    relevance.status = summary.status_updates
    relevance.published = summary.published
    papersWithRelevance.append(relevance.model_dump())

Analyzing reference papers....: 100%|██████████| 28/28 [00:42<00:00,  1.51s/it]


## Display Relevance Values

In [46]:
# Print all relevances
if len(papersWithRelevance)==0:
  print("Analysis for papers failed. Please check the below cell for list of errors....")
else:
  print("Here are the analysis Results : ")
  for paper in papersWithRelevance:
    print(json.dumps(paper, indent=2))

Analysis for papers failed. Please check the below cell for list of errors....


## Display Error Papers

In [47]:
print(f"{len(errorsList)} papers of {len(files)} faced the following errors when being analyzed : ")
for paper in errorsList:
    path = list(paper.keys())[0]
    err = paper[path]
    print(f"\tPaper: {path}\n\tERROR: {err.code}")

28 papers of 28 faced the following errors when being analyzed : 
	Paper: Literature_Review_Papers/(document question answering OR long document QA OR RAG)\Enhancing_Large_Model_Document_Question_Answering_Through_Retrieval_Augmentation.pdf
	ERROR: 429
	Paper: Literature_Review_Papers/(relevance scoring OR cross-encoder OR re-ranking)\A_distributed_search_engine_based_on_a_re-ranking_algorithm_model.pdf
	ERROR: 429
	Paper: Literature_Review_Papers/(relevance scoring OR cross-encoder OR re-ranking)\A_re-ranking_method_based_on_cloud_model.pdf
	ERROR: 429
	Paper: Literature_Review_Papers/(relevance scoring OR cross-encoder OR re-ranking)\Enhancing_Retrieval_and_Re-ranking_in_RAG_A_Case_Study_on_Tax_Law.pdf
	ERROR: 429
	Paper: Literature_Review_Papers/(relevance scoring OR cross-encoder OR re-ranking)\The_Research_on_Re-ranking_Algorithm_for_FAQ-based_Systems_in_the_Petroleum_Domain.pdf
	ERROR: 429
	Paper: Literature_Review_Papers/(semantic chunk OR section segmentation)\A_Sentence-Level_