<a href="https://colab.research.google.com/github/punnoose-1620/masters-thesis-sensor-data/blob/main/LiteratureReviewHelper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Perform Relevance analysis for all papers related to an idea.

### Expected File Structure :
```
papers_folder
  |---subfolder
      |---paper1
      |---paper2
      |---notepad
```

## Imports and Installs

In [None]:
!pip install google-generativeai
!pip install pdfplumber

In [None]:
from pydantic import BaseModel, ConfigDict
import google.generativeai as genai
from google.colab import userdata
from typing import Type, Optional

from tqdm import tqdm
import pdfplumber
import json
import os

## Declare class for return types

In [None]:
class ContextualSummary(BaseModel):
  summary: str

In [None]:
class Author(BaseModel):
  name: str
  institution: str

In [None]:
class Paper(BaseModel):
  title: str
  abstract: str
  methodology: str
  conclusion: str
  relevance: float
  relevant_pages: list
  citation: str
  paperType: str
  authors: list[Author]

  model_config = ConfigDict(extra='allow')

## Declare Static Queries

In [None]:
CLASS_DETAILS = {
    "title": "Title of the Paper",
    "abstract": "Abstract section from the paper",
    "methodology": "What is done in the paper and how it is done, including relevant technical details?",
    "conclusion": "What was the results of this paper with regard to our context?",
    "relevance": "Relevance score (0-1) for how relevant this paper is to our context.",
    "relevant_pages": "List of pages that have content relevant to our topic.",
    "citation": "String to cite this paper",
    "paperType": "What type of paper is this (qualitative/quantitative)?",
    "authors": [
        {
            "name": "Author Name",
            "institution": "Institution of Author"
        }
    ]
}

In [1]:
SYSTEM_QUERY_SUMMARIZER = """
You are an academically profound individual well versed in the domain of the reference paper. Do not miss any technical terms that might be relevant to this domain. Summarize this paper.
"""

SYSTEM_QUERY_RELEVANCE = f"""
You are an academically profound individual well versed in the domain of both papers. Do not miss any technical terms that might be relevant to this domain. Output must be Strictly in this format :
{CLASS_DETAILS}
"""

## Declare Models for each purpose

In [None]:
SUMMARIZATION_MODEL = "gemini-2.5-flash-lite"
RELEVANCE_MODEL = "gemini-2.5-pro"
MODEL_API_KEY = userdata.get('GOOGLE_API_KEY')

## Configure API Key for LLM

In [None]:
genai.configure(api_key=MODEL_API_KEY)

## Declare Folder Path

In [None]:
FOLDER_PATH = ""
PROJECT_CONTEXT = """
High-level overview
The proposed system represents a hybrid stakeholder chatbot architecture that combines traditional machine-learning classifiers, deterministic control logic, and optional LLM-based components to handle user queries in a controlled, auditable, and extensible manner.
The workflow is divided into three major subsystems:
- Intent Identification and Function Selection
- Data Retrieval and Code Execution
- Data Interpretation and Response Generation
Each subsystem is modular, allowing selective replacement of components (e.g., replacing classical ML models with LLMs) without changing the overall control flow.

1. Input and Intent Classification

Input structure
User input is structured and enriched before processing. Each query may include:
- Raw query text
- Optional annotations
- Intent label and confidence
- Metadata such as domain, geography, time scope
- An ambiguity level indicator
This structured representation ensures traceability and supports downstream confidence handling.

Intent classifier pipeline
The Intent Identifier subsystem processes the user query in two stages:
- Intent Classification
  -> Determines the most likely intent class for the user query.
  -> Implemented using traditional ML models such as:
     > Logistic Regression
     > Linear Support Vector Machines
- Intent Confidence Calculation
  -> Computes a confidence score for the predicted intent.
  -> Confidence thresholds are used to determine whether the system proceeds automatically or requests clarification.
This design explicitly separates intent prediction from confidence assessment, improving interpretability and control.

Intent identifier responsibilities
The intent identifier outputs:
- An intent class
- A confidence score
These outputs are then passed to the Function Selector.

2. Function Selection and Code Execution

Function selector
The Function Selector acts as a deterministic decision layer. Based on:
- The identified intent
- Confidence score
- Predefined preconditions
It selects an appropriate function from a Function Map, which contains:
- Function name
- Description
- Domain
- Required and optional inputs
- Preconditions
- Fallback behavior if preconditions are unmet
If preconditions fail, a fallback function is selected, often leading to clarification prompts for the user.
This ensures:
- Predictable behavior
- Reduced hallucination risk
- Clear operational boundaries

Data and code execution
Once a function is selected:
- Required data sources are identified using a Map of Available Data
- The Code Executor:
  -> Retrieves relevant data (e.g., from databases or APIs)
  -> Executes deterministic business logic
- The output of this stage is raw data, not user-ready text
This separation ensures that:
- Business logic remains auditable
- AI components do not directly manipulate core data

3. Data Interpretation and Translation
Purpose of the data interpreter
Raw outputs from the code executor are not directly exposed to the user. Instead, they are passed to a Data Interpreter and Translator module, whose responsibility is to:
- Interpret raw or structured data
- Determine whether the data contains textual content
- Convert results into a human-readable response

Interpreter workflow
- Content type check
  -> If the data contains textual content, it proceeds to semantic analysis.
  -> Otherwise, it is interpreted directly.
- Semantic chunk classification
  -> Breaks text into meaningful sections.
  -> Uses ML models (e.g., XGBoost, Random Forest) to classify relevance.
- Section relevance scoring
  -> Scores each chunk to filter noise and prioritize key information.
- Confidence estimation
  -> Assigns confidence levels to interpreted outputs.
- Final data interpretation
  -> Produces a structured, processed response ready for user consumption.
This layered interpretation ensures explainability and confidence awareness in the final output.

4. Response Delivery and Integration
The processed response is:
- Returned to the end user
- Optionally integrated with external systems (e.g., Microsoft Teams)
This allows the chatbot to function both as a standalone interface and as part of enterprise workflows.

5. Design Philosophy and Extensibility
Hybrid AI approach
- Red components represent AI/ML models.
- Blue dashed components indicate areas where LLMs can replace classical models.
- Green components represent deterministic data sources or mappings.

This makes the architecture:
- LLM-optional, not LLM-dependent
- Easier to validate for compliance and correctness
- Suitable for enterprise or regulated environments

6. Relevance for paper-checking / evaluation algorithms
For a paper-checking or validation algorithm, this architecture provides:
- Clear separation of concerns
- Explicit confidence handling at multiple stages
- Traceable decision points (intent, function selection, interpretation)
- Deterministic execution paths with AI-assisted interpretation

This makes it particularly suitable for:
- Automated consistency checks
- Hallucination detection
- Confidence-aware evaluation
- Explainability analysis
"""

## Function to get paths for files and notepads from a folder

In [None]:
def getFilesAndNotepads(folderPath:str):
    file_paths = []
    notepad_paths = []

    for root, _, files in os.walk(folderPath):
        for file in files:
            file_path = os.path.join(root, file)
            if file.lower().endswith(('.pdf')):
                file_paths.append(file_path)
            elif file.lower().endswith(('.txt')):
                notepad_paths.append(file_path)

    return file_paths, notepad_paths

## Functions to read content from Documents

In [None]:
def read_txt_file_content(file_path: str) -> str:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        return content
    except FileNotFoundError:
        return f"Error: The file at {file_path} was not found."
    except Exception as e:
        return f"An error occurred while reading the file: {e}"

In [None]:
def read_pdf_contents(pdf_path, detect_columns=True):
    """
    Read all contents from a PDF file, handling both single and multi-column layouts.

    Args:
        pdf_path: Path to the PDF file
        detect_columns: Whether to automatically detect and handle multi-column layouts

    Returns:
        Extracted text as a string

    Example:
        text = read_pdf_contents("research_paper.pdf")
        print(text)
    """
    all_text = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            if detect_columns:
                # Get page dimensions
                page_width = page.width
                page_height = page.height
                words = page.extract_words()

                # Detect multi-column layout
                is_multi_column = False
                if len(words) >= 20:
                    midpoint = page_width / 2
                    left_words = sum(1 for w in words if (w['x0'] + w['x1']) / 2 < midpoint)
                    right_words = sum(1 for w in words if (w['x0'] + w['x1']) / 2 > midpoint)
                    total = len(words)
                    is_multi_column = (left_words / total >= 0.3 and right_words / total >= 0.3)

                # Extract based on column detection
                if is_multi_column:
                    split_point = page_width * 0.5
                    left_text = page.crop((0, 0, split_point, page_height)).extract_text() or ""
                    right_text = page.crop((split_point, 0, page_width, page_height)).extract_text() or ""
                    page_text = f"{left_text}\n\n{right_text}".strip()
                else:
                    page_text = page.extract_text()
            else:
                page_text = page.extract_text()

            if page_text:
                all_text.append(f"=== Page {page_num} ===\n{page_text}")

    return "\n\n".join(all_text)

## Function to invoke LLM

In [None]:
def invoke_gemini(
    query: str,
    responseClass: Type[BaseModel],
    modelName: str,
    system_query: Optional[str] = None
):
    """
    Invokes Gemini and parses the response into responseClass.
    The user query is passed EXACTLY as-is.
    """

    model = genai.GenerativeModel(
        model_name=modelName,
        system_instruction=system_query,

    )

    response = model.generate_content(
        query,  # <-- query is untouched
        generation_config={
            "response_mime_type": "application/json",
            "response_schema": responseClass
        }
    )

    # Gemini already validates against the schema
    return response.parsed


## Start Analysis

In [None]:
papersWithRelevance = []

In [None]:
# Summarize each paper
# Analyze each summary with reference to project idea
files, notepads = getFilesAndNotepads(FOLDER_PATH)
for paper in tqdm(files, desc="Analyzing reference papers...."):
  paper_content = read_pdf_contents(paper)

  try:
    summary = invoke_gemini(paper_content, ContextualSummary, SUMMARIZATION_MODEL, SYSTEM_QUERY_SUMMARIZER)
  except Exception as e:
    print("ERROR: Summary generation faced an error : ", e)
    break

  relevance_query = f"""
  Here is my current project idea :
  {PROJECT_CONTEXT}

  Calculate relevance of this paper with context to my project. Here is the summary of the paper :
  {summary.summary}
  """

  try:
    relevance = invoke_gemini(relevance_query, Paper, RELEVANCE_MODEL, SYSTEM_QUERY_RELEVANCE)
  except Exception as e:
    print("ERROR: Relevance calculation faced an error : ", e)
    break
  relevance.paper_path = paper
  papersWithRelevance.append(relevance.model_dump())

In [None]:
# Print all relevances
for paper in papersWithRelevance:
  print(json.dumps(paper, indent=2))