<a href="https://colab.research.google.com/github/jothiovia-2004/project/blob/main/Automated_Research_Paper_Reviewer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install tools

Collecting tools
  Downloading tools-1.0.5-py3-none-any.whl.metadata (1.3 kB)
Downloading tools-1.0.5-py3-none-any.whl (40 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.2/40.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tools
Successfully installed tools-1.0.5


In [None]:
!pip install pymupdf


Collecting pymupdf
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m75.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.4


In [None]:
import fitz  # from PyMuPDF


In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
"""
Automated Research Paper Reviewer (prototype)

Usage (CLI):
    !python paper_reviewer.py --input ""/content/Object_Distance_Estimation_from_a_Single_Moving_Camera_for_Advanced_Driver_Assistance_System.pdf"" --device cpu


Optional (Streamlit UI):
    streamlit run paper_reviewer.py

What it does (prototype):
 - Extract text from PDF (via PyMuPDF)
 - Chunk & summarize sections (transformers summarization)
 - Generate structured review items via a text-generation model (Flan-T5)
 - Provide simple novelty & clarity heuristics + candidate review lines

Notes:
 - This is a prototype. Replace models or improve heuristics for production.
 - Large models may require GPU. Use --device cuda when available.
"""

import argparse
import re
import os
from typing import List, Tuple, Dict
from dataclasses import dataclass
import math
from tqdm import tqdm

# PDF extraction
import fitz  # PyMuPDF

# NLP
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Streamlit optional
try:
    import streamlit as st
    STREAMLIT_AVAILABLE = True
except Exception:
    STREAMLIT_AVAILABLE = False

# -----------------------------
# Utility: PDF -> text
# -----------------------------
def extract_text_from_pdf(path: str) -> str:
    doc = fitz.open(path)
    texts = []
    for page in doc:
        txt = page.get_text("text")
        if txt:
            texts.append(txt)
    return "\n\n".join(texts)

# -----------------------------
# Heuristic: split into sections by common headings
# -----------------------------
SECTION_HEADINGS = [
    r'abstract', r'introduction', r'related work', r'background', r'methods?',
    r'methodology', r'approach', r'experimental', r'experiments', r'results?', r'discussion',
    r'conclusion', r'conclusions', r'future work', r'acknowledg', r'references', r'references and notes'
]

def split_into_sections(text: str) -> List[Tuple[str, str]]:
    """
    Returns list of (heading, content). If headings not found, returns single 'full_text'.
    """
    # Normalize newlines and unify heading lines
    lines = [l.strip() for l in text.splitlines()]
    joined = "\n".join(lines)
    # find heading positions
    pattern = r'(^|\n)\s*(%s)\s*(\n|$)' % "|".join(SECTION_HEADINGS)
    matches = list(re.finditer(pattern, joined, flags=re.IGNORECASE | re.MULTILINE))
    if not matches:
        return [("full_text", joined)]
    sections = []
    for i, m in enumerate(matches):
        start = m.end()
        heading = m.group(2).strip().title()
        end = matches[i+1].start() if i+1 < len(matches) else len(joined)
        content = joined[start:end].strip()
        sections.append((heading, content))
    # Merge small sections into neighbor if too short
    merged = []
    for h, c in sections:
        if len(c.split()) < 50 and merged:
            prev_h, prev_c = merged[-1]
            merged[-1] = (prev_h, prev_c + "\n\n" + c)
        else:
            merged.append((h, c))
    return merged

# -----------------------------
# Chunking helper
# -----------------------------
def chunk_text_by_sentences(text: str, max_words: int = 500) -> List[str]:
    sents = sent_tokenize(text)
    chunks = []
    cur = []
    cur_words = 0
    for s in sents:
        w = len(s.split())
        if cur_words + w > max_words and cur:
            chunks.append(" ".join(cur))
            cur = [s]
            cur_words = w
        else:
            cur.append(s)
            cur_words += w
    if cur:
        chunks.append(" ".join(cur))
    return chunks

# -----------------------------
# Reviewer class that holds models
# -----------------------------
@dataclass
class PaperReviewer:
    device: str = "cpu"
    # models will be loaded in __post_init__
    summarizer = None
    qg_model = None   # generation model (Flan-T5)
    embedder = None

    def __post_init__(self):
        # 1) Summarization pipeline
        print("Loading summarization model (facebook/bart-large-cnn)...")
        self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0 if self.device != "cpu" and "cuda" in self.device else -1)
        # 2) Generation model for review-text generation (Flan-T5 small/medium)
        print("Loading text generation model (google/flan-t5-base)...")
        self.qg_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
        self.qg_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
        if self.device != "cpu" and "cuda" in self.device:
            self.qg_model = self.qg_model.to(self.device)
        # 3) Sentence-transformer embedder for simple heuristics
        print("Loading sentence-transformers embedder (all-MiniLM-L6-v2)...")
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2", device=self.device if self.device != "cpu" else "cpu")

    # -------------------------
    # Summarize a (possibly long) text by chunking
    # -------------------------
    def summarize_long(self, text: str, max_chunk_words: int = 400) -> str:
        chunks = chunk_text_by_sentences(text, max_words=max_chunk_words)
        summaries = []
        for ch in tqdm(chunks, desc="Summarizing chunks"):
            try:
                out = self.summarizer(ch, max_length=150, min_length=40, do_sample=False)
                summaries.append(out[0]['summary_text'])
            except Exception as e:
                # fallback: short extractive fallback
                summaries.append(" ".join(ch.split()[:120]))
        # if many chunks, summarize summaries again
        joined = " ".join(summaries)
        if len(summaries) > 2:
            try:
                out = self.summarizer(joined, max_length=160, min_length=60, do_sample=False)
                return out[0]['summary_text']
            except Exception:
                return joined
        return joined

    # -------------------------
    # Generate structured review items using prompt-based generation
    # -------------------------
    def generate_review_prompt(self, abstract: str, summary: str, section_summaries: Dict[str,str]) -> str:
        """
        Create a prompt for Flan-T5 to generate review sections.
        We'll instruct it to output structured JSON-like text with headings.
        """
        prompt = [
            "You are an expert academic reviewer. Given the paper abstract and a concise summary of the paper, produce a structured review with headings: SUMMARY, STRENGTHS, WEAKNESSES, NOVELTY_ASSESSMENT, CLARITY, METHODOLOGY_ISSUES, SUGGESTIONS, RECOMMENDATION.",
            "Be concise. Use bullet points under STRENGTHS and WEAKNESSES. Use a final RECOMMENDATION of Accept / Revise / Reject with a short rationale.",
            "",
            "ABSTRACT:",
            abstract.strip()[:3000],
            "",
            "CONSOLIDATED_SUMMARY:",
            summary.strip()[:4000],
            "",
            "SECTION_SUMMARIES:"
        ]
        for k, v in section_summaries.items():
            prompt.append(f"{k.upper()}:\n{v.strip()[:1200]}\n")
        prompt.append("\nOutput:")
        return "\n".join(prompt)

    def generate_structured_review(self, abstract: str, summary: str, section_summaries: Dict[str,str], max_out_len:int=512) -> str:
        prompt = self.generate_review_prompt(abstract, summary, section_summaries)
        inputs = self.qg_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(self.qg_model.device)
        outs = self.qg_model.generate(**inputs, max_length=max_out_len, num_beams=4, early_stopping=True)
        text = self.qg_tokenizer.decode(outs[0], skip_special_tokens=True)
        return text

    # -------------------------
    # Heuristic novelty score using semantic similarity to "related work" (if available)
    # If related-work section exists, compute self-similarity between abstract and related-work
    # Lower similarity -> potentially more novel. This is a heuristic only.
    # -------------------------
    def novelty_heuristic(self, abstract: str, related_work_text: str) -> float:
        if not related_work_text or len(related_work_text.split()) < 20:
            return 0.5  # unknown/neutral
        a_emb = self.embedder.encode(abstract, convert_to_tensor=True)
        r_emb = self.embedder.encode(related_work_text, convert_to_tensor=True)
        sim = util.cos_sim(a_emb, r_emb).item()
        # Map similarity to novelty score: lower sim -> higher novelty
        novelty = max(0.0, min(1.0, 1.0 - sim))  # crude
        return novelty

    # -------------------------
    # Find unclear sentences: sentences with many hedges / long incomprehensible sentences
    # -------------------------
    def unclear_sentences(self, text: str, top_k:int=5) -> List[str]:
        sents = sent_tokenize(text)
        # Score: very long sentences and many adjectives / commas are suspect
        scores = []
        for s in sents:
            words = s.split()
            length = len(words)
            commas = s.count(',')
            # simple readability heuristic
            score = length * 0.6 + commas * 0.4
            scores.append((score, s))
        scores.sort(reverse=True)
        return [s for _, s in scores[:top_k]]

# -----------------------------
# Main pipeline
# -----------------------------
def review_paper(path: str, device: str = "cpu") -> Dict:
    if not os.path.exists(path):
        raise FileNotFoundError(path)
    ext = os.path.splitext(path)[1].lower()
    if ext in [".pdf"]:
        print("Extracting text from PDF...")
        text = extract_text_from_pdf(path)
    elif ext in [".txt", ".md"]:
        text = open(path, "r", encoding="utf-8").read()
    else:
        raise ValueError("Unsupported file type. Provide PDF or TXT.")
    # split sections
    print("Splitting into sections...")
    sections = split_into_sections(text)
    sections_dict = {h: c for h, c in sections}
    # get abstract if exists else first 250-400 words
    abstract = sections_dict.get("Abstract", None)
    if not abstract:
        # fallback: first 400 words
        abstract = " ".join(text.split()[:400])

    # Summarize full paper
    reviewer = PaperReviewer(device=device)
    print("Summarizing the full paper (may take a while)...")
    full_summary = reviewer.summarize_long(text, max_chunk_words=450)

    # Summarize each section (short)
    section_summaries = {}
    for h, c in sections:
        if len(c.split()) < 40:
            section_summaries[h] = c.strip()
            continue
        section_summaries[h] = reviewer.summarize_long(c, max_chunk_words=250)

    # Generate structured review text
    print("Generating structured review text...")
    structured = reviewer.generate_structured_review(abstract, full_summary, section_summaries, max_out_len=512)

    # Heuristic novelty
    related = sections_dict.get("Related Work", "") or sections_dict.get("Related work", "")
    novelty_score = reviewer.novelty_heuristic(abstract, related)

    # Unclear sentence highlights
    unclear = reviewer.unclear_sentences(text, top_k=6)

    # Candidate "copy/paste" review lines: take first few weaknesses/strengths by simple extraction:
    candidate_lines = []
    for block in structured.split("\n"):
        if len(block.strip()) > 10 and len(block.split()) < 60:
            candidate_lines.append(block.strip())
    candidate_lines = candidate_lines[:10]

    result = {
        "structured_review_text": structured,
        "summary": full_summary,
        "section_summaries": section_summaries,
        "novelty_score_0_1": novelty_score,
        "unclear_sentences": unclear,
        "candidate_review_lines": candidate_lines
    }
    return result

# -----------------------------
# CLI or Streamlit UI
# -----------------------------
# Removed argparse and direct call for Colab environment
if __name__ == "__main__":
    # If Streamlit available and running under streamlit, create a small UI
    if STREAMLIT_AVAILABLE and "streamlit" in os.environ.get("PYTEST_CURRENT_TEST", "") or (STREAMLIT_AVAILABLE and os.getenv("STREAMLIT_RUN", None)):
        # Not recommended — this block is mostly for direct streamlit run
        st.title("Automated Research Paper Reviewer (Prototype)")
        uploaded = st.file_uploader("Upload PDF or TXT", type=["pdf", "txt"])
        device = st.selectbox("Device", ["cpu", "cuda"])
        if uploaded and st.button("Run Review"):
            with open("uploaded_paper.pdf", "wb") as f:
                f.write(uploaded.getbuffer())
            st.info("Processing — models will load. This may take a minute.")
            out = review_paper("uploaded_paper.pdf", device=device)
            st.header("Summary")
            st.write(out["summary"])
            st.header("Structured Review")
            st.text(out["structured_review_text"])
            st.header("Novelty (heuristic)")
            st.write(out["novelty_score_0_1"])
            st.header("Unclear sentences")
            for s in out["unclear_sentences"]:
                st.write(s)
            st.header("Candidate lines (copy/paste)")
            for l in out["candidate_review_lines"]:
                st.write("-", l)
    else:
        # Example usage in Colab
        dummy_paper_path = "/content/Object_Distance_Estimation_from_a_Single_Moving_Camera_for_Advanced_Driver_Assistance_System.pdf"  # Replace with the actual path to your paper
        device_to_use = "cpu" # or "cuda" if you have a GPU
        if os.path.exists(dummy_paper_path):
            print(f"Reviewing {dummy_paper_path}")
            out = review_paper(dummy_paper_path, device=device_to_use)
            print("\n=== SHORT SUMMARY ===\n")
            print(out["summary"])
            print("\n=== STRUCTURED REVIEW ===\n")
            print(out["structured_review_text"])
            print("\n=== NOVELTY HEURISTIC ===\n")
            print(f"Novelty score (0 low - 1 high): {out['novelty_score_0_1']:.3f}")
            print("\n=== UNCLEAR SENTENCES (examples) ===\n")
            for s in out['unclear_sentences']:
                print("- ", s[:300].replace("\n", " "))
            print("\n=== CANDIDATE LINES (copy/paste) ===\n")
            for l in out['candidate_review_lines']:
                print("-", l)
        else:
            print(f"Error: The file {dummy_paper_path} was not found. Please replace with the actual path to your paper.")

Reviewing /content/Object_Distance_Estimation_from_a_Single_Moving_Camera_for_Advanced_Driver_Assistance_System.pdf
Extracting text from PDF...
Splitting into sections...
Loading summarization model (facebook/bart-large-cnn)...


Device set to use cpu


Loading text generation model (google/flan-t5-base)...
Loading sentence-transformers embedder (all-MiniLM-L6-v2)...
Summarizing the full paper (may take a while)...


Summarizing chunks: 100%|██████████| 8/8 [03:54<00:00, 29.35s/it]
Summarizing chunks: 100%|██████████| 2/2 [00:37<00:00, 18.54s/it]


Generating structured review text...

=== SHORT SUMMARY ===

Distance estimation is a crucial component of advanced driver assistance systems (ADAS) Researchers felt the need for multi-object detection in real-time using morepragmatic vision-based (monocular, stereo cameras) You Only Look Once (YOLO) is one of the state-of-the-art works.

=== STRUCTURED REVIEW ===

STRENGTHS, WEAKNESSES, NOVELTY_ASSESSMENT, CLARITY, METHODOLOGY_ISSUES, SUGGESTIONS, RECOMMENDATIONs

=== NOVELTY HEURISTIC ===

Novelty score (0 low - 1 high): 0.500

=== UNCLEAR SENTENCES (examples) ===

-  7: Residual plot Further, to check the model sensitivity at different posi- tions of the objects, a comparison of the output of the three TABLE I: Performance of the machine learning models for depth correction Model RMSE R2 Linear regression 12.906 0.947 Decision tree regression 15.526 0.936 Random
-  Object Distance Estimation from a Single Moving Camera for Advanced Driver Assistance System Anurag Thombre∗†, Avinash 