<a href="https://colab.research.google.com/github/kartikdhyani817/AI-Research-Assistant/blob/main/Research_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install feedparser transformers sentence-transformers torch requests


Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=6c95b8a991ef3b96ecd97c6b4acb0f7d0eb362ad0722f56e709ce9c05018fabe
  Stored in directory: /root/.cache/pip/wheels/03/f5/1a/23761066dac1d0e8e683e5fdb27e12de53209d05a4a37e6246
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.12 sgmllib3k-1.0.0


In [2]:
import feedparser
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
import datetime


In [3]:
def search_arxiv(query: str, max_results: int = 5):
    url = f"http://export.arxiv.org/api/query?search_query=all:{query.replace(' ','+')}&start=0&max_results={max_results}"
    feed = feedparser.parse(url)
    papers = []
    for entry in feed.entries:
        papers.append({
            "title": entry.get("title", "").strip().replace("\n", " "),
            "summary": entry.get("summary", "").strip().replace("\n", " "),
            "link": entry.get("link"),
            "authors": [a.name for a in entry.get("authors", [])] if "authors" in entry else [],
            "published": entry.get("published", "")
        })
    return papers


In [4]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

def summarize_text(text: str, max_len=150, min_len=40):
    try:
        return summarizer(text, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
    except Exception:
        return text[:500]  # fallback


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [5]:
review_model = pipeline("text2text-generation", model="google/flan-t5-base")

def generate_review(topic: str, summaries: list):
    prompt = f"You are an expert researcher. Topic: {topic}\n\n"
    for i, s in enumerate(summaries, 1):
        prompt += f"[{i}] {s}\n"
    prompt += "\nWrite a structured literature review:\n"
    prompt += "1) Overview\n2) Common methods\n3) Key findings\n4) Limitations\n5) Future directions\n"

    response = review_model(prompt, max_new_tokens=500, do_sample=False)
    return response[0]['generated_text']


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [6]:
def run_research_assistant(topic="graph neural networks", max_papers=3):
    print(f"🔍 Searching for papers on: {topic}")
    papers = search_arxiv(topic, max_papers)

    summaries = []
    for p in papers:
        print(f"\n📄 {p['title']}")
        s = summarize_text(p['summary'])
        print(f"   ➡️ Summary: {s}")
        summaries.append(s)

    print("\n📊 Generating literature review...")
    review = generate_review(topic, summaries)

    now = datetime.datetime.utcnow().strftime("%Y-%m-%d_%H:%M:%S")
    report = f"# Literature Review on {topic}\n_Generated {now} UTC_\n\n"

    for i, p in enumerate(papers, 1):
        report += f"## {i}. {p['title']}\n"
        report += f"- Authors: {', '.join(p['authors'])}\n"
        report += f"- Published: {p['published']}\n"
        report += f"- Link: {p['link']}\n\n"
        report += f"**Auto-Summary:** {summaries[i-1]}\n\n---\n\n"

    report += "## Comparative Review\n\n" + review

    return report


In [7]:
report = run_research_assistant("renewable energy using AI", max_papers=3)

# Show in Colab
from IPython.display import Markdown
display(Markdown(report))

# Save to file
with open("literature_review.md", "w", encoding="utf-8") as f:
    f.write(report)


🔍 Searching for papers on: renewable energy using AI


Your max_length is set to 150, but your input_length is only 121. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=60)



📄 Energy Management for Renewable-Colocated Artificial Intelligence Data   Centers
   ➡️ Summary:  We develop an energy management system (EMS) for artificial intelligence (AI) data centers with colocated renewable generation . The EMS of renewable-colocated data center (RCDC) co-optimizes AI workload scheduling, on-site renewable utilization, and electricity market participation .

📄 Present and Future of AI in Renewable Energy Domain : A Comprehensive   Survey
   ➡️ Summary:  Artificial intelligence (AI) has become a crucial instrument for streamlining processes in various industries, including electrical power systems . Algorithms for artificial intelligence are data-driven models that are based on statistical learning theory and are used as a tool to take use of the data that the power system and its users generate .

📄 Load and Renewable Energy Forecasting Using Deep Learning for Grid   Stability
   ➡️ Summary:  Grid operators face several challenges when integrating renewable en

  now = datetime.datetime.utcnow().strftime("%Y-%m-%d_%H:%M:%S")


# Literature Review on renewable energy using AI
_Generated 2025-09-20_14:59:38 UTC_

## 1. Energy Management for Renewable-Colocated Artificial Intelligence Data   Centers
- Authors: Siying Li, Lang Tong, Timothy D. Mount
- Published: 2025-07-04T18:25:42Z
- Link: http://arxiv.org/abs/2507.08011v1

**Auto-Summary:**  We develop an energy management system (EMS) for artificial intelligence (AI) data centers with colocated renewable generation . The EMS of renewable-colocated data center (RCDC) co-optimizes AI workload scheduling, on-site renewable utilization, and electricity market participation .

---

## 2. Present and Future of AI in Renewable Energy Domain : A Comprehensive   Survey
- Authors: Abdur Rashid, Parag Biswas, Angona Biswas, MD Abdullah Al Nasim, Kishor Datta Gupta, Roy George
- Published: 2024-06-22T04:36:09Z
- Link: http://arxiv.org/abs/2406.16965v2

**Auto-Summary:**  Artificial intelligence (AI) has become a crucial instrument for streamlining processes in various industries, including electrical power systems . Algorithms for artificial intelligence are data-driven models that are based on statistical learning theory and are used as a tool to take use of the data that the power system and its users generate .

---

## 3. Load and Renewable Energy Forecasting Using Deep Learning for Grid   Stability
- Authors: Kamal Sarkar
- Published: 2025-01-23T06:33:33Z
- Link: http://arxiv.org/abs/2501.13412v1

**Auto-Summary:**  Grid operators face several challenges when integrating renewable energy sources with the grid . The most important challenge is to balance supply and demand because the solar and wind energy are highly unpredictable . When dealing with such uncertainty, trustworthy short-term load and renewable energy forecasting can help stabilize the grid, maximize energy storage, and guarantee the effective use of renewable resources .

---

## Comparative Review

We develop an energy management system for artificial intelligence (AI) data centers with colocated renewable generation.

In [8]:
!pip install gradio PyPDF2 pdfplumber


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m4

In [9]:
import pdfplumber
from transformers import pipeline

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

def extract_pdf_text(file):
    text = ""
    with pdfplumber.open(file) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

def summarize_pdf(pdf_file):
    text = extract_pdf_text(pdf_file)
    chunks = [text[i:i+900] for i in range(0, len(text), 900)]
    summaries = []
    for c in chunks:
        out = summarizer(c, max_length=150, min_length=50, do_sample=False)
        summaries.append(out[0]["summary_text"])
    final_summary = summarizer(" ".join(summaries), max_length=200, min_length=80, do_sample=False)
    return final_summary[0]["summary_text"]


Device set to use cpu


In [10]:
# for making ui we are using gradio ui
'''import gradio as gr

def research_assistant(topic, pdf_file):
    if pdf_file is not None:
        return summarize_pdf(pdf_file.name)  # summarize uploaded PDF
    else:
        from main import run_research_assistant  # reuse your arXiv pipeline
        return run_research_assistant(topic, max_papers=3)

with gr.Blocks() as demo:
    gr.Markdown("# 📚 AI Research Assistant Agent")
    topic = gr.Textbox(label="Enter Research Topic (e.g. Graph Neural Networks)")
    pdf = gr.File(label="Or Upload a PDF to Summarize", file_types=[".pdf"])
    output = gr.Textbox(label="Literature Review / Summary", lines=20)
    btn = gr.Button("Generate")
    btn.click(fn=research_assistant, inputs=[topic, pdf], outputs=output)

demo.launch()
'''

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://9667fd912f6212d0d8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [12]:
# ### Colab cell ###
# Run this cell if you haven't loaded these already from Phase 1
!pip install -q faiss-cpu
import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
from transformers import pipeline
import torch
import os

# Reuse DEVICE defined earlier if present; otherwise set it
try:
    DEVICE
except NameError:
    DEVICE = 0 if torch.cuda.is_available() else -1

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMBED_MODEL)

# RAG reader model to answer questions using retrieved context
READER_MODEL = "google/flan-t5-base"
reader = pipeline("text2text-generation", model=READER_MODEL, device=DEVICE)
print("Phase 2 models ready. Device:", "GPU" if DEVICE == 0 else "CPU")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Device set to use cpu


Phase 2 models ready. Device: CPU


In [13]:
# ### Colab cell ###
# Build FAISS index and retrieval functions for a list of text chunks.

def build_faiss_index(text_chunks):
    """
    Returns (index, embeddings, text_chunks). Index uses normalized vectors + IndexFlatIP for cosine.
    """
    embs = embedder.encode(text_chunks, convert_to_numpy=True, show_progress_bar=False)
    # normalize
    faiss.normalize_L2(embs)
    dim = embs.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embs)
    return index, embs

def retrieve_top_k(query, index, text_chunks, embeddings, top_k=5):
    q_emb = embedder.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    results = []
    for score, idx in zip(D[0].tolist(), I[0].tolist()):
        if idx == -1:
            continue
        results.append({"index": int(idx), "score": float(score), "text": text_chunks[idx]})
    return results


In [14]:
# ### Colab cell ###
def answer_question_with_rag(question, index, text_chunks, embeddings, top_k=4):
    retrieved = retrieve_top_k(question, index, text_chunks, embeddings, top_k=top_k)
    if not retrieved:
        return "No relevant content found."
    # Build context with indices
    context = "\n\n".join([f"[{r['index']}] {r['text']}" for r in retrieved])
    prompt = (
        "You are an expert answering using ONLY the CONTEXT below. "
        "If the answer is not present, reply 'Not found in the documents.'\n\n"
        f"CONTEXT:\n{context}\n\nQUESTION: {question}\n\nAnswer concisely and cite chunk indices used (e.g., [0], [2])."
    )
    out = reader(prompt, max_new_tokens=256, do_sample=False)
    return out[0].get("generated_text", out[0].get("text", "")).strip()


In [15]:
# ### Colab cell ###
def index_pdf_for_qa(pdf_path, chunk_chars=900):
    # extract text (Phase 1 function)
    text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(text, max_tokens_estimate=chunk_chars)
    idx, embs = build_faiss_index(chunks)
    return {"index": idx, "embeddings": embs, "chunks": chunks}

# Example usage:
# state = index_pdf_for_qa("/path/to/your.pdf")
# print(answer_question_with_rag("What methodology did the authors use?", state['index'], state['chunks'], state['embeddings']))


In [16]:
# ### Colab cell ###
import re

def extract_references_section(text):
    """
    Try to find a References / Bibliography section and return the block.
    This uses a heuristic: finds 'References' or 'Bibliography' heading and returns remainder or until next heading.
    """
    lower = text.lower()
    # find possible headings
    m = re.search(r'(references|bibliography|reference)\s*[:\n\r]+', lower)
    if not m:
        return ""
    start = m.start()
    # take rest of document after heading
    ref_block = text[start:]
    # optionally stop at 'appendix' or similar headings
    stop = re.search(r'\n\s*(appendix|acknowledg(e)?ments|supplementary)\b', ref_block, flags=re.I)
    if stop:
        ref_block = ref_block[:stop.start()]
    return ref_block

def parse_reference_lines(ref_block):
    """
    Split into likely reference lines and return as list.
    Filters out very short lines.
    """
    lines = [ln.strip() for ln in re.split(r'\n{1,}', ref_block) if ln.strip()]
    # Filter lines that are likely references (contain year or doi)
    parsed = []
    for ln in lines:
        if len(ln) < 30: continue
        if re.search(r'\b(19\d{2}|20\d{2})\b', ln) or 'doi' in ln.lower() or ',' in ln:
            parsed.append(ln)
    return parsed


In [17]:
# ### Colab cell ###
def extract_methods_and_findings(paper_text, top_k_chunks=5, chunk_chars=900):
    """
    Return a brief dict with 'methods' and 'findings' using RAG over the paper text.
    We'll index the paper and run two targeted questions.
    """
    chunks = chunk_text(paper_text, max_tokens_estimate=chunk_chars)
    idx, embs = build_faiss_index(chunks)
    # questions
    q_methods = "What method(s) or algorithm(s) are used in this paper? Provide concise bullet points."
    q_findings = "What are the key findings or results of this paper? Provide concise bullet points."
    methods = answer_question_with_rag(q_methods, idx, chunks, embs, top_k=top_k_chunks)
    findings = answer_question_with_rag(q_findings, idx, chunks, embs, top_k=top_k_chunks)
    return {"methods": methods, "findings": findings, "index": idx, "chunks": chunks, "embeddings": embs}


In [18]:
# ### Colab cell ###
def compare_multiple_papers(papers_texts, paper_titles=None):
    """
    papers_texts: list of strings (full text or abstracts)
    Returns a comparison summary: common methods, contrasts, gaps.
    """
    # Extract methods/findings for each
    brief_summaries = []
    details = []
    for i, text in enumerate(papers_texts):
        info = extract_methods_and_findings(text)
        details.append(info)
        header = paper_titles[i] if paper_titles else f"Paper {i+1}"
        brief_summaries.append(f"### {header}\nMethods:\n{info['methods']}\nFindings:\n{info['findings']}\n")

    # Build a prompt to compare
    combined = "\n\n".join(brief_summaries)
    comp_prompt = (
        "You are an expert researcher. Below are concise method/findings summaries for multiple papers:\n\n"
        f"{combined}\n\n"
        "Produce a structured comparison including:\n"
        "1) Common methods and themes (bullet points)\n"
        "2) Key differences across papers (bullet points)\n"
        "3) Open gaps or limitations (bullet points)\n"
        "4) Suggestions for future work inspired by comparing them (bullet points)\n"
        "Keep the answer concise and use paper titles to identify differences."
    )
    comp_out = reader(comp_prompt, max_new_tokens=512, do_sample=False)
    comparison = comp_out[0].get("generated_text", comp_out[0].get("text", "")).strip()
    return {"details": details, "comparison": comparison}


In [19]:
# ### Colab cell ###
!pip install -q gradio wordcloud matplotlib markdown2 reportlab


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m1.1/2.0 MB[0m [31m33.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [20]:
# ### Colab cell ###
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import textwrap

def save_text_to_pdf(text, out_path, title="Report"):
    """
    Very simple PDF writer. Wraps text; not for heavy formatting.
    """
    c = canvas.Canvas(out_path, pagesize=letter)
    width, height = letter
    margin = 50
    y = height - margin
    c.setFont("Helvetica-Bold", 14)
    c.drawString(margin, y, title)
    y -= 25
    c.setFont("Helvetica", 10)
    wrapper = textwrap.TextWrapper(width=100)
    lines = []
    for paragraph in text.split("\n\n"):
        wrapped = wrapper.wrap(paragraph)
        if not wrapped:
            lines.append("")
        else:
            lines.extend(wrapped)
        lines.append("")  # blank line between paragraphs
    for line in lines:
        if y < margin + 20:
            c.showPage()
            y = height - margin
            c.setFont("Helvetica", 10)
        c.drawString(margin, y, line)
        y -= 12
    c.save()
    return out_path


In [21]:
# ### Colab cell ###
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import io
from IPython.display import display, Image

def plot_wordcloud_from_text(text, max_words=100):
    wc = WordCloud(width=800, height=400, background_color="white", max_words=max_words)
    img = wc.generate(text)
    buf = io.BytesIO()
    plt.figure(figsize=(10,5))
    plt.imshow(img, interpolation="bilinear")
    plt.axis("off")
    plt.tight_layout()
    plt.savefig(buf, format="png")
    buf.seek(0)
    plt.close()
    return buf

def publication_timeline(papers_meta):
    """
    papers_meta: list of dicts with 'published' in ISO e.g. '2020-01-01'
    returns a matplotlib figure bytes
    """
    years = []
    for m in papers_meta:
        pub = m.get("published", "")
        if pub:
            year_match = re.search(r'(\d{4})', pub)
            if year_match:
                years.append(int(year_match.group(1)))
    if not years:
        return None
    counts = Counter(years)
    xs = sorted(counts.keys())
    ys = [counts[x] for x in xs]
    buf = io.BytesIO()
    plt.figure(figsize=(6,3))
    plt.bar(xs, ys)
    plt.xlabel("Year")
    plt.ylabel("Number of papers")
    plt.tight_layout()
    plt.savefig(buf, format="png")
    buf.seek(0)
    plt.close()
    return buf


In [23]:
# ### Colab cell ###
import gradio as gr
import tempfile
import os

# Reuse functions: search_arxiv, summarize_long_text, extract_text_from_pdf, index_pdf_for_qa, answer_question_with_rag, compare_multiple_papers

def process_topic(topic, max_papers=3):
    papers = search_arxiv(topic, max_results=max_papers)
    summaries = []
    meta = []
    for p in papers:
        summ = summarize_long_text(p.get("summary",""))
        summaries.append(summ)
        meta.append({
            "title": p.get("title"),
            "link": p.get("link"),
            "published": p.get("published"),
            "authors": p.get("authors")
        })
    review = generate_review(topic, summaries)  # from Phase 1 review_agent (if loaded)
    # create wordcloud from summaries
    wc_buf = plot_wordcloud_from_text(" ".join(summaries))
    # timeline
    tl_buf = publication_timeline(meta)
    # build report text
    report_text = f"# Literature review — {topic}\n\n"
    for i,m in enumerate(meta,1):
        report_text += f"## {i}. {m['title']}\n- authors: {', '.join(m['authors'])}\n- published: {m['published']}\n- link: {m['link']}\n\n"
        report_text += f"**Auto-summary:**\n{summaries[i-1]}\n\n---\n\n"
    report_text += "## Comparative review\n\n" + review
    # save pdf to temp
    tmp = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False)
    save_text_to_pdf(report_text, tmp.name, title=f"Literature review — {topic}")
    return report_text, wc_buf, tl_buf, tmp.name

# For PDF upload and Q/A
def process_uploaded_pdfs(uploaded_files):
    """
    uploaded_files: list of File objects from gradio (temp files)
    We index each PDF and prepare combined metadata
    """
    pdf_states = []
    meta = []
    for f in uploaded_files:
        path = f.name if hasattr(f, 'name') else f
        text = extract_text_from_pdf(path)
        summary = summarize_long_text(text)
        state = index_pdf_for_qa(path)
        # extract references heuristically
        refs_block = extract_references_section(text)
        refs = parse_reference_lines(refs_block)
        pdf_states.append({"path": path, "text": text, "summary": summary, "state": state, "references": refs})
        # try to get title from first 200 chars
        title = text.split("\n")[0][:80]
        meta.append({"title": title, "published": ""})
    # combined comparison (Phase 3)
    texts = [s["text"] for s in pdf_states]
    titles = [m["title"] for m in meta]
    comp = compare_multiple_papers(texts, paper_titles=titles)
    # combined wordcloud and report
    summaries = [s["summary"] for s in pdf_states]
    report_text = "# Uploaded PDFs analysis\n\n"
    for i, s in enumerate(pdf_states,1):
        report_text += f"## {i}. {titles[i-1]}\n**Auto-summary:**\n{s['summary']}\n\n**Top References (extracted):**\n"
        for r in (s["references"][:5]):
            report_text += f"- {r}\n"
        report_text += "\n---\n\n"
    report_text += "## Multi-paper comparison\n\n" + comp["comparison"]
    # visualizations
    wc_buf = plot_wordcloud_from_text(" ".join(summaries))
    tl_buf = publication_timeline(meta)
    tmp = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False)
    save_text_to_pdf(report_text, tmp.name, title="Uploaded PDFs analysis")
    return report_text, wc_buf, tl_buf, tmp.name, pdf_states

# Build Gradio interface
with gr.Blocks() as demo:
    gr.Markdown("# 📚 AI Research Assistant")
    with gr.Tab("Topic search & review"):
        topic_in = gr.Textbox(label="Topic (arXiv search)")
        max_papers = gr.Slider(1,6, value=3, step=1, label="Max papers")
        run_btn = gr.Button("Fetch & Review")
        review_out = gr.Textbox(lines=20, label="Report (Markdown)")
        wc_img = gr.Image(label="Wordcloud", type="pil")
        tl_img = gr.Image(label="Publication timeline", type="pil")
        pdf_download = gr.File(label="Download PDF")
        run_btn.click(process_topic, inputs=[topic_in, max_papers], outputs=[review_out, wc_img, tl_img, pdf_download])

    with gr.Tab("Upload PDFs & Analyze"):
        pdfs = gr.File(label="Upload one or more PDFs", file_count="multiple", file_types=[".pdf"])
        analyze_btn = gr.Button("Analyze PDFs")
        analyze_report = gr.Textbox(lines=20, label="Report (Markdown)")
        pdf_wc = gr.Image(label="Wordcloud", type="pil")
        pdf_tl = gr.Image(label="Timeline", type="pil")
        pdf_dl = gr.File(label="Download PDF")
        analyze_btn.click(process_uploaded_pdfs, inputs=[pdfs], outputs=[analyze_report, pdf_wc, pdf_tl, pdf_dl])

    with gr.Tab("PDF Q&A (single doc)"):
        upload_single = gr.File(label="Upload single PDF", file_count="single", file_types=[".pdf"])
        qa_index_btn = gr.Button("Index PDF")
        question = gr.Textbox(label="Ask a question about the PDF")
        answer_out = gr.Textbox(lines=8, label="Answer")
        # state container
        hidden_state = gr.State()
        def index_and_return_state(uploaded):
            if not uploaded:
                return None
            st = index_pdf_for_qa(uploaded.name)
            return st
        qa_index_btn.click(index_and_return_state, inputs=[upload_single], outputs=[hidden_state])
        def ask_q_and_answer(q, st):
            if not st:
                return "Please index a PDF first."
            return answer_question_with_rag(q, st['index'], st['chunks'], st['embeddings'], top_k=4)
        question.submit(ask_q_and_answer, inputs=[question, hidden_state], outputs=[answer_out])

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://836ff5021126040cae.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


