# LitSearch Setup and Exploration

# üìì LitSearch AI - Plan de D√©veloppement MVP

## Phase 1: DATA (Aujourd'hui - 3h)

### Step 1: Fetch 100 papers cs.AI
- T√©l√©charger 100 papers r√©cents depuis arXiv
- Cat√©gorie: cs.AI
- P√©riode: 6 derniers mois
- **Deliverable:** Liste de 100 papers avec m√©tadonn√©es

### Step 2: Parse PDFs (texte brut)
- T√©l√©charger les PDFs depuis arXiv
- Extraire le texte brut avec pypdf
- G√©rer les erreurs (PDFs corrompus, etc.)
- **Deliverable:** Texte complet pour chaque paper

### Step 3: Valider qualit√© des donn√©es
- V√©rifier que les PDFs sont lisibles
- Analyser longueur moyenne des textes
- Identifier papers probl√©matiques
- **Deliverable:** Dataset propre et valid√©

---

## Phase 2: RAG PIPELINE (Aujourd'hui soir + Demain - 5h)

### Step 4: Chunking intelligent
- Strat√©gie: Title+Abstract + Body chunks
- Taille chunks: 1000 chars avec overlap 200
- Pr√©server m√©tadonn√©es (arXiv ID, section)
- **Deliverable:** chunks avec metadata

### Step 5: Embeddings + Vector store
- Cr√©er embeddings avec OpenAI
- Stocker dans ChromaDB
- Indexer avec m√©tadonn√©es
- **Deliverable:** Vector store op√©rationnel

### Step 6: Test retrieval (sans LLM)
- Tester similarity search
- V√©rifier pertinence des r√©sultats
- Ajuster param√®tres (k, threshold)
- **Deliverable:** Retrieval qui fonctionne

### Step 7: RAG chain complet (avec LLM)
- Cr√©er prompt template scientifique
- Int√©grer LLM (GPT-4)
- Chain retrieval + generation
- **Deliverable:** RAG end-to-end fonctionnel

### Step 8: Tester avec questions
- Pr√©parer 10 questions test
- √âvaluer qualit√© des r√©ponses
- Identifier probl√®mes
- **Deliverable:** 5+ questions qui marchent bien

---

## Phase 3: POLISH (Demain apr√®s-midi - 3h)

### Step 9: Optimiser prompts
- Am√©liorer qualit√© des r√©ponses
- R√©duire hallucinations
- Forcer citations syst√©matiques
- **Deliverable:** R√©ponses de meilleure qualit√©

### Step 10: Am√©liorer citations
- Format: arXiv:ID, Section, Page
- Affichage clair des sources
- Relevance scores
- **Deliverable:** Citations professionnelles

### Step 11: Nettoyer le notebook
- Markdown explicatif entre cells
- Supprimer code mort
- Organiser logiquement
- **Deliverable:** Notebook pr√©sentable

---

## Phase 4: PR√âSENTATION (Samedi matin - 2h)

### Step 12: README + documentation
- Architecture diagram
- Origin story (amie chercheuse)
- Lien avec INSPIRE AI
- **Deliverable:** README professionnel

## Setup environment & Install necessary packages

In [None]:
!pip install jupyter arxiv pypdf openai langchain langchain-core langchain-openai langchain-community langchain-text-splitters chromadb python-dotenv pandas requests

In [None]:
import os
import sys
from dotenv import load_dotenv

import arxiv

from pypdf import PdfReader
import io
import requests

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

import pandas as pd
from datetime import datetime, timedelta
from typing import List, Dict
import json
import numpy as np

In [None]:
load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

# Phase 1 : Data extraction

# Step 1. Fetch Papers from arxiv

In [39]:
search = arxiv.Search(
      query='brain tumor AND MRI AND deep learning AND (detection OR segmentation)',
      max_results=150,
      sort_by=arxiv.SortCriterion.Relevance)

papers = []
for result in search.results():
    paper = {
        'article_id': result.entry_id.split('/')[-1],
        'title': result.title,
        'authors': [author.name for author in result.authors],
        'published': result.published,
        'summary': result.summary,
        'pdf_url': result.pdf_url
    }
    papers.append(paper)
    print(f"‚úì {paper['title'][:60]}...")

print(f"\n{len(papers)} papers fetched (RAG-focused)")

  for result in search.results():


‚úì DR-Unet104 for Multimodal MRI brain tumor segmentation...
‚úì Robust Semantic Segmentation of Brain Tumor Regions from 3D ...
‚úì Reproducible Evaluation of Data Augmentation and Loss Functi...
‚úì Brain Tumor Segmentation using 3D-CNNs with Uncertainty Esti...
‚úì Optimizing Brain Tumor Classification: A Comprehensive Study...
‚úì Brain Tumor Sequence Registration with Non-iterative Coarse-...
‚úì Efficient Meningioma Tumor Segmentation Using Ensemble Learn...
‚úì Brain tumor multi classification and segmentation in MRI ima...
‚úì Brain Tumor Segmentation from MRI Images using Deep Learning...
‚úì MRI Brain Tumor Detection with Computer Vision...
‚úì Parameter-efficient Fine-tuning for improved Convolutional B...
‚úì Novel Deep Learning Architectures for Classification and Seg...
‚úì Analyzing Deep Learning Based Brain Tumor Segmentation with ...
‚úì Brain Tumor Detection in MRI Based on Federated Learning wit...
‚úì Deep Brain Net: An Optimized Deep Learning Model for Brain t...


## STEP 2: Parse PDFs

In [None]:
def parse_pdf(pdf_url):
    try:
        response = requests.get(pdf_url, timeout=30)
        pdf = PdfReader(io.BytesIO(response.content))
        text = "".join(page.extract_text() for page in pdf.pages)
        return text if len(text) > 500 else None
    except:
        return None

print(f"\nParsing {len(papers)} PDFs...")

for i, paper in enumerate(papers):
    text = parse_pdf(paper['pdf_url'])
    
    if text:
        paper['full_text'] = text
    else:
        paper['full_text'] = f"{paper['title']}\n\n{paper['summary']}"
    
    if i % 10 == 0:
        print(f"{i}/{len(papers)}")

success = sum(1 for p in papers if len(p['full_text']) > 1000)
print(f"\nDone: {success}/{len(papers)} parsed")

## STEP 3: Data Quality Analysis

In [41]:
df_analysis = pd.DataFrame(papers)
df_analysis['text_length'] = df_analysis['full_text'].str.len()

print("Text Length Statistics:")
print(df_analysis['text_length'].describe())

print(f"\nShort papers (<2k chars): {(df_analysis['text_length'] < 2000).sum()}")

print("\n‚úì Data validated")

Text Length Statistics:
count       150.000000
mean      36151.133333
std       17251.731701
min        1556.000000
25%       25077.750000
50%       31661.000000
75%       45653.750000
max      109250.000000
Name: text_length, dtype: float64

Short papers (<2k chars): 2

‚úì Data validated


In [42]:
df_analysis.head(10)

Unnamed: 0,article_id,title,authors,published,summary,pdf_url,full_text,text_length
0,2011.02840v2,DR-Unet104 for Multimodal MRI brain tumor segm...,"[Jordan Colman, Lei Zhang, Wenting Duan, Xujio...",2020-11-04 01:24:26+00:00,In this paper we propose a 2D deep residual Un...,https://arxiv.org/pdf/2011.02840v2,\n \n \nDR-Unet104 for Multimodal MRI brain t...,25003
1,2001.02040v1,Robust Semantic Segmentation of Brain Tumor Re...,"[Andriy Myronenko, Ali Hatamizadeh]",2020-01-06 07:47:42+00:00,Multimodal brain tumor segmentation challenge ...,https://arxiv.org/pdf/2001.02040v1,Robust Semantic Segmentation of Brain Tumor\nR...,18633
2,2510.08617v1,Reproducible Evaluation of Data Augmentation a...,[Saumya B],2025-10-08 06:15:28+00:00,Brain tumor segmentation is crucial for diagno...,https://arxiv.org/pdf/2510.08617v1,Reproducible Evaluation of Data Augmentation a...,28235
3,2009.12188v1,Brain Tumor Segmentation using 3D-CNNs with Un...,"[Laura Mora Ballestar, Veronica Vilaplana]",2020-09-24 10:50:12+00:00,Automation of brain tumors in 3D magnetic reso...,https://arxiv.org/pdf/2009.12188v1,Brain Tumor Segmentation using 3D-CNNs with\nU...,24661
4,2308.06821v1,Optimizing Brain Tumor Classification: A Compr...,"[Raza Imam, Mohammed Talha Alam]",2023-08-13 17:30:32+00:00,Deep learning has emerged as a prominent field...,https://arxiv.org/pdf/2308.06821v1,Optimizing Brain Tumor Classification: A Compr...,38656
5,2211.07876v1,Brain Tumor Sequence Registration with Non-ite...,"[Mingyuan Meng, Lei Bi, Dagan Feng, Jinman Kim]",2022-11-15 03:58:47+00:00,"In this study, we focus on brain tumor sequenc...",https://arxiv.org/pdf/2211.07876v1,Brain Tumor Sequence Registration with Non-ite...,25657
6,2510.21040v1,Efficient Meningioma Tumor Segmentation Using ...,"[Mohammad Mahdi Danesh Pajouh, Sara Saeedi]",2025-10-23 22:51:22+00:00,Meningiomas represent the most prevalent form ...,https://arxiv.org/pdf/2510.21040v1,In loving memory of a wonderful grandma whose ...,27877
7,2304.10039v2,Brain tumor multi classification and segmentat...,"[Belal Amin, Romario Sameh Samir, Youssef Tare...",2023-04-20 01:32:55+00:00,This study proposes a deep learning model for ...,https://arxiv.org/pdf/2304.10039v2,BRAIN TUMOR MULTI CLASSIFICATION AND\nSEGMENTA...,29201
8,2305.00257v1,Brain Tumor Segmentation from MRI Images using...,"[Ayan Gupta, Mayank Dixit, Vipul Kumar Mishra,...",2023-04-29 13:33:21+00:00,"A brain tumor, whether benign or malignant, ca...",https://arxiv.org/pdf/2305.00257v1,Brain Tumor Segmentation from MRI Images using...,37212
9,2510.10250v1,MRI Brain Tumor Detection with Computer Vision,"[Jack Krolik, Jake Lynn, John Henry Rudden, Dm...",2025-10-11 15:07:52+00:00,This study explores the application of deep le...,https://arxiv.org/pdf/2510.10250v1,MRI Brain Tumor Detection with Computer Vision...,28734


# Phase 2 : Rag pipeline

## STEP 1: Chunking

In [None]:
def chunk_paper(paper):
    """Create chunks with metadata"""
    
    title_abstract = f"Title: {paper['title']}\n\nAbstract: {paper['summary']}"
    
    chunks = [Document(
        page_content=title_abstract,
        metadata={
            'arxiv_id': paper['article_id'],
            'title': paper['title'],
            'section': 'title_abstract',
            'authors': ', '.join(paper['authors'][:3])
        }
    )]
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    
    body_chunks = splitter.create_documents(
        texts=[paper['full_text']],
        metadatas=[{
            'arxiv_id': paper['article_id'],
            'title': paper['title'],
            'section': 'body',
            'authors': ', '.join(paper['authors'][:3])
        }]
    )
    
    chunks.extend(body_chunks)
    return chunks

print("Chunking papers...")

all_chunks = []
for i, paper in enumerate(papers):
    paper_chunks = chunk_paper(paper)
    all_chunks.extend(paper_chunks)
    
    if i % 20 == 0:
        print(f"{i}/{len(papers)} - {len(all_chunks)} chunks so far")

print(f"\nDone: {len(all_chunks)} total chunks")
print(f"Avg chunks per paper: {len(all_chunks)/len(papers):.1f}")

In [44]:
all_chunks

[Document(metadata={'arxiv_id': '2011.02840v2', 'title': 'DR-Unet104 for Multimodal MRI brain tumor segmentation', 'section': 'title_abstract', 'authors': 'Jordan Colman, Lei Zhang, Wenting Duan'}, page_content="Title: DR-Unet104 for Multimodal MRI brain tumor segmentation\n\nAbstract: In this paper we propose a 2D deep residual Unet with 104 convolutional layers (DR-Unet104) for lesion segmentation in brain MRIs. We make multiple additions to the Unet architecture, including adding the 'bottleneck' residual block to the Unet encoder and adding dropout after each convolution block stack. We verified the effect of introducing the regularisation of dropout with small rate (e.g. 0.2) on the architecture, and found a dropout of 0.2 improved the overall performance compared to no dropout, or a dropout of 0.5. We evaluated the proposed architecture as part of the Multimodal Brain Tumor Segmentation (BraTS) 2020 Challenge and compared our method to DeepLabV3+ with a ResNet-V2-152 backbone. We

## Step 2 : Embedding

In [None]:
print("Creating embeddings...")
print(f"Chunks to process: {len(all_chunks)}")
print("This will take ~2-3 minutes\n")

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    collection_name="litsearch_papers"
)

print(f"\nVector store created")
print(f"  Total chunks indexed: {len(all_chunks)}")
print(f"  Ready for retrieval")

## Step 6: Test retrieval (sans LLM)


In [None]:
def test_retrieval(query, k=5):
    """Test semantic search"""
    results = vectorstore.similarity_search_with_score(query, k=k)
    
    print(f"\nQuery: '{query}'")
    print(f"{'='*60}\n")
    
    for i, (doc, score) in enumerate(results):
        print(f"[{i+1}] Score: {score:.3f}")
        print(f"    ArXiv: {doc.metadata['arxiv_id']}")
        print(f"    Title: {doc.metadata['title'][:50]}...")
        print(f"    Section: {doc.metadata['section']}")
        print(f"    Text: {doc.page_content[:150]}...")
        print()

test_queries = [
      "What are the best performing segmentation techniques on BraTS dataset?",
      "What are the main challenges that remain unsolved in brain tumor segmentation?",
      "What are the most commonly used public datasets for brain tumor detection?"
]

for query in test_queries:
    test_retrieval(query, k=3)
    print("\n" + "‚îÄ"*60 + "\n")

## STEP 7: Build Complete RAG Chain

In [33]:
llm = ChatOpenAI(model="gpt-4", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI research assistant. Use the context to answer. Cite sources as [arXiv:ID].\n\nContext: {context}"),
    ("human", "{question}")
])

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

def format_docs(docs):
    return "\n\n".join([f"[arXiv:{doc.metadata['arxiv_id']}]: {doc.page_content}" for doc in docs])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("‚úì RAG chain ready")

‚úì RAG chain ready


In [None]:
test_queries = [
      "What are the best performing segmentation techniques on BraTS dataset?",
      "What are the main challenges that remain unsolved in brain tumor segmentation?",
      "What are the most commonly used public datasets for brain tumor detection?"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Q: {query}")
    print(f"{'='*60}\n")
    
    response = rag_chain.invoke(query)
    print(response)
    print()

# RAG Evaluation

In [None]:
from langchain_openai import ChatOpenAI
import json

def evaluate_faithfulness(question, answer, sources):
    """Evaluate if answer is faithful to sources using LLM judge"""
    
    context = "\n\n".join([f"Source {i+1}: {s.page_content}" for i, s in enumerate(sources)])
    
    prompt = f"""Rate faithfulness (0.0-1.0):

Sources:
{context}

Question: {question}
Answer: {answer}

Is every claim supported by sources? Return only a number 0.0-1.0."""
    
    judge = ChatOpenAI(model="gpt-4", temperature=0)
    score = judge.invoke(prompt).content.strip()
    
    return float(score)


test_queries = [
      "What are the best performing segmentation techniques on BraTS dataset?",
      "What are the main challenges that remain unsolved in brain tumor segmentation?",
      "What are the most commonly used public datasets for brain tumor detection?"
]

print("FAITHFULNESS EVALUATION")
print("="*60)

scores = []
for query in test_queries:
    answer = rag_chain.invoke(query)
    sources = vectorstore.similarity_search(query, k=5)
    score = evaluate_faithfulness(query, answer, sources)
    scores.append(score)
    
    print(f"{query[:45]}...")
    print(f"  Faithfulness: {score:.2f}\n")

print(f"Average Faithfulness: {np.mean(scores):.2f}")