# LitSearch AI

A RAG (Retrieval-Augmented Generation) pipeline for searching and querying scientific papers from arXiv.

## Pipeline Overview

1. **Data Collection**: Fetch papers from arXiv API with metadata
2. **PDF Parsing**: Extract text content from PDFs
3. **Data Validation**: Quality checks on extracted text
4. **Chunking**: Split documents into searchable chunks with metadata
5. **Embeddings**: Create vector embeddings using OpenAI
6. **Vector Store**: Index chunks in ChromaDB for similarity search
7. **RAG Chain**: Combine retrieval with LLM for question answering
8. **Evaluation**: Measure faithfulness of generated answers

## Setup

In [None]:
!pip install jupyter arxiv pypdf openai langchain langchain-core langchain-openai langchain-community langchain-text-splitters chromadb python-dotenv pandas requests

In [16]:
import os
import sys
from dotenv import load_dotenv

import arxiv

from pypdf import PdfReader
import io
import requests

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

import pandas as pd
from datetime import datetime, timedelta
from typing import List, Dict
import json
import numpy as np

In [17]:
load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

## 1. Data Collection

### Fetch Papers from arXiv

In [None]:
search = arxiv.Search(
      query='brain tumor detection AND deep learning AND MRI images',
      max_results=150,
      sort_by=arxiv.SortCriterion.Relevance)

papers = []
for result in search.results():
    paper = {
        'article_id': result.entry_id.split('/')[-1],
        'title': result.title,
        'authors': [author.name for author in result.authors],
        'published': result.published,
        'summary': result.summary,
        'pdf_url': result.pdf_url
    }
    papers.append(paper)
    print(f"✓ {paper['title'][:60]}...")

print(f"\n{len(papers)} papers fetched")

### Parse PDFs

In [20]:
def parse_pdf(pdf_url):
    try:
        response = requests.get(pdf_url, timeout=30)
        pdf = PdfReader(io.BytesIO(response.content))
        text = "".join(page.extract_text() or "" for page in pdf.pages)
        # Remove invalid unicode surrogate characters
        text = text.encode('utf-8', errors='surrogatepass').decode('utf-8', errors='replace')
        return text if len(text) > 500 else None
    except:
        return None

print(f"\nParsing {len(papers)} PDFs...")

for i, paper in enumerate(papers):
    text = parse_pdf(paper['pdf_url'])
    
    if text:
        paper['full_text'] = text
    else:
        paper['full_text'] = f"{paper['title']}\n\n{paper['summary']}"
    
    if i % 10 == 0:
        print(f"{i}/{len(papers)}")

success = sum(1 for p in papers if len(p['full_text']) > 1000)
print(f"\nDone: {success}/{len(papers)} parsed")


Parsing 150 PDFs...
0/150


Impossible to decode XFormObject /Im3: 'bbox'


10/150


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 18 0 (offset 0)
Ignoring wrong pointing object 20 0 (offset 0)
Ignoring wrong pointing object 22 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 45 0 (offset 0)
Ignoring wrong pointing object 58 0 (offset 0)
Ignoring wrong pointing object 60 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)
Ignoring wrong pointing object 65 0 (offset 0)
Ignoring wrong pointing object 68 0 (offset 0)
Ignoring wrong pointing object 70 0 (offset 0)
Ignoring wrong pointing object 81 0 (offset 0)
Ignoring wrong pointing object 83 0 (offset 0)
Ignoring wrong pointing object 107 0 (offset 0)
Ignoring wrong pointing object 118 0 (offset 0)
Ignoring wron

20/150
30/150
40/150
50/150
60/150
70/150
80/150
90/150


Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 19 0 (offset 0)
Ignoring wrong pointing object 27 0 (offset 0)
Ignoring wrong pointing object 35 0 (offset 0)
Ignoring wrong pointing object 43 0 (offset 0)
Ignoring wrong pointing object 57 0 (offset 0)
Ignoring wrong pointing object 75 0 (offset 0)
Ignoring wrong pointing object 81 0 (offset 0)
Ignoring wrong pointing object 228 0 (offset 0)


100/150


parsing for Object Streams


110/150


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 18 0 (offset 0)
Ignoring wrong pointing object 20 0 (offset 0)
Ignoring wrong pointing object 22 0 (offset 0)
Ignoring wrong pointing object 28 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 36 0 (offset 0)
Ignoring wrong pointing object 38 0 (offset 0)
Ignoring wrong pointing object 43 0 (offset 0)
Ignoring wrong pointing object 45 0 (offset 0)
Ignoring wrong pointing object 67 0 (offset 0)


120/150
130/150
140/150

Done: 150/150 parsed


### Data Quality Analysis

In [21]:
df_analysis = pd.DataFrame(papers)
df_analysis['text_length'] = df_analysis['full_text'].str.len()

print("Text Length Statistics:")
print(df_analysis['text_length'].describe())

print(f"\nShort papers (<2k chars): {(df_analysis['text_length'] < 2000).sum()}")

print("\n✓ Data validated")

Text Length Statistics:
count       150.000000
mean      48011.713333
std       30693.738474
min       15618.000000
25%       27611.500000
50%       37797.000000
75%       62723.250000
max      235577.000000
Name: text_length, dtype: float64

Short papers (<2k chars): 0

✓ Data validated


## 2. RAG Pipeline

### Chunking

In [23]:
def chunk_paper(paper):
    """Create chunks with metadata"""
    
    title_abstract = f"Title: {paper['title']}\n\nAbstract: {paper['summary']}"
    
    chunks = [Document(
        page_content=title_abstract,
        metadata={
            'arxiv_id': paper['article_id'],
            'title': paper['title'],
            'section': 'title_abstract',
            'authors': ', '.join(paper['authors'][:3])
        }
    )]
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    
    body_chunks = splitter.create_documents(
        texts=[paper['full_text']],
        metadatas=[{
            'arxiv_id': paper['article_id'],
            'title': paper['title'],
            'section': 'body',
            'authors': ', '.join(paper['authors'][:3])
        }]
    )
    
    chunks.extend(body_chunks)
    return chunks

print("Chunking papers...")

all_chunks = []
for i, paper in enumerate(papers):
    paper_chunks = chunk_paper(paper)
    all_chunks.extend(paper_chunks)
    
    if i % 20 == 0:
        print(f"{i}/{len(papers)} - {len(all_chunks)} chunks so far")

print(f"\nDone: {len(all_chunks)} total chunks")
print(f"Avg chunks per paper: {len(all_chunks)/len(papers):.1f}")

Chunking papers...
0/150 - 78 chunks so far
20/150 - 1397 chunks so far
40/150 - 2908 chunks so far
60/150 - 4071 chunks so far
80/150 - 5291 chunks so far
100/150 - 6382 chunks so far
120/150 - 7525 chunks so far
140/150 - 8748 chunks so far

Done: 9283 total chunks
Avg chunks per paper: 61.9


### Embeddings & Vector Store

In [25]:
print("Creating embeddings...")
print(f"Chunks to process: {len(all_chunks)}")
print("This will take ~2-3 minutes\n")

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    collection_name="litsearch_papers"
)

print(f"\nVector store created")
print(f"  Total chunks indexed: {len(all_chunks)}")
print(f"  Ready for retrieval")

Creating embeddings...
Chunks to process: 9283
This will take ~2-3 minutes


Vector store created
  Total chunks indexed: 9283
  Ready for retrieval


### Test Retrieval

In [26]:
def test_retrieval(query, k=5):
    """Test semantic search"""
    results = vectorstore.similarity_search_with_score(query, k=k)
    
    print(f"\nQuery: '{query}'")
    print(f"{'='*60}\n")
    
    for i, (doc, score) in enumerate(results):
        print(f"[{i+1}] Score: {score:.3f}")
        print(f"    ArXiv: {doc.metadata['arxiv_id']}")
        print(f"    Title: {doc.metadata['title'][:50]}...")
        print(f"    Section: {doc.metadata['section']}")
        print(f"    Text: {doc.page_content[:150]}...")
        print()

test_queries = [
      "What are the best performing segmentation techniques on BraTS dataset?",
      "What are the main challenges that remain unsolved in brain tumor segmentation?",
      "What are the most commonly used public datasets for brain tumor detection?"
]

for query in test_queries:
    test_retrieval(query, k=3)
    print("\n" + "─"*60 + "\n")


Query: 'What are the best performing segmentation techniques on BraTS dataset?'

[1] Score: 0.765
    ArXiv: 2307.15872v1
    Title: Cross-dimensional transfer learning in medical ima...
    Section: body
    Text: These methods have all performed more or less e fficiently, some better than others on
the different challenge tasks. This shows the e ffectiveness of...

[2] Score: 0.777
    ArXiv: 2306.12510v2
    Title: Comparative Analysis of Segment Anything Model and...
    Section: body
    Text: • DL for Medical Image Segmentation is used. 
• Hardware Acceleration on SBCs (Google's Edge 
TPU)  
Scenario 1: 
BUSI: 0.995 
UDIAT: 0.949 
Scenario ...

[3] Score: 0.785
    ArXiv: 2102.04525v4
    Title: Unified Focal loss: Generalising Dice and cross en...
    Section: body
    Text: 25described in (Table 1). For 3D binary segmentation, we used the BraTS20
dataset. Here, images were pre-processed, with the skull stripped and images...


──────────────────────────────────────────────────

### RAG Chain

In [27]:
llm = ChatOpenAI(model="gpt-4", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI research assistant. Use the context to answer. Cite sources as [arXiv:ID].\n\nContext: {context}"),
    ("human", "{question}")
])

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

def format_docs(docs):
    return "\n\n".join([f"[arXiv:{doc.metadata['arxiv_id']}]: {doc.page_content}" for doc in docs])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("✓ RAG chain ready")

✓ RAG chain ready


In [None]:
test_queries = [
      "Can conformal predictions be used for unsupervised anomaly detection in images?",
      "Are there machine learning methods to detect brain lesions using ultrasound?",
      "Transformers for multimodal image segmentation??",
      "Cross-attention for image classification?"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Q: {query}")
    print(f"{'='*60}\n")
    
    response = rag_chain.invoke(query)
    print(response)
    print()


Q: Can conformal predictions be used for unsupervised anomaly detection in images?

The context provided does not mention the use of conformal predictions for unsupervised anomaly detection in images. The discussed methods for unsupervised anomaly detection in images include the use of generative models to synthesize healthy samples from diseased images, the use of an encoder network to replace the time-consuming iterative restoration process, and the creation of synthetic anomalies to train a discriminative model. However, it does not mention the use of conformal predictions in this process.


Q: Are there machine learning methods to detect brain lesions using ultrasound?

Yes, there are machine learning methods to detect brain lesions using ultrasound. For instance, the paper by H. Chen et al. discusses the use of iterative multi-domain regularized deep learning for anatomical structure detection and segmentation from ultrasound images [4]. However, the application of deep learning 

## 3. Evaluation

In [None]:
from langchain_openai import ChatOpenAI
import json

def evaluate_faithfulness(question, answer, sources):
    """Evaluate if answer is faithful to sources using LLM judge"""
    
    context = "\n\n".join([f"Source {i+1}: {s.page_content}" for i, s in enumerate(sources)])
    
    prompt = f"""Rate faithfulness (0.0-1.0):

Sources:
{context}

Question: {question}
Answer: {answer}

Is every claim supported by sources? Return only a number 0.0-1.0."""
    
    judge = ChatOpenAI(model="gpt-4", temperature=0)
    score = judge.invoke(prompt).content.strip()
    
    return float(score)


test_queries = [
      "What are the best performing segmentation techniques on BraTS dataset?",
      "What are the main challenges that remain unsolved in brain tumor segmentation?",
      "What are the most commonly used public datasets for brain tumor detection?"
]

print("FAITHFULNESS EVALUATION")
print("="*60)

scores = []
for query in test_queries:
    answer = rag_chain.invoke(query)
    sources = vectorstore.similarity_search(query, k=5)
    score = evaluate_faithfulness(query, answer, sources)
    scores.append(score)
    
    print(f"{query[:45]}...")
    print(f"  Faithfulness: {score:.2f}\n")

print(f"Average Faithfulness: {np.mean(scores):.2f}")