# Scientific Paper Analysis System - RAG Implementation  
**Version:** 1.0  
**Author:** Lorena Melo </br>
**Last Updated:** 17 Feb 2025

---

## 📄 Overview  
This Jupyter Notebook provides an end-to-end Retrieval-Augmented Generation (RAG) system specifically designed for analyzing scientific papers. It enables users to:  

- Process PDF academic papers with specialized preprocessing  
- Create semantic search capabilities over documents  
- Ask natural language questions about paper content  
- Receive answers with citations from source material  

**Key Technologies Used**:  
- OpenAI GPT-4 & Embeddings API  
- ChromaDB vector database  
- Advanced text processing pipelines  

---

## ⚠️ Critical Requirements  
Before using this system, users **MUST**:  

1. **Obtain OpenAI API Key**  
   - Create account at [OpenAI Platform](https://platform.openai.com/)  
   - Generate API key in `API Keys` section  
   - **Important**: Add payment method - API usage incurs costs  

2. **Understand Costs**  
   |      Service  | Estimated Cost $ |  
   |---------------|----------------|  
   | Embeddings | 0,00013 / 1k tokens |  
   | GPT-4 | 0.03/1k tokens input |  

   *Costs based on OpenAI pricing as of July 2024*  

---

## 🛠️ Features  

### 1. PDF Processing Engine  
- Header/footer removal  
- DOI detection and filtering  
- Academic structure preservation  

### 2. Vector Database  
- ChromaDB persistent storage  
- `text-embedding-3-large` embeddings (3072-dimension)  
- MMR search algorithm  

### 3. Question Answering  
- GPT-4 Turbo integration  
- Technical term handling  
- Page/section citation system  

---

## 🚀 Installation  

### Requirements  
- Python 3.10+  
- Libraries:  
```bash
pip install langchain langchain-community chromadb pypdf python-dotenv tiktoken
```

---

## ⚙️ Configuration  

1. **Environment Setup**  
   Create `.env` file:  
   ```ini
   OPENAI_API_KEY=sk-your-key-here
   ```

2. **Document Preparation**  
   - Place PDF in specified path:  
   ```python
   DOCUMENT_PATH = "/path/to/your/paper.pdf"
   ```

3. **System Settings** (Optional)  
   ```python
   class Config:
       CHUNK_SIZE = 1500  # Optimal for technical content
       MODEL_NAME = "gpt-4-1106-preview"  # Default model
   ```

---

## 💻 Usage  

### Basic Workflow  
1. **First Run**  
   ```bash
   python scientific_rag.py
   ```  
   - System will:  
     - Process PDF (~2-10 mins depending on paper size)  
     - Create ChromaDB vector store  

2. **Query Interface**  
   ```bash
   Sistema de Análise Científica Pronto. Faça sua pergunta sobre o artigo:
   
   Pergunta: [Enter your question]
   ```

### Example Session  
```bash
Pergunta: What methodology did the authors use for evaluation?

Resposta:
The authors employed a hybrid evaluation approach combining:
- Quantitative metrics (F1-score: 92.4%, Table 3, page 12)
- Human expert validation (Section 4.2, page 8)
- Comparative analysis against baseline models (Figure 5, page 14)
```

---

## 🔧 Technical Notes  

### Cost Management Tips  
1. Monitor usage at [OpenAI Usage Dashboard](https://platform.openai.com/usage)  
2. For large papers, estimate costs running a cell right after the loading document function.



### Model Customization  
Available OpenAI Models:  
```python
# In Config class:
MODEL_NAME = "gpt-4-turbo"  # Alternative
# MODEL_NAME = "gpt-3.5-turbo"  # Cheaper but less accurate
```

---

## 📜 Support & Disclaimer  

### Support Includes  
- Code functionality  
- Configuration assistance  
- Basic usage guidance  

### Not Included  
- OpenAI account management  
- API cost optimization  
- Custom feature development  

### Legal Disclaimer  
```text
This product is dependent on third-party services (OpenAI).
The developer is not responsible for:
- API service interruptions
- Content generated by AI models
- Financial charges from API usage
```

---

## 📬 Contact  
For support: [lorenamelo.engr@gmail.com]  
Report issues: [GitHub Repository Link]  

---



In [5]:
# -*- coding: utf-8 -*-
"""RAG System for Scientific Papers.ipynb"""

# Instalação de dependências
!pip install -qU langchain langchain-community langchain-openai chromadb pypdf python-dotenv tiktoken



In [20]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Configuração de ambiente
load_dotenv()

# Configurações do usuário
class Config:
    OPENAI_API_KEY = "YOUR_OPENAI_API_KEI" #os.getenv("OPENAI_API_KEY")
    DOCUMENT_PATH = "/content/2007.03051v1.pdf"  # Altere
    CHROMA_DIR = "./chroma_db_scientific"
    CHUNK_SIZE = 1500  # Ideal para artigos científicos
    CHUNK_OVERLAP = 300
    MODEL_NAME = "gpt-4-1106-preview"  # Mais adequado para análise científica

config = Config()

# Verificação inicial
if not os.path.exists(config.DOCUMENT_PATH):
    raise FileNotFoundError(f"Documento não encontrado em: {config.DOCUMENT_PATH}")



In [21]:
def load_scientific_paper(file_path: str):
    loader = PyPDFLoader(file_path)
    documents = loader.load()

    # Debugging: print how many documents were loaded and a preview of the first document
    print(f"Loaded {len(documents)} documents.")
    if documents:
        print("Preview of first document:")
        print(documents[0].page_content[:500])

    # Preprocess the document by filtering out only obvious non-useful lines
    for doc in documents:
        lines = doc.page_content.split("\n")
        filtered_lines = []
        for line in lines:
            # Remove lines that are only digits (likely page numbers)
            #if line.strip().isdigit():
                #continue
            # Remove lines with DOI links if you consider them non-essential
            #if "https://doi.org" in line:
                #continue
            filtered_lines.append(line)
        doc.page_content = "\n".join(filtered_lines)
    return documents





In [22]:
load_scientific_paper(config.DOCUMENT_PATH)

Loaded 11 documents.
Preview of first document:
Carbontracker: Tracking and Predicting the Carbon Footprint of Training
Deep Learning Models
Lasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1
Abstract
Deep learning (DL) can achieve impressive
results across a wide variety of tasks, but
this often comes at the cost of training
models for extensive periods on specialized
hardware accelerators. This energy-intensive
workload has seen immense growth in recent
years. Machine learning (ML) may become
a signiﬁcant contributor to climate


[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2020-07-08T00:39:00+00:00', 'author': '', 'keywords': '', 'moddate': '2020-07-08T00:39:00+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/2007.03051v1.pdf', 'total_pages': 11, 'page': 0, 'page_label': '1'}, page_content='Carbontracker: Tracking and Predicting the Carbon Footprint of Training\nDeep Learning Models\nLasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1\nAbstract\nDeep learning (DL) can achieve impressive\nresults across a wide variety of tasks, but\nthis often comes at the cost of training\nmodels for extensive periods on specialized\nhardware accelerators. This energy-intensive\nworkload has seen immense growth in recent\nyears. Machine learning (ML) may become\na signiﬁcant contributor to climate change\nif this expon

In [30]:
from tiktoken import get_encoding

def estimate_cost_from_documents(docs, cost_per_1k=0.0001):
    """
    Given a list of Document objects (each with a 'page_content' attribute),
    combine their text, count tokens using cl100k_base, and return the token count and cost.
    """
    # Use the tokenizer (for GPT-3.5, GPT-4, text-embedding-ada-002)
    encoder = get_encoding("cl100k_base")

    # Combine text from all documents into a single string
    combined_text = " ".join(doc.page_content for doc in docs)

    # Encode the text to count tokens
    tokens = encoder.encode(combined_text)
    num_tokens = len(tokens)

    # Calculate estimated cost (cost per 1K tokens)
    estimated_cost = (num_tokens / 1000) * cost_per_1k
    return num_tokens, estimated_cost

# Example usage:
# Assume 'load_scientific_paper' is your function that returns the list of documents
documents = load_scientific_paper(config.DOCUMENT_PATH)
num_tokens, cost = estimate_cost_from_documents(documents, cost_per_1k=0.0001)
print(f"Number of tokens: {num_tokens}")
print(f"Estimated cost: ${cost:.4f}")


Loaded 11 documents.
Preview of first document:
Carbontracker: Tracking and Predicting the Carbon Footprint of Training
Deep Learning Models
Lasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1
Abstract
Deep learning (DL) can achieve impressive
results across a wide variety of tasks, but
this often comes at the cost of training
models for extensive periods on specialized
hardware accelerators. This energy-intensive
workload has seen immense growth in recent
years. Machine learning (ML) may become
a signiﬁcant contributor to climate
Number of tokens: 10430
Estimated cost: $0.0010


In [24]:
def split_scientific_text(documents):
    text_splitter = CharacterTextSplitter(
        chunk_size=config.CHUNK_SIZE,
        chunk_overlap=config.CHUNK_OVERLAP,
        separator="\n"
    )
    split_docs = text_splitter.split_documents(documents)
    # Debugging: print how many chunks were created
    print(f"Document split into {len(split_docs)} chunks.")
    return split_docs

split_scientific_text(load_scientific_paper(config.DOCUMENT_PATH))


Loaded 11 documents.
Preview of first document:
Carbontracker: Tracking and Predicting the Carbon Footprint of Training
Deep Learning Models
Lasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1
Abstract
Deep learning (DL) can achieve impressive
results across a wide variety of tasks, but
this often comes at the cost of training
models for extensive periods on specialized
hardware accelerators. This energy-intensive
workload has seen immense growth in recent
years. Machine learning (ML) may become
a signiﬁcant contributor to climate
Document split into 36 chunks.


[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2020-07-08T00:39:00+00:00', 'author': '', 'keywords': '', 'moddate': '2020-07-08T00:39:00+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/2007.03051v1.pdf', 'total_pages': 11, 'page': 0, 'page_label': '1'}, page_content='Carbontracker: Tracking and Predicting the Carbon Footprint of Training\nDeep Learning Models\nLasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1\nAbstract\nDeep learning (DL) can achieve impressive\nresults across a wide variety of tasks, but\nthis often comes at the cost of training\nmodels for extensive periods on specialized\nhardware accelerators. This energy-intensive\nworkload has seen immense growth in recent\nyears. Machine learning (ML) may become\na signiﬁcant contributor to climate change\nif this expon

In [26]:
def create_vector_store():
    # Load and preprocess the document
    raw_docs = load_scientific_paper(config.DOCUMENT_PATH)
    if not raw_docs:
        raise ValueError("Nenhum conteúdo válido encontrado no documento")

    # Split text into manageable chunks
    split_docs = split_scientific_text(raw_docs)

    # Create embeddings and initialize the vector store
    vector_store = Chroma.from_documents(
        documents=split_docs,
        embedding=OpenAIEmbeddings(
            openai_api_key=config.OPENAI_API_KEY,
            model="text-embedding-3-large",  # Best for scientific content
            dimensions=3072
        ),
        persist_directory=config.CHROMA_DIR
    )
    vector_store.persist()
    print("Number of documents in Chroma DB:", vector_store._collection.count())

    return vector_store

create_vector_store()

Loaded 11 documents.
Preview of first document:
Carbontracker: Tracking and Predicting the Carbon Footprint of Training
Deep Learning Models
Lasse F. Wolff Anthony∗1 Benjamin Kanding∗1 Raghavendra Selvan1
Abstract
Deep learning (DL) can achieve impressive
results across a wide variety of tasks, but
this often comes at the cost of training
models for extensive periods on specialized
hardware accelerators. This energy-intensive
workload has seen immense growth in recent
years. Machine learning (ML) may become
a signiﬁcant contributor to climate
Document split into 36 chunks.
Number of documents in Chroma DB: 36


  vector_store.persist()


<langchain_community.vectorstores.chroma.Chroma at 0x7e32a06b80d0>

In [27]:
def initialize_rag_system():
    # Set up the embedding model
    embedding_model = OpenAIEmbeddings(
        openai_api_key=config.OPENAI_API_KEY,
        model="text-embedding-3-large",
        dimensions=3072
    )

    # Either load an existing vector store or create a new one
    if os.path.exists(config.CHROMA_DIR):
        vector_store = Chroma(
            persist_directory=config.CHROMA_DIR,
            embedding_function=embedding_model
        )
    else:
        vector_store = create_vector_store()

    # Adjust the retriever: lower threshold and retrieve more chunks for richer context
    retriever = vector_store.as_retriever(
        search_type="mmr",  # Maximal Marginal Relevance for diversity
        search_kwargs={
            "k": 10,               # Retrieve more chunks
            #"score_threshold": 0.3  # Lower threshold to include more context
        }
    )

    # Use an English prompt for consistency with your queries
    SCIENCE_PROMPT = ChatPromptTemplate.from_template(
        """You are a research assistant specialized in analyzing scientific papers.
Use the provided context strictly to answer the question.

Article context:
{context}

Question: {question}

Required format:
- Provide a concise answer (maximum 300 words)
- Use technical terms in English (if applicable) in parentheses
- Cite the section/page of the document if applicable
- If there is insufficient information, say "Insufficient information in the article"

Answer:"""
    )

    llm = ChatOpenAI(
        openai_api_key=config.OPENAI_API_KEY,
        model_name=config.MODEL_NAME,
        temperature=0.1  # Less creativity, more factual
    )

    # Build the RAG chain
    return (
        {"context": retriever, "question": RunnablePassthrough()}
        | SCIENCE_PROMPT
        | llm
        | StrOutputParser()
    )
initialize_rag_system()

{
  context: VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7e329fa1b0d0>, search_type='mmr', search_kwargs={'k': 10}),
  question: RunnablePassthrough()
}
| ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='You are a research assistant specialized in analyzing scientific papers.\nUse the provided context strictly to answer the question.\n\nArticle context:\n{context}\n\nQuestion: {question}\n\nRequired format:\n- Provide a concise answer (maximum 300 words)\n- Use technical terms in English (if applicable) in parentheses\n- Cite the section/page of the document if applicable\n- If there is insufficient information, say "Insufficient information in the article"\n\nAnswer:'), additional_kwargs={})])
| ChatOpenAI(

In [28]:
def scientific_chat():
    rag_chain = initialize_rag_system()
    print("Scientific Analysis System Ready. Ask your question about the article (in English):")
    while True:
        try:
            query = input("\nQuestion: ")
            if query.lower() in ["exit", "quit"]:
                break
            response = rag_chain.invoke(query)
            print(f"\nAnswer:\n{response}")
        except Exception as e:
            print(f"Error: {str(e)}")
            print("Please rephrase your question or try again.")

scientific_chat()

Scientific Analysis System Ready. Ask your question about the article (in English):

Question: what is the main discovery in this paper?

Answer:
The main discovery in this paper is the development and implementation of "Carbontracker," an open-source tool written in Python designed to track and predict the energy consumption and carbon emissions associated with training deep learning (DL) models. The tool aims to raise awareness about the environmental impact of the increasing computational demands in DL by providing accurate reporting of energy and carbon footprints. Carbontracker operates as a multithreaded program, allowing it to collect power measurements and fetch real-time carbon intensity data without disrupting the main training process of the model.

The paper also discusses the effectiveness of power management techniques like dynamic voltage and frequency scaling (DVFS) in conserving energy consumption during the training and inference of different deep neural networks (DNN