# Prepare Paper Databases

## 1. Extract Papers from pdf 

### Workflow Steps

1. **Input Preparation**
   - Place all PDFs in a dedicated folder (e.g., `/inputs/papers_pdf`)
   - Ensure files have `.pdf` extensions
   - ⚠️ Note: Scanned/image-based PDFs will not process correctly 

Set paths in the configuration cell:
   ```python
   PDF_DIR = "./inputs/papers_pdf"  # Input PDFs
   OUTPUT_DIR = "./inputs/papers_text"  # Clean text output
   ```
2. Run the processing cell to execute:
   ```python
   process_pdf_batch(PDF_DIR, OUTPUT_DIR)
   ```
   


In [None]:
# Configuration
PDF_DIR = "./inputs/papers_pdf" 
OUTPUT_DIR = "./inputs/papers_text"

# Import processing function
from model_resources.functions.extraction import process_pdf_batch

# Execute pipeline
process_pdf_batch(PDF_DIR, OUTPUT_DIR)

# 2. Set up API keys for this session

Enter your OpenAI API key in the field provided below. This will store the key as an environment variable (OPENAI_API_KEY) for use during this notebook session.



In [None]:
import os

os.environ["OPENAI_API_KEY"] = ''

## 3. Summarize Extracted Paper Texts Using GPT-4

This step processes all extracted paper text files and summarizes them using GPT-4 via the OpenAI API.  
Using the default prompt, the summarization framework extracts structured information relevant to deep eutectic electrolyte research, such as:
- Electrolyte composition  
- Cathode composition  
- Electrochemical performance metrics  
- Scientific rationale for composition choice

Summaries are saved as `.txt` files in a specified output folder.  
The prompt used for summarization is stored in `./model_resources/prompts/summarize.txt` for easy editing and customization.

> -------------------------------------          
> ⚠️ **Warning**:
>  
> Submitting large numbers of academic papers for summarization may incur **significant API costs** due to the high token count of full papers.
>  
> Please ensure:
> - You are comfortable sending these documents to OpenAI’s servers.
> - You understand the potential cost based on the number and length of the papers.
> - If needed, consider using your **own local language model** or an **API cost monitor** during large batch processing.
> ---------------------------------------         



In [None]:
from model_resources.functions.summarization import summarize_papers_with_gpt4

# Configuration
SOURCE_DIR = "./inputs/papers_text"
SUMMARY_OUTPUT_DIR = "./inputs/summaries"

# Run summarization
summarize_papers_with_gpt4(SOURCE_DIR, SUMMARY_OUTPUT_DIR)


## 3. Store Extracted Papers and Summaries as Vector Databases

The `create_vector_db` function converts the extracted documents into embeddings for downstream retrieval.

- The paper database is split into overlapping token-based chunks to preserve contextual granularity during retrieval.

- The summary database stores entire summaries without splitting, as they are already concise.

These vector stores are saved in the `model_databases/` directory and will later be used by the multi-agent system for information retrieval.

In [None]:
from model_resources.functions.vector_db import create_vector_db

# Create the paper database (with token splitting)

PAPER_PERSIST_DIR = "./model_databases/paperstore"
SUMMARY_PERSIST_DIR = "./model_databases/summarystore"

# Create Paper Database
create_vector_db(OUTPUT_DIR,PAPER_PERSIST_DIR, split_tokens=True)

#Create Summary Databse
create_vector_db(SUMMARY_OUTPUT_DIR,SUMMARY_PERSIST_DIR, split_tokens=False)