# **Load, Split, Embed, Store using OpenAI and Chroma**

In [49]:
#!pip -q install langchain openai chromadb sentence_transformers evaluate rouge_score bert_score bleu_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
     ---------------------------------------- 61.1/61.1 kB 3.2 MB/s eta 0:00:00
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13


ERROR: Could not find a version that satisfies the requirement bleu_score (from versions: none)
ERROR: No matching distribution found for bleu_score


## **OpenAI Authenticatation**
We use OpenAI for the embedding. Make sure to have balance on your OpenAI Dashboard and create a personal secret key at https://platform.openai.com/api-keys.

In [23]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

········


In [16]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

  from tqdm.autonotebook import trange


## **Load documents**
All our documents are retrieved by the Scraper/PubMedScraper.py. Executing this script will generate a papers.json.
We noad load the json file, and take a closer look.

In [17]:
import json

file_path = 'papers.json'

# Open and read the JSON file
with open(file_path, 'r') as file:
    data = json.load(file)

# Now 'data' contains the JSON data as a Python object
# For example, print an item to check
print(data[5])


{'title': {'full_text': 'probability estimation with machine learning methods for dichotomous and multicategory outcome: theory.'}, 'abstract': {'full_text': 'probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long-standing tradition in biostatistics. however, biases may occur if the model is misspecified. in contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, including k-nearest neighbors (k-nn), bagged nearest neighbors (b-nn), random forests (rf), and support vector machines (svm). because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. probability estimation in k-nn, b-nn, and rf can be embedded into the class of nonparametric regression learning machines; therefore, we start with the constr

## **Create LangChain Document objects**
A Document object contains page_content (str) and metadata (dict). This object will be useful for splitting large documents, into smaller chunks later.

In [14]:
from langchain.schema import Document

transformed_docs = []
counter = 1  # Start the counter

for doc in data:
    title = doc['title']['full_text']
    abstract = doc['abstract']['full_text']
    unique_id = f"pubmed-{counter:07d}"  # Format the ID with leading zeros

    if doc['keywords'] and isinstance(doc['keywords'][0], list):
        keywords = doc['keywords'][0]
    else:
        keywords = []

    document = Document(
        page_content=abstract,
        metadata={
            'title': title,
            'keywords': keywords,
            'unique_id': unique_id  # Use the formatted unique ID
        }
    )
    transformed_docs.append(document)
    counter += 1  # Increment the counter for the next document


In [5]:
transformed_docs[0]

Document(page_content='', metadata={'title': 'efficient use of social media during the avian influenza a(h7n9) emergency response.', 'keywords': [], 'unique_id': 'pubmed-0000001'})

## **Split the documents**
We now split large documents into smaller pieces, and create a new Document object for each chunk.

In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunked_texts = []

for doc in transformed_docs:
    # Use 'split_text' to split the document's page_content into chunks
    chunks = text_splitter.split_text(doc.page_content)

    for chunk in chunks:
        # Create a new Document for each chunk, preserving the original metadata
        chunked_doc = Document(
            page_content=chunk,
            metadata=doc.metadata  # This includes the unique_id
        )
        chunked_texts.append(chunked_doc)



## **Embedding and Storage using OpenAIEmbeddings and Chroma**

In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings,GPT4AllEmbeddings,HuggingFaceBgeEmbeddings

In [22]:
import os

persist_directory = "./Chroma/chroma_openai"
# Create the directory if it does not exist
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)
    print(f"Directory '{persist_directory}' created.")
else:
    print(f"Directory '{persist_directory}' already exists.")


Directory './Chroma/chroma_openai' already exists.


### **Prepare batch system for embedding**
To not embed all chunked documents at once, we prepare a batching system to reduce the resource load.

In [11]:
def process_batch(docs_batch, embedding, persist_directory):
    #embedding.set_runtime("gpu")
    simplified_docs_batch = []
    for doc in docs_batch:
        # Convert list of keywords to a single string
        keywords_str = ', '.join(doc.metadata.get('keywords', []))

        # Create new metadata with simple data types
        simplified_metadata = {
            'title': doc.metadata.get('title', ''),
            'keywords': keywords_str
        }

        # Create a new Document with simplified metadata
        simplified_doc = Document(page_content=doc.page_content, metadata=simplified_metadata)
        simplified_docs_batch.append(simplified_doc)

    vectordb = Chroma.from_documents(
        documents=simplified_docs_batch, embedding=embedding, persist_directory=persist_directory
    )
    vectordb.persist()
from tqdm import tqdm

def batch_process_embeddings(docs, batch_size, embedding, persist_directory):
    for i in tqdm(range(0, len(docs), batch_size)):
        docs_batch = docs[i:i + batch_size]
        process_batch(docs_batch, embedding, persist_directory)
batch_size = 1000

### **Execute embedding and store to Chroma.**

In [20]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [18]:
persist_directory = './Chroma/chroma_openai'
batch_process_embeddings(chunked_texts, batch_size, OpenAIEmbeddings(), persist_directory)

NameError: name 'chunked_texts' is not defined

In [24]:
query = "What is the proposed optical biosensor based on?"

persist_directory = './Chroma/chroma_openai'
db3 = Chroma(persist_directory=persist_directory, embedding_function=OpenAIEmbeddings())
# Call the similarity search method with the query and k
docs = db3.similarity_search_with_score(query, k=3)

for doc, score in docs:
    print(f"Document: {doc}\nScore: {score}\n")

Document: page_content='in this work, we theoretically propose an optical biosensor (consists of a bk7 glass, a metal film, and a graphene sheet) based on photonic spin hall effect (she). we establish a quantitative relationship between the spin-dependent shift in photonic she and the refractive index of sensing medium. it is found that, by considering the surface plasmon resonance effect, the refractive index variations owing to the adsorption of biomolecules in sensing medium can effectively change the spin-dependent displacements. remarkably, using the weak measurement method, this tiny spin-dependent shifts can be detected with a desirable accuracy so that the corresponding biomolecules concentration can be determined.' metadata={'keywords': '', 'seq_num': 38643, 'source/title': 'photonic spin hall effect enabled refractive index sensor using weak measurements.'}
Score: 0.24147269129753113

Document: page_content='over the last 30 years, optical biosensors based on nanostructured m