OCI OpenSearch Service sample notebook.

Copyright (c) 2024 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License (UPL) v 1.0](https://oss.oracle.com/licenses/upl/).

### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

# Prereqs: Install/Upgrade Langchain along with other necesaries libraries 
you can use **`pip`** to  install all the required dependencies into your conda or python. Recommended packages include:
- **langchain**: This will give you environment access to all the native langchain libraries 
- **langchain-community**: this install extended libraries/integration from communities
- **oracle_ads**: this is the Oracle Data Science sdk that allows you to use Oracle Data Science librairies
- **oci** : oci sdk
- **sentence-transformers**: give you the ability to download sentence-transformers 
- **opensearch-py** : installs the sdk which allows you access opensearch clusters securely and perform operations
- **pypdf**: lanchain pdf processing library
- **langchain-huggingface** : with this you can directly register any hugging-face model via langchain integration by specifying the name. 


In [2]:
!pip install -U langchain langchain-community opensearch-py pypdf  sentence-transformers oci  langchain-huggingface oracle_ads



# Configure necessary variables


In [3]:
# Put your compartment id
compartment_id = "<YOUR-COMPARTMENT-OCID>"
# opensearch_url
opensearch_url="<YOUR-OPENSEARCH-URL>:9200"

username="<YOUR-OPENSEARCH-USERNAME>"
password="<YOUR-OPENSEARCH-PWD>"
index_name = "<YOUR-INDEX-NAME>"

AUTH_TYPE="RESOURCE_PRINCIPAL"

# configure embedding model using LangChain hugging Face integration


In [7]:
#import the Langchain huggingface library 
# from langchain.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
#select embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

# Create Connection to OpenSearch DB using LangChain


In [8]:
#import the LangChain Library
from langchain.vectorstores import OpenSearchVectorSearch
import oci

# Setup Resource Principal for authentication
auth_provider = oci.auth.signers.get_resource_principals_signer()
auth = (username, password)

# Initialize OpenSearch as the vector database
vector_db = OpenSearchVectorSearch(opensearch_url=opensearch_url, 
                            index_name=index_name, 
                            embedding_function=embedding_model,
                            signer=auth_provider,
                            auth_type=AUTH_TYPE,
                            http_auth=auth)

# Load, Preprocess, and Chunk Documents (In this case PDF) with LangChain


In [9]:
import os
from tqdm import tqdm
from langchain.document_loaders import PyPDFLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter



def load_and_split_pdfs_with_chunks(directory_path, chunk_size=500, chunk_overlap=50):
    """
    Loads, splits, and further splits PDF documents into overlapping chunks.
    
    Args:
        directory_path (str): Path to the directory containing PDF files.
        chunk_size (int): Maximum size of each text chunk.
        chunk_overlap (int): Overlap size between consecutive chunks.
    
    Returns:
        list: A list of text chunks from all the PDFs.
    """
    pdf_documents = []
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    # Ensure the directory exists
    if not os.path.exists(directory_path):
        raise FileNotFoundError(f"Directory '{directory_path}' does not exist.")
    
    # List all files in the directory
    files = [f for f in os.listdir(directory_path) if f.endswith('.pdf')]
    print(f"Number of pdf documents to process {len(files)}")
    
    if not files:
        print("No PDF files found in the directory.")
        return pdf_documents
    
    # Load and split each PDF file
    for file in tqdm(files, desc="Processing PDFs"):
        file_path = os.path.join(directory_path, file)
        try:
            # Load and split the PDF using PyPDFLoader
            loader = PyPDFLoader(file_path)
            raw_documents = loader.load_and_split()
            print(f"Number of pages in pdf document {file_path} :  {len(raw_documents)}")
            prev_chunks_count=len(pdf_documents)
            
            # Further split the documents into overlapping chunks
            for doc in raw_documents:
                pdf_documents.extend(text_splitter.split_text(doc.page_content))
            print(f"Number of chunks processed for pdf document {file_path} :  {len(pdf_documents)- prev_chunks_count}")
        except Exception as e:
            print(f"Error processing '{file}': {e}")
        

    
    print(f"")
    print(f"Successfully Processed :  {len(files)} Documents/files, split into a total of {len(pdf_documents)} overlaping Chunks.")
    return pdf_documents



In [10]:
pdf_data_path = "./data/pdf"
chunk_size = 500
chunk_overlap = 50
processed_pdf_document_chunks = load_and_split_pdfs_with_chunks(directory_path=pdf_data_path, chunk_size=chunk_size, chunk_overlap=chunk_overlap)

Number of pdf documents to process 5


Processing PDFs:  40%|████      | 2/5 [00:00<00:00,  5.27it/s]

Number of pages in pdf document ./data/pdf/evolution_of_ai_ml.pdf :  93
Number of chunks processed for pdf document ./data/pdf/evolution_of_ai_ml.pdf :  281
Number of pages in pdf document ./data/pdf/stock_market_today_america_long.pdf :  80
Number of chunks processed for pdf document ./data/pdf/stock_market_today_america_long.pdf :  238


Processing PDFs:  60%|██████    | 3/5 [00:00<00:00,  3.03it/s]

Number of pages in pdf document ./data/pdf/covid_pandemic_literature.pdf :  253
Number of chunks processed for pdf document ./data/pdf/covid_pandemic_literature.pdf :  758


Processing PDFs: 100%|██████████| 5/5 [00:01<00:00,  3.72it/s]

Number of pages in pdf document ./data/pdf/famous_painters_and_artists.pdf :  225
Number of chunks processed for pdf document ./data/pdf/famous_painters_and_artists.pdf :  690
Number of pages in pdf document ./data/pdf/stock_market_today_america.pdf :  4
Number of chunks processed for pdf document ./data/pdf/stock_market_today_america.pdf :  4

Successfully Processed :  5 Documents/files, split into a total of 1971 overlaping Chunks.





In [15]:
#print the first few  documents chunks processed ans ready to be ingested 
for i in range(5):
    print(f" Chunk {i}: \n {processed_pdf_document_chunks[i]} \n")

 Chunk 0: 
 Stock Market Update in America
The evolution of Artificial Intelligence (AI) and Machine Learning (ML) has been a fascinating
journey, 
marked by breakthroughs, setbacks, and transformative discoveries. The story of AI/ML begins in
the mid-20th century, 
when pioneering researchers began to explore the idea of machines that could simulate human
intelligence.
The 1950s and 1960s are often regarded as the "classical era" of AI. During this time, computer
scientists like 

 Chunk 1: 
 scientists like 
Alan Turing, John McCarthy, and Marvin Minsky laid the theoretical foundations of AI. Turing's
seminal work, 
the "Turing Test," proposed a framework to assess whether a machine could exhibit intelligent
behavior indistinguishable 
from that of a human. In 1956, the Dartmouth Conference, organized by McCarthy and others,
officially coined the term 
"Artificial Intelligence" and set the stage for decades of research. 

 Chunk 2: 
 In the early years, AI research was dominated by s

# Ingest Documents into OpenSearch Vector DB


In [12]:
from tqdm import tqdm

def ingest_documents_with_embeddings(document_chunks, index_name, vector_db, batch_size=100):
    """
    Ingest Pre-processed Data chunks into a new or existing index along with the auto generated embeddings into Opensearch. i
    use bulk ingestion to speedup.
    
    Args:
        pages (list): List of documents to be embedded and stored.
        my_index_name (str): Name of the OpenSearch index.
        db (OpenSearchVectorSearch): LangChain OpenSearch database instance.

        batch_size (int): Maximum number of documents to process in a single batch (default: 95).
    
    Returns:
        None
    """
    print(f"Index Name: {index_name}")
    
    # Ingest documents in batches
    for i in tqdm(range(0, len(document_chunks), batch_size), desc="Ingesting batches"):
        batch = document_chunks[i:i + batch_size]
        try:
            # vector_db.add_texts(batch)
            # vector_db.add_documents(batch)
            vector_db.add_texts(texts=batch, 
                     bulk_size=batch_size,
                     embedding=embedding_model, 
                     opensearch_url=opensearch_url, 
                     index_name=index_name,
                     http_auth=auth)
        except Exception as e:
            print(f"Error while adding texts to the opensearch {index_name} index. Error occured in chunks batch {i + 1}-{i+batch_size}: {e}")
    
    #refresh index
    vector_db.client.indices.refresh(index=index_name)
    print(f"Index '{index_name}' refreshed!")
    print(f"Successfully ingested {len(document_chunks)} documents into the OpenSearch index '{index_name}'!")


In [13]:
 # Ingest the documents and generate embeddings
ingestion_batch_size=100
ingest_documents_with_embeddings(document_chunks=processed_pdf_document_chunks, index_name=index_name, vector_db=vector_db, batch_size=ingestion_batch_size)

Index Name: pdf-materials-demo


Ingesting batches: 100%|██████████| 20/20 [00:40<00:00,  2.00s/it]

Index 'pdf-materials-demo' refreshed!
Successfully ingested 1971 documents into the OpenSearch index 'pdf-materials-demo'!





## &nbsp;&nbsp;&nbsp;&nbsp;  Validate that index has been created and validate index mapping 

In [16]:
# Check the index mapping
response = vector_db.client.indices.get_mapping(index=index_name)
print("Index Mapping:", response)

Index Mapping: {'pdf-materials-demo': {'mappings': {'properties': {'metadata': {'type': 'object'}, 'text': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}, 'vector_field': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'nmslib', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {'ef_construction': 512, 'm': 16}}}}}}}


# Step 7: Perform Semantic Search

In [17]:
import numpy as np

# Function to perform a semantic search using vector embeddings
def retrieve_documents_with_embeddings(query, top_k=5):
    # Generate the embedding for the query using your embedding function
    query_embedding = vector_db.embedding_function.embed_query(query)
    
    # Ensure the embedding is in the correct format (e.g., a list of floats)
    query_embedding = np.array(query_embedding).tolist()

    # Perform a knn search in OpenSearch
    search_results = vector_db.client.search(
        index=vector_db.index_name,
        body={
            "size": top_k,
            "query": {
                "knn": {
                    "vector_field": {
                        "vector": query_embedding,
                        "k": top_k
                    }
                }
            }
        }
    )

    documents_with_embeddings = []
    for hit in search_results['hits']['hits']:
        doc_content = hit['_source']['text']  # Adjust to the correct field name for document text
        embedding = hit['_source'].get('vector_field')  # Retrieve the embedding if needed
        documents_with_embeddings.append((doc_content, embedding))

    return documents_with_embeddings

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Validate that embeddings are getting generated

In [18]:
# Example usage
query = "can you summarize what transformers do?"
documents_with_embeddings = retrieve_documents_with_embeddings(query,2)

# Print the documents and their embeddings
print(f"Top {len(documents_with_embeddings)} documents and their embeddings for the query: \"{query}\"")
for idx, (content, embedding) in enumerate(documents_with_embeddings):
    print(f"\nDocument {idx + 1}:\n")
    print(f"Content: {content}\n")
    print(f"Embedding: {embedding}\n\n")

Top 2 documents and their embeddings for the query: "can you summarize what transformers do?"

Document 1:

Content: computer vision and 
natural language processing tasks, respectively. Breakthroughs like AlexNet in 2012 demonstrated
the potential of deep 
learning in image recognition, while models like GPT-3 and BERT showcased its power in
understanding and generating 
human language.
The last decade has seen AI/ML permeate nearly every aspect of modern life, from healthcare and
finance to entertainment 
and autonomous systems. Technologies like facial recognition, recommendation engines, and

Embedding: [0.0033366407733410597, -0.09587888419628143, 0.013144778087735176, 0.000498955138027668, -0.0023298643063753843, 0.03460882976651192, 0.013671289198100567, -0.044045694172382355, 0.017882492393255234, -0.03096061758697033, 0.0017778659239411354, 0.03852275013923645, -0.046889711171388626, -0.009664032608270645, 0.00836594682186842, 0.051842328161001205, 0.02841189317405224, 0.05635

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; validate semantic search

In [19]:
# Semantic Search Test Function
def semantic_search_test(query, top_k=5):
    # Perform a semantic search
    search_results = vector_db.similarity_search(query, k=top_k)
    
    # Display the top-k retrieved documents
    print(f"Top {top_k} results for the query: \"{query}\"")
    for idx, result in enumerate(search_results):
        print(f"\nResult {idx + 1}:")
        print(f"Document: {result.page_content}\n")

# Run a semantic search test
semantic_search_test("what period in history is considered the AI winter and why?", top_k=5)

Top 5 results for the query: "what period in history is considered the AI winter and why?"

Result 1:
Document: their ability to handle complexity and ambiguity.
The 1970s and 1980s brought about the first "AI winter," a period characterized by reduced funding
and enthusiasm for 
AI research. The limitations of symbolic AI became apparent as researchers struggled to scale these
systems to handle 
real-world problems. During this time, expert systems, which encoded domain-specific knowledge,
gained traction but 
eventually faced scalability issues.


Result 2:
Document: their ability to handle complexity and ambiguity.
The 1970s and 1980s brought about the first "AI winter," a period characterized by reduced funding
and enthusiasm for 
AI research. The limitations of symbolic AI became apparent as researchers struggled to scale these
systems to handle 
real-world problems. During this time, expert systems, which encoded domain-specific knowledge,
gained traction but 
eventually faced sc

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Retrieve topK documents with similarity scores

In [20]:
# Generate topK documents with scores
query = "what period in history is considered the AI winter and why?"
query_embedding = embedding_model.embed_query(query)

# Perform a similarity search using the query embedding and retrieve scores
search_results = vector_db.similarity_search_with_score_by_vector(query_embedding, k=5)

# Iterate over the search results and print the text along with the scores
for document, score in search_results:
    print(f"Score: {score}")
    print(f"Document: {document.page_content}\n")

Score: 0.53234446
Document: their ability to handle complexity and ambiguity.
The 1970s and 1980s brought about the first "AI winter," a period characterized by reduced funding
and enthusiasm for 
AI research. The limitations of symbolic AI became apparent as researchers struggled to scale these
systems to handle 
real-world problems. During this time, expert systems, which encoded domain-specific knowledge,
gained traction but 
eventually faced scalability issues.

Score: 0.53234446
Document: their ability to handle complexity and ambiguity.
The 1970s and 1980s brought about the first "AI winter," a period characterized by reduced funding
and enthusiasm for 
AI research. The limitations of symbolic AI became apparent as researchers struggled to scale these
systems to handle 
real-world problems. During this time, expert systems, which encoded domain-specific knowledge,
gained traction but 
eventually faced scalability issues.

Score: 0.53234446
Document: their ability to handle comple