# EXAMPLES (RAG)
- [RAG](https://docs.activeloop.ai/examples/rag)
  - [RAG Quickstart](https://docs.activeloop.ai/examples/rag/quickstart)
  - [RAG Tutorials](https://docs.activeloop.ai/examples/rag/tutorials)
    - [Vector Store Basics](https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics)
    - [Vector Search Options](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options)
      - [LangChain API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/langchain-api)
      - [Deep Lake Vector Store API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/vector-store-api)
      - [Managed Database REST API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/rest-api)
    - [Customizing Your Vector Store](https://docs.activeloop.ai/examples/rag/tutorials/step-4-customizing-vector-stores)
    - [Image Similarity Search](https://docs.activeloop.ai/examples/rag/tutorials/image-similarity-search)
    - [**Improving Search Accuracy using Deep Memory**](https://docs.activeloop.ai/examples/rag/tutorials/deepmemory)


## RAG Tutorials (Improving Search Accuracy using Deep Memory)

### Use Deep Memory to Improve the Accuracy of your Vector Search
*Deep Memory computes a transformation that converts your embeddings into an embedding space that is tailored for your use case, based on several examples for which the most relevant embedding is known. This can increase the accuracy of your Vector Search by up to 22%.*

*In this example, we'll use Deep Memory to improve the accuracy of Vector Search on the SciFact dataset, where the input prompt is a scientific claim, and the search result is the corresponding abstract.*

#### Downloading the Data

In [1]:
# !pip install datasets
# !pip install ipywidgets

In [2]:
from deeplake import VectorStore
# from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
# from deeplake.core.vectorstore import VectorStore
import os
import getpass
import datasets
import openai
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(override = True)
open_api_key = os.getenv('OPENAI_API_KEY')
activeloop_token = os.getenv('ACTIVELOOP_TOKEN')



In [5]:
# Download the dataset locally

# corpus = datasets.load_dataset("scifact", "corpus")
corpus = datasets.load_dataset("scifact", "corpus", trust_remote_code=True)

Downloading data:   0%|          | 0.00/3.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5183 [00:00<?, ? examples/s]

#### Creating the Vector Store

In [6]:
# Define an embedding function for the text data and create a Deep Lake Vector Store in our Managed Database.
# Deep Memory is only available for Vector Stores in our Managed Database.

def embedding_function(texts, model="text-embedding-ada-002"):
   
   if isinstance(texts, str):
       texts = [texts]

   texts = [t.replace("\n", " ") for t in texts]
   # return [data['embedding']for data in openai.Embedding.create(input = texts, model=model)['data']]
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

In [7]:
# SciFact dataset
# - https://huggingface.co/datasets/allenai/scifact

# path = 'hub://<org_id>/<vector_store_name>'
path = 'hub://pavelkloscz/ds-scifact'

In [8]:
vectorstore = VectorStore(
    path=path,
    embedding_function=embedding_function,
    runtime={"tensor_db": True},
)

Your Deep Lake dataset has been successfully created!


 

#### Adding data to the Vector Store

In [9]:
# Extract the data from the SciFact dataset and add it to our Vector Store. In this example, we embed the abstracts of the scientific papers.
# Normally, the id tensor is auto-populated, but in this case, we want to use the ids in the SciFact dataset, in order to use
#   the internal connection between ids, abstracts, and claims, that already exists in SciFact.

ids = [f"{id_}" for id_ in corpus["train"]["doc_id"]]
texts = [' '.join(text) for text in corpus["train"]["abstract"]]
metadata = [{"title": title} for title in corpus["train"]["title"]]

In [10]:
vectorstore.add(
    text=texts,
    id=ids,
    embedding_data=texts,
    embedding_function=embedding_function,
    metadata=metadata,
)

Creating 5183 embeddings in 11 batches of size 500:: 100%|█████████████████████████████████████████████████████| 11/11 [02:06<00:00, 11.46s/it]

Dataset(path='hub://pavelkloscz/ds-scifact', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
   text       text      (5183, 1)      str     None   
 metadata     json      (5183, 1)      str     None   
 embedding  embedding  (5183, 1536)  float32   None   
    id        text      (5183, 1)      str     None   





#### Generating claims

In [11]:
# Create a relationship between the claims and their corresponding most relevant abstracts.
# This correspondence already exists in the SciFact dataset, and we extract that information using the helper function below.

def preprocess_scifact(claims_dataset, dataset_type="train"):

    # Using a dictionary to store unique claims and their associated relevances
    claims_dict = {}

    for item in claims_dataset[dataset_type]:
        claim = item['claim']  # Assuming 'claim' is the field for the question
        relevance = item['cited_doc_ids']  # Assuming 'cited_doc_ids' is the field for relevance
        relevance = [(str(r), 1) for r in relevance]

        # Check for non-empty relevance
        if claim not in claims_dict:
            claims_dict[claim] = relevance
        else:
            # If the does not exist in the dictionary, append the new relevance
            if relevance not in claims_dict[claim]:
                claims_dict[claim].extend(relevance)

    # Split the dictionary into two lists: claims and relevances
    claims = list(claims_dict.keys())
    relevances = list(claims_dict.values())
    return claims, relevances

In [12]:
claims_dataset = datasets.load_dataset('scifact', 'claims')
claims, relevances = preprocess_scifact(claims_dataset, dataset_type="train")

Generating train split:   0%|          | 0/1261 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/300 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/450 [00:00<?, ? examples/s]

In [None]:
# Print the first 10 claims and their relevant abstracts.
# The relevances are a list of tuples, where each the id corresponds to the id tensor value in
#   the Abstracts Vector Store, and 1 indicates a positive relevance.

In [13]:
claims[:10]

['0-dimensional biomaterials lack inductive properties.',
 '1 in 5 million in UK have abnormal PrP positivity.',
 '1-1% of colorectal cancer patients are diagnosed with regional or distant metastases.',
 '10% of sudden infant death syndrome (SIDS) deaths happen in newborns aged less than 6 months.',
 '32% of liver transplantation programs required patients to discontinue methadone treatment in 2001.',
 '4-PBA treatment decreases endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.',
 '4-PBA treatment raises endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.',
 '40mg/day dosage of folic acid and 2mg/day dosage of vitamin B12 does not affect chronic kidney disease (CKD) progression.',
 "5'-nucleotidase metabolizes 6MP.",
 '50% of patients exposed to radiation have activated markers of mesenchymal stem cells.']

In [14]:
relevances[:10]

[[('31715818', 1)],
 [('13734012', 1)],
 [('22942787', 1)],
 [('2613775', 1)],
 [('44265107', 1)],
 [('32587939', 1)],
 [('32587939', 1)],
 [('33409100', 1), ('33409100', 1)],
 [('641786', 1)],
 [('22080671', 1)]]

#### Running the Deep Memory Training

In [15]:
# Run a Deep Memory training, which runs asynchronously and executes on our managed service.

job_id = vectorstore.deep_memory.train(
    queries = claims,
    relevance = relevances,
    embedding_function = embedding_function,
)

DeepMemoryAccessError: Deep Memory is not available for organizations on Community plan.Please, consider upgrading or start a free trial at https://app.activeloop.ai/pricing.

In [None]:
# All of the Deep Memory training jobs for this Vector Store can be listed using the command below.
# The PROGRESS tells us the state of the training job, as well as the recall improvement on the data.

# recall@k corresponds to the percentage of rows for which the correct (most relevant) answer was returned in the top k vector search results

In [None]:
vectorstore.deep_memory.list_jobs()

# [OUTPUT]
# This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop-test/test-deepmemory-ivo
# ID                        STATUS     RESULTS                        PROGRESS       
# 6525a94bbfacbf7e75a08c76  completed  recall@10: 0.00% (+0.00%)      eta: 45.5 seconds
#                                                                     recall@10: 0.00% (+0.00%)
# 6538186bc1d2ffd8e8cd3b49  completed  recall@10: 85.81% (+21.78%)    eta: 1.9 seconds
#                                                                     recall@10: 85.81% (+21.78%)

#### Evaluating Deep Memory's Performance

In [None]:
# Evaluate the recall improvement for an evaluation dataset that was not used in the training process.
# Deep Memory inference, and by extension this evaluation process, runs on the client.

In [None]:
validation_claims, validation_relevances = preprocess_scifact(claims_dataset, dataset_type="validation")

In [None]:
recalls = vectorstore.deep_memory.evaluate(
    queries = validation_claims,
    relevance = validation_relevances,
    embedding_function = embedding_function,
)

# [OUTPUT]
# ---- Evaluating without Deep Memory ---- 
# Recall@1:	  44.2%
# Recall@3:	  56.9%
# Recall@5:	  61.3%
# Recall@10:	  67.3%
# Recall@50:	  77.2%
# Recall@100:	  79.9%
# ---- Evaluating with Deep Memory ---- 
# Recall@1:	  60.4%
# Recall@3:	  67.6%
# Recall@5:	  71.7%
# Recall@10:	  75.4%
# Recall@50:	  79.1%
# Recall@100:	  80.2%

#### Using Deep Memory in your Application

In [None]:
# To use Deep Memory in your applications, specify the deep_memory = True parameter during vector search.
# If you are using the LangChain integration, you may specify this parameter during Vector Store initialization.
# Let's try searching embedding using a prompt, with and without Deep Memory.

In [None]:
prompt = "Which diseases are inflammation-related processes"

**Without Deep Memory**

In [None]:
results = vectorstore.search(embedding_data = prompt)

In [None]:
results['text']

# [OUTPUT]
# ['Inflammation is a fundamental protective response that sometimes goes awry and becomes a major cofactor in the pathogenesis of many chronic human diseases, including cancer.',
#  'Kidney diseases, including chronic kidney disease (CKD) and acute kidney injury (AKI), are associated with inflammation.',
#  'BACKGROUND Persistent inflammation has been proposed to contribute to various stages in the pathogenesis of cardiovascular disease.',
#  'Inflammation accompanies obesity and its comorbidities-type 2 diabetes, non-alcoholic fatty liver disease and atherosclerosis, among others-and may contribute to their pathogenesis.']

**With Deep Memory**

In [None]:
results_dm = vectorstore.search(embedding_data = prompt, deep_memory = True)

In [None]:
results_dm['text']

# [OUTPUT]
# ['Kidney diseases, including chronic kidney disease (CKD) and acute kidney injury (AKI), are associated with inflammation.',
#  'OBJECTIVES Calcific aortic valve (AV) disease is known to be an inflammation-related process.',
#  "Crohn's disease and ulcerative colitis, the two main types of chronic inflammatory bowel disease, are multifactorial conditions of unknown aetiology.",
#  'BACKGROUND Two inflammatory disorders, type 1 diabetes and celiac disease, cosegregate in populations, suggesting a common genetic origin.']

**We observe that there are overlapping results for both search methods, but 50% of the answers differ.**