# EXAMPLES (RAG)
- [RAG](https://docs.activeloop.ai/examples/rag)
  - [RAG Quickstart](https://docs.activeloop.ai/examples/rag/quickstart)
  - [RAG Tutorials](https://docs.activeloop.ai/examples/rag/tutorials)
    - [**Vector Store Basics**](https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics)
    - [Vector Search Options](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options)
      - [LangChain API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/langchain-api)
      - [Deep Lake Vector Store API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/vector-store-api)
      - [Managed Database REST API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/rest-api)
    - [Customizing Your Vector Store](https://docs.activeloop.ai/examples/rag/tutorials/step-4-customizing-vector-stores)
    - [Image Similarity Search](https://docs.activeloop.ai/examples/rag/tutorials/image-similarity-search)
    - [Improving Search Accuracy using Deep Memory](https://docs.activeloop.ai/examples/rag/tutorials/deepmemory)


## RAG Tutorials (Vector Store Basics)

### Downloading and Preprocessing the Data

In [1]:
from deeplake.core.vectorstore import VectorStore
import openai
import os
from dotenv import load_dotenv

load_dotenv(override = True)
open_api_key = os.getenv('OPENAI_API_KEY')
activeloop_token = os.getenv('ACTIVELOOP_TOKEN')



In [2]:
# !git clone https://github.com/twitter/the-algorithm

In [3]:
# vector_store_path = '/vector_store_getting_started'
# repo_path = '/the-algorithm'

vector_store_path = './vector_store_getting_started'
# repo_path = './the-algorithm'
repo_path = './the-algorithm/twml/twml/layers'

In [4]:
CHUNK_SIZE = 1000

chunked_text = []
metadata = []
for dirpath, dirnames, filenames in os.walk(repo_path):
    for file in filenames:
        try: 
            full_path = os.path.join(dirpath,file)
            with open(full_path, 'r') as f:
               text = f.read()
            new_chunkned_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]
            chunked_text += new_chunkned_text
            metadata += [{'filepath': full_path} for i in range(len(new_chunkned_text))]
            print(file)  # TODO: COMMENT
        except Exception as e: 
            print(e)
            pass

batch_prediction_tensor_writer.py
batch_prediction_writer.py
data_record_tensor_writer.py
full_dense.py
full_sparse.py
isotonic.py
layer.py
mdl.py
partition.py
percentile_discretizer.py
sequential.py
sparse_max_norm.py
stitch.py
__init__.py


In [5]:
print(type(chunked_text))
print(len(chunked_text))

<class 'list'>
76


In [6]:
# chunked_text[0]
# print(chunked_text[0])
print(chunked_text[0][:200])

# pylint: disable=no-member, invalid-name
"""
Implementing Writer Layer
"""
from .layer import Layer

import libtwml


class BatchPredictionTensorWriter(Layer):
  """
  A layer that packages keys and 


In [7]:
def embedding_function(texts, model="text-embedding-ada-002"):
   
   if isinstance(texts, str):
       texts = [texts]

   texts = [t.replace("\n", " ") for t in texts]
   
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

In [8]:
vector_store = VectorStore(
    path = vector_store_path,
)

vector_store.add(text = chunked_text, 
                 embedding_function = embedding_function, 
                 embedding_data = chunked_text, 
                 metadata = metadata
)

Deep Lake Dataset in ./vector_store_getting_started already exists, loading from the storage


Creating 76 embeddings in 1 batches of size 76:: 100%|███████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.99s/it]

Dataset(path='./vector_store_getting_started', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (152, 1536)  float32   None   
    id        text      (152, 1)      str     None   
 metadata     json      (152, 1)      str     None   
   text       text      (152, 1)      str     None   





In [9]:
vector_store.summary()

Dataset(path='./vector_store_getting_started', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (152, 1536)  float32   None   
    id        text      (152, 1)      str     None   
 metadata     json      (152, 1)      str     None   
   text       text      (152, 1)      str     None   


### Performing Vector Search

In [10]:
# prompt = "What do trust and safety models do?"

# Check file [mdl.py] and its text 'MDL layer is constructed by MDLCalibrator'
# prompt = "MDL layer is constructed by?"
# prompt = "How is the MDL layer constructed?"
prompt = "What is the MDL layer?"

search_results = vector_store.search(embedding_data=prompt, embedding_function=embedding_function)

In [11]:
len(search_results['text'])

4

In [12]:
# search_results['text']
print(search_results['text'][0])

# pylint: disable=no-member, attribute-defined-outside-init, too-many-instance-attributes
"""
Implementing MDL Layer
"""


from .layer import Layer
from .partition import Partition
from .stitch import Stitch

import libtwml
import numpy as np
import tensorflow.compat.v1 as tf
import twml


class MDL(Layer):  # noqa: T000
  """
  MDL layer is constructed by MDLCalibrator after accumulating data
  and performing minimum description length (MDL) calibration.

  MDL takes sparse continuous features and converts then to sparse
  binary features. Each binary output feature is associated to an MDL bin.
  Each MDL input feature is converted to n_bin bins.
  Each MDL calibration tries to find bin delimiters such that the number of features values
  per bin is roughly equal (for each given MDL feature).
  Note that if an input feature is rarely used, so will its associated output bin/features.
  """

  def __init__(
          self,
          n_feature, n_bin, out_bits,
          bin_values=None,

In [13]:
search_results['metadata'][0]

{'filepath': './the-algorithm/twml/twml/layers\\mdl.py'}

### Customization of Vector Search

In [14]:
# Customize your vector search with simple parameters,
#   such as selecting the distance_metric and top k results

search_results = vector_store.search(embedding_data=prompt, 
                                     embedding_function=embedding_function, 
                                     k=10,
                                     distance_metric='l2')

In [15]:
len(search_results['text'])

10

In [16]:
# search_results['text'][0]
print(search_results['text'][0])

# pylint: disable=no-member, attribute-defined-outside-init, too-many-instance-attributes
"""
Implementing MDL Layer
"""


from .layer import Layer
from .partition import Partition
from .stitch import Stitch

import libtwml
import numpy as np
import tensorflow.compat.v1 as tf
import twml


class MDL(Layer):  # noqa: T000
  """
  MDL layer is constructed by MDLCalibrator after accumulating data
  and performing minimum description length (MDL) calibration.

  MDL takes sparse continuous features and converts then to sparse
  binary features. Each binary output feature is associated to an MDL bin.
  Each MDL input feature is converted to n_bin bins.
  Each MDL calibration tries to find bin delimiters such that the number of features values
  per bin is roughly equal (for each given MDL feature).
  Note that if an input feature is rarely used, so will its associated output bin/features.
  """

  def __init__(
          self,
          n_feature, n_bin, out_bits,
          bin_values=None,

### Full Customization of Vector Search

In [17]:
# Load a representative Vector Store that is already stored in  Deep Lake Tensor Database

vector_store = VectorStore(
    path = "hub://activeloop/twitter-algorithm",
    read_only=True
)

Deep Lake Dataset in hub://activeloop/twitter-algorithm already exists, loading from the storage


In [18]:
# Query should be constructed using the Tensor Query Language (TQL) syntax

prompt = "What do trust and safety models do?"

embedding = embedding_function(prompt)[0]

# Format the embedding array or list as a string, so it can be passed in the REST API request.
embedding_string = ",".join([str(item) for item in embedding])

tql_query = f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score) order by score desc limit 5"

In [19]:
# Run the query, noting that the query execution happens in the Managed Tensor Database, and not on the client

search_results = vector_store.search(query=tql_query)

In [20]:
# search_results['text'][0]
print(search_results['text'][0])

Trust and Safety Models

We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.

We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.


In [21]:
# print(search_results['metadata'][0])

# Returns {'filepath': '/Users/istranic/ActiveloopCode/the-algorithm/trust_and_safety_models/README.md', 'extension': '.md'}
# ERROR: KeyError: 'metadata'

In [22]:
list(search_results.keys())

['text', 'score']

In [23]:
print(search_results['score'][0])

0.8414647579193115


In [24]:
print(search_results)

