# EXAMPLES (RAG)
- [RAG](https://docs.activeloop.ai/examples/rag)
  - [RAG Quickstart](https://docs.activeloop.ai/examples/rag/quickstart)
  - [RAG Tutorials](https://docs.activeloop.ai/examples/rag/tutorials)
    - [Vector Store Basics](https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics)
    - [Vector Search Options](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options)
      - [LangChain API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/langchain-api)
      - [Deep Lake Vector Store API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/vector-store-api)
      - [Managed Database REST API](https://docs.activeloop.ai/examples/rag/tutorials/vector-search-options/rest-api)
    - [Customizing Your Vector Store](https://docs.activeloop.ai/examples/rag/tutorials/step-4-customizing-vector-stores)
    - [Image Similarity Search](https://docs.activeloop.ai/examples/rag/tutorials/image-similarity-search)
    - [Improving Search Accuracy using Deep Memory](https://docs.activeloop.ai/examples/rag/tutorials/deepmemory)
  - [LangChain Integration](https://docs.activeloop.ai/examples/rag/langchain-integration)
  - [LlamaIndex Integration](https://docs.activeloop.ai/examples/rag/llamaindex-integration)
  - [**Managed Tensor Database**](https://docs.activeloop.ai/examples/rag/managed-database)
    - [**REST API**](https://docs.activeloop.ai/examples/rag/managed-database/rest-api)
    - [**Migrating Datasets to the Tensor Database**](https://docs.activeloop.ai/examples/rag/managed-database/migrating-datasets-to-the-tensor-database)
  - [Deep Memory](https://docs.activeloop.ai/examples/rag/deep-memory)
    - [How it Works](https://docs.activeloop.ai/examples/rag/deep-memory/how-it-works)

## RAG (Managed Tensor Database)

### Overview of Deep Lake's Managed Tensor Database
- *Deep Lake offers a serverless Managed Tensor Database that eliminates the complexity of self-hosting and substantially lowers costs. Currently, it only supports dataset queries, including vector search, but additional features for creating and modifying data being added in December 2023.*<br>
- *Comparison of Deep Lake as a Managed Database vs Embedded Database*<br>
[DeepLake (Embedded vs Managed)](https://docs.activeloop.ai/~gitbook/image?url=https%3A%2F%2Fcontent.gitbook.com%2Fcontent%2FWOs95B2h3lcO4dwXDRJ3%2Fblobs%2FK3tTkpXoP4wBU4GBgBsc%2FDeep_Lake_Embedded_vs_Managed.png&width=400&dpr=3&quality=100&sign=df2cb5c0&sv=2)

#### User Interfaces
**LangChain and LlamaIndex**

In [1]:
# To use the Managed Vector Database in LangChain or Llama Index, specify dataset_path during Vector Store creation.

# dataset_path = hub://org_id/dataset_name and runtime = {"tensor_db": True}

### REST API
*Standalone REST API is available for interacting with the Managed Database*

#### Overview of the Managed Database REST API

In [2]:
# Deep Lake Tensor Database can be accessed via REST API.
# The datasets must be stored in the Tensor Database by specifying the deeplake_path

# deeplake_path = hub://org_id/dataset_name and runtime = {"tensor_db": True}

#### Querying via the REST API

In [3]:
# The primary input to the query API is a query string that contains all the necessary information for
#   executing the query, including the path to the Deep Lake data.

**Input**

In [4]:
# url = "https://app.activeloop.ai/api/query/v1"

# headers = {
#     "Authorization": f"Bearer {user_token}"
#     }

# # Format the embedding array or list as a string, so it can be passed in the REST API request.
# embedding_string = ",".join([str(item) for item in embedding])

# request = {
#     "query": f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score from \"{dataset_path}\") order by score desc limit 5",
#     "as_list": True/False # Defaults to True.
#     }

**Response**<br>
**as_list = True** (default)<br>
Returns a list of jsons, one per row.

In [5]:
# {
#   "message": "Query successful.",
#   "tensors": [
#     "text",
#     "score"
#   ],
#   "data": [
#     {
#       "text": "# Twitter's Recommendation Algorithm\n\nTwitter's Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the\nHome Timeline. For an introduction to how the algorithm works, please refer to our [engineering blog](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). The\ndiagram below illustrates how major services and jobs interconnect.\n\n![](docs/system-diagram.png)\n\nThese are the main components of the Recommendation Algorithm included in this repository:",
#       "score": 22.59016227722168
#     },
#     {
#       "text": "![](docs/system-diagram.png)\n\nThese are the main components of the Recommendation Algorithm included in this repository:",
#       "score": 22.5976619720459
#     },...
#     ]

**Response**<br>
**as_list = False**<br>
Returns a list of values per tensor.

In [6]:
# {
#   "message": "Query successful.",
#   "tensors": [
#     "text",
#     "score"
#   ],
#   "data": {
#     "text": [
#       "# Twitter's Recommendation Algorithm\n\nTwitter's Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the\nHome Timeline. For an introduction to how the algorithm works, please refer to our [engineering blog](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). The\ndiagram below illustrates how major services and jobs interconnect.\n\n![](docs/system-diagram.png)\n\nThese are the main components of the Recommendation Algorithm included in this repository:",
#       "![](docs/system-diagram.png)\n\nThese are the main components of the Recommendation Algorithm included in this repository:",
#       "| Type | Component | Description |\n|------------|------------|------------|\n| Feature | [SimClusters](src/scala/com/twitter/simclusters_v2/README.md) | Community detection and sparse embeddings into those communities. |\n|         | [TwHIN](https://github.com/twitter/the-algorithm-ml/blob/main/projects/twhin/README.md) | Dense knowledge graph embeddings for Users and Tweets. |\n|         | [trust-and-safety-models](trust_and_safety_models/README.md) | Models for detecting NSFW or abusive content. |\n|         | [real-graph](src/scala/com/twitter/interaction_graph/README.md) | Model to predict the likelihood of a Twitter User interacting with another User. |\n|         | [tweepcred](src/scala/com/twitter/graph/batch/job/tweepcred/README) | Page-Rank algorithm for calculating Twitter User reputation. |\n|         | [recos-injector](recos-injector/README.md) | Streaming event processor for building input streams for [GraphJet](https://github.com/twitter/GraphJet) based services. |\n|         | [graph-feature-service](graph-feature-service/README.md) | Serves graph features for a directed pair of Users (e.g. how many of User A's following liked Tweets from User B). |\n| Candidate Source | [search-index](src/java/com/twitter/search/README.md) | Find and rank In-Network Tweets. ~50% of Tweets come from this candidate source. |\n|                  | [cr-mixer](cr-mixer/README.md) | Coordination layer for fetching Out-of-Network tweet candidates from underlying compute services. |\n|                  | [user-tweet-entity-graph](src/scala/com/twitter/recos/user_tweet_entity_graph/README.md) (UTEG)| Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the [GraphJet](https://github.com/twitter/GraphJet) framework. Several other GraphJet based features and candidate sources are located [here](src/scala/com/twitter/recos). |\n|                  | [follow-recommendation-service](follow-recommendations-service/README.md) (FRS)| Provides Users with recommendations for accounts to follow, and Tweets from those accounts. |\n| Ranking | [light-ranker](src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md) | Light Ranker model used by search index (Earlybird) to rank Tweets. |\n|         | [heavy-ranker](https://github.com/twitter/the-algorithm-ml/blob/main/projects/home/recap/README.md) | Neural network for ranking candidate tweets. One of the main signals used to select timeline Tweets post candidate sourcing. |\n| Tweet mixing & filtering | [home-mixer](home-mixer/README.md) | Main service used to construct and serve the Home Timeline. Built on [product-mixer](product-mixer/README.md). |\n|                          | [visibility-filters](visibilitylib/README.md) | Responsible for filtering Twitter content to support legal compliance, improve product quality, increase user trust, protect revenue through the use of hard-filtering, visible product treatments, and coarse-grained downranking. |\n|                          | [timelineranker](timelineranker/README.md) | Legacy service which provides relevance-scored tweets from the Earlybird Search Index and UTEG service. |\n| Software framework | [navi](navi/README.md) | High performance, machine learning model serving written in Rust. |\n|                    | [product-mixer](product-mixer/README.md) | Software framework for building feeds of content. |\n|                    | [twml](twml/README.md) | Legacy machine learning framework built on TensorFlow v1. |",
#       "We include Bazel BUILD files for most components, but not a top-level BUILD or WORKSPACE file.\n\n## Contributing",
#       "We include Bazel BUILD files for most components, but not a top-level BUILD or WORKSPACE file.\n\n## Contributing\n\nWe invite the community to submit GitHub issues and pull requests for suggestions on improving the recommendation algorithm. We are working on tools to manage these suggestions and sync changes to our internal repository. Any security concerns or issues should be routed to our official [bug bounty program](https://hackerone.com/twitter) through HackerOne. We hope to benefit from the collective intelligence and expertise of the global community in helping us identify issues and suggest improvements, ultimately leading to a better Twitter.\n\nRead our blog on the open source initiative [here](https://blog.twitter.com/en_us/topics/company/2023/a-new-era-of-transparency-for-twitter)."
#     ],
#     "score": [
#       22.843185424804688,
#       22.83962631225586,
#       22.835460662841797,
#       22.83342170715332,
#       22.832916259765625
#     ]
#   }
# }

### Migrating Datasets to the Tensor Database

#### Migrate existing Deep Lake datasets to the Tensor Database

In [7]:
# Datasets are created in the Tensor Database by specifying the dest  during dataset creation.
# If datasets are currently stored locally, in your cloud, or in non-database Activeloop storage,
#   they can be migrated to the Tensor Database using.

# dest = "hub://<org_id>/<dataset_name>" and runtime = {"tensor_db": True}

In [8]:
# import deeplake

# ds_tensor_db = deeplake.deepcopy(src = <current_path>, 
#                                  dest = "hub://<org_id>/<dataset_name>", 
#                                  runtime = {"tensor_db": True}, 
#                                  src_creds = {<creds_dict>}, # Only necessary if src is in your cloud
#                                  )

## REST API (EXAMPLE)

In [9]:
import requests
import openai
import os
from dotenv import load_dotenv

load_dotenv(override = True)
open_api_key = os.getenv('OPENAI_API_KEY')
activeloop_token = os.getenv('ACTIVELOOP_TOKEN')

MODEL_GPT = 'gpt-4o-mini'

In [10]:
# Tokens should be set in environmental variables.
ACTIVELOOP_TOKEN = os.environ['ACTIVELOOP_TOKEN']
DATASET_PATH = 'hub://activeloop/twitter-algorithm'
ENDPOINT_URL = 'https://app.activeloop.ai/api/query/v1'
SEARCH_TERM = 'What do the trust and safety models do?'
# os.environ['OPENAI_API_KEY'] OPEN AI TOKEN should also exist in env variables

In [11]:
# The headers contains the user token
headers = {
    "Authorization": f"Bearer {ACTIVELOOP_TOKEN}",
}

In [12]:
# Embed the search term

# embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]
embedding = openai.embeddings.create(input=SEARCH_TERM, model="text-embedding-ada-002").data[0].embedding

In [13]:
# Format the embedding array or list as a string, so it can be passed in the REST API request.
embedding_string = ",".join([str(item) for item in embedding])

# Create the query using TQL
# query = f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score from \"{dataset_path}\") order by score desc limit 5"
query = f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score from \"{DATASET_PATH}\") order by score desc limit 5"
          
# Submit the request
response = requests.post(ENDPOINT_URL, json={"query": query}, headers=headers)

data = response.json()

In [14]:
# print(data)
print(data["description"])
print(data["tensors"])
# print(data["data"])

Query successful.
['text', 'score']


In [15]:
print(data["data"][0]["score"])

0.8365592956542969


In [16]:
print(data["data"][0]["text"])

Trust and Safety Models

We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.

We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.


In [17]:
print(data)

