# Build a RAG pipeline with Anyscale, Llamaindex, and HuggingFace
Tap into LLMs and query your data. Here's a step-by-step guide to crafting a RAG pipeline with Anyscale, Llamaindex, and HuggingFace.

## Setup

For this RAG pipeline you'll be using the following components:
- **LLM**: [Anyscale's Llama 2 70B model](https://docs.endpoints.anyscale.com/supported-models/meta-llama-Llama-2-70b-chat-hf/) through their inference endpoint.
- **Vectorizer**: [`bge-small-en-v1.5` embeddings model](https://huggingface.co/BAAI/bge-small-en-v1.5) from HuggingFace. You can choose other model form the [embeddings leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
- **Vector Store**: Llamaindex's in-memory vector store.


In [7]:
import os
from llama_index.llms import Anyscale
from llama_index.embeddings import HuggingFaceEmbedding

from llama_index import ServiceContext, download_loader, VectorStoreIndex, Document

This tutorial requires an Anyscale API key. If you don't have an Anyscale API key, you can obtain one with 1 million free tokens by [registering to Anyscale](https://app.endpoints.anyscale.com/welcome).

In [8]:
ANYSCALE_API_KEY = os.environ["ANYSCALE_API_KEY"] # Your Anyscale API key (esecret_...)

## Load data

In this section, you'll load Facebook's [Long-Context LLama 2 paper](https://ai.meta.com/research/publications/effective-long-context-scaling-of-foundation-models/) into a single LLamaindex document.

In [9]:
LLAMA2_LONG_URL = "https://scontent.fpei3-1.fna.fbcdn.net/v/t39.2365-6/382490704_260884199667762_4629529713553101244_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=Bkh2eqWw__wAX9NLirW&_nc_ht=scontent.fpei3-1.fna&oh=00_AfA8SUU5fGbJy7ZkRsMcEmv57Y9I5kG2PzZNkBnWaCUQlg&oe=651B4045"
RemoteReader = download_loader("RemoteReader")
original_docs = RemoteReader().load_data(url=LLAMA2_LONG_URL)

Check that the loder correctly parsed the PDF file.

In [10]:
print(original_docs[0].text[:500])

Effective Long-Context Scaling of Foundation Models
Wenhan Xiong†∗, Jingyu Liu†, Igor Molybog,
Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta,
Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang,
Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan,
Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang∗, Hao Ma∗
Meta
Abstract
We present a series of long-context LLMs that support effective context windows
of up to 32,768 tokens. Our model series are built th


Finally, create a single document from the parsed PDF.

In [11]:
docs_content = "\n\n".join(doc.get_content() for doc in original_docs)
docs = [Document(text=docs_content)]

## Query engine
Now that you've created a document, you can set up a **query engine** to query the document.

In this section you'll set up a query engine using [HuggingFace embeddings](https://gpt-index.readthedocs.io/en/stable/examples/embeddings/huggingface.html) and Anyscale's LLama 2 70B endpoint.

In [12]:
# Anyscale's Llama 2 70B model, feel free to use 13B and 7B models as well
llm = Anyscale("meta-llama/Llama-2-70b-chat-hf", api_key=ANYSCALE_API_KEY)

# Top-ranking lightweight embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
vector_index = VectorStoreIndex.from_documents(docs, service_context=service_context)

In [13]:
query_engine = vector_index.as_query_engine()

In [14]:
response = query_engine.query("What's the maximum context window of LLama 2 long context?")

In [15]:
print(response.response)

 According to the provided context information, the maximum context window of LLama 2 long context is 32,768 tokens. This is mentioned in the passage as the longest sequence length used in the continual pretraining process.
