# Building a Streaming API for LlaMa 2: Real-time AI with Jina and DocArray

[Read the accompanying article on the Jina AI website.](https://jina.ai/news/building-a-streaming-api-for-llama-2-real-time-ai-with-jina-and-docarray)

⚠ Activate GPU in order to run the notebook. (Click on **Runtime** > **Change runtime type**, and in the popup, select **GPU** under **Hardware accelerator**.)

First, we install the libraries we need.

In [None]:
!pip install docarray jina bitsandbytes accelerate transformers

⚠ Before starting, you will need permission from Meta to download Llama 2 models from Hugging Face.

- First, you will need a Hugging Face account. You can sign up for one for free at https://huggingface.co/join.
- Next, you will need to request permission from Meta. You can make the request at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf when you are logged in with your Hugging Face credentials. This request may take **one to two days to process**.
- Finally, request a token from Hugging Face from your settings page at https://huggingface.co/settings/tokens. You will need this token to download the model.- Finally, request a token from Hugging Face from your settings page at https://huggingface.co/settings/tokens. You will need this token to download the model.

**Your Hugging Face account email address MUST match the email you provide on the Meta website, or your request will not be approved.**

When prompted by `huggingface-cli` below, paste your token into the `Token:` field and press enter.

In [None]:
!huggingface-cli login

# Set up the schema classes

First, we write the message schemas using DocArray.

In [None]:
from docarray import BaseDoc

class PromptDocument(BaseDoc):
    prompt: str
    max_tokens: int

class ModelOutputDocument(BaseDoc):
    token_id: int
    generated_text: str

# Construct an Executor

Now, construct the Executor, including the code to download and open the Llama-2-7b-chat-hf model and tokenizer.

In [None]:
from jina import Executor, requests
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-chat-hf"

class TokenStreamingExecutor(Executor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map='auto',
            load_in_8bit=True
        )

    def starts_with_space(self, token_id):
        token = self.tokenizer.convert_ids_to_tokens(token_id)
        return token.startswith('▁')

    @requests(on='/stream')
    async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument:
        input = self.tokenizer(doc.prompt, return_tensors='pt')
        input_len = input['input_ids'].shape[1]

        for output_length in range(doc.max_tokens):
            output = self.model.generate(**input, max_new_tokens=1)
            current_token_id = output[0][-1]
            if current_token_id == self.tokenizer.eos_token_id:
                break

            current_token = self.tokenizer.decode(current_token_id, skip_special_tokens=True)
            if self.starts_with_space(current_token_id.item()) and output_length > 1:
                current_token = ' ' + current_token
            yield ModelOutputDocument(
                token_id=current_token_id,
                generated_text=current_token,
            )

            input = {
                'input_ids': output,
                'attention_mask': torch.ones(1, len(output[0])),
            }


#Create a prompt for Llama 2.

Next, we create the prompt, including system instruction.

In [None]:
llama_prompt = PromptDocument(
    max_token=200,
    prompt="""<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant. Always answer as helpfully
and thoroughly as possible, while being safe. Sometimes people might say things
jokingly so don't take everything too seriously.

If a question does not make any sense, or is not factually coherent, explain why
instead of answering something not correct. If you don't know the answer to a
question, please don't share false information.
<</SYS>>

If I punch myself in the face and it hurts, am I weak or strong?! [/INST]"""
)

# Loading the model and executing requests to it

This `main()` method will create a client to talk the to deployed Executor and query it with the prompt, printing out its answer as it streams in.

In [None]:
from jina import Deployment, Client
import asyncio

async def main():
  client = Client(port=12345, protocol='grpc', asyncio=True)
  async for doc in client.stream_doc(
      on='/stream',
      inputs=PromptDocument(prompt=llama_prompt, max_tokens=200),
      return_type=ModelOutputDocument,
  ):
      print(doc.generated_text, end='')


Last, we deploy the Executor and run the `main()` method from above.

In [None]:
async def deploy_and_run():
    with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc'):
        await main()

Be aware that the initial run may take 10-15 minutes due to the model download.

Ideally, you should load and deploy the model once, then send requests. But for this notebook demo, we'll execute all steps together, which might slow things down.

In [None]:
await deploy_and_run()
