In [None]:
# @title
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Running an Llama2 LLM in Google Colab.

<a target="_blank" href="https://colab.research.google.com/github/gregmeldrum/localLLM/blob/main/colab/Llama2_Orca_Mini_3.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This colab is used to run any of the Llama2 LLM and derivitives.

Due to the resource constraints of colab, we'll use a 5 bit quantized 13B parameter model in GGML format. The [llama cpp](https://github.com/ggerganov/llama.cpp) projects supports GGML formatted models and allows the model to run on a combination of CPU and GPU.

We'll use [langchain](https://python.langchain.com/) to run the model.

The model used in this colab is baed on [Orca-Mini-v3-13B](https://huggingface.co/psmathur/orca_mini_v3_13b) which (at the time of writing this note) is one of the top rated 13B model on the [Hugging Face LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). More specifically, we'll be using the [GGML quantized version](https://huggingface.co/TheBloke/orca_mini_v3_13B-GGML) created by [The Bloke](https://huggingface.co/TheBloke).

This model uses the following prompt template:

```
### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
{prompt}

### Input:
{input}

### Response:
```

We won't be using `Input`, so we'll remove that from the template.

First we'll install and compile the `llama-cpp-python` library and then download the ggml llm model from Hugging Face.

In [None]:
!pip uninstall -y llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -q llama-cpp-python --no-cache-dir

!apt-get -y install -qq aria2
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/orca_mini_v3_13B-GGML/resolve/main/orca_mini_v3_13b.ggmlv3.q5_K_M.bin -d /content/ -o orca_mini_v3_13b.ggmlv3.q5_K_M.bin


Next we'll install langchain and create the llm and llm chain. For a description of the parameters used to configure the LlamaCpp LLM see the [API definition](https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html#langchain.llms.llamacpp.LlamaCpp).

In [None]:
!pip install -q langchain

In [None]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

template = """### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
{question}

### Response:"""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

temperature = 0.75 # Use a value between 0 and 2. Lower = factual, higher = creative
n_gpu_layers = 43  # Change this value based on your model and your GPU VRAM pool.
n_batch = 2048  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/content/orca_mini_v3_13b.ggmlv3.q5_K_M.bin",
    temperature=temperature,
    max_tokens=2048,
    n_ctx=2048,
    top_p=0.95,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

llm_chain = LLMChain(prompt=prompt, llm=llm)


Now lets test out the model.

In [None]:
question = "Write the python code to calculate a least squares regression line for a list of x , y coordinates. Output the slope and intercept"

output = llm_chain.run(question)

In [None]:
question = "Who was the prime minister of Canada on the day that Justin Beiber was born. Explain your reasoning."

output = llm_chain.run(question)

In [None]:
question = "Who would win in a fight, Batman or Superman."

output = llm_chain.run(question)


In [None]:
question = "If Taylor is faster than Rahul and Rahul is faster than Juan, is Taylor faster than Juan? Explain your reasoning."

output = llm_chain.run(question)

In [None]:
question = "Write a short distopian story about how in the near future, AI will become the master of the human race."

output = llm_chain.run(question)

In [None]:
question = "Is the meaning of life really 42?"

output = llm_chain.run(question)

In [None]:
print(output)