# Exllama

[GitHub:turboderp/exllama](https://github.com/turbo-derp/exllama) A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.
The python module is used as in this [GitHub:jllllll/exllama](https://github.com/jllllll/exllama) repo.

This example goes over how to use LangChain to interact with `Exllama` models.

In [1]:
!python -m pip install git+https://github.com/jllllll/exllama

Note: you may need to restart the kernel to use updated packages.


### Import GPT4All

In [1]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import Exllama
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

### Set Up Question to pass to LLM

In [2]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

### Specify Model

To run locally, download a compatible ggml-formatted model. 
 
The [Exllama page](https://github.com/turboderp/exllama/blob/master/doc/model_compatibility.md) has list of compatible models:

* Select a model of interest
* Download the model and move it to the `local_path` (noted below)

---

In [None]:
local_path = "./models/Llama-2-7b-Chat-GPTQ"  # replace with your desired local file path


In [None]:
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
llm = Exllama(model=local_path, callbacks=callbacks, verbose=True)


In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"

llm_chain.run(question)

Justin Bieber was born on March 1, 1994. In 1994, The Cowboys won Super Bowl XXVIII.