# CTranslate2

**CTranslate2** is a C++ and Python library for efficient inference with Transformer models.

The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU.

Full list of features and supported models/frameworks is included in the [project's repository](https://github.com/OpenNMT/CTranslate2). To start, please check out the official [quickstart guide](https://opennmt.net/CTranslate2/quickstart.html).

To use, you should have `ctranslate2` python package installed.

In [None]:
#!pip install ctranslate2

As explained in quickstart guide, to use Hugging Face model with CTranslate2, it has to be first converted to CTranslate2 format using the command `ct2-transformers-converter` command:

In [None]:
#!ct2-transformers-converter --model meta-llama/Llama-2-7b-hf --quantization int8 --output_dir llama-2-7b-ct2 --force

In [None]:
from langchain.llms import CTranslate2

llm = CTranslate2(
    model_path="/mnt/ml-team/homes/eryk.mazus/llama-2-7b-ct2",
    tokenizer_name="meta-llama/Llama-2-7b-hf",
    device="cuda",
    # device_index can be either single int or list or ints,
    # indicating the ids of GPUs to use for inference:
    device_index=[0,1], 
    compute_type="bfloat16"
)

## Single call

In [None]:
print(
    llm(
        "A rap battle between Stephen Colbert and John Oliver",
        max_length=256,
        sampling_topp=0.95,
        sampling_temperature=1.0,
        repetition_penalty=2,
    )
)

## Multiple calls:

In [None]:
print(
    llm.generate(
        ["List of European capital cities:", "List of capital cities in Asia:"],
        max_length=128
    )
)

## Integrate the model in an LLMChain

In [None]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.run(question))