# CTranslate2

## Overview

CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU.

I try to use it for Flan-T5 models by refer to the sample codes from https://opennmt.net/CTranslate2/guides/transformers.html#t5

## Setup


In [1]:
%pip install -qU ctranslate2 transformers[torch] sentencepiece

Note: you may need to restart the kernel to use updated packages.


## Convert Model
### Use Code


In [2]:
#import ctranslate2 
#model_id = "google/flan-t5-small"
#ct = ctranslate2.converters.TransformersConverter(model_name_or_path=model_id)
#ct.convert(output_dir="google/flan-t5-small-ct2", force=True)

### Use command

In [3]:
!ct2-transformers-converter --model google/flan-t5-small --output_dir google/flan-t5-small-ct2 --quantization int8 --force


## Sample Codes


In [6]:
import ctranslate2
import transformers

translator = ctranslate2.Translator("google/flan-t5-small-ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("google/flan-t5-small-ct2")

input_text = "translate English to German: The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text)


Das Haus ist schön.
