# Translate a markdown file into Chinese


## 1. Read in the data

In [75]:
import yaml
import openai

# 读取YAML文件
with open('config.yaml', 'r') as yaml_file:
    data = yaml.safe_load(yaml_file)

# 读取配置
access_token = data['access_token_list'][1]
BASE_URL = data['BASE_URL'][1]

# openai.api_key = "这里填 access token，不是 api key"
openai.api_base = BASE_URL
openai.api_key = access_token

In [76]:
import openai
from transformers import GPT2Tokenizer

# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("data/geometry_slovenian.tex", "r") as f:
    text = f.read()

### 1.1 Count the tokens in each chunk

In [77]:
# 这里为了测试，缩短文章篇幅
chunks = text[:20000].split('\n\n')
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)

847

It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.

We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.

In [78]:
def group_chunks(chunks, ntokens, max_len=1000, hard_max_len=3000):
    """
    Group very short chunks, to form approximately page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0
    
    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        # discard chunks that exceed hard max length
        if ntoken > hard_max_len:
            print(f"Warning: Chunk discarded for being too long ({ntoken} tokens > {hard_max_len} token limit). Preview: '{chunk[:50]}...'")
            continue

        # if room in current batch, add new chunk
        if cur_tokens + 1 + ntoken <= max_len:
            cur_batch += "\n\n" + chunk
            cur_tokens += 1 + ntoken  # adds 1 token for the two newlines
        # otherwise, record the batch and start a new one
        else:
            batches.append(cur_batch)
            cur_batch = chunk
            cur_tokens = ntoken
            
    if cur_batch:  # add the last batch if it's not empty
        batches.append(cur_batch)
        
    return batches


chunks = group_chunks(chunks, ntokens)
len(chunks)

10

Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.

The format of the prompt sent to the model consists of:
1. A high level instruction to translate only the text, but not commands into the desired language
2. A sample untranslated command, where only the content of the chapter name needs to be translated
3. The chunk of text to be translated
4. The translated sample command from 2, which shows the model the beginning of the translation process

The expected output is the translated chunk of text.

In [79]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

In [80]:
def translate_chunk(chunk, engine='text-davinci-002',
                    dest_language='English',
                    sample_translation=("\poglavje{Osnove Geometrije} \label{osn9Geom}", "\poglavje{The basics of Geometry} \label{osn9Geom}")
                    ):
    prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
    
"""
{sample_translation[0]}
{chunk}"""

{sample_translation[1]}
'''
    # response = openai.Completion.create(
    #     prompt=prompt,
    #     engine=engine,
    #     temperature=0,
    #     top_p=1,
    #     max_tokens=1500,
    # )
    # result = response['choices'][0]['text'].strip()
    result = get_completion(prompt)
    result = result.replace('"""', '') # remove the double quotes, as we used them to surround the text
    return result
print(translate_chunk(chunks[2], engine='text-davinci-002', dest_language='English'))

Chapter 1: Basics of Geometry
The first two chapters deal with the history and axiomatic design of geometry. The consequences of the axioms of incidence, congruence, and parallelism are discussed in detail, while in the other two groups (axioms of order and continuity), the consequences are mostly not proven. Chapters three and four deal with the relation of the congruence of figures, the use of the triangle congruence theorems, and a circle. In the fifth chapter, vectors are defined. Thales's theorem of proportion is proven. Chapter six deals with isometries and their use. Their classification has been performed. Chapters 7 and 8 deal with similarity transformations, figure similarity relation, and area of figures. The ninth chapter presents the inversion. At the end of each chapter (except the introductory one) are exercises. Solutions and instructions can be found in the last, tenth chapter.

The book contains 341 theorems, 247 examples, and 418 solved problems (28 of them from the 

We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.

Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.

In [81]:
dest_language = "English"

translated_chunks = []
for i, chunk in enumerate(chunks):
    print(str(i+1) + " / " + str(len(chunks)))
    # translate each chunk
    translated_chunks.append(translate_chunk(chunk, engine='text-davinci-002', dest_language=dest_language))

# join the chunks together
result = '\n\n'.join(translated_chunks)

# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
    f.write(result)

1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
