<a href="https://colab.research.google.com/gist/c-bata/67d72199474d3b0e3e7f9ecac6c6b11d/cuda-toolkit-colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python + BERTで英文のPDFをレイアウトを維持したまま翻訳してみる
Python + BERTで英文のPDFをレイアウトを維持したままの翻訳を実施します．<br>
今回はGoogle ColabとGoogle Driveを連携させて，notebook形式で実行してます．<br>

> Google Colaboratory（以下Google Colab）は、Google社が無料で提供している機械学習の教育や研究用の開発環境です。開発環境はJupyter Notebookに似たインターフェースを持ち、Pythonの主要なライブラリがプリインストールされています。<br>
引用元：[Google Colabの使い方](https://interface.cqpub.co.jp/ail01/)

## Google Colabのファイル構成
プロジェクトディレクトリはpdf_translate_bertとしています．度々，省略しています．
```init
pdf_translate
├── /pdf_translate_bert
│   ├── analyze_pdf_text_bert.py <- pdfの解析
│   ├── translate_csv_bert.py <- 翻訳前後をcsv化
│   ├── embed_annots.py <- pdfにコメ追加
│   ├── pdf_translate_bert.sh <- 実行スクリプト
│   ├── AttentionGAN.pdf <- 翻訳前
│   ├── JA_BERT_AttentionGAN.pdf <- 翻訳後
│   ├── AttentionGAN.pdf.bert.csv <- 翻訳前後csv
│   ├── AttentionGAN.pdf.bert.json  <- pdf情報のjson
│   └── (省略)
└── pdf_translate_bert.ipynb <- 実行用ノートブック
```

## Google Driveと連携
Google ColabとGoogle Driveを連携させて，作業ディレクトリを作成します．<br>

In [1]:
# Google ColabとGoogle Driveを連携
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%bash
# ディレクトリの移動
cd /content/drive/My\ Drive/PDF_transrate
# gitのダウンロード
mkdir pdf_translate_bert
ls

EasyNMT
EasyNMT.ipynb
PDF_tanslate.ipynb
PDF_tanslate_NMT.ipynb
pdf_translate
pdf_translate_bert


In [5]:
# 作業ディレクトリの移動
%cd /content/drive/My\ Drive/PDF_transrate/pdf_translate_bert
!ls

/content/drive/My Drive/PDF_transrate/pdf_translate_bert
42_Russia		  AttentionGAN.pdf.bert.csv   JA_BERT_AttentionGAN.pdf
42_Russia_pdfs.csv	  AttentionGAN.pdf.bert.json  pdf_translate_bert.sh
analyze_pdf_text_bert.py  book_pdf.pdf		      translate_csv_bert.py
AttentionGAN.pdf	  embed_annots.py


## モジュールのインストール
下記のコマンドでモジュールをインストールします．

In [None]:
!pip install -U easynmt
!pip install fitz
!pip install PyMuPDF==1.16.14

## 翻訳実行

### コードの作成
個人的にcsvとして翻訳結果も残したいので，下記のコードを作成します．

In [None]:
# analyze_pdf_text_bert.pyの書き込み
%%writefile analyze_pdf_text_bert.py

import sys
import fitz
import csv
import json
from tqdm import tqdm


def main(input_path):
    argvs = sys.argv
    argc = len(argvs)
    if argc != 2:
        print("Usage #python %s [PDFパス]" % argvs[0])
        exit()
    input_path = argvs[1]
    doc = fitz.open(input_path)
    list_string = []
    result = []
    pno = 0
    for page in tqdm(doc):
        blocklist = page.getText('blocks')
        for block in blocklist:
            r = fitz.Rect(block[0], block[1], block[2], block[3])
            txt = block[4].replace('\n', '').replace('\r', '')
            result.append({
                'page' : pno,
                'x0' : block[0],
                'y0' : block[1],
                'x1' : block[2],
                'y1' : block[3],
                'text' : txt,
                'block_type' : block[5],
                'block_no' : block[6],
            })
            if not txt in list_string:
                list_string.append(txt)
        pno += 1

    string_path = '{}.bert.csv'.format(input_path)
    with open(string_path, 'w', encoding='utf8', newline="") as fp:
        writer = csv.writer(fp)
        for s in list_string:
            writer.writerow([s, 'todo'])
    data = {
        "input_path" : input_path,
        "string_path" : string_path,
        "text_block" : result
    }
    with open('{}.bert.json'.format(input_path), mode='w', encoding='utf8') as fp:
        json.dump(data, fp, sort_keys=True, indent=4, ensure_ascii=False)


if __name__ == '__main__':
    main(sys.argv)

Writing analyze_pdf_text_bert.py


In [None]:
# translate_csv.pyの書き込み
%%writefile translate_csv_bert.py

import sys
import time
import random
import pandas as pd
from easynmt import EasyNMT

# モデルの読み込み
model = EasyNMT('mbart50_m2m')

def main(input_path):
  argvs = sys.argv
  argc = len(argvs)
  if argc != 2:
      print("Usage #python %s [PDFãƒ‘ã‚¹]" % argvs[0])
      exit()
  input_path = argvs[1]
  df = pd.read_csv(input_path, names=('org_text', 'convert_text'))
  
  for i in range(len(df)):
      org_text = df['org_text'][i].replace('.', '. ')
      # print(org_text)
      df['convert_text'][i] = model.translate(org_text, target_lang='ja')
      
  df.to_csv(input_path, header=False, index=False)

if __name__ == '__main__':
  main(sys.argv)

Overwriting translate_csv_bert.py


In [None]:
%%writefile embed_annots.py
import sys
import fitz
import csv
import json
from tqdm import tqdm


def main(argvs):
    """メイン処理"""
    argvs = sys.argv
    argc = len(argvs)
    if argc != 3:
        print("Usage #python %s [JSONパス] [出力パス]" % argvs[0])
        exit()
    json_path = argvs[1]
    out_path = argvs[2]
    with open(json_path, mode='r', encoding='utf8') as fp:
        json_info = json.load(fp)

    string_path = json_info['string_path']
    string_info = {}
    with open(string_path, 'r', encoding='utf8', newline="\n") as fp:
        reader = csv.reader(fp)
        for row in reader:
            string_info[row[0].strip()] = row[1]
    doc = fitz.open(json_info['input_path'])

    for text_block in tqdm(json_info["text_block"]):
        txt = text_block['text'].strip()
        if txt in string_info:
            txt = string_info[txt]
            # print(txt)
            page = doc[text_block['page']]
            page.addTextAnnot(
                (text_block['x1'] - 20, text_block['y1'] - 20),
                txt
            )
        else:
            print("翻訳に失敗：", text_block['text'])

    doc.save(out_path, garbage=1, clean=1, deflate=1) 

if __name__ == '__main__':
    main(sys.argv)


Overwriting embed_annots.py


### シェルスクリプトファイルの作成
下記のコマンドでシェルスクリプトファイルの作成します．

In [8]:
# pdf_translate.shの書き込み
%%writefile pdf_translate_bert.sh
#!/bin/sh
python ./analyze_pdf_text_bert.py "$1"
python ./translate_csv_bert.py "$1.bert.csv"
python ./embed_annots.py "$1.bert.json" "JA_BERT_$1"

rm "$1.bert.csv" "$1.bert.json"

Overwriting pdf_translate_bert.sh


### 翻訳実行
下記のコマンドで翻訳を実行します．

In [9]:
!ls 

42_Russia		  AttentionGAN.pdf.bert.csv   embed_annots.py
42_Russia_pdfs.csv	  AttentionGAN.pdf.bert.json  JA_BERT_AttentionGAN.pdf
analyze_pdf_text_bert.py  book_pdf.pdf		      JA_BERT_book_pdf.pdf
assembly64.pdf		  book_pdf.pdf.bert.csv       pdf_translate_bert.sh
AttentionGAN.pdf	  book_pdf.pdf.bert.json      translate_csv_bert.py


In [10]:
!sh pdf_translate_bert.sh assembly64.pdf

100% 357/357 [00:00<00:00, 621.91it/s]
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
 28% 1900/6707 [00:00<00:00, 18994.99it/s]翻訳に失敗： (gdb) x/5cb &strNum0x600104: 49 '1' 52 '4' 57 '9' 56 '8' 0 ' '
100% 6707/6707 [00:00<00:00, 19519.79it/s]


In [None]:
# 42_Russia_translate.py
import subprocess
from glob import glob
import pandas as pd

ERROR_list = []
pdf_files = glob("./42_Russia/00_Projects/*/*/*.pdf")

df = pd.DataFrame(pdf_files, columns=['path'])
df.to_csv('42_Russia_pdfs.csv', index=False)

In [None]:
df = pd.read_csv('42_Russia_pdfs.csv')
pdf_files =df['path'].to_list()
pdf_files

In [None]:
# 42_Russia_translate.py
import subprocess
from glob import glob

ERROR_list = []
# pdf_files = glob("./42_Russia/00_Projects/*/*/*.pdf")
df = pd.read_csv('42_Russia_pdfs.csv')
pdf_files =df['path'].to_list()

# pdf_file = '../minishell_(new).pdf'

for pdf_file in pdf_files:
    try:
        # コマンド実行
        subprocess.run(["sh", "pdf_translate_bert.sh", pdf_file])
        print("END-",pdf_file)
    except:
        ERROR_list.append(pdf_file)
        print("ERROR-", pdf_file)

END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/00____init/init - scale for the project.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/00____init/init.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/ft_services/ft_services_Oct_08_20.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/04____roger-skyline-2/roger-skyline-2.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/03____roger-skyline-1/roger-skyline-1.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/ft_server/ft_server_03_17_2020_19:46.pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/02____cloud-1/cloud-1_(new).pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/02____cloud-1/cloud-1_(old).pdf
END- ./42_Russia/00_Projects/05_Infrastructure_and_Admin/01____docker-1/docker-1.pdf
END- ./42_Russia/00_Projects/09_OCaml/01____ft_turing/ft_turing.pdf
END- ./42_Russia/00_Projects/09_OCaml/00____piscine_ocaml/d08.en.pdf
END- ./42_Russia/00_Projects/09

In [None]:
len(pdf_files)

232

END- ./42_Russia/00_Projects/07_CPP/06____nibbler/nibbler.pdf

## まとめ
※Adobe Acrobat Readerで閲覧してください。ブラウザ経由だと注釈が文字化けします。

## 参考サイト
[mima3/pdf_translate](https://github.com/mima3/pdf_translate)

[UKPLab/EasyNMT](https://github.com/UKPLab/EasyNMT)

[PythonでPDFのレイアウトを維持したまま翻訳してみる](https://qiita.com/mima_ita/items/3f698050196d4af3a46d)

[yta-git/pdf_translate](https://github.com/yta-git/pdf_translate)

[【python】googletransの『AttributeError: 'NoneType' object has no attribute 'group'』対策](https://qiita.com/_yushuu/items/83c51e29771530646659)



## エラー集
