## Python自动翻译英语论文PDF

***涉及技术：***
1. Python读取PDF文本
2. pandas的读取csv、多数据merge、输出Excel
3. Python正则表达式实现英文分词

### 1. 读取PDF文本内容

In [1]:
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfplumber

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [48]:
import pdfplumber
def read_pdf(pdf_fpath):
    pdf = pdfplumber.open(pdf_fpath)
    page_conts = []
    for page in pdf.pages:
        page_conts.append(page.extract_text())
    pdf.close()
    return " ".join(page_conts)

In [49]:
pdf_fpath = "D:/tmp/Wide & Deep Learning for Recommender Systems.pdf"
pdf_cont = read_pdf(pdf_fpath)

In [50]:
print(pdf_cont[:2000])

Wide & Deep Learning for Recommender Systems
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil,
Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah
∗
GoogleInc.
ABSTRACT
have never or rarely occurred in the past. Recommenda-
6
tions based on memorization are usually more topical and
1 Generalized linear models with nonlinear feature transfor-
directly relevant to the items on which users have already
0 mations are widely used for large-scale regression and clas-
performed actions. Compared with memorization, general-
2 siﬁcationproblemswithsparseinputs. Memorizationoffea-
ization tends to improve the diversity of the recommended
  tureinteractionsthroughawide setofcross-productfeature
n items. Inthispaper,wefocusontheappsrecommendation
transformationsareeﬀectiveandinterpretable,whilegener-
u problemfortheGooglePlaystore,buttheapproachshould
alizationrequiresmorefeaturee

### 2. 读取英语-汉语翻译词典文件

词典文件来自：https://github.com/skywind3000/ECDICT
使用步骤：
1. 下载代码打包：https://github.com/skywind3000/ECDICT/archive/master.zip
2. 解压master.zip，然后解压其中的‪stardict.csv文件

In [8]:
import pandas as pd

In [9]:
# 注意：stardict.csv的地址需要替换成你自己的文件地址
df_dict = pd.read_csv("D:/tmp/ECDICT-master/stardict.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
df_dict.shape

(3402564, 13)

In [11]:
df_dict.sample(10).head()

Unnamed: 0,word,phonetic,definition,translation,pos,collins,oxford,tag,bnc,frq,exchange,detail,audio
801655,design height,,,设计高度,,,,,,,,,
2739800,shibu,,,[网络] 方回春堂；喊吧,,,,,,,,,
1232187,genus Testudo,,,[网络] Testudo属,,,,,0.0,0.0,s:genus testudoes,,
2403094,profit-and-loss statements,,,[会计] 损益表,,,,,,,0:profit-and-loss statement/1:s,,
1197174,gain limited sensitivity,,,极限增益灵敏度,,,,,,,,,


In [12]:
# 把word、translation之外的列扔掉
df_dict = df_dict[["word", "translation"]]
df_dict.head()

Unnamed: 0,word,translation
0,'a,na. 一\nn. 英文字母表的第一字母；【乐】A音\nart. 冠以不定冠词主要表示类别\...
1,'A' game,[网络] 游戏；一个游戏；一局
2,'Abbāsīyah,[地名] 阿巴西耶 ( 埃 )
3,'Abd al Kūrī,[地名] 阿卜杜勒库里岛 ( 也门 )
4,'Abd al Mājid,[地名] 阿卜杜勒马吉德 ( 苏丹 )


### 3. 英文分词和数据清洗

In [13]:
# 分词
import re
word_list = re.split("""[ ,.\(\)/\n|\-:=\$\["']""", pdf_cont)
word_list[:10]

['Wide',
 '&',
 'Deep',
 'Learning',
 'for',
 'Recommender',
 'Systems',
 'Heng',
 'Tze',
 'Cheng']

In [14]:
# 数据清洗
word_list_clean = []
for word in word_list:
    word = str(word).lower().strip()
    # 过滤掉空词、数字、单个字符的词、停用词
    if not word or word.isnumeric() or len(word)<=1:
        continue
    word_list_clean.append(word)
word_list_clean[:20]

['wide',
 'deep',
 'learning',
 'for',
 'recommender',
 'systems',
 'heng',
 'tze',
 'cheng',
 'levent',
 'koc',
 'jeremiah',
 'harmsen',
 'tal',
 'shaked',
 'tushar',
 'chandra',
 'hrishi',
 'aradhye',
 'glen']

### 4. 分词结果构造成一个DataFrame

In [15]:
df_words = pd.DataFrame({
    "word": word_list_clean
})
df_words.head()

Unnamed: 0,word
0,wide
1,deep
2,learning
3,for
4,recommender


In [16]:
df_words.shape

(2322, 1)

In [17]:
# 统计词频
df_words = (
    df_words
    .groupby("word")["word"]
    .agg(count="size")
    .reset_index()
    .sort_values(by="count", ascending=False)
)
df_words.head(10)

Unnamed: 0,word,count
804,the,128
57,and,67
546,of,46
503,model,41
939,wide,36
374,in,36
203,deep,35
405,is,31
286,for,30
845,to,29


### 5. 和单词词典实现merge

In [21]:
df_merge = pd.merge(
    left = df_dict,
    right = df_words,
    left_on = "word",
    right_on = "word"
)

In [32]:
df_merge.sample(10)

Unnamed: 0,word,translation,count
1,account,"n. 报告, 解释, 估价, 理由, 利润, 算账, 帐目\nvi. 报帐, 解释, 导致,...",1
380,prediction,"n. 预言, 预报\n[化] 预测",2
185,generalization,"n. 一般化, 普遍化, 概括\n[化] 推广; 普适化",4
56,burget,[人名] 伯吉特,1
372,pipeline,"n. 管道, 传递途径\n[化] 管路; 管线",1
237,include,"vt. 包括, 把...算入, 包住\n[计] DOS内部命令:在CONFIG.SYS文件的...",2
524,threads,n. 线；相关串连；线程（thread的复数）,2
208,heng,n. 恒; 珩,1
62,capacity,"n. 容量, 能力, 才能, 资格\n[计] 容量",1
228,important,"a. 重要的, 有地位的, 大量的, 显要的, 自负的\n[计] 要点",2


In [33]:
df_merge.shape

(607, 3)

### 6. 存入Excel

In [34]:
df_merge.to_excel("./39. pdf_chinese_english.xlsx", index=False)