<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-and-clean-the-data" data-toc-modified-id="Load-and-clean-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load and clean the data</a></span></li><li><span><a href="#Run-the-algorithm" data-toc-modified-id="Run-the-algorithm-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Run the algorithm</a></span><ul class="toc-item"><li><span><a href="#Find-the-lines-with-'immortal'" data-toc-modified-id="Find-the-lines-with-'immortal'-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Find the lines with 'immortal'</a></span></li><li><span><a href="#Run-the-matches" data-toc-modified-id="Run-the-matches-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Run the matches</a></span><ul class="toc-item"><li><span><a href="#75%" data-toc-modified-id="75%-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>75%</a></span></li><li><span><a href="#60%" data-toc-modified-id="60%-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>60%</a></span></li><li><span><a href="#70%" data-toc-modified-id="70%-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>70%</a></span></li><li><span><a href="#65%" data-toc-modified-id="65%-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>65%</a></span></li></ul></li></ul></li><li><span><a href="#Find-the-lines-where-'immortal'-is-substituted" data-toc-modified-id="Find-the-lines-where-'immortal'-is-substituted-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Find the lines where 'immortal' is substituted</a></span><ul class="toc-item"><li><span><a href="#75%-similarity" data-toc-modified-id="75%-similarity-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>75% similarity</a></span></li><li><span><a href="#70%" data-toc-modified-id="70%-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>70%</a></span></li><li><span><a href="#65%" data-toc-modified-id="65%-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>65%</a></span></li><li><span><a href="#60%-similarity" data-toc-modified-id="60%-similarity-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>60% similarity</a></span></li></ul></li></ul></div>

In [3]:
import pandas as pd
import re
from fuzzywuzzy import fuzz
import string
from pandas import json_normalize
import json
from tqdm.notebook import tqdm 
from joblib import Parallel, delayed

# Load and clean the data

In [33]:
QTS = json_normalize(json.load(open('../data/QTS_JSON_CTEXT_clean_punc_no_comm_no_addnames.json')))
QTS['sequence'] = QTS['sequence'].astype('int')
print (f"Number of poems: {len(QTS)}")

Number of poems: 42864


In [34]:
def clean(content, ids):
    poem = ' '.join(content)
    lines = re.split("[，。]", poem)
    clean_lines = [[ids, re.sub(' ', '', line)] for line in lines if line != '']
    return clean_lines

In [35]:
data = QTS.copy()

In [36]:
data['lines'] = data.apply(
    lambda row: clean(content=row['content'], ids=f"{row['volume']}_{row['sequence']}"), axis=1)
all_lines_list = data['lines'].tolist()

In [37]:
all_lines = [item for sublist in all_lines_list for item in sublist]

# Run the algorithm

## Find the lines with 'immortal'

First, sort out all lines that contain the word 'immortal' in them, saving the poem id.

In [50]:
new_lines = all_lines.copy()
source_lines = [line for line in all_lines if '仙' in line[1]]

In [41]:
source_lines[1]

['1_18', '仙氣凝三嶺']

In [49]:
print (f"There are {len(source_lines)} lines that contain the word 'immortal'.")
print (f"In {len([i for i in source_lines if list(i[1]).count('仙') >1])} cases the word appears more than once in a line. \
So the count of '仙' is larger than the count of the lines found.")

There are 3475 lines that contain the word 'immortal'.
In 25 cases the word appears more than once in a line. So the count of '仙' is larger than the count of the lines found.


## Run the matches

In [119]:
def find_matches(source_line, al, similarity):
    matches = []
    for target_line in al:
        if fuzz.ratio(source_line[1], target_line[1]) > similarity:
            #all_words_dict[source_line].append(target_line)
            #all_words_dict[target_line].append(source_line)
            matches.append(target_line)
        else:
            continue      
    return matches

### 75%

In [53]:
all_matches = Parallel(n_jobs=4)(delayed(find_matches)(line, new_lines[:], 75) for line in tqdm(source_lines[:]))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3475.0), HTML(value='')))




Clean and record results

In [55]:
out = open('immortals_id_matches.txt', 'w')
for i in all_matches:
    out.write(f"{i}\n")

In [57]:
len(all_matches)

3475

Remove lines that didn't match any others (only the source line is recorded).

In [59]:
matched_lines = [i for i in all_matches if len(i)>1]
print (f"There are {len(matched_lines)} lines that matched something else.")

There are 352 lines that matched something else.


Remove duplicates (if line A is matched to line B and the line B is matched to line A, then the second instance is a duplicate and is removed).

In [60]:
no_duplicates = []
for i in tqdm(matched_lines):
    line = sorted(i)
    if line in no_duplicates:
        continue
    else:
        no_duplicates.append(line)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=352.0), HTML(value='')))




In [64]:
print (f"There are {len(no_duplicates)} lines that uniquely match to others.")
print (f"Out of those only {len([i for i in no_duplicates if len(i) >2])} \
have a line that has similar ones across more then two poems.")

There are 190 lines that uniquely match to others.
Out of those only 22 have a line that has similar ones across more then two poems.


In [66]:
out2=open('immortals_unique_matches_with_id.txt', 'w')
b = sorted(no_duplicates, reverse=True, key=len)
for i in b:
    out2.write(f"{i}\n")


### 60%

In [68]:
all_matches2 = Parallel(n_jobs=4)(delayed(find_matches)(line, new_lines[:],60) for line in tqdm(source_lines[:]))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3475.0), HTML(value='')))




Clean and record

In [69]:
out = open('immortals_id_matches60.txt', 'w')
for i in all_matches2:
    out.write(f"{i}\n")

In [70]:
len(all_matches2)

3475

In [76]:
#Remove those that didn't match and duplicates

matched_lines2 = [i for i in all_matches2 if len(i)>1]
print (f"There are {len(matched_lines2)} lines that are more than 60% similar to something else.\n\
This is {len(matched_lines2) - len(matched_lines)} more than for 75% similarity.")

There are 611 lines that are more than 60% similar to something else.
This is 259 more than for 75% similarity.


In [83]:
no_duplicates2 = []
for i in tqdm(matched_lines2):
    line = sorted(i)
    if line in no_duplicates2:
        continue
    else:
        no_duplicates2.append(line)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=611.0), HTML(value='')))




In [86]:
print (f"There are {len(no_duplicates2)} lines that uniquely match to others.")
print (f"Out of those {len([i for i in no_duplicates2 if len(i) >2])} \
have a line that has similar ones across more then two poems.")

There are 414 lines that uniquely match to others.
Out of those 119 have a line that has similar ones across more then two poems.


In [87]:
out2=open('immortals_unique_matches_with_id60.txt', 'w')
b = sorted(no_duplicates2, reverse=True, key=len)
for i in b:
    out2.write(f"{i}\n")

### 70%

In [120]:
all_matches70 = Parallel(n_jobs=4)(delayed(find_matches)(line, new_lines[:],70) for line in tqdm(source_lines[:]))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3475.0), HTML(value='')))






In [121]:
out = open('immortals_id_matches70.txt', 'w')
for i in all_matches70:
    out.write(f"{i}\n")
len(all_matches70)

3475

In [139]:
#Remove those that didn't match and duplicates

matched_lines70 = [i for i in all_matches70 if len(i)>1]
print (f"There are {len(matched_lines70)} lines that are more than 70% similar to something else.\n\
This is {len(matched_lines70) - len(matched_lines)} more than for 75% similarity.")

There are 396 lines that are more than 70% similar to something else.
This is 44 more than for 75% similarity.


In [123]:
no_duplicates70 = []
for i in tqdm(matched_lines70):
    line = sorted(i)
    if line in no_duplicates70:
        continue
    else:
        no_duplicates70.append(line)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=396.0), HTML(value='')))




In [124]:
print (f"There are {len(no_duplicates70)} lines that uniquely match to others.")
print (f"Out of those {len([i for i in no_duplicates70 if len(i) >2])} \
have a line that has similar ones across more then two poems.")

There are 224 lines that uniquely match to others.
Out of those 40 have a line that has similar ones across more then two poems.


In [125]:
out2=open('immortals_unique_matches_with_id70.txt', 'w')
b = sorted(no_duplicates70, reverse=True, key=len)
for i in b:
    out2.write(f"{i}\n")

### 65%

In [126]:
all_matches65 = Parallel(n_jobs=4)(delayed(find_matches)(line, new_lines[:],65) for line in tqdm(source_lines[:]))

out = open('immortals_id_matches65.txt', 'w')
for i in all_matches65:
    out.write(f"{i}\n")
len(all_matches65)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3475.0), HTML(value='')))


There are 603 lines that are more than 60% similar to something else.
This is 251 more than for 75% similarity.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=603.0), HTML(value='')))


There are 405 lines that uniquely match to others.
Out of those 116 have a line that has similar ones across more then two poems.


In [140]:

#Remove those that didn't match and duplicates

matched_lines65 = [i for i in all_matches65 if len(i)>1]
print (f"There are {len(matched_lines65)} lines that are more than 65% similar to something else.\n\
This is {len(matched_lines65) - len(matched_lines)} more than for 75% similarity.")

no_duplicates65 = []
for i in tqdm(matched_lines65):
    line = sorted(i)
    if line in no_duplicates65:
        continue
    else:
        no_duplicates65.append(line)

print (f"There are {len(no_duplicates65)} lines that uniquely match to others.")
print (f"Out of those {len([i for i in no_duplicates65 if len(i) >2])} \
have a line that has similar ones across more then two poems.")

out2=open('immortals_unique_matches_with_id65.txt', 'w')
b = sorted(no_duplicates65, reverse=True, key=len)
for i in b:
    out2.write(f"{i}\n")

There are 603 lines that are more than 65% similar to something else.
This is 251 more than for 75% similarity.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=603.0), HTML(value='')))


There are 405 lines that uniquely match to others.
Out of those 116 have a line that has similar ones across more then two poems.


# Find the lines where 'immortal' is substituted

## 75% similarity

In [134]:
xian_substituted75 = []
for group in tqdm(no_duplicates):
    if len(group) > len([i for i in group if '仙' in i[1]]):
        xian_substituted75.append(group)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=190.0), HTML(value='')))




In [135]:
print (f"There are {len(xian_substituted75)} cases in 75% similar lines where xian is substituted.")

There are 19 cases in 75% similar lines where xian is substituted.


In [136]:
for i in sorted(xian_substituted75, key=len, reverse=True):
    group = [n[1] for n in i]
    print(group, '\n')

['故人不可見', '故人不可見', '故人不可見', '故人不可見', '仙人不見我', '此人不可見', '古人不可見', '上仙不可見', '故人不可見', '故人不可見', '遊人不可見', '故人不可見', '仙人不可見', '仙舟不可見', '神仙不可見'] 

['暗從何處來', '問從何處來', '子從何處來', '又從何處來', '三從何處來', '仙從何處來'] 

['通籍在金閨', '通籍在金閨', '已通仙籍在金閨'] 

['叢簧發仙弄', '叢簧發天弄'] 

['仙人何處在', '美人何處在'] 

['雲歸帝鄉遠', '雲歸仙帝鄉'] 

['玉醴浮金菊', '玉醴浮仙菊'] 

['東風吹雪舞山家', '東風吹雪舞仙家'] 

['五雲抱仙殿', '五雲抱山殿'] 

['我思仙人乃在碧海之東隅', '乃在碧海之東隅'] 

['雞犬逐人靜', '仙犬逐人靜'] 

['郎官能賦許依投', '仙郎能賦許依投'] 

['眾香天上梵仙宮', '眾香天上梵王宮'] 

['仙人居其中', '人乃居其中'] 

['門前便是嵩山路', '門前便是仙山路'] 

['曾向此中游', '神仙曾向此中游'] 

['嘗聞古老言', '嘗聞仙老言'] 

['仿佛列山群', '仿佛列仙群'] 

['看雲忽見仙', '看雲忽見山'] 



## 70%

In [133]:
xian_substituted70 = []
for group in tqdm(no_duplicates70):
    if len(group) > len([i for i in group if '仙' in i[1]]):
        xian_substituted70.append(group)

print (f"There are {len(xian_substituted70)} cases in 70% similar lines where xian is substituted.\n")

for i in sorted(xian_substituted70, key=len, reverse=True):
    group = [n[1] for n in i]
    print(group, '\n')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=224.0), HTML(value='')))


There are 40 cases in 70% similar lines where xian is substituted.

['故人不可見', '故人不可見', '故人不可見', '故人不可見', '仙人不見我', '此人不可見', '古人不可見', '上仙不可見', '故人不可見', '故人不可見', '遊人不可見', '故人不可見', '仙人不可見', '仙舟不可見', '人不見', '神仙不可見'] 

['歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '仙歌宛轉聽', '仙歌宛轉聽', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉'] 

['不知精爽歸何處', '不知仙駕歸何處', '不知此地歸何處', '不知雲雨歸何處', '不知雲雨歸何處', '不知功滿歸何處', '不知何處'] 

['可中得似紅兒貌', '得似紅兒今日貌', '神仙得似紅兒貌', '稍教得似紅兒貌', '若教得似紅兒貌', '若教得似紅兒貌', '阿嬌得似紅兒貌'] 

['暗從何處來', '問從何處來', '子從何處來', '又從何處來', '三從何處來', '仙從何處來'] 

['不可得', '神仙不可求', '神仙不可得', '神仙不可學', '神仙不可見'] 

['壺中別有日月天', '應是壺中別有家', '壺中別有仙家日', '應是壺中別有家'] 

['卻向人間作酒徒', '且向人間作酒仙', '罰向人間作酒狂', '謫向人間作酒狂'] 

['仙雲在何處', '仙雲在何處', '在何處'] 

['仙人不見我', '仙人不可見', '人不見'] 

['門前便是家山道', '門前便是嵩山路', '門前便是仙山路'] 

['先皇曾向此中游', '曾向此中游', '神仙曾向此中游'] 

['通籍在金閨', '通籍在金閨', '已通仙籍在金閨'] 

['三千功滿仙升去', '三千功滿好歸去', '三千功滿去升天'] 

['不知何處是天真', '不知何處偶真仙', '不知何處'] 

['叢簧發仙弄', '叢簧發天弄'] 

['仙人何處在', '美人何處在'] 

['雲歸帝鄉遠', '雲歸仙帝鄉'] 

['玉醴浮金菊', '玉醴浮仙菊'] 

['東風吹雪舞山家', '東風吹雪舞仙家'] 

['仙人樓

## 65%

In [138]:
xian_substituted65 = []
for group in tqdm(no_duplicates65):
    if len(group) > len([i for i in group if '仙' in i[1]]):
        xian_substituted65.append(group)

print (f"There are {len(xian_substituted65)} cases in 65% similar lines where xian is substituted.\n")

for i in sorted(xian_substituted65, key=len, reverse=True):
    group = [n[1] for n in i]
    print(group, '\n')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=405.0), HTML(value='')))


There are 123 cases in 65% similar lines where xian is substituted.

['故人不可見', '故人不可見', '故人不可見', '西憶故人不可見', '故人不可見', '仙人不見我', '此人不可見', '京師故人不可見', '古人不可見', '上仙不可見', '故人不可見', '故人不可見', '遊人不可見', '故人不可見', '人間想望不可見', '海上仙遊不可見', '仙人不可見', '仙舟不可見', '人不見', '神仙不可見'] 

['不知何處火', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '不知精爽歸何處', '不知仙駕歸何處', '不知此地歸何處', '不知何處笛', '不知雲雨歸何處', '不知雲雨歸何處', '仙馭歸何處', '不知何處峰', '不知功滿歸何處', '不知何處', '不知何處火'] 

['歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '仙歌宛轉聽', '仙歌宛轉聽', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉'] 

['不知何處火', '仙源不知處', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '仙吏不知何處隱', '不知何處笛', '不知何處峰', '不知何處', '不知何處火'] 

['不知何處火', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '不知何處笛', '不知何處峰', '不知何處是天真', '不知何處偶真仙', '不知何處', '不知何處火'] 

['神仙中人不易得', '神仙望見不得到', '不可得', '神仙不可求', '神仙不可得', '神仙不可學', '聞道神仙不可接', '神仙不可見'] 

['笑問客從何處來', '暗從何處來', '問從何處來', '子從何處來', '又從何處來', '幽鳥晚從何處來', '三從何處來', '仙從何處來'] 

['仙雲在何處', '仙雲在何處', '在何處', '欲問神仙在何處', '兩漢真仙在何處', '孤雲自在知何處', '雲中有寺在何處'] 

['自言神訣不可求', '神仙不可求', '神仙不可得

## 60% similarity

In [129]:
xian_substituted60 = []
for group in tqdm(no_duplicates2):
    if len(group) > len([i for i in group if '仙' in i[1]]):
        xian_substituted60.append(group)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=414.0), HTML(value='')))




In [130]:
print (f"There are {len(xian_substituted60)} cases in 60% similar lines where xian is substituted.")

There are 131 cases in 60% similar lines where xian is substituted.


In [131]:
for i in sorted(xian_substituted60, key=len, reverse=True):
    group = [n[1] for n in i]
    print(group,'\n')

['故人不可見', '故人不可見', '故人不可見', '西憶故人不可見', '故人不可見', '仙人不見我', '此人不可見', '京師故人不可見', '古人不可見', '上仙不可見', '故人不可見', '故人不可見', '遊人不可見', '故人不可見', '人間想望不可見', '海上仙遊不可見', '仙人不可見', '仙舟不可見', '人不見', '神仙不可見'] 

['不知何處火', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '不知精爽歸何處', '不知仙駕歸何處', '不知此地歸何處', '不知何處笛', '不知雲雨歸何處', '不知雲雨歸何處', '仙馭歸何處', '不知何處峰', '不知何處相尋', '不知功滿歸何處', '不知何處', '去去不知何處', '不知何處火'] 

['不知何處火', '仙源不知處', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '仙吏不知何處隱', '不知何處笛', '不知何處峰', '不知何處相尋', '不知何處', '去去不知何處', '不知何處火'] 

['不知何處火', '不知何處恨', '不知何處去', '不知何處葬', '不知此何處', '不知何處歸', '不知何處笛', '不知何處峰', '不知何處相尋', '不知何處是天真', '不知何處偶真仙', '不知何處', '去去不知何處', '不知何處火'] 

['歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉', '仙歌宛轉聽', '仙歌宛轉聽', '歌宛轉', '歌宛轉', '歌宛轉', '歌宛轉'] 

['神仙中人不易得', '神仙望見不得到', '不可得', '神仙不可求', '神仙不可得', '神仙不可學', '聞道神仙不可接', '神仙不可見'] 

['笑問客從何處來', '暗從何處來', '問從何處來', '子從何處來', '又從何處來', '幽鳥晚從何處來', '三從何處來', '仙從何處來'] 

['仙雲在何處', '仙雲在何處', '在何處', '欲問神仙在何處', '兩漢真仙在何處', '孤雲自在知何處', '雲中有寺在何處'] 

['自言神訣不可求', '神仙不可求', '神仙不可得', '神仙不可學'