<a href="https://colab.research.google.com/github/michelle-shih/2022CALISE/blob/main/111_CALISE_Project_Train_CDA111004.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 111年 CALISE 中華圖書資訊學教育學會大數據競賽

## 題目：書目資料主題詞預測
1. 題目類型：multi-label
2. 說明：目前圖書館編目皆由館員進行人工處理，為提升館員編目工作效率的優化，因此本次競賽目標為設計一主題詞預測方法以提供館員編目之參考，請根據現今館員編目原則(即一本書目可對應一至多個主題詞)進行設計

### 訓練資料
* 書目資料欄位簡介：[點此下載](https://docs.google.com/document/d/1odEYMcL34SI3WIGteMvqP3lbDVDlS9PR/edit?usp=sharing&ouid=103782426576759505585&rtpof=true&sd=true)
* 書目資料訓練資料集(主題詞已先進行分割處理)：[Train DataSet](https://drive.google.com/file/d/1Cp6UJbhesOvZPNZ5l6lO401OoFF_Uom9/view?usp=sharing) (28萬筆資料)
  * 訓練資料集為提供參賽者作為模型訓練使用
* 書目資料測試資料集(主題詞已先進行分割處理)：[Test DataSet](https://drive.google.com/file/d/1-CjrC-oyzmk79F-hpGogoaT_W0THkmqO/view?usp=sharing) (1萬筆資料)
  * 測試資料集為提供參賽者作為模型訓練完畢後測試使用
* 書目資料訓練資料集(主題詞未進行分割處理)：[Train DataSet](https://drive.google.com/file/d/1sd9i5AumQI4A_WO7o9l3RmyRM2RJ1wm5/view?usp=sharing) (28萬筆資料)
  * 訓練資料集為提供參賽者作為模型訓練使用
* 書目資料測試資料集(主題詞未進行分割處理)：[Test DataSet](https://drive.google.com/file/d/1n3YCi_AY__nPd_Awek7GYHM_-tOdcKnW/view?usp=sharing)(1萬筆資料)
  * 測試資料集為提供參賽者作為模型訓練完畢後測試使用
---
### 注意事項
1. 請依下列範本格式設計建構您的專案，範本格式為基本要求，可自行增加設計但不可刪減，若造成無法評審之結果則不予以參賽評分。

2. 繳交專案請直接修改本份專案資料名稱，專案名稱規則為：111_CALISE_Project_您的參賽團隊編號.ipynb。

3. 本次競賽將直接使用Colab資源執行作為評分平台，評審資料筆數與測試資料筆數相同，**請以測試資料筆數於Colab所提供之資源與時間內得以執行完畢整個專案**為設計。


---

# 0.Colab 環境配置(Colab Environment Configuration)

* 請於此處描述您所使用的Colab環境配置，如硬體加速器使用等等，以便評審時進行設置。


我們無特別更動Colab之環境配置。採用連結至雲端硬碟的形式匯入檔案。

In [None]:
#套件載入
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import regex as re # 處理文字比對、替換等工作的套件
from IPython import display
import matplotlib.dates as mdates
from ipywidgets import Text, IntText
from datetime import datetime # 處理日期時間資料的套件
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# 下載字型壓縮檔
!wget "https://noto-website-2.storage.googleapis.com/pkgs/NotoSansCJKtc-hinted.zip"
# 解壓縮
!unzip "NotoSansCJKtc-hinted.zip"
# 將字型檔移到字型目錄下
!mv NotoSansCJKtc-Regular.otf /usr/share/fonts/truetype/

# 清除執行結果
display.clear_output()

# 圖形的文字管理
import matplotlib.font_manager as fm
# 字型檔的路徑
ch_font_path = '/usr/share/fonts/truetype/NotoSansCJKtc-Regular.otf'

# 設定字型的字型檔路徑和大小
fm.fontManager.addfont(ch_font_path)
plt.rcParams['font.sans-serif'] = 'Noto Sans CJK TC'
plt.rcParams['font.size']=12
# 修復負號顯示問題
plt.rcParams['axes.unicode_minus']=False

In [None]:
#讀檔案
#掛雲端
from google.colab import drive
drive.mount('/content/drive')

##出現提示欄進行授權

os.chdir('/content/drive/My Drive/111_CALISE_CDA111004') #切換該目錄
#https://drive.google.com/drive/folders/1zoSxsPrDdX_7yepTgpBBELRfRGNV0f_5?usp=sharing
os.listdir() #確認目錄內容

Mounted at /content/drive


['NotoSansCJKtc-Black.otf',
 'NotoSansCJKtc-Light.otf',
 'NotoSansCJKtc-DemiLight.otf',
 'NotoSansCJKtc-Bold.otf',
 'NotoSansCJKtc-Thin.otf',
 'NotoSansCJKtc-Medium.otf',
 'NotoSansMonoCJKtc-Bold.otf',
 'NotoSansMonoCJKtc-Regular.otf',
 'README',
 'LICENSE_OFL.txt',
 'NotoSansCJKtc-hinted.zip',
 'data',
 'variables',
 'assets',
 'ckiptagger_model',
 'keras_metadata.pb',
 'saved_model.pb',
 'our_model.h5',
 '111_CALISE_Project_Train_CDA111004.ipynb',
 '111_CALISE_Project_Test_CDA111004.ipynb',
 'subject_CBOW.wv',
 'subjects.npy',
 'y_train_numpy_array.npy',
 'class_num.npy',
 'language_code.npy',
 '111_CALISE_Project_Train_CDA111004砍掉重練.ipynb',
 '111_CALISE_Project_Test_CDA111004砍掉重練.ipynb',
 'title_CBOW.wv',
 'title.npy',
 'test_class_num.npy',
 'test_subject_CBOW.wv',
 'test_subjects.npy',
 'test_y_train_numpy_array.npy',
 'test_language_code.npy',
 'test_title_CBOW.wv',
 'test_title.npy',
 'pus_model.h5']

# 1.資料前處理(Data Processing)

* 若您所規畫使用的資料有進行任何的爬蟲補充、轉換、合併等等各種處理，請於此區段撰寫，並**請印出前10筆處理後的結果**，以利評審判斷。



In [None]:
#匯入訓練資料集
books_df = pd.read_csv("/content/drive/My Drive/111_CALISE_CDA111004/data/merge_train_dataset.csv")
books_df.head(10)

Unnamed: 0,Permanent Call Number,MMS Id,Publication Date,Resource Type,Title,Edition,Author,Author (contributor),ISBN,Place of Publication - Country,Publisher,Language Code,Material Type,Subjects
0,011.18 8455,991019356069705721,2002[民91],Book - Physical,禁書 : 100部曾被禁的世界經典作品 /,初版,"卡羅里德斯 (Karolides, Nicholas J.)","Sova, Dawn B.; Bald, Margaret.; Karolides, Nic...",9574551326; 9789574551323,China,知己總經銷; 晨星發行,chi,Book,United States; History
1,011.92 503,991014836579705721,[民88],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 著 (Denby, David)","Denby, David.; 嚴韻 譯",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History
2,011.92 865,991014836579705721,[民88],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 著 (Denby, David)","Denby, David.; 嚴韻 譯",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History
3,011.92 865 2004,991017426349705721,2004[民93],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 (Denby, David)","Denby, David.; 嚴韻",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History
4,012.4 875,991012415289705721,2007[民96],Book - Physical,書痴指南 : 如何在你鎖定的主題讀到對的書 /,初版,"皮而 (Pearl, Nancy)","Pearl, Nancy.; 柯惠琮",9789868287624; 9868287626,China,時報文化總經銷; 閱讀地球文化出版,chi,Book,United States
5,020 168,991010620629705721,2006[民95],Book - Physical,圖書館這一行 /,第1版,"克恩 著 (Kane, Laura Townsend)","鳳儀知識產業公司編譯組 編; Kane, Laura Townsend.",9868210208; 9789868210202,China,鳳儀知識產業,chi,Book,United States
6,020 854,991010620629705721,2006[民95],Book - Physical,圖書館這一行 /,第1版,"克恩 著 (Kane, Laura Townsend)","鳳儀知識產業公司編譯組 編; Kane, Laura Townsend.",9868210208; 9789868210202,China,鳳儀知識產業,chi,Book,United States
7,024.4 834,991001817299705721,2003[民92],Book - Physical,調適性科技的網際網路 /,一版,"馬茲 著 (Mates, Barbara T.)","Mates, Barbara T.; 程鈺雄 譯",9571134244; 9789571134246,China,五南,chi,Book,United States
8,027 365,991001817299705721,2003[民92],Book - Physical,調適性科技的網際網路 /,一版,"馬茲 著 (Mates, Barbara T.)","Mates, Barbara T.; 程鈺雄 譯",9571134244; 9789571134246,China,五南,chi,Book,United States
9,027 8653,991015096929705721,2000[民89],Book - Physical,檔案教學 /,初版,"丹尼爾遜 作 (Danielson, Charlotte)","Abrutyn, Leslye.; Danielson, Charlotte.; 蔡佩玲 譯...",9577024041; 9789577024046,China,心理,chi,Book,United States


In [None]:
#觀察原始長度
len(books_df)

211028

In [None]:
#分離中英文書
chi_books_df= books_df[(books_df['Language Code']=='chi')]
eng_books_df= books_df[(books_df['Language Code']=='eng')]

## 分類號

In [None]:
#先處理分類號
#自然語言處理，分類號取前三碼
import re
def get_codenum(s):
  if type(s) is str:
    m = re.search("\d{3}",s)
    if m is not None:
      return m.group(0)
    else:
      return "x"
  else:
    return "not_str"

In [None]:
#把分類號丟進去定義的自然語言處理的函數
eng_books_df['cut_eng_num']=eng_books_df['Permanent Call Number'].apply(get_codenum)
eng_books_df.head(10)

Unnamed: 0,Permanent Call Number,MMS Id,Publication Date,Resource Type,Title,Edition,Author,Author (contributor),ISBN,Place of Publication - Country,Publisher,Language Code,Material Type,Subjects,cut_eng_num
1178,001 M184,991017404239705721,c1980.,Book - Physical,"Knowledge, its creation, distribution, and eco...",,"Machlup, Fritz, 1902-",,0691042268; 9780691042268,United States,Princeton University Press,eng,Book,United States,1
1179,001.0922 B676,991003464039705721,1999-,Book - Physical,Bohm-Biederman correspondence /,,"Bohm, David.","Pylkkänen, P. (Paavo); Biederman, Charles Jose...",0415162254; 9780415162258,United Kingdom,Routledge,eng,Book,United States,1
1180,001.1 D317,991016550269705721,1985.,Book - Physical,Degeneration : the dark side of progress /,,,"Gilman, Sander L.; Chamberlin, J. Edward, 1943-",0231051964; 9780231051965,United States,Columbia University Press,eng,Book,United States; Europe,1
1181,001.1 D346,991017050959705721,c1995.,Book - Physical,The delegated intellect : Emersonian essays on...,,,"Morse, Donald E., 1936-; Gifford, Don.",0820426059; 9780820426051,United States,P Lang,eng,Book,United States; Philosophy,1
1182,001.10973 B458,991012717469705721,c1993.,Book - Physical,Intellect and public life : essays on the soci...,,"Bender, Thomas.",,0801844339; 9780801844331,United States,Johns Hopkins University Press,eng,Book,United States; History,1
1183,001.2 E24,991006599949705721,c2010.,Book - Physical,Educating scholars : doctoral education in the...,,,"Ehrenberg, Ronald G.",9780691142661; 0691142661,United States,Princeton University Press,eng,Book,United States,1
1184,001.2 S438,991002565009705721,c1992.,Book - Physical,Emerson on the scholar /,,"Sealts, Merton M.",,0826208312; 9780826208316,United States,University of Missouri Press,eng,Book,United States,1
1185,001.2028546 N277,991013460849705721,[1991],Book - Physical,The National Research and Education Network (N...,,,"McClure, Charles R.",089391813X; 9780893918132,United States,Ablex Pub Corp,eng,Book,United States; Research,1
1186,001.3 C734,991016550419705721,c1980.,Book - Physical,The humanities in American life : report of th...,,Commission on the Humanities (1978- ),,0520041836; 9780520041837; 0520042085; 9780520...,United States,University of California Press,eng,Book,United States,1
1187,001.3 C934,991020241789705721,2003.,Book - Physical,Critical cultural policy studies : a reader /,,,"Miller, Toby.; Lewis, Justin, 1958-",0631222995; 9780631222996; 0631223002; 9780631...,United Kingdom,Blackwell Pub,eng,Book,United States,1


In [None]:
#看看有哪些例外
eng_books_df[eng_books_df['cut_eng_num']=="x"]

Unnamed: 0,Permanent Call Number,MMS Id,Publication Date,Resource Type,Title,Edition,Author,Author (contributor),ISBN,Place of Publication - Country,Publisher,Language Code,Material Type,Subjects,cut_eng_num
2692,15.70,991021223479805721,2006.,Book - Physical,Western intellectuals and the Soviet Union : 1...,,"Stern, Ludmila.",,9780415360050; 0415360056; 9780203008140; 0203...,United States,Routledge Taylor & Francis Group,eng,Book,United States; History; Europe,x
2693,15.75,991021169947705721,[2015],Book - Physical,Nixon's nuclear specter : the secret alert of ...,,"Burr, William, author.","Kimball, Jeffrey P., author.",0700620826; 9780700620821; 9780700620838; 0700...,United States,University Press of Kansas,eng,Book,United States; Foreign relations,x
39423,CCU,991005480569705721,c1991.,Book - Physical,The new meaning of educational change /,2nd ed.,FullanMichaele.,StiegelbauerSuzanne M.,0304324221; 9780304324224,United Kingdom,Cassell Educational Limited,eng,Book,United States,x
39424,CCU,991006013909705721,c1994.,Book - Physical,Authentic Assessment : a handbook for educators /,,HartDiane.,,0201818647; 9780201818642,United Kingdom,Addison-Wesley Pub Co,eng,Book,United States,x
39425,CCU,991006462449705721,1991.,Book - Physical,Restructuring schools /,,"MurphyJoseph, 1949-",,0304327344; 9780304327348,United Kingdom,Cassell,eng,Book,United States,x
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204570,E.88.XVII.5,991015904309705721,1988.,Book - Physical,Handbook of national accounting : public secto...,,,United Nations. Statistical Office.; United Na...,9211612926; 9789211612929,United States,United Nations,eng,Book,Finance,x
204630,Unknown,991002725659705721,,Book - Physical,How to create your own real-estate fortune usi...,,NielsenJens E.,,914306375,United States,International Wealth Success,eng,Book,Finance,x
204631,Unknown,991021186717405721,2005.,Book - Physical,Don't eat the marshmallow-- yet! : the secret ...,1st ed.,"Posada, Joachim de.","Singer, Ellen, 1957-",9780425205457; 0425205452,United States,Berkley Books,eng,Book,Finance,x
204632,Unknown,991021220275205721,[2022],Book - Physical,Modern computational finance : scripting for d...,,"Andreasen, Jesper, author.","Savine, Antoine, 1970- author.",111954078X; 9781119540786; 9781119540793; 1119...,United States,John Wiley & Sons Inc,eng,Book,Finance,x


In [None]:
#單獨取出切完的分類號
eng_num=eng_books_df['cut_eng_num']
eng_num

1178      001
1179      001
1180      001
1181      001
1182      001
         ... 
207588    811
207589    811
211025    507
211026    643
211027    848
Name: cut_eng_num, Length: 156046, dtype: object

In [None]:
#一樣的事情換中文分類號做
chi_books_df['cut_chi_num']=chi_books_df['Permanent Call Number'].apply(get_codenum)
chi_books_df.head(10)

Unnamed: 0,Permanent Call Number,MMS Id,Publication Date,Resource Type,Title,Edition,Author,Author (contributor),ISBN,Place of Publication - Country,Publisher,Language Code,Material Type,Subjects,cut_chi_num
0,011.18 8455,991019356069705721,2002[民91],Book - Physical,禁書 : 100部曾被禁的世界經典作品 /,初版,"卡羅里德斯 (Karolides, Nicholas J.)","Sova, Dawn B.; Bald, Margaret.; Karolides, Nic...",9574551326; 9789574551323,China,知己總經銷; 晨星發行,chi,Book,United States; History,11
1,011.92 503,991014836579705721,[民88],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 著 (Denby, David)","Denby, David.; 嚴韻 譯",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History,11
2,011.92 865,991014836579705721,[民88],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 著 (Denby, David)","Denby, David.; 嚴韻 譯",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History,11
3,011.92 865 2004,991017426349705721,2004[民93],Book - Physical,華麗的探險 : 西方經典的當代閱讀 /,初版,"敦比 (Denby, David)","Denby, David.; 嚴韻",9577087388; 9789577087386; 9577087396; 9789577...,China,城邦文化發行; 麥田出版,chi,Book,United States; History,11
4,012.4 875,991012415289705721,2007[民96],Book - Physical,書痴指南 : 如何在你鎖定的主題讀到對的書 /,初版,"皮而 (Pearl, Nancy)","Pearl, Nancy.; 柯惠琮",9789868287624; 9868287626,China,時報文化總經銷; 閱讀地球文化出版,chi,Book,United States,12
5,020 168,991010620629705721,2006[民95],Book - Physical,圖書館這一行 /,第1版,"克恩 著 (Kane, Laura Townsend)","鳳儀知識產業公司編譯組 編; Kane, Laura Townsend.",9868210208; 9789868210202,China,鳳儀知識產業,chi,Book,United States,20
6,020 854,991010620629705721,2006[民95],Book - Physical,圖書館這一行 /,第1版,"克恩 著 (Kane, Laura Townsend)","鳳儀知識產業公司編譯組 編; Kane, Laura Townsend.",9868210208; 9789868210202,China,鳳儀知識產業,chi,Book,United States,20
7,024.4 834,991001817299705721,2003[民92],Book - Physical,調適性科技的網際網路 /,一版,"馬茲 著 (Mates, Barbara T.)","Mates, Barbara T.; 程鈺雄 譯",9571134244; 9789571134246,China,五南,chi,Book,United States,24
8,027 365,991001817299705721,2003[民92],Book - Physical,調適性科技的網際網路 /,一版,"馬茲 著 (Mates, Barbara T.)","Mates, Barbara T.; 程鈺雄 譯",9571134244; 9789571134246,China,五南,chi,Book,United States,27
9,027 8653,991015096929705721,2000[民89],Book - Physical,檔案教學 /,初版,"丹尼爾遜 作 (Danielson, Charlotte)","Abrutyn, Leslye.; Danielson, Charlotte.; 蔡佩玲 譯...",9577024041; 9789577024046,China,心理,chi,Book,United States,27


In [None]:
#看看誰是x
chi_books_df[chi_books_df['cut_chi_num']=="x"]

Unnamed: 0,Permanent Call Number,MMS Id,Publication Date,Resource Type,Title,Edition,Author,Author (contributor),ISBN,Place of Publication - Country,Publisher,Language Code,Material Type,Subjects,cut_chi_num
1177,Unknown,991021176663005721,2021,Book - Physical,"Ci jiang er jie : ""Menluo zhu yi"" yu jin dai k...",Beijing di 1 ban,"Zhang, Yongle, author",,9787108070760; 7108070766,China,Sheng huo du shu xin zhi san lian shu dian,chi,Book,United States; China; Foreign relations,x
40836,D2-0,991021065068005721,2016.,Book - Physical,Hu Jintao wen xuan /,Di 1 ban.,"Hu, Jintao, 1942- author.",,7010167273; 9787010167275; 7010167214; 9787010...,China,Ren min chu ban she,chi,Book,History; Politics and government; China,x
40837,D2-0,991021065068205721,2016.,Book - Physical,Hu Jintao wen xuan /,Di 1 ban.,"Hu, Jintao, 1942- author.",,7010167273; 9787010167275; 7010167214; 9787010...,China,Ren min chu ban she,chi,Book,History; Politics and government; China,x
40847,Unknown,991021071392105721,2019,Book - Physical,Guang ze qing liu : Xiongnu gu du Tongwan Chen...,Di 1 ban,"Shi, Xiaolong, author","统万城文物管理所, editor; Tongwan Cheng wen wu guan li...",7501060827; 9787501060825,China,Wen wu chu ban she,chi,Book,History; China,x
40848,Unknown,991021096998805721,2019,Book - Physical,Gu ji zhi wei wen wu /,Beijing di 1 ban,"Li, Kaisheng, 1982- author.",,7101142451; 9787101142457,China,Zhonghua shu ju,chi,Book,History; China,x
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211017,Unknown,991021214673505721,2021.11,Book - Physical,從生命書寫到藝術越界 : 性別族群認同.視覺文化再現 = From life writing...,,,,9789869956239; 9869956238,China,典藏文創有限公司,chi,Book,文集,x
211018,Unknown,991021216747005721,2022[民111],Book - Physical,漢學論衡初集 /,,鄭吉雄 著,,9789863505785; 9863505781,China,臺灣大學發行; 臺灣大學出版中心出版,chi,Book,文集,x
211019,Unknown,991021225963705721,2019.,Book - Physical,文白之爭 : 語文、教育、國族的百年戰場 /,初版,,"王, 嘉弘, 〔中國文學〕",9789577635457; 9577635458,China,五南出版,chi,Book,文集,x
211020,Unknown,991021225977305721,2011.,Book - Physical,葉長海曲論自選集 /,初版,"葉長海, 1944-",,9789573612926; 9573612925,China,國家,chi,Book,文集,x


In [None]:
#單獨取出切完的分類號
chi_num=chi_books_df['cut_chi_num']
chi_num

0         011
1         011
2         011
3         011
4         012
         ... 
211020      x
211021      x
211022    520
211023    548
211024    827
Name: cut_chi_num, Length: 54982, dtype: object

In [None]:
#分類號encoding
from gensim.models import Word2Vec
from numpy import array
from keras.preprocessing.sequence import pad_sequences
#取出dataframe進入list
eng_num=list(eng_num)
chi_num=list(chi_num)

In [None]:
#利用辭典取得該分類號有的主題詞 該分類號擁有的主題詞去加權平均，成為該分類號的向量
from gensim.models import KeyedVectors
wv = KeyedVectors.load('/content/drive/My Drive/111_CALISE_CDA111004/subject_CBOW.wv')
words = wv.vocab.keys()
cbow_class_dict = {word:wv[word] for word in words}

In [None]:
#分離出主題詞
books_subjects = books_df['Subjects']
books_subjects.head(10)

0    United States; History
1    United States; History
2    United States; History
3    United States; History
4             United States
5             United States
6             United States
7             United States
8             United States
9             United States
Name: Subjects, dtype: object

In [None]:
#把它變成list
books_subjects1=list(books_subjects)
books_subjects1

['United States; History',
 'United States; History',
 'United States; History',
 'United States; History',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'United States; Philosophy; 哲學',
 'United States; Philosophy; 哲學',
 'United States; Psychology; Social conditions',
 'United States; Psychology',
 'United States',
 'United States',
 'United States',
 'United States; Psychology',
 'United States',
 'United States',
 'United States; Education',
 'United States; Education',
 'United States; Psychology',
 'United States; Psychology',
 'United States; Biography; Education',
 'United States',
 'United States; Psychology; Social conditions',
 'United States; History; Social aspects; Europe; Psychology',
 'United States',
 'United States; Psychology',
 'United States; Psychology',
 'United States; Psychology',
 'United States; Biography',
 '

In [None]:
#把它切乾淨，把不同主題詞分開
books_subjects2 = []
for i in range(len(books_subjects1)):
  single = books_subjects1[i].split('; ')
  books_subjects2.append(single)
books_subjects2

[['United States', 'History'],
 ['United States', 'History'],
 ['United States', 'History'],
 ['United States', 'History'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States', 'Philosophy', '哲學'],
 ['United States', 'Philosophy', '哲學'],
 ['United States', 'Psychology', 'Social conditions'],
 ['United States', 'Psychology'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States', 'Psychology'],
 ['United States'],
 ['United States'],
 ['United States', 'Education'],
 ['United States', 'Education'],
 ['United States', 'Psychology'],
 ['United States', 'Psychology'],
 ['United States', 'Biography', 'Education'],
 ['United States'],
 ['United States', 'Psychology', 'Social conditions'],
 ['United States', 'History', 'Social aspects', 'Europe', 'Psychology'],
 ['United States'],
 ['Un

In [None]:
#subject對應到該書分類號
d=zip (eng_num,books_subjects2)
temp=list(d)
temp

[('001', ['United States', 'History']),
 ('001', ['United States', 'History']),
 ('001', ['United States', 'History']),
 ('001', ['United States', 'History']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States', 'Philosophy', '哲學']),
 ('001', ['United States', 'Philosophy', '哲學']),
 ('001', ['United States', 'Psychology', 'Social conditions']),
 ('001', ['United States', 'Psychology']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States', 'Psychology']),
 ('001', ['United States']),
 ('001', ['United States']),
 ('001', ['United States', 'Education']),
 ('001', ['United States', 'Education']),
 ('001', ['United States', 'Psychology']),
 

In [None]:
#加權平均英文的分類號的空間向量
class_vec1=[]
for n in range(0,len(eng_num)):
  class_vec=[]
  for word in temp[n][1]:
    if word in cbow_class_dict.keys():
      class_vec.append(cbow_class_dict[word])
  arr1=np.array(class_vec)
  arr1=np.array(class_vec).mean(axis=0)
  class_vec1.append(arr1)
class_vec1

[array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15

In [None]:
#加權平均中文的分類號的空間向量
class_vec2=[]
for n in range(0,len(chi_num)):
  class_vec=[]
  for word in temp[n][1]:
    if word in cbow_class_dict.keys():
      class_vec.append(cbow_class_dict[word])
  arr1=np.array(class_vec)
  arr1=np.array(class_vec).mean(axis=0)
  class_vec2.append(arr1)
class_vec2

[array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15

In [None]:
#合併中英文分類號embedding
class_vec=class_vec1+class_vec2
class_vec

[array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15261763, -0.2448442 ,
        -0.21742885,  0.53674775, -0.19683124,  0.2124317 , -0.68649244,
         0.4692445 , -1.2805687 ,  0.02308443,  0.09686872,  0.4037317 ,
        -0.47502074,  0.59054774, -0.48070747, -0.745545  , -0.36540422],
       dtype=float32),
 array([-0.6378303 ,  0.49136826,  0.28620297,  0.15

In [None]:
#確認長度
len(class_vec)

211028

In [None]:
#存檔
np.save('class_num.npy',class_vec)

## 主題詞

In [None]:
#接著處理主題詞
#分離出主題詞
books_subjects = books_df['Subjects']
books_subjects.head(10)

0    United States; History
1    United States; History
2    United States; History
3    United States; History
4             United States
5             United States
6             United States
7             United States
8             United States
9             United States
Name: Subjects, dtype: object

In [None]:
#放進list再分割出來
books_subjects1=list(books_subjects)
books_subjects2 = []
for i in range(len(books_subjects1)):
  single = books_subjects1[i].split('; ')
  books_subjects2.append(single)
books_subjects2

[['United States', 'History'],
 ['United States', 'History'],
 ['United States', 'History'],
 ['United States', 'History'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States', 'Philosophy', '哲學'],
 ['United States', 'Philosophy', '哲學'],
 ['United States', 'Psychology', 'Social conditions'],
 ['United States', 'Psychology'],
 ['United States'],
 ['United States'],
 ['United States'],
 ['United States', 'Psychology'],
 ['United States'],
 ['United States'],
 ['United States', 'Education'],
 ['United States', 'Education'],
 ['United States', 'Psychology'],
 ['United States', 'Psychology'],
 ['United States', 'Biography', 'Education'],
 ['United States'],
 ['United States', 'Psychology', 'Social conditions'],
 ['United States', 'History', 'Social aspects', 'Europe', 'Psychology'],
 ['United States'],
 ['Un

In [None]:
# 載入word2vec物件與KeyVectors物件
from gensim.models import word2vec, KeyedVectors
# Settings
seed = 7777  # Seed for the random number generator.
sg = 0    # 1 for skip-gram; otherwise CBOW.
window_size = 99 # Maximum distance between the current and predicted word within a sentence (context window)
vector_size = 20 # Dimensionality of the word vectors
min_count = 5 # Ignores all words with total frequency lower than this
workers = 4 # Use these many worker threads to train the model (=faster training with multicore machines)
epochs = 10 # Number of iterations (epochs) over the corpus
batch_words = 10000 # Target size (in words) for batches of examples passed to worker threads

model_cbow = word2vec.Word2Vec(
    books_subjects2,
    min_count=min_count,
    size=vector_size,
    workers=workers,
    iter=epochs,
    window=window_size,
    sg=sg,
    seed=seed,
    batch_words=batch_words
)

cbow_word_vectors = model_cbow.wv # This object essentially contains the mapping between words and embeddings.

In [None]:
#檢查看看跑出來的結果
cbow_word_vectors.most_similar("Economic policy", topn=30)

[('Social conditions', 0.8336923718452454),
 ('Finance', 0.7734158635139465),
 ('Europe', 0.7724349498748779),
 ('Foreign relations', 0.7570346593856812),
 ('Law and legislation', 0.7417837381362915),
 ('Management', 0.7024695873260498),
 ('Congresses', 0.6917930245399475),
 ('China', 0.6598811149597168),
 ('Case studies', 0.6025389432907104),
 ('Education', 0.558008074760437),
 ('Politics and government', 0.5278199315071106),
 ('Economic conditions', 0.5002659559249878),
 ('History and criticism', 0.4638478457927704),
 ('Social aspects', 0.43836236000061035),
 ('Great Britain', 0.4378286898136139),
 ('History', 0.43076810240745544),
 ('United States', 0.41847673058509827),
 ('Biography', 0.3653489649295807),
 ('Research', 0.3464658260345459),
 ('Social sciences', 0.30389437079429626),
 ('Philosophy', 0.22965067625045776),
 ('Psychology', 0.20504364371299744),
 ('傳記', 0.1536179631948471),
 ('中國', 0.0795997753739357),
 ('臺灣', 0.05701030418276787),
 ('論文', 0.05378364771604538),
 ('文集', 0

In [None]:
#先存起來
cbow_word_vectors.wv.save(f'subject_CBOW.wv')

In [None]:
#字典
words = model_cbow.wv.vocab.keys()
cbow_dict_s = {word:model_cbow.wv[word] for word in words}
cbow_dict_s

{'United States': array([-0.6077292 , -0.23355266,  0.77556354, -0.0431797 , -0.53473186,
         0.30111805, -0.87165165,  0.27131724, -0.3805656 ,  0.84078497,
         0.48305535, -1.2931877 ,  0.04601353, -0.32210666,  1.1426463 ,
        -0.05633279, -0.09591821,  0.42916793, -0.83319986,  0.5311736 ],
       dtype=float32),
 'History': array([-0.41319665,  0.13074027,  0.29924604,  0.2308393 , -0.335659  ,
         0.76354027,  0.8991989 , -0.47209856, -0.4264657 ,  0.42181852,
         0.16342236, -0.885888  , -0.54295164, -1.2375339 , -0.28560457,
        -1.6871419 , -0.4804814 ,  0.06175535,  0.09135851, -0.11595544],
       dtype=float32),
 'Philosophy': array([-0.8110841 ,  0.08888862,  0.4941035 , -0.8738172 , -0.2278348 ,
         0.06116626, -0.10190937, -0.5650267 , -0.7790672 ,  0.17435503,
         0.19788983, -0.17158006,  0.67498845,  0.06803577,  0.21540649,
        -0.16411811, -0.06403619,  0.28220794, -0.68905467,  0.3057863 ],
       dtype=float32),
 '哲學': arr

In [None]:
#存成npy
np.save('subjects.npy',cbow_dict_s)

In [None]:
#主題詞dictionary
subdict = {0:'United States',1:'History',2:'中國',3:'Politics and government',4:'Great Britain',5:'Congresses',
           6:'Philosophy',7:'China',8:'傳記',9:'History and criticism',10:'歷史',11:'Social aspects',
           12:'Biography',13:'Foreign relations',14:'臺灣',15:'Europe',16:'Economic conditions',
           17:'Psychology',18:'論文',19:'Social conditions',20:'Management',21:'Case studies',
           22:'Education',23:'Economic policy',24:'Law and legislation',25:'Social sciences',
           26:'Research',27:'Finance',28:'哲學',29:'文集'}

In [None]:
ytrainlist = []
for i in range(len(books_subjects2)):
  temlist = []
  for j in range(len(subdict)):
    if subdict[j] in books_subjects2[i]:
      temlist.append(1)
    else:
      temlist.append(0)
  ytrainlist.append(temlist)
ytrainlist

[[1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,

In [None]:
#把它存進np array
y_train_numpy_array = np.array(ytrainlist)
np.save('y_train_numpy_array.npy',y_train_numpy_array )

## 語言

In [None]:
#接下來處理語言
#分離出語言
language_code = books_df['Language Code']
language_code.head(10)

0    chi
1    chi
2    chi
3    chi
4    chi
5    chi
6    chi
7    chi
8    chi
9    chi
Name: Language Code, dtype: object

In [None]:
#將書分為中文0，英文1
language_code1=list(language_code)
language_code2 = []
for i in range(len(language_code1)):
  if language_code1[i] == 'chi':
    language_code2.append(0)
  else:
    language_code2.append(1)
language_code2

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [None]:
np.save('language_code.npy',language_code2)

## 書名

In [None]:
#最後是書名
#分離出書名
chi_titles=chi_books_df['Title']
eng_titles=eng_books_df['Title']

In [None]:
#英文斷詞
#tokenize
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#取出dataframe進入list
eng_titles1=list(eng_titles)

from nltk.tokenize import sent_tokenize
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

from gensim.parsing.preprocessing import remove_stopwords

#tokenization
i=0
eng_tokens3=[]
while i<len(eng_titles1):
  def listtostring(s):
      return ''.join(s)
  a=remove_stopwords(listtostring(eng_titles1[i]))
  a=listtostring(eng_titles1[i])
  a= re.sub(r'[^\w\s]', '', a)
  a=a.strip('/ ')
  a=a.strip('the')
  a=a.strip('The')
  a=nltk.word_tokenize(a)
  eng_tokens3.append(a)
  i+=1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
#lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()

def lemmatize(word):
    lemma = wnl.lemmatize(word)
    return lemma

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
eng_lem=[]

for i in range(0,len(eng_tokens3)):
  end=len(eng_tokens3[i])
  book_lem=[]
  for k in range(0,end):
    b=lemmatize(eng_tokens3[i][k])
    b=b.casefold()
    book_lem.append(b)
  eng_lem.append(book_lem)
eng_lem

[['knowledge',
  'it',
  'creation',
  'distribution',
  'and',
  'economic',
  'significanc'],
 ['bohmbiederman', 'correspondenc'],
 ['degeneration', 'the', 'dark', 'side', 'of', 'progress'],
 ['delegated',
  'intellect',
  'emersonian',
  'essay',
  'on',
  'literature',
  'science',
  'and',
  'art',
  'in',
  'honor',
  'of',
  'don',
  'gifford'],
 ['intellect',
  'and',
  'public',
  'life',
  'essay',
  'on',
  'the',
  'social',
  'history',
  'of',
  'academic',
  'intellectual',
  'in',
  'the',
  'united',
  'states'],
 ['educating', 'scholar', 'doctoral', 'education', 'in', 'the', 'humanity'],
 ['emerson', 'on', 'the', 'scholar'],
 ['national',
  'research',
  'and',
  'education',
  'network',
  'nren',
  'research',
  'and',
  'policy',
  'perspective'],
 ['humanity',
  'in',
  'american',
  'life',
  'report',
  'of',
  'the',
  'commission',
  'on',
  'the',
  'humanities'],
 ['critical', 'cultural', 'policy', 'study', 'a', 'reader'],
 ['divided', 'knowledge', 'across',

In [None]:
#lem優化!
eng_lem1 = []
for i in range(len(eng_lem)):
  llist = nltk.pos_tag(eng_lem[i])
  eng_lem1.append(llist)

In [None]:
from nltk.corpus import wordnet as wn

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('JJ'):
        return wn.ADJ
    elif nltk_tag.startswith('VV'):
        return wn.VERB
    elif nltk_tag.startswith('NN'):
        return wn.NOUN
    elif nltk_tag.startswith('RR'):
        return wn.ADV
    else:
        return None

In [None]:
#英文title
e_title = []
for i in range(len(eng_lem1)):
  lllist = []
  for j in range(len(eng_lem1[i])):
    if nltk_tag_to_wordnet_tag(eng_lem1[i][j][1]) is not None:
      lllist.append(wnl.lemmatize(eng_lem1[i][j][0],pos=nltk_tag_to_wordnet_tag(eng_lem1[i][j][1])))
    else:
      lllist.append(wnl.lemmatize(eng_lem1[i][j][0]))
  e_title.append(lllist)
e_title

[['knowledge',
  'it',
  'creation',
  'distribution',
  'and',
  'economic',
  'significanc'],
 ['bohmbiederman', 'correspondenc'],
 ['degeneration', 'the', 'dark', 'side', 'of', 'progress'],
 ['delegated',
  'intellect',
  'emersonian',
  'essay',
  'on',
  'literature',
  'science',
  'and',
  'art',
  'in',
  'honor',
  'of',
  'don',
  'gifford'],
 ['intellect',
  'and',
  'public',
  'life',
  'essay',
  'on',
  'the',
  'social',
  'history',
  'of',
  'academic',
  'intellectual',
  'in',
  'the',
  'united',
  'state'],
 ['educating', 'scholar', 'doctoral', 'education', 'in', 'the', 'humanity'],
 ['emerson', 'on', 'the', 'scholar'],
 ['national',
  'research',
  'and',
  'education',
  'network',
  'nren',
  'research',
  'and',
  'policy',
  'perspective'],
 ['humanity',
  'in',
  'american',
  'life',
  'report',
  'of',
  'the',
  'commission',
  'on',
  'the',
  'humanity'],
 ['critical', 'cultural', 'policy', 'study', 'a', 'reader'],
 ['divided', 'knowledge', 'across', 'd

In [None]:
#中文斷詞
chi_titles=chi_books_df['Title']

In [None]:
#取出dataframe進入list
chi_titles1=list(chi_titles)

In [None]:
# 安裝中研院資訊所詞知識庫小組(ckip)的斷詞系統
!pip install -U ckiptagger[tf,gdown]

# 清除執行結果
display.clear_output()

In [None]:
# 載入ckiptagger套件中所需程式
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER

In [None]:
# 查看斷詞模型壓縮檔資料是否下載並解壓縮成功
os.listdir('/content/drive/My Drive/111_CALISE_CDA111004/ckiptagger_model/data')

['LICENSE',
 'embedding_character',
 'model_ws',
 'model_pos',
 'model_ner',
 'embedding_word']

In [None]:
# 設定斷詞(ws)、詞類標示(pos)系統所使用的模型
ws = WS("/content/drive/My Drive/111_CALISE_CDA111004/ckiptagger_model/data")
#pos = POS("/content/drive/My Drive/Colab Notebooks/ckiptagger_model/data")

In [None]:
# 設定資料目錄
data_path = "/content/drive/My Drive/111_CALISE_CDA111004/data/書名斷詞"

In [None]:
#開始斷詞
#rname = f'{data_path}/chi_titles01' # 讀取的檔案資料夾名稱
#print(f'將讀取{os.path.abspath(rname)}檔案資料夾')

wname = f'{data_path}/chi_titles_seg' # 寫入的檔案資料夾名稱
print(f'將寫入{os.path.abspath(wname)}檔案資料夾')

if os.path.exists(wname):
  print(f'{os.path.abspath(wname)}檔案資料夾已經存在')
else:
  print(f'{os.path.abspath(wname)}檔案資料夾不存在，將創建新資料夾')
  os.mkdir(wname)

將寫入/content/drive/My Drive/111_CALISE_CDA111004/data/書名斷詞/chi_titles_seg檔案資料夾
/content/drive/My Drive/111_CALISE_CDA111004/data/書名斷詞/chi_titles_seg檔案資料夾已經存在


In [None]:
#去掉標點符號和空白
i=0
chi_titles2=[]
while i<len(chi_titles1):
  def listtostring(s):
      return ''.join(s)
  a=listtostring(chi_titles1[i])
  a= re.sub(r'[^\w\s]','', a)
  a=a.strip('/ ')
  chi_titles2.append(a)
  i+=1

In [None]:
# 斷詞
words_list = ws(chi_titles2)
#清理斷詞
chi_tokens = []
m=0
n=0
for m in range(0,len(words_list)):
  end3=len(words_list[m])
  chi_tokens1=[]
  for n in range(0,end3):
      a=remove_stopwords(words_list[m][n].strip())
      a=words_list[m][n].strip()
      chi_tokens1.append(a)
      if len(a)==0:
        chi_tokens1.remove(a)
      a= ''.join(a).split()

  chi_tokens.append(chi_tokens1)
chi_tokens

[['禁書', '100', '部', '曾', '被', '禁', '的', '世界', '經典', '作品'],
 ['華麗', '的', '探險', '西方', '經典', '的', '當代', '閱讀'],
 ['華麗', '的', '探險', '西方', '經典', '的', '當代', '閱讀'],
 ['華麗', '的', '探險', '西方', '經典', '的', '當代', '閱讀'],
 ['書痴', '指南', '如何', '在', '你', '鎖定', '的', '主題', '讀到', '對', '的', '書'],
 ['圖書館', '這', '一', '行'],
 ['圖書館', '這', '一', '行'],
 ['調適性', '科技', '的', '網際網路'],
 ['調適性', '科技', '的', '網際網路'],
 ['檔案', '教學'],
 ['愛', '書', '狂賊'],
 ['我', '的', '大英', '百科', '狂想曲'],
 ['零', '障礙', '博物館'],
 ['如何', '為', '民眾', '規劃', '博物館', '的', '展覽'],
 ['哈佛', '學者'],
 ['愛戀', '智慧', '哲學家', '的', '愛', '智', '之', '路'],
 ['愛上', '哲學', '尋找', '蘇菲', '之', '路', '的', '故事'],
 ['高齡', '的', '魅力', '培養', '積極', '的', '老年', '人生觀'],
 ['長', '不', '大', '的', '男人', '不', '成熟', '成人', '的', '小飛俠', '併發症'],
 ['成功',
  '的',
  '專業',
  '女性',
  '女性',
  '在',
  '男性',
  '專業',
  '世界',
  '中',
  '的',
  '難題',
  '與',
  '適應'],
 ['如何', '教養', '獨生子', '迎接', '獨子', '時代', '的', '來臨'],
 ['生活', '悠遊術'],
 ['沈默', '之', '子', '擺脫', '童年', '情緒', '創傷', '重建', '男性', '心靈', '健康'],
 ['服務', '的', '呼喚', 

In [None]:
len(chi_tokens)

54982

In [None]:
#中英文書名list融合
final_list= [*e_title,*chi_tokens]

In [None]:
# 載入word2vec物件與KeyVectors物件
from gensim.models import word2vec, KeyedVectors

In [None]:
#cbow
# Settings
seed = 60  # Seed for the random number generator.
sg = 0    # 1 for skip-gram; otherwise CBOW.
window_size = 99 # Maximum distance between the current and predicted word within a sentence (context window)
vector_size = 20 # Dimensionality of the word vectors
min_count = 5 # Ignores all words with total frequency lower than this
workers = 4 # Use these many worker threads to train the model (=faster training with multicore machines)
epochs = 10 # Number of iterations (epochs) over the corpus
batch_words = 10000 # Target size (in words) for batches of examples passed to worker threads

model_cbow = word2vec.Word2Vec(
    final_list,
    min_count=min_count,
    size=vector_size,
    workers=workers,
    iter=epochs,
    window=window_size,
    sg=sg,
    seed=seed,
    batch_words=batch_words
)

cbow_word_vectors = model_cbow.wv  # This object essentially contains the mapping between words and embeddings.

In [None]:
print(model_cbow)

Word2Vec(vocab=18932, size=20, alpha=0.025)


In [None]:
#存檔
cbow_word_vectors.save(f'title_CBOW.wv')

In [None]:
#建立cbow dictionary
words = model_cbow.wv.vocab.keys()
cbow_dict = {word:model_cbow.wv[word] for word in words}

In [None]:
#vector average
title_vec1=[]
error=[0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5]
error1 = np.array(error)
for n in range(0,len(final_list)):
  title_vec=[]
  for word in final_list[n]:
    #if len(final_list[n])==0:
      #title_vec.append(error)
    if word in cbow_dict.keys():
      title_vec.append(cbow_dict[word])
    #else:
    #  title_vec.append(error)
  if len(title_vec)==0:
    title_vec = error
    #print(n)
  #arr1=np.array(title_vec)
    arr1=error1
  else:
    arr1=np.array(title_vec).mean(axis=0)
  title_vec1.append(arr1)
title_vec1

[array([ 0.640484  , -0.48638922, -0.157919  ,  0.05001006, -0.8374806 ,
        -0.75681067,  1.134833  , -1.6748451 ,  1.4312432 ,  0.23210667,
        -2.7424986 , -1.3168203 ,  0.6137576 , -0.95362186, -0.45549864,
        -0.20428239, -0.08746865,  0.78232807, -0.7377572 ,  0.873143  ],
       dtype=float32),
 array([ 4.80789058e-02,  1.00489527e-01,  1.83181375e-01,  6.37190118e-02,
         1.97434932e-01, -3.35282460e-02, -3.53979431e-02, -4.67569113e-01,
        -5.00916183e-01, -5.01638293e-01, -1.46310642e-01,  1.16918795e-01,
         5.06352842e-01, -3.04309558e-02, -1.24059163e-01, -3.53086025e-01,
        -3.57695040e-04, -8.07032958e-02,  2.17895895e-01,  6.50230646e-01],
       dtype=float32),
 array([ 0.47468093, -0.2882378 ,  0.11256421, -0.46706244, -0.06394293,
         0.88285464,  1.3038822 , -1.7297288 ,  0.07886371,  0.48435792,
        -1.4824834 , -1.4450856 ,  0.9642842 , -1.6285263 , -0.3795494 ,
        -0.6370452 , -0.2428642 , -0.4085113 ,  0.5039104 ,  

In [None]:
np.shape(title_vec1)

(211028, 20)

In [None]:
np.save('title.npy',title_vec1)

# 2.模型(Model)
* 請於此處定義與架構您所使用的模型，若有需要也可印出，以作為評審審查判斷。

In [None]:
#environment setup
import os
import statsmodels as sm
import sklearn as skl
import scipy as sp
import matplotlib

#deep learning libraries
import tensorflow as tf
import keras
from tensorflow.keras.layers import Embedding, Input, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Concatenate, Dense
from tensorflow.keras.layers import LSTM

from gensim.models import word2vec, KeyedVectors
#import theano
#padsequence

In [None]:
#input輸入與處理
#load input data
x_class=np.load('class_num.npy')
x_subjects=np.load('y_train_numpy_array.npy')
x_language=np.load('language_code.npy')
x_titles=np.load('title.npy')

In [None]:
np.shape(x_class)

(211028, 20)

In [None]:
#有該主題詞為1，無為0(前面做好的)
x_subjects

array([[1, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [None]:
#reshape numpy to 2d
x_language=x_language.reshape(-1,1)

In [None]:
#define x_train,y_train
x_train=np.hstack((x_titles,x_language,x_class))
y_train=x_subjects

In [None]:
#check shape
np.shape(x_train)

(211028, 41)

In [None]:
#check shape
np.shape(y_train)

(211028, 30)

In [None]:
#避免overfitting
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

In [None]:
#模型架構
def baseline_model():
  model=keras.Sequential()
  model.add(Dense(600, input_dim=41, activation='relu'))
  keras.layers.Dropout(0.1, noise_shape=None, seed=None)
  model.add(Dense(800, activation='relu'))
  keras.layers.Dropout(0.1, noise_shape=None, seed=None)
  model.add(Dense(30, activation='softmax'))
  #compile model
  model.compile(optimizer='adam',loss='binary_crossentropy',metrics='accuracy')
  return model

In [None]:
inputs=Input(shape=(41,))
model=baseline_model()

#3.訓練與儲存模型(Training and Saving model)
* 請於此處規劃您的訓練相關定義與執行，若有需要也可印出訓練結果，以作為評審審查判斷。
* 請記得將模型儲存於可下載之空間連結，若審查時因連結權限或其他因素無法下載而導致評審無法執行，則不予評分。

In [None]:
model.fit(x_train, y_train, batch_size=50000, epochs=30, callbacks=[callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f94e065d5d0>

In [None]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_12 (Dense)            (None, 600)               25200     
                                                                 
 dense_13 (Dense)            (None, 800)               480800    
                                                                 
 dense_14 (Dense)            (None, 30)                24030     
                                                                 
Total params: 530,030
Trainable params: 530,030
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.save("nice_model.h5")