Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
No description, website, or topics provided.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
KDD CUP 2013 - Track 2 ====================== Copyright 2013 Cheng-Xia Chang, Wei-Cheng Chang, Wei-Sheng Chin, Kuan-Hao Huang, Yu-Chin Juan, Tzu-Ming Kuo, Chun-Liang Li, Chih-Jen Lin, Hsuan-Tien Lin, Shan-Wei Lin, Shou-De Lin, Ting-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Hao Tsai, Hsiao-Yu Tung, Jui-Pin Wang, Cheng-Kuang Wei, Felix Wu, Chun-Pai Yang, Tu-Chun Yin, Tong Yu, and Yong Zhuang. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This package is developed at National Taiwan University. Our approach includes three different author-matching algorithms, 'main1', 'main2' and 'typo', and the outputs of these three algorithms are merged using post-processing scripts in 'merge'. To simplify the usage, we integrate all necessary processes into a Makefile in the top directory, so users could easily type 'make' to activate the whole process. Moreover, 'buff/main1.csv' and 'buff/main2.csv' are results of these two algorithms respectively. Because our algorithm relies on some other open source packages, please read the following statements before you get started. The detail of these methods will be introduced in the paper which will be published in mid July. A. Package organization 'main1' (One author matching algorithm) 'main2' (Another author matching algorithm) 'typo' (A matching algorithm for detecting duplicates that can not be found by 'main1' and 'main2' because of typos) 'merge' (Scripts for merging the results of above methods) B. Performance in KDD-Cup 2013(F1 score): | main1 | main2 | merge public(20%) |0.99186|0.99071|0.99195 private(80%)|0.99198|0.99083|0.99202 C. Make a prediction step-by-step: 1. Requirements and dependency: Our package runs under Ubuntu 10.04 and requires the following packages: 1-1. Python2 (test with version 2.6.5) 1-2. Python3 (test with version 3.3.1) 1-3. Perl5 (test with version 5.10) 1-4. Perl module Text::CSV 1-5. Raw data (dataRev2.zip). The zipped file should be stored at the top directory of this package 2. Run: Type 'make'. 3. Result: The result file is 'final.csv'. Please notice that algorithm may take more than 2 hours to generate the result file. D. Public resources used. They are included in this package. 1. Chinese information for 'main1'. This information is used for Eastern and Western name identification. 1-1. Chinese family name We have two lists of Chinese family names. The smaller one, TW.raw, is the official romanization of first 100 common Chinese name in Taiwan. The larger one, CN.raw, including 506 common Chinese names and their romanization, is downloaded from Wikipedia. Links: "http://tc.wangchao.net.cn/xinxi/detail_1855256.html" and romanization in "http://www.boca.gov.tw/mp?mp=1" https://zh.wikipedia.org/wiki/中文姓氏羅馬字標注 http://www.greatchinese.com/surname/surname.htm 1-2. Korean family name KR.raw contains 20 common Korean first names and their romanization. Links: http://mirror.enha.kr/wiki/한국인%20이름의%20로마자%20표기 1-3. Common romanization of Chinese tokens Links: http://www.pinyin.info/romanization/compare/gwoyeu_romatzyh.html http://en.wikipedia.org/wiki/Comparison_of_Chinese_romanization_systems In these tokens, we manually select 45 tokens frequently appeared in both English and Chinese. 2. Chinese information for 'main2'. 2-1. Chinese family name Link: http://www.chineseinla.com/lastname/key_ng.html 2-2. Common romanization of Chinese tokens Link: http://irw.ncut.edu.tw/general/chen813/羅馬拼音/中文羅馬拼音對照表.htm 3. Nick names. We substitute all nick names before we do any matching. Link for 'main1': http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html http://mentalfloss.com/article/24761/origins-10-nicknames Link for 'main2': https://code.google.com/p/author-dedupe/ 4. List of stop words used in the merge step Links: http://nlp.stanford.edu/software/tmt/tmt-0.4/