-
Notifications
You must be signed in to change notification settings - Fork 1
ntcirtemporalia/TemporaliaChTagger
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
################################################### # Temporalia-Style Tagging for Chinese Collection # ################################################### ################### # 1. Introduction # ################### To tag the Named Entities (e.g., Person, Organization, Location, etc) and Temporal expressions (e.g., June 27, 2012) included in the Chinese documents, we prepared this script, which assembles two off-the-shelf open-source tools, Stanford CoreNLP and HeidelTime. Such open source tools are copyrighted by their respective owners. Stanford CoreNLP is used to perform Named Entity recognition, for detailed information, please refer to http://nlp.stanford.edu/software/corenlp.shtml HeidelTime is used to perform Temporal expression annotation, for detailed information, please refer to https://code.google.com/p/heideltime/ All rights reserved. This program and the accompanying materials are made available under the terms of the GNU General Public License. (with exceptions for 3rd party software & data) ###################### # 2. Getting started # ###################### Tip-1: This script is used to tagging the SogouCA (2012) collection for dry-run. Please download it by yourself via http://www.sogou.com/labs/dl/ca.html Tip-2: Make sure that the Java running environment is installed correctly! Tip-3: Due to the size-limitation, the required jar files are not uploaded with the codes. Please refer to https://www.dropbox.com/sh/iuef1faawybwltl/AAByThnWNHlgJbPDpKgrZqpia?dl=0 Note: the file "stanford-chinese-corenlp-2015-01-30-models.jar" should be added first, and "stanford-corenlp-3.5.1.jar" should be added subsequently. There are three example files under the folder of "exampleFiles", they correspond to the raw document, pre-processed document and tagged document respectively. Here is an example usage of org.archive.sogou.TemChTagger for generating the tagged documents in context of Eclipse IDE. Required setting: (1) Right-click to the TemporaliaChTagger project and choose Build Path -> Configure Build Path, Choose Libraries -> Click the "Add Variable..." button -> Click the "Configure Variables..." button, Click "New..." button, then add a new variable entry, say Name: PathOfResource Path: this should be set as the path of folder "resources" under the project TemporaliaChTagger (using the "Folder..." button) (2) Make sure all the .jar files under the folder "lib" are properly set. Running Steps: ## Step-1 ## given the raw data file, the first step is to perform pre-process by running org.archive.sogou.TemChTagger. The "program arguments" (i.e., RunConfigurations) should be set as: -p oriFile outputDir @param oriFile the original raw data file, e.g., news_tensite_xml.smarty.dat or news_tensite_xml.dat @param outputDir the output directory for example, -p collectionTest/news_tensite_xml.smarty.dat collectionTest/NoTagVersion/ ## Step-2 ## given the files generated by Step-1, the second step is to perform the desired tagging by running org.archive.sogou.TemChTagger again. The "program arguments" should be set as: -t NoTagFileDir TagFileDir * @param NoTagFileDir the directory of the files generated by Step-1 * @param TagFileDir: the output directory, i.e., the directory to store the tagged files for example, -t collectionTest/NoTagVersion/ collectionTest/TagVersion/
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published