Skip to content

ntcirtemporalia/TemporaliaChTagger

Repository files navigation

###################################################
# Temporalia-Style Tagging for Chinese Collection #
###################################################

###################
# 1. Introduction #
###################


To tag the Named Entities (e.g., Person, Organization, Location, etc) and Temporal expressions (e.g., June 27, 2012) included in the Chinese documents,
we prepared this script, which assembles two off-the-shelf open-source tools, Stanford CoreNLP and HeidelTime.
Such open source tools are copyrighted by their respective owners.

Stanford CoreNLP is used to perform Named Entity recognition, for detailed information, please refer to http://nlp.stanford.edu/software/corenlp.shtml
HeidelTime is used to perform Temporal expression annotation, for detailed information, please refer to https://code.google.com/p/heideltime/

All rights reserved. This program and the accompanying materials are made available under the terms of the GNU General Public License.
(with exceptions for 3rd party software & data)

######################
# 2. Getting started #
######################

Tip-1:
This script is used to tagging the SogouCA (2012) collection for dry-run. Please download it by yourself via http://www.sogou.com/labs/dl/ca.html
Tip-2:
Make sure that the Java running environment is installed correctly!
Tip-3:
Due to the size-limitation, the required jar files are not uploaded with the codes.
Please refer to https://www.dropbox.com/sh/iuef1faawybwltl/AAByThnWNHlgJbPDpKgrZqpia?dl=0
Note: the file "stanford-chinese-corenlp-2015-01-30-models.jar" should be added first, and "stanford-corenlp-3.5.1.jar" should be added subsequently.

There are three example files under the folder of "exampleFiles", they correspond to the raw document, pre-processed document and tagged document respectively.

Here is an example usage of org.archive.sogou.TemChTagger for generating the tagged documents in context of Eclipse IDE.

Required setting:
(1)
Right-click to the TemporaliaChTagger project and choose Build Path -> Configure Build Path,
Choose Libraries -> Click the "Add Variable..." button -> Click the "Configure Variables..." button,
Click "New..." button, then add a new variable entry, say 

Name:	PathOfResource
Path:	this should be set as the path of folder "resources" under the project TemporaliaChTagger (using the "Folder..." button)

(2)
Make sure all the .jar files under the folder "lib" are properly set.

Running Steps:

##	Step-1	##

given the raw data file, the first step is to perform pre-process by running org.archive.sogou.TemChTagger.

The "program arguments" (i.e., RunConfigurations) should be set as: -p oriFile outputDir

@param oriFile the original raw data file, e.g., news_tensite_xml.smarty.dat or news_tensite_xml.dat
@param outputDir the output directory

for example, -p collectionTest/news_tensite_xml.smarty.dat collectionTest/NoTagVersion/

##	Step-2	##

given the files generated by Step-1, the second step is to perform the desired tagging by running org.archive.sogou.TemChTagger again.

The "program arguments" should be set as: -t NoTagFileDir	TagFileDir

* @param NoTagFileDir	the directory of the files generated by Step-1
* @param TagFileDir:	the output directory, i.e., the directory to store the tagged files

for example, -t collectionTest/NoTagVersion/ collectionTest/TagVersion/


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages