Precursor-induced CRF for named entity recognition

reference:

Lee, W., & Choi, J. (2019). Precursor-induced conditional random fields: connecting separate entities by induction for improved clinical named entity recognition. BMC Medical Informatics and Decision Making. 2019. 19:132 (https://doi.org/10.1186/s12911-019-0865-1)
contact: jinsamdol at snu ac kr

Model-jar file description

the conventional first-order CRF

trainer: original_crf_trainer_o1_dev.jar
labeler: original_crf_tester_o1_dev.jar

the conventional second-order CRF

trainer: original_crf_trainer_o2_dev.jar
labeler: original_crf_tester_o2_dev.jar

the precursor-induced CRF (pi-CRF)

trainer: pre-induced_crf_basic_trainer_dev.jar
labeler: pre-induced_crf_basic_tester_dev.jar

Usage

training

java -jar pre-induced_crf_basic_trainer_dev.jar --train_text="YOUR_TRAINING_DATA_ABSOLUTE_PATH" --model="YOUR_MODEL_ABSOLUTE_PATH" --report="YOUR_REPORT_FILE_ABSOLUTE_PATH" --valid_fraction=1.0 --sigma=10.0 --start_label="A*START" --null_feature="null"

all paths are absolute path
--valid_fraction=DECIMAL : the decimal indicates how much you adjust the portion of your training data out of your input data. In the case the value is 1.0, you use whole data as the training data and you have no validation data. In the case the value is 0.8, you use the 80% of the input data for the training data and 20% for the validation data. default is 1.0
--sigma=DECIMAL : this is the hyperparameter for sigma in the L2-regularization function. default is 10.0
--start_label="A*START" : setting the label for the starting label.
--null_feature="null" : many words may have no feature for certain feature dimensions for the features, for the case, you can indicate the null feature as a special tag and the default setting for the tag is 'null'.
--iter=INT : you can constrain the maximum iteration number.

labeling

java -jar pre-induced_crf_opaque_tester_dev.jar "YOUR_TESTING_DATA_ABSOLUTE_PATH (DIRECTORY)" "SUFFIX_OF_YOUR_TRAINING_DATA" "YOUR_MODEL_ABSOLUTE_PATH" "OUTPUT_DATA_ABSOLUTE_PATH" "EVALUATION_DATA_ABSOLUTE_PATH"

for instance, you can use the tester like this
java -jar pre-induced_crf_opaque_tester_dev.jar "D:\Dropbox\crf test" ".cntx.test.txt" "D:\Dropbox\crf test\my_pi_crf.crf" "D:\Dropbox\crf test\my_pi_crf_output.txt" "D:\Dropbox\crf test\my_pi_crf_eval.txt"
We would like to note the last one is for evaluation result, however, the current evaluation module is unstable. Thus, please use your own evaluation code.

Important: chunk scheme: IOB2

B for the first token of an entity expanse.
I for the other tokens of an entity expanse.
O for the outside.
You have to use O (uppercase o) for the outside token for your data.

Data format

single line is for one timestep.
'space' is delimiter for features
single blank line indicates the end of an instance.
the first column is for the word, the last column is for the label, and the others are for features.

For instance, we have two word sequences "The patient also reports blood-streaked diarrhea for two days ." and "The patient was exacerbated" and we use the word itself, next word and the previous word as features. Then it can be abstracted as a feature map as below

The t1:patient null O
patient t1:also t-1:The O
also t1:reports t-1:patient O
reports t1:blood-streaked t-1:also O
blood-streaked t1:diarrhea t-1:reports B-PROBLEM
diarrhea t1:for t-1:blood-streaked I-PROBLEM
for t1:two t-1:diarrhea O
two t1:days t-1:for O
days t1:. t-1:two O
. null t-1:days O

The t1:patient null O
patient t1:was t-1:The O
was t1:exacerbated t-1:patient O
exacerbated null t-1:was B-PROBLEM

the prefixes t1:... and t-1:... are used for distinguishing features each other. You may freely use arbitrary features, and you may freely use arbitrary prefixes for features, as well as, you may not use prefixes. However, I prefer to use prefixes in order to make distinction between different features because the CRF uses features in the form of a bag-of-features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

original_crf_tester_o1_dev.jar

original_crf_tester_o1_dev.jar

original_crf_tester_o2_dev.jar

original_crf_tester_o2_dev.jar

original_crf_trainer_o1_dev.jar

original_crf_trainer_o1_dev.jar

original_crf_trainer_o2_dev.jar

original_crf_trainer_o2_dev.jar

pre-induced_crf_basic_tester_dev.jar

pre-induced_crf_basic_tester_dev.jar

pre-induced_crf_basic_trainer_dev.jar

pre-induced_crf_basic_trainer_dev.jar

Repository files navigation

Precursor-induced CRF for named entity recognition

reference:

Model-jar file description

Usage

training

labeling

Important: chunk scheme: IOB2

Data format

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
original_crf_tester_o1_dev.jar		original_crf_tester_o1_dev.jar
original_crf_tester_o2_dev.jar		original_crf_tester_o2_dev.jar
original_crf_trainer_o1_dev.jar		original_crf_trainer_o1_dev.jar
original_crf_trainer_o2_dev.jar		original_crf_trainer_o2_dev.jar
pre-induced_crf_basic_tester_dev.jar		pre-induced_crf_basic_tester_dev.jar
pre-induced_crf_basic_trainer_dev.jar		pre-induced_crf_basic_trainer_dev.jar

License

jinsamdol/precursor-induced_CRF

Folders and files

Latest commit

History

Repository files navigation

Precursor-induced CRF for named entity recognition

reference:

Model-jar file description

Usage

training

labeling

Important: chunk scheme: IOB2

Data format

About

Resources

License

Stars

Watchers

Forks