CHV terms recommendation

Overview

A project focus on recommendation of CHV terms from social media, based on dataset UMLS and platform Apcache Spark. Conceptual idea is shwon in flowing figure.

We first extract n-gram from Yahoo!Answers textual dataset (of course, you can use any other textual dataset);
Second, identify the CHV terms using fuzzy matching method with UMLS database. We also call these terms as seed terms. The other terms which do not match with any CHV term is considered as candidate terms.
Third, we use K-means algorithm to train a model from the identified CHV terms, and we get K centers for the model..
Fourth, we calculate the distance between a candidate term to all the K centers, and we choose the shortest distance as the score to measure whether we should recommend a term as a CHV term. The smaller the score is, the more likely a candidate term should considered as a CHV term.

The workflow of the project is shown in following figure.

For more detail of the methodology, please read our paper.

Prepare to run

Download this project use: git clone https://github.com/henryhezhe2003/simiTerm.git. I recommend to use an IDE such as IDEA. It is a maven project, so it should be easy to build it. Add an environment variable CTM_ROOT_PATH, which indicates the root directory of the project. The tool will find the configuration file and resource file in the project directory.
Add the dependency package int libs directory to the project, see file ( docs/dependency-package.jpg), and set the environment variable CTM_ROOT_PATH to be the root directory of the project. The program will look for configuration file and other resource files (e.g. stopwords.txt) based on this root directory.
Prepare the UMLS data for test. (This step may take lots of time) You can follow the Sujit's post Understanding UMLS or the documentation of UMLS. At the end, you will import the UMLS data into Mysql.
Build index database for fuzzy matching. Run the test function: com.votors.umls.UmlsTagger2Test.testBuildIndex2db, or run java -cp Clinical-Text-Mining-0.0.1-SNAPSHOT-jar-with-dependencies.jar:/data/ra/stanford-corenlp-3.6.0-models.jar com.votors.umls.BuildTargetTerm in terminal. and it will create a index table from UMLS database
Good luck and enjoy it!

Run for CHV paper

you can run the class com.votors.ml.Clustering in the IDEA directly. You probably need to set the maximum memory larger: -Xmx5000m or submit it to Spark cluster:

spark-submit --master spark://127.0.0.1:7077  --deploy-mode cluster --num-executors 9 --driver-memory 10g  --executor-memory 2048m
 --executor-cores 1  --class com.votors.ml.Clustering  --conf 'spark.executor.extraJavaOptions=-DCTM_ROOT_PATH=/data/ra/Clinical-Text-Mining
 -cp /data/ra/stanford-corenlp-3.6.0-models.jar:/usr/bin/spark/spark-run/conf/:/usr/bin/spark/spark-run/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/usr/bin/spark/spark-run/lib/datanucleus-core-3.2.10.jar:/usr/bin/spark/spark-run/lib/datanucleus-rdbms-3.2.9.jar:/usr/bin/spark/spark-run/lib/datanucleus-api-jdo-3.2.6.jar'
 --driver-java-options=-DCTM_ROOT_PATH=/data/ra/Clinical-Text-Mining
 --files /data/ra/Clinical-Text-Mining/conf/default.properties,/data/ra/Clinical-Text-Mining/conf/current.properties
 /data/ra/Clinical-Text-Mining/target/Clinical-Text-Mining-0.0.1-SNAPSHOT-jar-with-dependencies.jar > result_yahoo_rank.txt 2>spark.log

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
conf		conf
data		data
docs/figurs		docs/figurs
libs		libs
r		r
sql-script		sql-script
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
solr_Configuration.md		solr_Configuration.md
term_identification.md		term_identification.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHV terms recommendation

Overview

Prepare to run

Run for CHV paper

Dependency

Test Result

Contributor

About

Releases

Packages

Contributors 2

Languages

License

henryhezhe2003/simiTerm

Folders and files

Latest commit

History

Repository files navigation

CHV terms recommendation

Overview

Prepare to run

Run for CHV paper

Dependency

Test Result

Contributor

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages