Word embeddings capture latent knowledge discovery

Setup

#create venv
python3 -m venv venv
#activate venv
source venv/bin/activate
#install requirements
pip3 install --ignore-installed -r requirements.txt

Change search strings on `search_strings.txt` (optional)

Run crawler or download results

mkdir results
python3 crawler.py

Merge abstract files

This will generate results_file.txt from abstracts on results folder

python3 merge_txt.py

Clean abstract files

This will generate results_file_clean.txt

python3 clean_text.py

Word embeddings

Word2Vec

Training the model

cd word2vec
python3 train.py

Loading the model

from gensim.models import Word2Vec
model = Word2Vec.load('model.bin')

Access vector for one word

model.wv['cytarabin']

GloVe

git clone http://github.com/stanfordnlp/glove
cd glove && make

Making changes to demo.sh:

Remove the script from if to fi after 'make'
Replace the CORPUS name with "../results_file_clean.txt"
On if [ "$CORPUS" = 'text8' ]; then replace text8 with "../results_file_clean.txt"

Training the model ./demo.sh

Word vectors will be placed on vectors.txt

Create inputs for Projector inside tensorboard_inputs folder

After training the model, choose which one will be transformed to tensorboard format. Pass as a second argument the number of common words that will not be plotted. The number of words must be between 1 and 333.333, the default value is 10.000

python3 to_tensorboard_format.py glove <your number>

or

python3 to_tensorboard_format.py word2vec <your number>

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
NER_bern		NER_bern
bert		bert
streamlit_app		streamlit_app
word2vec		word2vec
.gitignore		.gitignore
README.md		README.md
clean_text.py		clean_text.py
crawler.py		crawler.py
get_n_common_words_english.py		get_n_common_words_english.py
merge_txt.py		merge_txt.py
personal_stop_words.txt		personal_stop_words.txt
requirements.txt		requirements.txt
search_strings.txt		search_strings.txt
to_tensorboard_format.py		to_tensorboard_format.py

priscilaportela/WE4LKD-leukemia

Folders and files

Latest commit

History

Repository files navigation

Word embeddings capture latent knowledge discovery

Setup

Change search strings on search_strings.txt (optional)

Run crawler or download results

Merge abstract files

Clean abstract files

Word embeddings

Word2Vec

GloVe

Create inputs for Projector inside tensorboard_inputs folder

About

Resources

Stars

Watchers

Forks

Languages

Change search strings on `search_strings.txt` (optional)