**Experiment 1**

This notebook contains experiments conducted in order to find out whether having a smaller but more focused dataset improves the word2vec model's capability of extracting terms related to the ASDPTO and HPO ontologies.
The focused autism dataset is compared with a dataset expanded with a non-related disease(asthma) and with a related disease(dyslexia)

In [8]:
%load_ext autoreload
%autoreload 2

import word2vec
import gensim
import sys
sys.path.append("../../")
import pm_parser
import evaluation

def train_model(dataset="Training", model='vectors-phrase.bin', th1=100, th2=50,cbow=0, size=200, min_count=1, window=10, sample="1e-5", negative=25, threads=20, iters=15):
    !sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < {dataset} | tr -c "A-Za-z'_ \n" " " > Training-norm0
    print('Finished preprocessing 1')
    word2vec.word2phrase('Training-norm0', 'Training-norm0-phrase0', threshold=th1, verbose=True)
    print('Finished word2phrase 1')
    word2vec.word2phrase('Training-norm0-phrase0', 'Training-norm0-phrase1', threshold=th2, verbose=True)
    print('Finished word2phrase 2')
    !tr A-Z a-z < Training-norm0-phrase1 > Training-norm1-phrase1
    print('Finished preprocessing 2')
    word2vec.word2vec('Training-norm1-phrase1', model, cbow=cbow, size=size, window=window, min_count=min_count, sample=sample, negative=negative, threads=20, iter_=iters, verbose=True)
    print('Finished model training')
    !rm Training-norm0 Training-norm0-phrase0 Training-norm0-phrase1 Training-norm1-phrase1
    print('Removed reduntant files')


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
def train_phrase(dataset="Training",out="Training-norm1-phrase1", th1=100, th2=50):
    !sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < {dataset} | tr -c "A-Za-z'_ \n" " " > Training-norm0
    print('Finished preprocessing 1')
    word2vec.word2phrase('Training-norm0', 'Training-norm0-phrase0', threshold=th1, verbose=True)
    print('Finished word2phrase 1')
    word2vec.word2phrase('Training-norm0-phrase0', 'Training-norm0-phrase1', threshold=th2, verbose=True)
    print('Finished word2phrase 2')
    !tr A-Z a-z < Training-norm0-phrase1 > {out}
    print('Finished preprocessing 2')
    !rm Training-norm0 Training-norm0-phrase0 Training-norm0-phrase1
    print('Removed reduntant files')

*1a) Autism dataset*

In this experiment the word2vec model is trained on abstracts annotated with **only** autism-related mesh terms.


In [2]:
dataset = '/afs/inf.ed.ac.uk/group/project/biomednlp/msc2019/Shibo/word2vec/examples/Training'

In [3]:
train_model('../../../Shibo/word2vec/examples/Training', 'autism_model.bin')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 4400K     Vocab size: 646K  
Vocab size (unigrams + bigrams): 402225
Words in train file: 4475947
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 4000K     Vocab size: 728K  
Vocab size (unigrams + bigrams): 444758
Words in train file: 4091230
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 57901
Words in train file: 3843255
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 15.62k  Finished model training
Removed reduntant files


In [None]:
#evaluate here

*1b) Autism + Dyslexia dataset*

In this experiment the word2vec model is trained on abstracts annotated with two related diseases' (autism or dyslexia) mesh terms.

In [4]:
train_model('../../../Shibo/dataset/dysiexia_Training.txt', 'dyslexia_model.bin')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1500K     Vocab size: 251K  
Vocab size (unigrams + bigrams): 155301
Words in train file: 1528798
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1400K     Vocab size: 273K  
Vocab size (unigrams + bigrams): 166372
Words in train file: 1425128
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 23749
Words in train file: 1346446
Alpha: 0.000002  Progress: 100.07%  Words/thread/sec: 18.40k  Finished model training
Removed reduntant files


In [None]:
#evaluate here

*1c) Autism + Asthma dataset*

In this experiment the word2vec model is trained on abstracts annotated with two weakly related diseases' (autism or asthma) mesh terms.

In [5]:
train_model('../../../Shibo/dataset/Asthma_Training.txt', 'asthma_model.bin')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1400K     Vocab size: 256K  
Vocab size (unigrams + bigrams): 156248
Words in train file: 1430491
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1300K     Vocab size: 278K  
Vocab size (unigrams + bigrams): 167174
Words in train file: 1320260
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 25472
Words in train file: 1254264
Alpha: 0.000002  Progress: 100.09%  Words/thread/sec: 17.66k  Finished model training
Removed reduntant files


In [None]:
#evaluate here

*1d) Mixed dataset*

In this experiment the word2vec model is trained on abstracts annotated with a mixed dataset' (autism + asthma + dyslexia) mesh terms.

In [4]:
word2vec.word2vec('../../../Shibo/dataset/Training-norm1-phrase1', 'mixed_model.bin', cbow=0, size=150, window=10, min_count=1, sample="1e-5", negative=25, threads=20, iter_=15, verbose=True)
print('Finished model training')

Starting training using file ../../../Shibo/dataset/Training-norm1-phrase1
Vocab size: 57583
Words in train file: 3850548
Alpha: 0.000002  Progress: 100.05%  Words/thread/sec: 16.54k  Finished model training


In [None]:
#evaluate here

**FULL FOLDER EVALUATION**

In [6]:
evaluation.evaluate_folder('/', 'experiment1_gold_results.csv', ext='.bin',annotated_testset_file="../../annotated_GOLD_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/3 [00:00<?, ?it/s]

Number of models to evaluate: 3


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

  0%|          | 0/27 [00:00<?, ?it/s][A
  4%|▎         | 1/27 [00:02<01:05,  2.52s/it][A
  7%|▋         | 2/27 [00:03<00:53,  2.12s/it][A
 11%|█         | 3/27 [00:05<00:51,  2.13s/it][A
 15%|█▍        | 4/27 [00:06<00:38,  1.67s/it][A
 19%|█▊        | 5/27 [00:07<00:32,  1.50s/it][A
 22%|██▏       | 6/27 [00:10<00:37,  1.80s/it][A
 26%|██▌       | 7/27 [00:12<00:40,  2.02s/it][A
 30%|██▉       | 8/27 [00:16<00:49,  2.63s/it][A
 33%|███▎      | 9/27 [00:21<00:58,  3.23s/it][A
 37%|███▋      | 10/27 [00:27<01:08,  4.03s/it][A
 41%|████      | 11/27 [00:28<00:50,  3.15s/it][A
 44%|████▍     | 12/27 [00:33<00:55,  3.71s/it][A
 48%|████▊     | 13/27 [00:35<00:46,  3.35s/it][A
 52%|█████▏    | 14/27 [00:40<00:47,  3.62s/it][A
 56%|█████▌    | 15/27 [00:44<00:44,  3.75s/it][A
 59%|█████▉    | 16/27 [00:48<00:42,  3.86s/it][A
 63%|██████▎   | 17/27 [00:50<00:33,  3.32s/it][A
 67%|██████▋   | 18/27 [00:52<00

****finished****





In [7]:
evaluation.evaluate_folder('/', 'experiment1_ncbo_results.csv', ext='.bin',annotated_testset_file="../../annotated_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/3 [00:00<?, ?it/s]

Number of models to evaluate: 3



  0%|          | 0/853 [00:00<?, ?it/s][A
  0%|          | 1/853 [00:00<04:03,  3.50it/s][A
  0%|          | 2/853 [00:00<03:27,  4.10it/s][A
  0%|          | 3/853 [00:00<02:53,  4.89it/s][A
  1%|          | 5/853 [00:00<02:25,  5.83it/s][A
  1%|          | 7/853 [00:01<02:29,  5.65it/s][A
  1%|          | 9/853 [00:01<02:16,  6.19it/s][A
  1%|          | 10/853 [00:01<02:13,  6.31it/s][A
  1%|▏         | 12/853 [00:01<02:29,  5.61it/s][A
  2%|▏         | 13/853 [00:02<02:52,  4.86it/s][A
  2%|▏         | 14/853 [00:02<02:48,  4.99it/s][A
  2%|▏         | 15/853 [00:02<02:24,  5.81it/s][A
  2%|▏         | 17/853 [00:02<02:00,  6.94it/s][A
  2%|▏         | 18/853 [00:02<02:10,  6.38it/s][A
  2%|▏         | 19/853 [00:03<02:08,  6.49it/s][A
  2%|▏         | 20/853 [00:03<03:02,  4.57it/s][A
  2%|▏         | 21/853 [00:03<02:39,  5.22it/s][A
  3%|▎         | 22/853 [00:03<02:30,  5.53it/s][A
  3%|▎         | 23/853 [00:03<02:39,  5.21it/s][A
  3%|▎         | 24/853 [0

 43%|████▎     | 367/853 [01:15<01:54,  4.25it/s][A
 43%|████▎     | 368/853 [01:16<02:07,  3.80it/s][A
 43%|████▎     | 370/853 [01:16<01:53,  4.24it/s][A
 43%|████▎     | 371/853 [01:17<02:17,  3.51it/s][A
 44%|████▎     | 372/853 [01:17<02:24,  3.34it/s][A
 44%|████▎     | 373/853 [01:17<02:43,  2.93it/s][A
 44%|████▍     | 374/853 [01:17<02:20,  3.40it/s][A
 44%|████▍     | 375/853 [01:18<02:04,  3.83it/s][A
 44%|████▍     | 376/853 [01:18<01:54,  4.17it/s][A
 44%|████▍     | 377/853 [01:18<02:05,  3.79it/s][A
 44%|████▍     | 378/853 [01:19<02:21,  3.35it/s][A
 45%|████▍     | 380/853 [01:19<02:06,  3.75it/s][A
 45%|████▍     | 381/853 [01:19<02:16,  3.46it/s][A
 45%|████▍     | 382/853 [01:20<02:14,  3.50it/s][A
 45%|████▍     | 383/853 [01:20<02:58,  2.63it/s][A
 45%|████▌     | 384/853 [01:21<03:31,  2.22it/s][A
 45%|████▌     | 385/853 [01:21<02:58,  2.62it/s][A
 45%|████▌     | 386/853 [01:21<02:29,  3.12it/s][A
 45%|████▌     | 387/853 [01:22<03:09,  2.46it

 87%|████████▋ | 743/853 [02:36<00:27,  4.04it/s][A
 87%|████████▋ | 745/853 [02:36<00:25,  4.32it/s][A
 88%|████████▊ | 747/853 [02:36<00:20,  5.17it/s][A
 88%|████████▊ | 748/853 [02:37<00:23,  4.57it/s][A
 88%|████████▊ | 749/853 [02:37<00:19,  5.42it/s][A
 88%|████████▊ | 750/853 [02:37<00:20,  4.99it/s][A
 88%|████████▊ | 751/853 [02:37<00:23,  4.27it/s][A
 88%|████████▊ | 752/853 [02:37<00:19,  5.10it/s][A
 88%|████████▊ | 753/853 [02:38<00:29,  3.41it/s][A
 88%|████████▊ | 754/853 [02:38<00:27,  3.61it/s][A
 89%|████████▊ | 755/853 [02:38<00:23,  4.09it/s][A
 89%|████████▊ | 756/853 [02:39<00:25,  3.81it/s][A
 89%|████████▊ | 757/853 [02:39<00:27,  3.49it/s][A
 89%|████████▉ | 758/853 [02:40<00:38,  2.49it/s][A
 89%|████████▉ | 759/853 [02:40<00:31,  2.96it/s][A
 89%|████████▉ | 761/853 [02:40<00:24,  3.74it/s][A
 89%|████████▉ | 762/853 [02:40<00:22,  4.12it/s][A
 89%|████████▉ | 763/853 [02:40<00:20,  4.48it/s][A
 90%|████████▉ | 764/853 [02:40<00:16,  5.25it

 27%|██▋       | 233/853 [01:23<03:42,  2.79it/s][A
 27%|██▋       | 234/853 [01:23<04:07,  2.50it/s][A
 28%|██▊       | 235/853 [01:24<04:52,  2.11it/s][A
 28%|██▊       | 236/853 [01:24<04:38,  2.21it/s][A
 28%|██▊       | 237/853 [01:25<05:42,  1.80it/s][A
 28%|██▊       | 238/853 [01:26<06:21,  1.61it/s][A
 28%|██▊       | 239/853 [01:26<05:19,  1.92it/s][A
 28%|██▊       | 240/853 [01:26<04:40,  2.19it/s][A
 28%|██▊       | 241/853 [01:27<04:55,  2.07it/s][A
 28%|██▊       | 242/853 [01:27<03:51,  2.63it/s][A
 28%|██▊       | 243/853 [01:27<03:38,  2.79it/s][A
 29%|██▊       | 244/853 [01:27<03:01,  3.36it/s][A
 29%|██▊       | 245/853 [01:28<02:46,  3.65it/s][A
 29%|██▉       | 246/853 [01:28<02:35,  3.90it/s][A
 29%|██▉       | 247/853 [01:28<03:14,  3.12it/s][A
 29%|██▉       | 248/853 [01:29<02:56,  3.42it/s][A
 29%|██▉       | 249/853 [01:29<02:41,  3.74it/s][A
 29%|██▉       | 250/853 [01:29<03:08,  3.19it/s][A
 29%|██▉       | 251/853 [01:30<03:55,  2.55it

 66%|██████▌   | 563/853 [03:11<01:04,  4.50it/s][A
 66%|██████▌   | 564/853 [03:11<01:04,  4.46it/s][A
 66%|██████▌   | 565/853 [03:11<01:25,  3.35it/s][A
 66%|██████▋   | 566/853 [03:12<01:30,  3.17it/s][A
 66%|██████▋   | 567/853 [03:12<01:28,  3.24it/s][A
 67%|██████▋   | 568/853 [03:12<01:17,  3.67it/s][A
 67%|██████▋   | 569/853 [03:13<01:35,  2.97it/s][A
 67%|██████▋   | 570/853 [03:13<01:30,  3.11it/s][A
 67%|██████▋   | 571/853 [03:13<01:34,  2.97it/s][A
 67%|██████▋   | 572/853 [03:14<01:50,  2.54it/s][A
 67%|██████▋   | 573/853 [03:14<01:57,  2.39it/s][A
 67%|██████▋   | 574/853 [03:14<01:30,  3.09it/s][A
 67%|██████▋   | 575/853 [03:15<01:45,  2.63it/s][A
 68%|██████▊   | 576/853 [03:15<01:26,  3.22it/s][A
 68%|██████▊   | 577/853 [03:15<01:17,  3.54it/s][A
 68%|██████▊   | 578/853 [03:15<01:05,  4.22it/s][A
 68%|██████▊   | 579/853 [03:16<01:17,  3.54it/s][A
 68%|██████▊   | 580/853 [03:16<01:16,  3.56it/s][A
 68%|██████▊   | 581/853 [03:16<01:05,  4.13it

  7%|▋         | 58/853 [00:10<02:09,  6.15it/s][A
  7%|▋         | 60/853 [00:11<01:46,  7.47it/s][A
  7%|▋         | 62/853 [00:11<01:47,  7.35it/s][A
  8%|▊         | 64/853 [00:11<01:30,  8.69it/s][A
  8%|▊         | 66/853 [00:11<01:20,  9.73it/s][A
  8%|▊         | 68/853 [00:11<01:41,  7.74it/s][A
  8%|▊         | 70/853 [00:12<01:31,  8.53it/s][A
  8%|▊         | 72/853 [00:12<01:55,  6.74it/s][A
  9%|▊         | 74/853 [00:12<01:39,  7.82it/s][A
  9%|▉         | 76/853 [00:12<01:36,  8.01it/s][A
  9%|▉         | 77/853 [00:13<02:04,  6.22it/s][A
  9%|▉         | 78/853 [00:13<01:57,  6.58it/s][A
  9%|▉         | 79/853 [00:13<02:26,  5.28it/s][A
  9%|▉         | 80/853 [00:13<02:31,  5.09it/s][A
 10%|▉         | 82/853 [00:14<02:27,  5.24it/s][A
 10%|▉         | 83/853 [00:14<03:00,  4.27it/s][A
 10%|▉         | 84/853 [00:14<02:33,  5.00it/s][A
 10%|▉         | 85/853 [00:15<03:12,  3.99it/s][A
 10%|█         | 86/853 [00:15<02:58,  4.31it/s][A
 10%|█      

 53%|█████▎    | 453/853 [01:18<01:11,  5.61it/s][A
 53%|█████▎    | 454/853 [01:18<01:05,  6.11it/s][A
 53%|█████▎    | 455/853 [01:19<01:28,  4.51it/s][A
 54%|█████▎    | 457/853 [01:19<01:22,  4.79it/s][A
 54%|█████▎    | 458/853 [01:19<01:23,  4.71it/s][A
 54%|█████▍    | 459/853 [01:19<01:11,  5.53it/s][A
 54%|█████▍    | 460/853 [01:20<01:18,  5.01it/s][A
 54%|█████▍    | 461/853 [01:20<01:11,  5.46it/s][A
 54%|█████▍    | 462/853 [01:20<01:06,  5.85it/s][A
 54%|█████▍    | 463/853 [01:20<01:03,  6.10it/s][A
 54%|█████▍    | 464/853 [01:20<01:03,  6.10it/s][A
 55%|█████▍    | 466/853 [01:21<01:16,  5.08it/s][A
 55%|█████▍    | 468/853 [01:21<01:00,  6.40it/s][A
 55%|█████▍    | 469/853 [01:21<00:55,  6.96it/s][A
 55%|█████▌    | 470/853 [01:21<00:59,  6.42it/s][A
 55%|█████▌    | 471/853 [01:21<00:56,  6.78it/s][A
 55%|█████▌    | 472/853 [01:22<01:07,  5.62it/s][A
 55%|█████▌    | 473/853 [01:22<01:25,  4.46it/s][A
 56%|█████▌    | 474/853 [01:22<01:15,  4.99it

100%|█████████▉| 849/853 [02:30<00:00,  4.92it/s][A
100%|█████████▉| 851/853 [02:31<00:00,  5.18it/s][A
100%|█████████▉| 852/853 [02:31<00:00,  4.72it/s][A
100%|██████████| 3/3 [10:16<00:00, 193.07s/it]/s][A

****finished****





In [6]:
evaluation.evaluate_folder('/', 'experiment1_gold_results.csv', ext='mixed_model.bin',annotated_testset_file="../../annotated_GOLD_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/1 [00:00<?, ?it/s]

Number of models to evaluate: 1


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

  0%|          | 0/27 [00:00<?, ?it/s][A
  4%|▎         | 1/27 [00:04<02:03,  4.74s/it][A
  7%|▋         | 2/27 [00:06<01:33,  3.76s/it][A
 11%|█         | 3/27 [00:10<01:34,  3.94s/it][A
 15%|█▍        | 4/27 [00:12<01:17,  3.35s/it][A
 19%|█▊        | 5/27 [00:15<01:09,  3.17s/it][A
 22%|██▏       | 6/27 [00:19<01:16,  3.62s/it][A
 26%|██▌       | 7/27 [00:23<01:12,  3.62s/it][A
 30%|██▉       | 8/27 [00:29<01:19,  4.19s/it][A
 33%|███▎      | 9/27 [00:37<01:40,  5.57s/it][A
 37%|███▋      | 10/27 [00:45<01:44,  6.13s/it][A
 41%|████      | 11/27 [00:46<01:15,  4.71s/it][A
 44%|████▍     | 12/27 [00:53<01:17,  5.19s/it][A
 48%|████▊     | 13/27 [00:55<01:01,  4.42s/it][A
 52%|█████▏    | 14/27 [01:02<01:08,  5.23s/it][A
 56%|█████▌    | 15/27 [01:08<01:04,  5.37s/it][A
 59%|█████▉    | 16/27 [01:18<01:13,  6.70s/it][A
 63%|██████▎   | 17/27 [01:23<01:02,  6.20s/it][A
 67%|██████▋   | 18/27 [01:29<00

****finished****





In [9]:
evaluation.evaluate_folder('/', 'experiment1_ncbo_results.csv', ext='mixed_model.bin',annotated_testset_file="../annotated_mixed_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/1 [00:00<?, ?it/s]

Number of models to evaluate: 1



  0%|          | 0/2551 [00:00<?, ?it/s][A
  0%|          | 1/2551 [00:00<21:43,  1.96it/s][A
  0%|          | 2/2551 [00:00<16:53,  2.52it/s][A
  0%|          | 3/2551 [00:00<13:40,  3.11it/s][A
  0%|          | 4/2551 [00:01<12:52,  3.30it/s][A
  0%|          | 5/2551 [00:01<15:45,  2.69it/s][A
  0%|          | 6/2551 [00:01<15:28,  2.74it/s][A
  0%|          | 7/2551 [00:02<15:11,  2.79it/s][A
  0%|          | 8/2551 [00:02<15:30,  2.73it/s][A
  0%|          | 9/2551 [00:02<14:34,  2.91it/s][A
  0%|          | 10/2551 [00:03<12:14,  3.46it/s][A
  0%|          | 11/2551 [00:03<09:52,  4.29it/s][A
  0%|          | 12/2551 [00:03<08:16,  5.11it/s][A
  1%|          | 14/2551 [00:03<07:39,  5.52it/s][A
  1%|          | 16/2551 [00:03<07:00,  6.03it/s][A
  1%|          | 17/2551 [00:04<07:20,  5.75it/s][A
  1%|          | 18/2551 [00:04<07:22,  5.72it/s][A
  1%|          | 19/2551 [00:04<07:12,  5.86it/s][A
  1%|          | 20/2551 [00:04<07:20,  5.74it/s][A
  1%|     

 13%|█▎        | 339/2551 [01:32<04:42,  7.82it/s][A
 13%|█▎        | 340/2551 [01:33<09:11,  4.01it/s][A
 13%|█▎        | 341/2551 [01:33<09:20,  3.94it/s][A
 13%|█▎        | 342/2551 [01:33<09:31,  3.86it/s][A
 13%|█▎        | 343/2551 [01:34<17:27,  2.11it/s][A
 13%|█▎        | 344/2551 [01:35<15:35,  2.36it/s][A
 14%|█▎        | 345/2551 [01:35<16:07,  2.28it/s][A
 14%|█▎        | 346/2551 [01:36<15:49,  2.32it/s][A
 14%|█▎        | 347/2551 [01:36<15:00,  2.45it/s][A
 14%|█▎        | 348/2551 [01:36<13:11,  2.78it/s][A
 14%|█▎        | 349/2551 [01:36<11:43,  3.13it/s][A
 14%|█▎        | 350/2551 [01:37<14:24,  2.54it/s][A
 14%|█▍        | 351/2551 [01:37<12:46,  2.87it/s][A
 14%|█▍        | 352/2551 [01:37<11:31,  3.18it/s][A
 14%|█▍        | 353/2551 [01:38<10:11,  3.59it/s][A
 14%|█▍        | 354/2551 [01:38<08:59,  4.07it/s][A
 14%|█▍        | 355/2551 [01:39<14:17,  2.56it/s][A
 14%|█▍        | 356/2551 [01:39<20:19,  1.80it/s][A
 14%|█▍        | 358/2551 [0

 26%|██▋       | 674/2551 [03:20<12:13,  2.56it/s][A
 26%|██▋       | 675/2551 [03:20<11:07,  2.81it/s][A
 26%|██▋       | 676/2551 [03:21<09:09,  3.41it/s][A
 27%|██▋       | 677/2551 [03:21<07:48,  4.00it/s][A
 27%|██▋       | 678/2551 [03:21<07:24,  4.21it/s][A
 27%|██▋       | 679/2551 [03:22<16:11,  1.93it/s][A
 27%|██▋       | 680/2551 [03:22<13:45,  2.27it/s][A
 27%|██▋       | 681/2551 [03:23<12:06,  2.57it/s][A
 27%|██▋       | 682/2551 [03:23<12:09,  2.56it/s][A
 27%|██▋       | 683/2551 [03:23<11:48,  2.64it/s][A
 27%|██▋       | 684/2551 [03:24<09:38,  3.23it/s][A
 27%|██▋       | 685/2551 [03:24<08:54,  3.49it/s][A
 27%|██▋       | 686/2551 [03:24<07:27,  4.16it/s][A
 27%|██▋       | 687/2551 [03:24<08:32,  3.63it/s][A
 27%|██▋       | 688/2551 [03:30<56:06,  1.81s/it][A
 27%|██▋       | 689/2551 [03:30<41:39,  1.34s/it][A
 27%|██▋       | 690/2551 [03:30<31:45,  1.02s/it][A
 27%|██▋       | 691/2551 [03:30<24:33,  1.26it/s][A
 27%|██▋       | 692/2551 [0

 39%|███▉      | 996/2551 [05:47<08:05,  3.20it/s][A
 39%|███▉      | 997/2551 [05:48<14:55,  1.74it/s][A
 39%|███▉      | 998/2551 [05:49<18:55,  1.37it/s][A
 39%|███▉      | 999/2551 [05:49<15:08,  1.71it/s][A
 39%|███▉      | 1000/2551 [05:50<12:39,  2.04it/s][A
 39%|███▉      | 1001/2551 [05:50<09:42,  2.66it/s][A
 39%|███▉      | 1002/2551 [05:50<08:07,  3.17it/s][A
 39%|███▉      | 1004/2551 [05:50<06:36,  3.91it/s][A
 39%|███▉      | 1005/2551 [05:51<08:01,  3.21it/s][A
 39%|███▉      | 1006/2551 [05:51<07:54,  3.26it/s][A
 39%|███▉      | 1007/2551 [05:51<07:05,  3.63it/s][A
 40%|███▉      | 1008/2551 [05:51<06:51,  3.75it/s][A
 40%|███▉      | 1009/2551 [05:52<06:38,  3.87it/s][A
 40%|███▉      | 1010/2551 [05:52<07:30,  3.42it/s][A
 40%|███▉      | 1011/2551 [05:52<06:11,  4.14it/s][A
 40%|███▉      | 1012/2551 [05:52<06:44,  3.80it/s][A
 40%|███▉      | 1013/2551 [05:53<06:29,  3.95it/s][A
 40%|███▉      | 1014/2551 [05:53<05:48,  4.40it/s][A
 40%|███▉     

 52%|█████▏    | 1329/2551 [09:14<05:48,  3.50it/s][A
 52%|█████▏    | 1330/2551 [09:20<38:20,  1.88s/it][A
 52%|█████▏    | 1331/2551 [09:26<1:04:44,  3.18s/it][A
 52%|█████▏    | 1332/2551 [09:27<48:18,  2.38s/it]  [A
 52%|█████▏    | 1333/2551 [09:27<37:43,  1.86s/it][A
 52%|█████▏    | 1334/2551 [09:28<29:13,  1.44s/it][A
 52%|█████▏    | 1335/2551 [09:28<24:45,  1.22s/it][A
 52%|█████▏    | 1336/2551 [09:29<18:55,  1.07it/s][A
 52%|█████▏    | 1337/2551 [09:29<14:00,  1.44it/s][A
 52%|█████▏    | 1338/2551 [09:29<10:27,  1.93it/s][A
 53%|█████▎    | 1340/2551 [09:30<10:51,  1.86it/s][A
 53%|█████▎    | 1341/2551 [09:36<45:51,  2.27s/it][A
 53%|█████▎    | 1342/2551 [09:37<33:12,  1.65s/it][A
 53%|█████▎    | 1343/2551 [09:37<25:04,  1.25s/it][A
 53%|█████▎    | 1344/2551 [09:37<20:05,  1.00it/s][A
 53%|█████▎    | 1345/2551 [09:38<16:49,  1.19it/s][A
 53%|█████▎    | 1346/2551 [09:38<13:46,  1.46it/s][A
 53%|█████▎    | 1347/2551 [09:38<10:55,  1.84it/s][A
 53%|█

 65%|██████▌   | 1660/2551 [11:52<08:42,  1.70it/s][A
 65%|██████▌   | 1662/2551 [11:52<06:23,  2.32it/s][A
 65%|██████▌   | 1663/2551 [11:53<05:52,  2.52it/s][A
 65%|██████▌   | 1664/2551 [11:53<04:37,  3.19it/s][A
 65%|██████▌   | 1665/2551 [11:53<03:52,  3.82it/s][A
 65%|██████▌   | 1666/2551 [11:53<05:26,  2.71it/s][A
 65%|██████▌   | 1667/2551 [11:55<08:22,  1.76it/s][A
 65%|██████▌   | 1668/2551 [11:56<10:42,  1.37it/s][A
 65%|██████▌   | 1669/2551 [11:56<09:09,  1.60it/s][A
 65%|██████▌   | 1670/2551 [11:56<07:56,  1.85it/s][A
 66%|██████▌   | 1671/2551 [11:57<07:00,  2.09it/s][A
 66%|██████▌   | 1672/2551 [11:57<06:31,  2.25it/s][A
 66%|██████▌   | 1673/2551 [11:57<06:25,  2.28it/s][A
 66%|██████▌   | 1674/2551 [11:58<05:34,  2.62it/s][A
 66%|██████▌   | 1675/2551 [11:59<11:40,  1.25it/s][A
 66%|██████▌   | 1676/2551 [12:00<12:05,  1.21it/s][A
 66%|██████▌   | 1677/2551 [12:01<09:41,  1.50it/s][A
 66%|██████▌   | 1678/2551 [12:01<07:55,  1.84it/s][A
 66%|█████

 79%|███████▊  | 2003/2551 [14:14<02:41,  3.40it/s][A
 79%|███████▊  | 2004/2551 [14:14<02:22,  3.84it/s][A
 79%|███████▊  | 2005/2551 [14:15<02:18,  3.93it/s][A
 79%|███████▊  | 2006/2551 [14:15<03:09,  2.88it/s][A
 79%|███████▊  | 2007/2551 [14:15<02:52,  3.15it/s][A
 79%|███████▊  | 2008/2551 [14:16<02:46,  3.27it/s][A
 79%|███████▉  | 2009/2551 [14:16<02:13,  4.07it/s][A
 79%|███████▉  | 2010/2551 [14:17<04:42,  1.92it/s][A
 79%|███████▉  | 2011/2551 [14:18<05:27,  1.65it/s][A
 79%|███████▉  | 2012/2551 [14:18<04:25,  2.03it/s][A
 79%|███████▉  | 2014/2551 [14:18<03:31,  2.54it/s][A
 79%|███████▉  | 2017/2551 [14:19<02:57,  3.01it/s][A
 79%|███████▉  | 2018/2551 [14:20<06:13,  1.43it/s][A
 79%|███████▉  | 2019/2551 [14:21<05:43,  1.55it/s][A
 79%|███████▉  | 2020/2551 [14:22<05:55,  1.49it/s][A
 79%|███████▉  | 2021/2551 [14:22<04:59,  1.77it/s][A
 79%|███████▉  | 2022/2551 [14:22<03:48,  2.32it/s][A
 79%|███████▉  | 2023/2551 [14:22<03:03,  2.87it/s][A
 79%|█████

 92%|█████████▏| 2356/2551 [16:28<01:33,  2.09it/s][A
 92%|█████████▏| 2357/2551 [16:28<01:16,  2.53it/s][A
 92%|█████████▏| 2358/2551 [16:28<01:01,  3.13it/s][A
 92%|█████████▏| 2359/2551 [16:30<02:08,  1.49it/s][A
 93%|█████████▎| 2360/2551 [16:31<02:32,  1.26it/s][A
 93%|█████████▎| 2361/2551 [16:31<01:56,  1.63it/s][A
 93%|█████████▎| 2362/2551 [16:31<01:34,  2.00it/s][A
 93%|█████████▎| 2363/2551 [16:32<01:13,  2.55it/s][A
 93%|█████████▎| 2364/2551 [16:32<01:07,  2.78it/s][A
 93%|█████████▎| 2365/2551 [16:32<01:03,  2.92it/s][A
 93%|█████████▎| 2366/2551 [16:32<00:58,  3.15it/s][A
 93%|█████████▎| 2367/2551 [16:33<01:06,  2.78it/s][A
 93%|█████████▎| 2368/2551 [16:33<00:58,  3.12it/s][A
 93%|█████████▎| 2369/2551 [16:35<01:55,  1.58it/s][A
 93%|█████████▎| 2370/2551 [16:36<02:34,  1.17it/s][A
 93%|█████████▎| 2371/2551 [16:36<02:00,  1.50it/s][A
 93%|█████████▎| 2372/2551 [16:37<01:52,  1.59it/s][A
 93%|█████████▎| 2373/2551 [16:37<02:00,  1.48it/s][A
 93%|█████

****finished****





**GOLD**

In [9]:
train_model('../../../Shibo/word2vec/examples/Training', 'autism_model_gold.bin', window=20, min_count=0, size=150)

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 4400K     Vocab size: 646K  
Vocab size (unigrams + bigrams): 402225
Words in train file: 4475947
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 4000K     Vocab size: 728K  
Vocab size (unigrams + bigrams): 444758
Words in train file: 4091230
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 57901
Words in train file: 3843255
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 11.23k  Finished model training
Removed reduntant files


In [10]:
train_model('../../../Shibo/dataset/dysiexia_Training.txt', 'dyslexia_model_gold.bin', window=20, min_count=0, size=150)

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1500K     Vocab size: 251K  
Vocab size (unigrams + bigrams): 155301
Words in train file: 1528798
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1400K     Vocab size: 273K  
Vocab size (unigrams + bigrams): 166372
Words in train file: 1425128
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 23749
Words in train file: 1346446
Alpha: 0.000005  Progress: 100.03%  Words/thread/sec: 13.56k  Finished model training
Removed reduntant files


In [12]:
train_model('../../../Shibo/dataset/Asthma_Training.txt', 'asthma_model_gold.bin', window=20, min_count=0, size=150)

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1400K     Vocab size: 256K  
Vocab size (unigrams + bigrams): 156248
Words in train file: 1430491
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1300K     Vocab size: 278K  
Vocab size (unigrams + bigrams): 167174
Words in train file: 1320260
Finished word2phrase 2
Finished preprocessing 2
Starting training using file Training-norm1-phrase1
Vocab size: 25472
Words in train file: 1254264
Alpha: 0.000002  Progress: 100.09%  Words/thread/sec: 13.30k  Finished model training
Removed reduntant files


In [14]:
word2vec.word2vec('../../../Shibo/dataset/Training-norm1-phrase1', 'mixed_model_gold.bin', cbow=0, size=150, window=200, min_count=0, sample="1e-5", negative=25, threads=20, iter_=15, verbose=True)
print('Finished model training')

Starting training using file ../../../Shibo/dataset/Training-norm1-phrase1
Vocab size: 57583
Words in train file: 3850548
Alpha: 0.000002  Progress: 100.05%  Words/thread/sec: 4.06k  Finished model training


In [13]:
evaluation.evaluate_folder('/', 'experiment1_gold_results_gfix.csv', ext='gold.bin',annotated_testset_file="../../annotated_GOLD_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/3 [00:00<?, ?it/s]

Number of models to evaluate: 3


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

  0%|          | 0/27 [00:00<?, ?it/s][A
  4%|▎         | 1/27 [00:02<01:16,  2.92s/it][A
  7%|▋         | 2/27 [00:04<00:59,  2.37s/it][A
 11%|█         | 3/27 [00:06<00:56,  2.34s/it][A
 15%|█▍        | 4/27 [00:06<00:42,  1.85s/it][A
 19%|█▊        | 5/27 [00:07<00:34,  1.58s/it][A
 22%|██▏       | 6/27 [00:09<00:35,  1.71s/it][A
 26%|██▌       | 7/27 [00:11<00:35,  1.76s/it][A
 30%|██▉       | 8/27 [00:14<00:38,  2.01s/it][A
 33%|███▎      | 9/27 [00:17<00:42,  2.34s/it][A
 37%|███▋      | 10/27 [00:20<00:44,  2.61s/it][A
 41%|████      | 11/27 [00:21<00:31,  1.99s/it][A
 44%|████▍     | 12/27 [00:24<00:36,  2.43s/it][A
 48%|████▊     | 13/27 [00:26<00:30,  2.18s/it][A
 52%|█████▏    | 14/27 [00:29<00:32,  2.46s/it][A
 56%|█████▌    | 15/27 [00:32<00:29,  2.49s/it][A
 59%|█████▉    | 16/27 [00:34<00:27,  2.54s/it][A
 63%|██████▎   | 17/27 [00:36<00:21,  2.17s/it][A
 67%|██████▋   | 18/27 [00:37<00

****finished****





In [15]:
evaluation.evaluate_folder('/', 'experiment1_gold_results_gfix.csv', ext='mixed_model_gold.bin',annotated_testset_file="../../annotated_GOLD_testset.json", phenotypes_pickle="../../onto_tokens.pickle", mwe_tokens_pickle="../../mwe_tokens.pickle")

  0%|          | 0/1 [00:00<?, ?it/s]

Number of models to evaluate: 1


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

  0%|          | 0/27 [00:00<?, ?it/s][A
  4%|▎         | 1/27 [00:01<00:49,  1.89s/it][A
  7%|▋         | 2/27 [00:03<00:41,  1.66s/it][A
 11%|█         | 3/27 [00:05<00:44,  1.86s/it][A
 15%|█▍        | 4/27 [00:06<00:34,  1.50s/it][A
 19%|█▊        | 5/27 [00:06<00:29,  1.34s/it][A
 22%|██▏       | 6/27 [00:09<00:32,  1.56s/it][A
 26%|██▌       | 7/27 [00:10<00:33,  1.67s/it][A
 30%|██▉       | 8/27 [00:13<00:37,  1.96s/it][A
 33%|███▎      | 9/27 [00:16<00:42,  2.36s/it][A
 37%|███▋      | 10/27 [00:20<00:46,  2.71s/it][A
 41%|████      | 11/27 [00:21<00:33,  2.08s/it][A
 44%|████▍     | 12/27 [00:24<00:38,  2.58s/it][A
 48%|████▊     | 13/27 [00:26<00:31,  2.29s/it][A
 52%|█████▏    | 14/27 [00:29<00:33,  2.60s/it][A
 56%|█████▌    | 15/27 [00:32<00:31,  2.64s/it][A
 59%|█████▉    | 16/27 [00:35<00:29,  2.71s/it][A
 63%|██████▎   | 17/27 [00:36<00:23,  2.33s/it][A
 67%|██████▋   | 18/27 [00:38<00

****finished****





**PHRASES FOR GLOVE**

In [20]:
train_phrase('../../../Shibo/word2vec/examples/Training', out='autism-phrase1')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 4400K     Vocab size: 646K  
Vocab size (unigrams + bigrams): 402225
Words in train file: 4475947
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 4000K     Vocab size: 728K  
Vocab size (unigrams + bigrams): 444758
Words in train file: 4091230
Finished word2phrase 2
Finished preprocessing 2
Removed reduntant files


In [21]:
train_phrase('../../../Shibo/dataset/dysiexia_Training.txt', out='dyslexia-phrase1')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1500K     Vocab size: 251K  
Vocab size (unigrams + bigrams): 155301
Words in train file: 1528798
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1400K     Vocab size: 273K  
Vocab size (unigrams + bigrams): 166372
Words in train file: 1425128
Finished word2phrase 2
Finished preprocessing 2
Removed reduntant files


In [22]:
train_phrase('../../../Shibo/dataset/Asthma_Training.txt', out='asthma-phrase1')

Finished preprocessing 1
Starting training using file Training-norm0
Words processed: 1400K     Vocab size: 256K  
Vocab size (unigrams + bigrams): 156248
Words in train file: 1430491
Finished word2phrase 1
Starting training using file Training-norm0-phrase0
Words processed: 1300K     Vocab size: 278K  
Vocab size (unigrams + bigrams): 167174
Words in train file: 1320260
Finished word2phrase 2
Finished preprocessing 2
Removed reduntant files
