# Imports

In [1]:
import numpy as np
import pandas as pd
import sys
libraries = (('Numpy', np), ('Pandas', pd))

print("Python Version:", sys.version, '\n')
for lib in libraries:
    print('{0} Version: {1}'.format(lib[0], lib[1].__version__))

Python Version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] 

Numpy Version: 1.16.4
Pandas Version: 0.23.0


In [2]:
import CollectData as cd
import GenerateFeatures as gf
import VWTxt as vt

sys.path.append('../MyModules/')
import UnZipper as uz

# Unzip Data Files
Data for this project came from the [Genia version 3.02 corpus](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) which contains abstracts found on Medline. The designated training set contained 2000 abstracts, while the designated testing set contained 404 abstracts.

In [3]:
# uz.untar('Genia4ERtraining.tar.gz', 'data')

In [4]:
# uz.untar('Genia4ERtest.tar.gz', 'data')

# Generate Train & Test txt files
I will approach this project 2 ways:
1. 1 model with all 5 entities, which will be 11 classes.
2. 5 separate models, 1 model per entity, which will be 3 classes per model. 

The original dataset included Medline abstract IDs and rows of words & [IOB tags](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). In order to generate features for each word, all the words must be grouped back together into paragraph structure, preserving the sentences.  The texts were then tokenized with SpaCy. Each SpaCy token are robust objects containing the original text, lemmas, pos tags, entity types, and so much more. This SpaCy generated information for each word will be the features for the Vowpal Wabbit models.

##  Approach 1: All 5 entities

To generate a train txt with all 5 entities, a dictionary of IOB tags to number labels does not need to be provided. The function will automatically create one as it generates the train txt. But the test txt and train txt must have the same relationship of IOB tags to number labels. So, when generating test txt, the IOB dictionary created from the train txt generation must be passed to the text txt generation. 

### 1.1 Train Txt

In [5]:
txtlines = cd.collect_lines_from_file('data/Genia4ERtask2.iob2')
data = cd.grab_words_tags_from_lines(txtlines)
train_data = gf.split_text_to_match_tokens(data)

train_IOB_dict = vt.make_vw_txt('data/vw_train.txt', list(train_data.keys()), train_data)
print(train_IOB_dict)

2000
{'O': 1, 'B-DNA': 2, 'I-DNA': 3, 'B-protein': 4, 'I-protein': 5, 'B-cell_type': 6, 'I-cell_type': 7, 'B-cell_line': 8, 'I-cell_line': 9, 'B-RNA': 10, 'I-RNA': 11}


### 1.2 Test Txt

In [6]:
txtlines = cd.collect_lines_from_file('data/Genia4EReval2.iob2')
data = cd.grab_words_tags_from_lines(txtlines)
test_data = gf.split_text_to_match_tokens(data)

test_IOB_dict = vt.make_vw_txt('data/vw_test.txt', list(test_data.keys()), test_data, train_IOB_dict)

404


## Approach 2:  1 entity at a time

A pair to train & test txts must be generated for each model to predict only 1 entity at a time. These txts must have the same a dictionary of IOB tag. Within that dictionary, the "B-" tag or beginning of the targeted entity must be labeled as "2" and the "I-" tag or insided of the targeted entity must be labeled as "3". All other tags must be labeled as "1".  

### 2.1 Protein

In [7]:
IOB_dict = {'O': 1, 'B-protein': 2, 'I-protein': 3, \
            'B-cell_line': 1, 'I-cell_line': 1, \
            'B-cell_type': 1, 'I-cell_type': 1, \
            'B-RNA': 1, 'I-RNA': 1, \
            'B-DNA': 1, 'I-DNA': 1}
train_address = 'data/protein_train.txt'
test_address = 'data/protein_test.txt'

_ = vt.make_vw_txt(train_address, list(train_data.keys()), train_data, IOB_dict)
_ = vt.make_vw_txt(test_address, list(test_data.keys()), test_data, IOB_dict)

### 2.2 Cell line

In [8]:
IOB_dict = {'O': 1, 'B-protein': 1, 'I-protein': 1, \
            'B-cell_line': 2, 'I-cell_line': 3, \
            'B-cell_type': 1, 'I-cell_type': 1, \
            'B-RNA': 1, 'I-RNA': 1, \
            'B-DNA': 1, 'I-DNA': 1}
train_address = 'data/cellline_train.txt'
test_address = 'data/cellline_test.txt'

_ = vt.make_vw_txt(train_address, list(train_data.keys()), train_data, IOB_dict)
_ = vt.make_vw_txt(test_address, list(test_data.keys()), test_data, IOB_dict)

### 2.3 Cell type

In [9]:
IOB_dict = {'O': 1, 'B-protein': 1, 'I-protein': 1, \
            'B-cell_line': 1, 'I-cell_line': 1, \
            'B-cell_type': 2, 'I-cell_type': 3, \
            'B-RNA': 1, 'I-RNA': 1, \
            'B-DNA': 1, 'I-DNA': 1}
train_address = 'data/celltype_train.txt'
test_address = 'data/celltype_test.txt'

_ = vt.make_vw_txt(train_address, list(train_data.keys()), train_data, IOB_dict)
_ = vt.make_vw_txt(test_address, list(test_data.keys()), test_data, IOB_dict)

### 2.4 RNA

In [10]:
IOB_dict = {'O': 1, 'B-protein': 1, 'I-protein': 1, \
            'B-cell_line': 1, 'I-cell_line': 1, \
            'B-cell_type': 1, 'I-cell_type': 1, \
            'B-RNA': 2, 'I-RNA': 3, \
            'B-DNA': 1, 'I-DNA': 1}
train_address = 'data/RNA_train.txt'
test_address = 'data/RNA_test.txt'

_ = vt.make_vw_txt(train_address, list(train_data.keys()), train_data, IOB_dict)
_ = vt.make_vw_txt(test_address, list(test_data.keys()), test_data, IOB_dict)

### 2.5 DNA

In [11]:
IOB_dict = {'O': 1, 'B-protein': 1, 'I-protein': 1, \
            'B-cell_line': 1, 'I-cell_line': 1, \
            'B-cell_type': 1, 'I-cell_type': 1, \
            'B-RNA': 1, 'I-RNA': 1, \
            'B-DNA': 2, 'I-DNA': 3}
train_address = 'data/DNA_train.txt'
test_address = 'data/DNA_test.txt'

_ = vt.make_vw_txt(train_address, list(train_data.keys()), train_data, IOB_dict)
_ = vt.make_vw_txt(test_address, list(test_data.keys()), test_data, IOB_dict)