# Split the Annotated Dataset into Training and Testing Sets

This notebook demonstrates how to make the split. Because of the space issue, we didn't include all the discharge summaries, but we do include all the positive documents (either have family history of breast cancer or colon cancer), and then randomly sampled some other documents. Since this is just for demonstration purpose, let's assume that this is the full corpus.

This is how the corpus zip file looks like (bc:breast cancer, cc: colon cancer):
![inside corpus zip file](../img/snapshot4.png)

Our goal is to split the dataset, and create 4 zip files like these:
![splitted zip files](../img/snapshot5.png)

In [21]:
# import libraries
import csv
import random
import os
import shutil
import re
from zipfile import ZipFile,ZIP_DEFLATED
# because visual.py is in parent directory, we need do this first before import it
import sys
sys.path.append("../")

from visual import scrollPrint

#### here you can set up 

In [22]:
# location of the annotated data
corpus_zip='../data/FHI.zip'
# what's the percentage of training set
train_percentage=60

First, let's see how ZipFile list files inside the 'corpus_zip'

In [23]:
with ZipFile(corpus_zip, 'r') as myzip:
#  scrollPrint can take in a list of string or a single string, 
# and print it out in a scrollable DIV (a type of html element)
    scrollPrint(myzip.namelist(),250)


In general, it is much easier if we unzip the file and directly operate the unzipped files. However, using the jupyter notebook GUI (not using code), it is very cubersome to display or move around these files. Thus, we operate the corpus within zip files. Also, zip files are much easier to download if you want to.

(Another option would be importing the data into a database, which is commonly used in real practice. )

In [32]:
def splitter(corpus_zip, train_percent):
    #   you are welcome to include all the documents, but here we are just limiting the size for the workload considerations
    size = 100
    train_percent = 0.01 * train_percent
    corpus = {'bc': {1: [], 0: []},
              'cc': {1: [], 0: []}}
    with ZipFile(corpus_zip, 'r') as myzip:
        #  Read document level annoations, differentiate whether the document is positive or negative,
        #  because we want to randomly split within positive docs or negative docs, not in a whole (why?)
        for filename in myzip.namelist():
            prefix = filename[:2]
            if filename.endswith('.ann'):
                with myzip.open(filename) as annfile:
                    doc_anno_line = annfile.readline().decode("utf-8")
                    conclusion = 0 if doc_anno_line.split('\t')[1].startswith('NE') else 1
                    corpus[prefix][conclusion].append(filename[3:-4])

        for prefix in corpus.keys():
            train_zip = ZipFile(prefix + '_train.zip', mode='w', compression=ZIP_DEFLATED)
            test_zip = ZipFile(prefix + '_test.zip', mode='w', compression=ZIP_DEFLATED)
            try:
                subcorpus = corpus[prefix]
                # if you don't want to limit the size, just use  neg_doc_names = subcorpus[0]  directly
                random.shuffle(subcorpus[1])
                random.shuffle(subcorpus[0])
                subcorpus[0] = subcorpus[0][:(size - len(subcorpus[1]))]
                # split among positive documents
                split_train_test(myzip, train_zip, test_zip, prefix, subcorpus[1], train_percent)
                # split among negative documents
                split_train_test(myzip, train_zip, test_zip, prefix, subcorpus[0], train_percent)

            finally:
                train_zip.close()
                test_zip.close()


def split_train_test(corpuszip, train_zip, test_zip, prefix, subcorpus, train_percent):
    splice = round(len(subcorpus) * train_percent)
    # add sampled training set from the documents
    add_files(corpuszip, train_zip, prefix, subcorpus[:splice])
    # add the rest of the documents into testing set
    add_files(corpuszip, test_zip, prefix, subcorpus[splice:])


def add_files(corpuszip, targetzip, prefix, filenames):
    print('write ' + str(len(filenames)) + ' files into file: ' + targetzip.filename)
    for doc_name in filenames:
        annfile = doc_name + '.ann'
        txtfile = doc_name + '.txt'
        targetzip.writestr(annfile, corpuszip.open(prefix + '/' + annfile).read())
        targetzip.writestr(txtfile, corpuszip.open(prefix + '/' + txtfile).read())

In [33]:
splitter(corpus_zip,60)

write 32 files into file: bc_train.zip
write 22 files into file: bc_test.zip
write 28 files into file: bc_train.zip
write 18 files into file: bc_test.zip
write 25 files into file: cc_train.zip
write 17 files into file: cc_test.zip
write 35 files into file: cc_train.zip
write 23 files into file: cc_test.zip


<br/><hr/>This material presented as part of the Foundermental Health Informatics Course, 2017 Fall, BMI, University of Utah. It's revised from the <a href="https://github.com/UUDeCART/decart_rule_based_nlp">material</a> of the DeCART  Summer Program (Data, exploration, Computation, and Analytics Real-world Training for the Health Sciences) at the University of Utah in 2017. <br/><br/>Original presenters : Dr. Wendy Chapman, Jianlin Shi and Kelly Peterson.<br/>
Revised by: Jianlin Shi and Dr. Wendy Chapman<br/>
<img align="left" src="https://wiki.creativecommons.org/images/1/10/Cc.org_cc_by_license.jpg" alt="Except where otherwise noted, this website is licensed under a Creative Commons Attribution 3.0 Unported License.">