### Text Pre-Processing

This notebook preprocesses the Indiana University Chest X-ray Collection's text metadata in order to use within the multimodal. The text data is already preprocessed and ready to use with ground truth labels as another column in the same data directory of this repo.

This preprocessing is very standard and doesn't contain any special operation.

Apart from tokenization and removing stopwords, these tokens are converted to padded - sequences, in order to be ingested into the 1-D CNN.

In [10]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from nltk.corpus import stopwords
import pandas as pd
import pickle
import json
import string
import re

In [24]:
def tokenize(doc):
    tokens = doc.split()
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    return ' '.join(tokens)

def clean_medical(text_list):
    text_list = [single_string.lower().strip() for single_string in text_list] # lower case & whitespace removal
    text_list = [re.sub(r'\d+', '', single_string) for single_string in text_list] # remove numerics
    text_list = [single_string.translate(str.maketrans("","",string.punctuation)) for single_string in text_list] # remove punctuation 
    text_list = [tokenize(single_string) for single_string in text_list]
    return text_list

def list_to_seq(text_list, num_words, seq_len):
    tokenizer = Tokenizer(num_words=num_words)
    tokenizer.fit_on_texts(text_list)
    sequences = tokenizer.texts_to_sequences(text_list)
    padded_sequences = pad_sequences(sequences, maxlen=seq_len, padding='post')
    return padded_sequences,tokenizer.word_index

def make_processed_text(in_path, filename):
    # read in 
    data = pd.read_csv(in_path, index_col=0)
    clean_data = clean_medical(list(data.Text))
    data['Text'] = clean_data
    print('example document: {}'.format(clean_data[0])) # example document
    seq_data, vocab = list_to_seq(text_list=clean_data, num_words=15000, seq_len=140) # on average 40 words per document, keeping it a bit more then that 
    print('corresponding padded sequence of the example document: {}'.format(seq_data[0])) # corresponding padded sequence of the example document
    data = dict(zip(list(data['ID']),seq_data))
    with open('../data/'+ filename + '.pkl', 'wb') as f:
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
    with open('../data/vocab.json', 'w') as fp:
        json.dump(vocab, fp)

In [25]:
make_processed_text('../data/ids_raw_texts_labels.csv', 'text_processed')

example document: xxxx normal chest heart size normal lungs clear xxxx normal pneumonia effusions edema pneumothorax adenopathy nodules masses
corresponding padded sequence of the example document: [  1   2  14   6   9   2   8  10   1   2  70  44  59   5 179  96 120   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
