#Project objectives
##To identify keywords(&synonyms) of overactive bladder; using a prediction-based word vectors

###Input Format
We can't input the raw nurse assessments from the SOAP. Instead, we clean them up by converting everything to txt. 

###Input Files
The result is to have five documents:

test-neg.txt: 4 non-OAB nursing assessments from the test data.
test-pos.txt: 1 OAB nursing assessments from the test data.
train-neg.txt: 8 negative movie reviews from the training data.
train-pos.txt: 3 positive movie reviews from the training data.
train-unsup.txt: 11 Unlabelled assessments.

###Methods

####I. Subgroups
The method I used to define which assessment is positive and which assessment is negative:
  If there is "Overactive Bladder" keyword in the "Assessment" row, I put the text from "Subjective" row into the "positive" group. 
  If there is not "Overactive Bladder" keyword in the "Assessment" row, I put the text from "Subjective" into the "negative" group. 

####II. Text Cleaning
The method I used to clean the text:
  1. Make all words lowercase
  2. Eliminate symbols,including ".", ",","/","\",";",":","(",")" and quotations. I kept hyphones.
  3. Make the text from each PDF one line. Text from different PDFs are in different lines.

####III. NLP Models
I used word2vec to generate embeddings from text.

####IV. Modules

In [0]:

# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

# numpy
import numpy

# classifier
from sklearn.linear_model import LogisticRegression

# random
import random

import smart_open

In [0]:
class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
    
    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences
    
    def sentences_perm(self):
        shuffled = list(self.sentences)
        random.shuffle(shuffled)
        return shuffled

In [0]:
sources = {'test-neg.txt':'TEST_NEG', 'test-pos.txt':'TEST_POS', 'train-neg.txt':'TRAIN_NEG', 'train-pos.txt':'TRAIN_POS', 'train-unsup.txt':'TRAIN_UNS'}



In [0]:
sentences = LabeledLineSentence(sources)

In [0]:
model = Doc2Vec(window=5, min_count=1, size=50, sample=1e-5, negative=5, workers=1)


In [0]:
model.build_vocab(sentences.to_array())

####V.Training Doc2vec

In [0]:
for epoch in range(10):
    model.train(sentences.sentences_perm())

In [0]:
model.most_similar('overactive')

In [0]:
model['TRAIN_NEG_0']

In [0]:
model.save('~/Google Drive/2019 Surgical Outcome/OAB Definition/nlp/txt/soap.d2v')
model = Doc2Vec.load('~/Google Drive/2019 Surgical Outcome/OAB Definition/nlp/txt/soap.d2v')

####VI. Classifying sentiment

In [0]:
train_arrays = numpy.zeros((11, 100))
train_labels = numpy.zeros(11)

for i in range(3):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[3 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[3 + i] = 0

In [0]:
print train_arrays

In [0]:
print train_labels

In [0]:

test_arrays = numpy.zeros((11, 100))
test_labels = numpy.zeros(11)

for i in range(1):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[1 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[1 + i] = 0

In [0]:
classifier = LogisticRegression()
classifier.fit(train_arrays, train_labels)

In [0]:
classifier.score(test_arrays, test_labels)

####VI. Current Problems
1. code error with smart_open
2. some nurses use abbreviations (exp. "f/u", "appmnt"...) in assessments and some are not. It'd be better if I got a list of what abbreviations represent what.
3. typos are not corrected in the current version
4. The process of getting SOAP notes and dividing SOAP notes into OAB & non-OAB group is not automated.

####VII. The advantages of the current method 
1. Converted words to embedddings, so it's faster to calculate
2. Not only keyword importance was recognized, but also joint probabilities of two, three, four or more words appearing close to each other.
3. Databricks support multiple clusters to do parallel computing. So a huge dataset won't be a big problem.