## WEC Analysis - Initial Processing

*Created 22/03/2019*

This notebook contains code that is meant to be run just once, in the beginning of the analysis.

In [2]:
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.parse import CoreNLPParser
from nltk.stem import WordNetLemmatizer
#from nltk import pos_tag #alternative POS tagger
import pandas as pd
#from stanfordcorenlp import StanfordCoreNLP

In [5]:
# Create a map between Treebank and WordNet 
from nltk.corpus import wordnet as wn

# WordNet POS tags are: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'
# Descriptions (c) https://web.stanford.edu/~jurafsky/slp3/10.pdf
tag_map = {
        'CD':wn.NOUN, # cardinal number (one, two)                           
        'EX':wn.ADV, # existential ‘there’ (there)            
        'IN':wn.ADV, # preposition/sub-conj (of, in, by)   
        'JJ':wn.ADJ,
        'JJR':wn.ADJ,
        'JJS':wn.ADJ,          
        'NN':wn.NOUN, # noun, sing. or mass (llama)          
        'NNS':wn.NOUN, # noun, plural (llamas)                  
        'NNP':wn.NOUN, # proper noun, sing. (IBM)              
        'NNPS':wn.NOUN, # proper noun, plural (Carolinas)
        'PDT':wn.ADJ, # predeterminer (all, both)            
        'RB':wn.ADV, # adverb (quickly, never)            
        'RBR':wn.ADV, # adverb, comparative (faster)        
        'RBS':wn.ADV, # adverb, superlative (fastest)     
        'RP':wn.ADJ, # particle (up, off)
        'VB':wn.VERB, # verb base form (eat)
        'VBD':wn.VERB, # verb past tense (ate)
#        'VBG':wn.VERB, # verb gerund (eating)
        'VBN':wn.VERB, # verb past participle (eaten)
        'VBP':wn.VERB, # verb non-3sg pres (eat)
        'VBZ':wn.VERB # verb 3sg pres (eats)
    }

### Pre-processing
Read Excel spreadsheet into a Pandas dataframe, and transpose the answers into format: Duration, Id, Question, Answer. The answers were previously spellchecked and corrected. We also replaced occurrences of A and B for VarA and VarB, but kept the articles A's in beginning of sentences.

The pre-processing includes a POS tagging stage that requires the Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) running in the background. Tagging can alternatively be done using a NLTK parser, for example. For that, instead of pos_tagger.tag below, use pos_tag (uncommenting nltk import pos_tag in the first block).

We also specify the POS tags for vara and varb as proper nouns, and also variable and variables as nouns (otherwise they get occasionally tagged as adjectives).

In [3]:
tokenizer = RegexpTokenizer(r'\b[\w-]+\b',gaps=False)
excs = set([])
en_stopws = stopwords.words('english') #Load common stopwords
#sw_exceptions = set(('X','B'))
#stopws = set(en_stopws) - sw_exceptions
#excs = set(['increases'])
pos_tagger = CoreNLPParser(url='http://localhost:9500', tagtype='pos')
lemmatizer = WordNetLemmatizer()
custom_tags = [('vara','NNP'),('varb','NNP'),('variable','NN'),('variables','NNS') ]
Qs = ["C8P","C6P","C4P","C2P","C0","C2N","C4N","C6N","C8N"]
stop_tags = ["NN","JJ","NNS","RB"]

In [12]:
raw = pd.read_excel("WEC-160.xlsx")
data = pd.melt(raw,id_vars=["Duration (in seconds)","ResponseId"], var_name='Question', value_name='Answer')
#data = data.replace(np.nan, '', regex=True)
data = data.dropna()
if False:
    data['Correlation'] = 'n'
    data['Relationship'] = 'n'
    for idx in data.index:
        if ('correlation' in tokenizer.tokenize(data.at[idx,'Answer'])):
            data.at[idx,'Correlation'] = 'y' #data[i]['Correlation'] = 'y'     
        if ('relationship' in tokenizer.tokenize(data.at[idx,'Answer'])):
            data.at[idx,'Relationship'] = 'y' #data[i]['Correlation'] = 'y'
    data[(data['Relationship'] == 'n') & (data['Correlation'] == 'n')]
#for row in data[(data['Relationship'] == 'n') & (data['Correlation'] == 'n')].itertuples(index=True, name=None):
#    print(row[4])

In [13]:
def replace_tag(original,custom_tags):
    idx = original[0] in dict(custom_tags) #custom_tags.index(original[0])
    if (idx == False):
#        print(original)
        return original
    else:
        pos = dict(custom_tags)[original[0]]
#        print(original[0],pos)
        return (original[0],pos)

Steps:  
1) Tokenize answers with regular expression that preserves hyphenated words  
2) Tag them using Stanford CoreNLP  
2.1) Replace the tags for custom words (e.g. VarA and VarB)  
3) Apply WordNet lemmatizer 

In [27]:
data['tokenized'] = data['Answer'].apply(lambda x: tokenizer.tokenize(x))
data['tagged'] = data['tokenized'].apply(lambda x: pos_tagger.tag([token.lower() for token in x]))
data['tagged'] = data['tagged'].apply(lambda x: [replace_tag((t,p),custom_tags) for t,p in x])
#for i,(t,p) in enumerate(snt_tags):
#snt_tags[i] = replace_tag((t,p),custom_tags)
data['lemmatized'] = data['tagged'].apply(lambda x: [(lemmatizer.lemmatize(word,pos=tag_map[pos]),pos) if pos in tag_map else (word,pos) for word,pos in x])
data['stop_tags'] = data['lemmatized'].apply(lambda x: ' '.join([word for word,tag in x if tag in stop_tags]))

In [None]:
#Save data to a pickle file
data.to_pickle("WECdf.pkl")

## OpenIE -- OPTIONAL

Although in the end we did not need to use it, it might be convenient to use information extraction to get relation structures in utterances.
This is very slow and not optimized.

In [None]:
def get_relation(sentence):
    nlp = StanfordCoreNLP('http://localhost',9500)
    res = nlp.annotate(sentence,
                   properties={
                       'annotators': 'openie',
                       'outputFormat': 'json',
                       'timeout': 1000,
                   })
    res = json.loads(res)
    try:
        relation = res['sentences'][0]['openie'][0]['relation']
        subject = res['sentences'][0]['openie'][0]['subject']
        obj = res['sentences'][0]['openie'][0]['object']
    except IndexError:
        relation = ''
        subject = ''
        obj = ''
    
    return (relation,subject,obj)

In [188]:
#data['openie'] = data['Answer'].apply(lambda x: get_relation(x))

In [226]:
#data['openie_replaced'] = data['openie_replaced'].apply(lambda x : none_to_na(x)) 

In [225]:
def empty_to_none(t):
    if (t[0] == ''):
        return (None,None,None)
    return t
def none_to_na(t):
    if (t[0] == None):
        return np.nan
    return t