# Clause Parser Algorithm with Custom Metrics

## Introduction
This is the first of four notebooks that is to be run in sequence to qualify and quantify clauses. This notebook does the following:
1. Take an input from input.csv. This will have two required columns: "prompt" and "response". These two columns together make a coherent sentence. 
2. Preprocesses the coherent sentence to remove non alphanumeric characters. 
3. Splits them into clauses such that each clause contains a verb. 

Load the spacy models which will be used to determine the verbs. It will also be used to determine the voices based on the rules elaborated. This uses 'en_core_web_md'. If you want better tokenization of words, use 'en_core_web_lg'

In [1]:
!python --version
!python -m spacy download en_core_web_md
print("Downloaded")
#TODO: use en_core_web_lg in a better machine. lg is running out of space in binder. 

Python 3.6.7
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |████████████████████████████████| 120.9MB 25.3MB/s ta 0:00:011   42% |█████████████▌                  | 50.8MB 72.0MB/s eta 0:00:01    46% |███████████████                 | 56.3MB 94.5MB/s eta 0:00:01 | 59.4MB 2.1MB/s eta 0:00:2900:01/s eta 0:00:01MB/s eta 0:00:01 eta 0:00:01   | 85.9MB 2.3MB/s eta 0:00:16�      | 98.0MB 2.2MB/s eta 0:00:11% |████████████████████████████▎   | 106.8MB 77.0MB/s eta 0:00:01��███████████████████████████▍ | 114.7MB 64.9MB/s eta 0:00:01███▋| 119.6MB 95.3MB/s eta 0:00:01
[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Linking successful[0m
    /srv/conda/lib/python

In [2]:
import spacy
import html
from spacy import displacy

nlp = spacy.load('en_core_web_md')
print("Loaded models")

Loaded models


Get the input file from the current directory

In [3]:
from io import StringIO
import pandas as pd, numpy as np

df = pd.read_csv("./input.csv")
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments
350,2546.04,2546,4,92,"These days, school",are fun,2.0,25,
340,2542.33,2542,33,33,When I am nervous,I normally go away,2.5,31,
524,3284.17,3284,17,17,When they avoided me,I wondered what might be going on and what the...,3.5,22,
392,2648.11,2648,11,39,What I like to do best is,be active.,2.0,42,
167,2178.12,2178,12,12,A good boss,identifies with every employee as an expressio...,6.0,31,


Get the actual sentence by joining the prompt and response.

In [4]:
if "prompt" in df.columns: #Original dataset
    df['sentence'] = df.apply(lambda row : "{} {}".format(row['prompt'], row['response']), axis = 1)

df.sample(frac=1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence
457,2900.11,2900,11,39,What I like to do best is,play.,2.0,38,,What I like to do best is play.
408,2721.16,2721,16,16,I feel sorry,for the pain that all humans and all beings on...,4.5,3,,I feel sorry for the pain that all humans and ...
473,3122.21,3122,21,21,I just can\'t stand people who,are offensive,3.0,50,,I just can\'t stand people who are offensive
220,2338.01,2338,1,1,Raising a family,"Captivates my whole being, just as every windo...",6.0,36,,"Raising a family Captivates my whole being, ju..."
40,1805.04,1805,4,37,"These days, work",Is a rainbow of consciousness manifesting thro...,6.0,1,,"These days, work Is a rainbow of consciousness..."


Preprocessing to remove non-alphanumeric characters and tokenize the sentence using Spacy.

In [5]:
import re, html
PATTERN = "[^a-zA-Z0-9\s]+"
rgx = re.compile(PATTERN, re.IGNORECASE)

df['preprocessed_sentence'] = df['sentence'].apply(lambda ip : re.sub('\s+', ' ', rgx.sub(' ', html.unescape(ip))))
print(df.columns, df.shape)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence'],
      dtype='object') (539, 11)


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence
511,3185.34,3185,34,47,Technology,is exciting,3.0,48,,Technology is exciting,Technology is exciting
12,1668.34,1668,34,47,Technology,has been one of the most significant disruptor...,5.0,37,,Technology has been one of the most significan...,Technology has been one of the most significan...
116,1989.2,1989,20,44,Business and society,Are inter-connected and part of the bigger wor...,4.5,11,,Business and society Are inter-connected and p...,Business and society Are inter connected and p...
171,2178.26,2178,26,26,When I get mad,the anger is both a collective and personal ph...,6.0,15,,When I get mad the anger is both a collective ...,When I get mad the anger is both a collective ...
295,2507.01,2507,1,90,My family,has a cat...I don't have a cat though I have 2...,1.5,5,,My family has a cat...I don't have a cat thoug...,My family has a cat I don t have a cat though ...


In [6]:
df['nlp_doc'] = df['preprocessed_sentence'].apply(lambda ip : nlp(ip))
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc
394,2654.01,2654,1,1,Raising a family,is a delight,3.0,40,,Raising a family is a delight,Raising a family is a delight,"(Raising, a, family, is, a, delight)"
289,2504.19,2504,19,96,Bullying could be stopped if,by saying NO!!,1.5,27,,Bullying could be stopped if by saying NO!!,Bullying could be stopped if by saying NO,"(Bullying, could, be, stopped, if, by, saying,..."
479,3137.26,3137,26,26,When I get mad,i lash out in a controlled way but then eventu...,3.5,45,,When I get mad i lash out in a controlled way ...,When I get mad i lash out in a controlled way ...,"(When, I, get, mad, i, lash, out, in, a, contr..."
523,3267.19,3267,19,19,Crime and delinquency could be halted if,There was a lot more love in our world. We al...,4.5,45,,Crime and delinquency could be halted if There...,Crime and delinquency could be halted if There...,"(Crime, and, delinquency, could, be, halted, i..."
328,2538.18,2538,18,42,Rules,no running on the concrete.,1.5,16,,Rules no running on the concrete.,Rules no running on the concrete,"(Rules, no, running, on, the, concrete)"


### Actual splitting of clauses
#### Metrics

* Total % of sentences with correct reconstructions from a existing dataset =  0.9061 . It's actually greater than 91% since complex first clauses followed by conjunctions put the conjuction with the parent clause in the first.
* Response expected = actual verbatim : 

#### Algorithm
NOTE: Check http://universaldependencies.org/ to understand the grammatical dependencies. To visualize each sentence, look into the html folder. They contain the parsing which can be used to determine direct parents and sub-sentences aka clauses. 
1. Each doc contains clauses such that they have a main verb. 
2. These verbs are connected together to make the entire document in Spacy.
3. We use a recursive method 'get_children' to determine if a child verb is linking two clauses or not. If they are not linking two clauses (these are auxilliary verbs (aux) or clausal complements (xcomp)), they are part of the same clause.
4. This gives an array of clauses and each clause is an array of Spacy token. 
5. This 2D array might have one or more clauses which are sub-clauses of another clause in the same 2D array. These are removed in the postprocessing

In [7]:
def flatten_list(l):
    flat_list = [item for sublist in l for item in sublist]
    return flat_list

def get_children(doc):
    if len([x for x in doc.children]) == 0:
        return [doc]
    if doc.pos_ == "VERB" and doc.dep_ not in ["xcomp", "aux"]:
        return []

    op = flatten_list([get_children(l) for l in doc.lefts]) + [doc] + flatten_list([get_children(r) for r in doc.rights])
    return op

def postprocess(tokens_arr):
    if len(tokens_arr) == 1 and ( tokens_arr[0].dep_ in ["aux", "auxpass"] or tokens_arr[0].tag_ in ["VBG"]): 
        return []
    return tokens_arr

def get_text_from_tokens(tokens_arr):
    op = ' '.join([x.text for x in tokens_arr])
    op = op.replace(" nt", "nt").replace(" '", "'")
    return op

def clause_split_by_verbs(doc):
    op = []
    for token in doc:
        if token.pos_ == "VERB":
            arr = flatten_list([get_children(l) for l in token.lefts]) + [token] + flatten_list([get_children(r) for r in token.rights])
            arr = postprocess(arr)
            op.append(arr)
    if len(op)==0:
        op.append(doc)
    return op

df['split_by_verbs_arr'] = df['nlp_doc'].apply(clause_split_by_verbs)
df.sample(frac = 1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr
379,2612.02,2612,2,2,When I am criticized,I can have a range of responses depending upon...,4.5,30,,When I am criticized I can have a range of res...,When I am criticized I can have a range of res...,"(When, I, am, criticized, I, can, have, a, ran...","[[], [When, I, am, criticized], [], [I, can, h..."
153,2155.35,2155,35,35,My conscience bothers me if,I\'m mean to someone.,2.0,18,,My conscience bothers me if I\'m mean to someone.,My conscience bothers me if I m mean to someone,"(My, conscience, bothers, me, if, I, m, mean, ...","[[My, conscience, bothers, me], [m], [if, I, m..."
21,1701.06,1701,6,6,The thing I like about myself is,how radiant and resplendent the blessing of cr...,5.5,26,,The thing I like about myself is how radiant a...,The thing I like about myself is how radiant a...,"(The, thing, I, like, about, myself, is, how, ...","[[I, like, about, myself], [The, thing, is], [..."
432,2812.06,2812,6,6,The thing I like about myself is,Im versatile,3.0,16,,The thing I like about myself is Im versatile,The thing I like about myself is Im versatile,"(The, thing, I, like, about, myself, is, I, m,...","[[I, like, about, myself], [The, thing, is, I]..."
228,2338.32,2338,32,32,If I can\'t get what I want,The ultimate question of life and mind! Really...,6.0,29,,If I can\'t get what I want The ultimate quest...,If I can t get what I want The ultimate questi...,"(If, I, can, t, get, what, I, want, The, ultim...","[[], [If, I, can, t, get], [what, I, want, The..."


df postprocessing and the clause delimiting

In [8]:
def remove_prompts(df):
    prompt, tokens_arr = df.prompt, df.split_by_verbs_arr
    pdoc = nlp(prompt)
    ignore_indices = [x.i for x in pdoc]
    new_arr = []
    for clause in tokens_arr:
        new_clause = [t for t in clause if t.i not in ignore_indices]
        if len(new_clause) >= 0:
            new_arr.append(new_clause)
    return [x for x in new_arr if len(x) != 0]

def filter_valid_text_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(i)
    new_arr = [idx for idx in first_pass if len(clauses_arr[idx]) > 0]
    return new_arr

def get_valid_text_df(row):
    clauses_arr = row["clauses_doc_final"]
    valid_indices = row["valid_indices_per_doc"]
    filtered_clauses = [get_text_from_tokens(clauses_arr[x]) for x in valid_indices]
    return filtered_clauses

def process_verbs_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(clauses_arr[i])
    
    for clauses in first_pass:
        if len(clauses) == 0:
            continue
        txt = get_text_from_tokens(clauses)
        new_arr.append(txt)
    
    return new_arr
        
df['clauses_doc_final'] = df[['prompt', 'split_by_verbs_arr']].apply(remove_prompts, axis = 1) 
df["valid_indices_per_doc"] = df['clauses_doc_final'].apply(filter_valid_text_df)
df['clauses_text_final'] = df.apply(lambda row: get_valid_text_df(row), axis = 1)
df['split_by_verbs_arr_cleaned'] = df['split_by_verbs_arr'].apply(process_verbs_df)
df.sample(frac = 1).head(20)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
119,1997.21,1997,21,21,I just can\'t stand people who,judge and complain.,3.5,39,,I just can\'t stand people who judge and compl...,I just can t stand people who judge and complain,"(I, just, can, t, stand, people, who, judge, a...","[[], [I, just, can, t, stand, people], [who, j...","[[who, judge, and, complain], [complain]]",[0],[who judge and complain],"[I just can t stand people, who judge and comp..."
224,2338.23,2338,23,23,I am,no thing as apparently I see this whole panopl...,6.5,13,,I am no thing as apparently I see this whole p...,I am no thing as apparently I see this whole p...,"(I, am, no, thing, as, apparently, I, see, thi...","[[I, am, no, thing], [as, apparently, I, see],...","[[no, thing], [as, apparently, I, see], [this,...","[0, 1, 2, 3, 4, 5]","[no thing, as apparently I see, this whole pan...","[I am no thing, as apparently I see, this whol..."
390,2642.22,2642,22,43,At times I worry about,the hours I work and the impact on my family l...,3.5,40,,At times I worry about the hours I work and th...,At times I worry about the hours I work and th...,"(At, times, I, worry, about, the, hours, I, wo...","[[At, times, I, worry, about, the, hours, and,...","[[the, hours, and, the, impact, on, my, family...","[0, 1, 2, 4]","[the hours and the impact on my family life, I...",[At times I worry about the hours and the impa...
230,2341.17,2341,17,17,When they avoided me,I wondered is this mine or theirs? and feeling...,4.5,33,,When they avoided me I wondered is this mine o...,When they avoided me I wondered is this mine o...,"(When, they, avoided, me, I, wondered, is, thi...","[[When, they, avoided, me], [I, wondered], [is...","[[I, wondered], [is, this, mine, or, theirs, a...","[0, 1, 2, 3, 4, 6]","[I wondered, is this mine or theirs and, feeli...","[When they avoided me, I wondered, is this min..."
511,3185.34,3185,34,47,Technology,is exciting,3.0,48,,Technology is exciting,Technology is exciting,"(Technology, is, exciting)","[[Technology, is, exciting]]","[[is, exciting]]",[0],[is exciting],[Technology is exciting]
492,3151.09,3151,9,9,Education,is selfing and then un-selfing into unity and ...,6.0,37,,Education is selfing and then un-selfing into ...,Education is selfing and then un selfing into ...,"(Education, is, selfing, and, then, un, selfin...",[[]],[],[],[],[]
462,2952.25,2952,25,25,My main problem is,getting over decades spent proving myself.,4.5,2,,My main problem is getting over decades spent ...,My main problem is getting over decades spent ...,"(My, main, problem, is, getting, over, decades...","[[], [My, main, problem, is, getting, over, de...","[[getting, over, decades], [spent, proving, my...","[0, 1]","[getting over decades, spent proving myself]","[My main problem is getting over decades, spen..."
356,2548.23,2548,23,23,I am,"happy when something good happens, like when I...",2.0,47,,"I am happy when something good happens, like w...",I am happy when something good happens like wh...,"(I, am, happy, when, something, good, happens,...","[[I, am, happy], [when, something, good, happe...","[[happy], [when, something, good, happens, lik...","[0, 1, 3, 4, 5, 6]","[happy, when something good happens like, when...","[I am happy, when something good happens like,..."
399,2690.28,2690,28,83,A teacher has the right to,I don't mean to be oppositional. . but this se...,5.0,29,,A teacher has the right to I don't mean to be ...,A teacher has the right to I don t mean to be ...,"(A, teacher, has, the, right, to, I, don, t, m...","[[A, teacher, has, the, right], [to, I, don], ...","[[I, don], [t, mean, to, be, oppositional, but...","[0, 1, 3, 4, 5, 6, 7, 8, 9, 10]","[I don, t mean to be oppositional but, this se...","[A teacher has the right, to I don, t mean to ..."
270,2469.02,2469,2,2,When I am criticized,"I can feel hurt but have learned that ""critici...",5.0,21,,When I am criticized I can feel hurt but have ...,When I am criticized I can feel hurt but have ...,"(When, I, am, criticized, I, can, feel, hurt, ...","[[], [When, I, am, criticized], [], [I, can, f...","[[I, can, feel, hurt, but], [hurt], [have, lea...","[0, 2, 3, 4, 5, 6]","[I can feel hurt but, have learned, that criti...","[When I am criticized, I can feel hurt but, ha..."


In [9]:
#We will solve the inconsistency in voice length  using valid_indices
df[df["clauses_text_final"].apply(len) != df["clauses_doc_final"].apply(len)]

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
0,1357.14,1357,14,14,The past,"Winds through us, both from our lives and cult...",5.5,15,,"The past Winds through us, both from our lives...",The past Winds through us both from our lives ...,"(The, past, Winds, through, us, both, from, ou...","[[The, past, Winds, through, us, both, from, o...","[[Winds, through, us, both, from, our, lives, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...",[Winds through us both from our lives and cult...,[The past Winds through us both from our lives...
3,1522.10,1522,10,10,When people are helpless,They often don&#039;t know it so they flak aro...,2.0,6,,When people are helpless They often don&#039;t...,When people are helpless They often don t know...,"(When, people, are, helpless, They, often, don...","[[When, people, are, helpless], [They, often, ...","[[They, often, don], [t, know, it], [so, they,...","[0, 1, 2]","[They often don, t know it, so they flak aroun...","[When people are helpless, They often don, t k..."
4,1522.15,1522,15,41,Privacy,is a sense of hiding from others that which yo...,1.0,1,,Privacy is a sense of hiding from others that ...,Privacy is a sense of hiding from others that ...,"(Privacy, is, a, sense, of, hiding, from, othe...","[[Privacy, is, a, sense, of], [hiding, from, o...","[[is, a, sense, of], [hiding, from, others, th...","[0, 1, 3, 4, 5]","[is a sense of, hiding from others that, which...","[Privacy is a sense of, hiding from others tha..."
7,1529.10,1529,10,10,When people are helpless,At times I try to find other ways of doing thi...,3.5,12,,When people are helpless At times I try to fin...,When people are helpless At times I try to fin...,"(When, people, are, helpless, At, times, I, tr...","[[When, people, are, helpless], [At, times, I,...","[[At, times, I, try, to, find, other, ways, of...","[0, 2, 3, 4, 5]","[At times I try to find other ways of, doing t...","[When people are helpless, At times I try to f..."
11,1668.27,1668,27,45,People who step out of line,change the line and provide others the opportu...,5.5,14,,People who step out of line change the line an...,People who step out of line change the line an...,"(People, who, step, out, of, line, change, the...","[[who, step, out, of, line], [People, change, ...","[[change, the, line, and, that, where, not, po...","[0, 1, 2, 3, 4]",[change the line and that where not possible u...,"[who step out of line, People change the line ..."
12,1668.34,1668,34,47,Technology,has been one of the most significant disruptor...,5.0,37,,Technology has been one of the most significan...,Technology has been one of the most significan...,"(Technology, has, been, one, of, the, most, si...","[[], [Technology, has, been, one, of, the, mos...","[[has, been, one, of, the, most, significant, ...","[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12]",[has been one of the most significant disrupto...,[Technology has been one of the most significa...
16,1681.31,1681,31,31,My father,was a hard worker and always tried to do his b...,2.5,48,,My father was a hard worker and always tried t...,My father was a hard worker and always tried t...,"(My, father, was, a, hard, worker, and, always...","[[My, father, was, a, hard, worker, and], [alw...","[[was, a, hard, worker, and], [always, tried, ...","[0, 1, 3]","[was a hard worker and, always tried to do his...","[My father was a hard worker and, always tried..."
20,1698.25,1698,25,25,My main problem is,I haven&#039;t figured out how to warp the tem...,4.5,27,,My main problem is I haven&#039;t figured out ...,My main problem is I haven t figured out how t...,"(My, main, problem, is, I, haven, t, figured, ...","[[My, main, problem, is], [I, haven, t, figure...","[[I, haven, t, figured, out, how, to, warp, th...","[0, 2, 3, 4, 6, 7]",[I haven t figured out how to warp the tempora...,"[My main problem is, I haven t figured out how..."
25,1717.13,1717,13,40,We could make the world a better place if,we would truly listen to each other enough to ...,5.5,22,,We could make the world a better place if we w...,We could make the world a better place if we w...,"(We, could, make, the, world, a, better, place...","[[], [We, could, make, the, world, a, better, ...","[[we, would, truly, listen, to, each, other, e...","[0, 1, 2, 4, 5]","[we would truly listen to each other enough, t...","[We could make the world a better place, if we..."
27,1731.11,1731,11,39,What I like to do best is,"be here in this moment with all parts of me, h...",5.5,30,,What I like to do best is be here in this mome...,What I like to do best is be here in this mome...,"(What, I, like, to, do, best, is, be, here, in...","[[I, like, What, to, do, best], [What, to, do,...","[[be, here, in, this, moment, with, all, parts...","[0, 1, 3]",[be here in this moment with all parts of me h...,"[I like What to do best, is be here in this mo..."


The voice determination of each clause in the actual entence is done using the rules below. 

In [10]:
a_poss, p_yn, p_beverb, p_get, a_def, undef = "A_pron_x", "P_yn", "P_bevb_x", "P_get_x", "A_def", "Undefined"

def voice_rule_engine(clause):
    if True not in [x.pos_ == "VERB" for x in clause]:
        return undef
    
    for x in clause:
        if x.dep_ == "poss":
            return a_poss
        
    ct = 0
    for x in clause:
        if x.text.lower().strip() in ['yes', 'no']:
            ct += 1
    if ct >= len(clause)/2:
        return p_yn

    BEING_VERBS = ['be', 'am', 'is', 'isn', 'are', 'aren', \
                   'was', 'were', 'wasn', 'weren', 'been', 'being', \
                   'have', 'haven', 'has', 'hasn', 'could', 'couldn', \
                   'should', 'shouldn', 'would', 'wouldn', 'may', 'might', 'mightn', \
                   'must','mustn', 'shall', 'can', 'will', \
                   'do', 'don', 'did', 'didn', 'does', 'doesn', 'having']
    for x in clause:
        if x.text.lower().strip() in BEING_VERBS and x.pos_ == "VERB":
            return p_beverb

    for x in clause:
        if x.dep_ in ["advcl", "ROOT"] and x.text in ["get", "seem", "feel", "gets", "seems", "feels", "got", "seemed", "felt"]:
            return p_get
    
    return a_def
    
def clauses_voice(arr_of_clauses):
    op = []
    for clause in arr_of_clauses:
        voice = voice_rule_engine(clause)
        op.append(voice)         
    return op

df['voice'] = df.clauses_doc_final.apply(clauses_voice)
df["voice_filtered"] = df.apply(lambda row: [row["voice"][i] for i in range(len(row["voice"])) if i in row["valid_indices_per_doc"]], axis = 1)
df["voice"] = df["voice_filtered"]
df[['sentence', 'clauses_doc_final', 'voice', "voice_filtered"]].sample(frac = 1).head()

Unnamed: 0,sentence,clauses_doc_final,voice,voice_filtered
481,If I were in charge I would surround myself in...,"[[I, would, surround, myself, in, confident, c...","[P_bevb_x, A_pron_x]","[P_bevb_x, A_pron_x]"
432,The thing I like about myself is Im versatile,"[[I], [m, versatile]]","[Undefined, A_def]","[Undefined, A_def]"
112,Change is constant -- as in every millisecond ...,"[[constant, as, in, every, millisecond, of, th...","[Undefined, A_def, P_bevb_x, P_bevb_x, A_def, ...","[Undefined, A_def, P_bevb_x, P_bevb_x, A_def, ..."
247,My co-workers and I Work very well together as...,"[[very, well, together], [as, we, have, a, lon...","[Undefined, P_bevb_x, A_def, A_def]","[Undefined, P_bevb_x, A_def, A_def]"
205,The thing I like about myself is These days I\...,"[[These, days], [I, m, happy], [that, there, s...","[Undefined, A_def, A_pron_x, A_def, A_def, P_b...","[Undefined, A_def, A_pron_x, A_def, A_def, P_b..."


In [11]:
df[df["clauses_text_final"].apply(len) != df["voice"].apply(len)].shape[0] # assert 0

0

This is the visualization of each sentence's parse tree. The output for each sentence in the input dataframe is in the /html folder. 

In [12]:
!pip install "msgpack-numpy<0.4.4.0"

Collecting msgpack-numpy<0.4.4.0
  Downloading https://files.pythonhosted.org/packages/ad/45/464be6da85b5ca893cfcbd5de3b31a6710f636ccb8521b17bd4110a08d94/msgpack_numpy-0.4.3.2-py2.py3-none-any.whl
Installing collected packages: msgpack-numpy
  Found existing installation: msgpack-numpy 0.4.4.2
    Uninstalling msgpack-numpy-0.4.4.2:
      Successfully uninstalled msgpack-numpy-0.4.4.2
Successfully installed msgpack-numpy-0.4.3.2
[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [13]:
def htmlise(row):
    html_fs = """
    <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <div>{}</div>
            <div>{}</div>
            <div>{}</div>
        </body>
    </html>"""
    op = spacy.displacy.render(row.nlp_doc, style='dep')
    with open("./html/file_{}.html".format(row.UID), "w") as f:
        f.write(html_fs.format(row.prompt, row.response, row.clauses_text_final, op))
    return
        
df['idx'] = df.index
#df.apply(htmlise, axis = 1)
#print("HTML processing done")
spacy.displacy.render([df.iloc[0].nlp_doc], style='dep')

TypeError: __init__() got an unexpected keyword argument 'encoding'

The output is the split clauses. This is stored in voice_classified.csv . This will be the input to the second notebook

In [14]:
df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'idx']].to_csv("./voice_classified.csv", index = False)