# Clause Parser Algorithm with Custom Metrics

## Introduction
This is the first of four notebooks that is to be run in sequence to qualify and quantify clauses. This notebook does the following:
1. Take an input from input.csv. This will have two required columns: "prompt" and "response". These two columns together make a coherent sentence. 
2. Preprocesses the coherent sentence to remove non alphanumeric characters. 
3. Splits them into clauses such that each clause contains a verb. 

Load the spacy models which will be used to determine the verbs. It will also be used to determine the voices based on the rules elaborated. This uses 'en_core_web_md'. If you want better tokenization of words, use 'en_core_web_lg'

In [1]:
!python --version
!python -m spacy download en_core_web_md
print("Downloaded")
#TODO: use en_core_web_lg in a better machine. lg is running out of space in binder. 

Python 3.6.7
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |████████████████████████████████| 120.9MB 114.9MB/s ta 0:00:01

[93m    Linking successful[0m
    /srv/conda/lib/python3.6/site-packages/en_core_web_md -->
    /srv/conda/lib/python3.6/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')

Downloaded


In [2]:
import spacy
import html
from spacy import displacy

nlp = spacy.load('en_core_web_md')
print("Loaded models")

Loaded models


Get the input file from the current directory

In [3]:
from io import StringIO
import pandas as pd, numpy as np

df = pd.read_csv("./input.csv")
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments
525,3286.06,3286,6,6,The thing I like about myself is,I am continually curious about life and how it...,5.5,37,
484,3146.03,3146,3,3,Change is,the shittiest fucking thing ever; I so wish re...,5.0,47,
312,2512.21,2512,21,21,I just can\'t stand people who,cry,1.5,2,
37,1791.34,1791,34,47,Technology,might one day be the next step of evolution. I...,6.0,35,
460,2941.17,2941,17,17,When they avoided me,I wondered what was their motive to do so?,3.0,45,


Get the actual sentence by joining the prompt and response.

In [4]:
if "prompt" in df.columns: #Original dataset
    df['sentence'] = df.apply(lambda row : "{} {}".format(row['prompt'], row['response']), axis = 1)

df.sample(frac=1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence
365,2552.1,2552,10,10,When people are helpless,I try to make them not helpless,2.5,44,,When people are helpless I try to make them no...
184,2187.26,2187,26,26,When I get mad,I usually see it as a projection that I need t...,4.5,13,,When I get mad I usually see it as a projectio...
158,2166.15,2166,15,41,Privacy,is to be respected and not invaded. Different ...,3.5,41,,Privacy is to be respected and not invaded. Di...
245,2387.09,2387,9,9,Education,can be many things to many different peoples.,3.0,14,,Education can be many things to many different...
41,1806.23,1806,23,23,I am,amazed at how quickly the world gives way to t...,5.0,44,,I am amazed at how quickly the world gives way...


Preprocessing to remove non-alphanumeric characters and tokenize the sentence using Spacy.

In [5]:
import re, html
PATTERN = "[^a-zA-Z0-9\s]+"
rgx = re.compile(PATTERN, re.IGNORECASE)

df['preprocessed_sentence'] = df['sentence'].apply(lambda ip : re.sub('\s+', ' ', rgx.sub(' ', html.unescape(ip))))
print(df.columns, df.shape)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence'],
      dtype='object') (539, 11)


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence
153,2155.35,2155,35,35,My conscience bothers me if,I\'m mean to someone.,2.0,18,,My conscience bothers me if I\'m mean to someone.,My conscience bothers me if I m mean to someone
211,2314.3,2314,30,30,If I were in charge,"of another organization, I would focus on bein...",3.5,16,,"If I were in charge of another organization, I...",If I were in charge of another organization I ...
338,2542.19,2542,19,96,Bullying could be stopped if,there would be a rule,2.0,48,,Bullying could be stopped if there would be a ...,Bullying could be stopped if there would be a ...
305,2509.33,2509,33,33,When I am nervous,"I would get out of bed, or house",1.5,45,,"When I am nervous I would get out of bed, or h...",When I am nervous I would get out of bed or house
536,3352.29,3352,29,29,If my mother,was more like me - I imagine we would be close...,4.0,43,,If my mother was more like me - I imagine we w...,If my mother was more like me I imagine we wou...


In [6]:
df['nlp_doc'] = df['preprocessed_sentence'].apply(lambda ip : nlp(ip))
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc
473,3122.21,3122,21,21,I just can\'t stand people who,are offensive,3.0,50,,I just can\'t stand people who are offensive,I just can t stand people who are offensive,"(I, just, can, t, stand, people, who, are, off..."
38,1801.01,1801,1,1,Raising a family,", my biggest unfulfilled dream, triggers some ...",5.0,26,,"Raising a family , my biggest unfulfilled drea...",Raising a family my biggest unfulfilled dream ...,"(Raising, a, family, my, biggest, unfulfilled,..."
146,2102.25,2102,25,25,My main problem is,only a main problem if I think it is.,3.5,11,,My main problem is only a main problem if I th...,My main problem is only a main problem if I th...,"(My, main, problem, is, only, a, main, problem..."
62,1837.05,1837,5,5,Being with other people,can be rewarding.,3.0,2,,Being with other people can be rewarding.,Being with other people can be rewarding,"(Being, with, other, people, can, be, rewarding)"
450,2891.07,2891,7,38,My co-workers and I,"care deeply about our work, enjoy each others ...",3.5,13,,My co-workers and I care deeply about our work...,My co workers and I care deeply about our work...,"(My, co, workers, and, I, care, deeply, about,..."


### Actual splitting of clauses
#### Metrics

* Total % of sentences with correct reconstructions from a existing dataset =  0.9061 . It's actually greater than 91% since complex first clauses followed by conjunctions put the conjuction with the parent clause in the first.
* Response expected = actual verbatim : 

#### Algorithm
NOTE: Check http://universaldependencies.org/ to understand the grammatical dependencies. To visualize each sentence, look into the html folder. They contain the parsing which can be used to determine direct parents and sub-sentences aka clauses. 
1. Each doc contains clauses such that they have a main verb. 
2. These verbs are connected together to make the entire document in Spacy.
3. We use a recursive method 'get_children' to determine if a child verb is linking two clauses or not. If they are not linking two clauses (these are auxilliary verbs (aux) or clausal complements (xcomp)), they are part of the same clause.
4. This gives an array of clauses and each clause is an array of Spacy token. 
5. This 2D array might have one or more clauses which are sub-clauses of another clause in the same 2D array. These are removed in the postprocessing

In [7]:
def flatten_list(l):
    flat_list = [item for sublist in l for item in sublist]
    return flat_list

def get_children(doc):
    if len([x for x in doc.children]) == 0:
        return [doc]
    if doc.pos_ == "VERB" and doc.dep_ not in ["xcomp", "aux"]:
        return []

    op = flatten_list([get_children(l) for l in doc.lefts]) + [doc] + flatten_list([get_children(r) for r in doc.rights])
    return op

def postprocess(tokens_arr):
    if len(tokens_arr) == 1 and ( tokens_arr[0].dep_ in ["aux", "auxpass"] or tokens_arr[0].tag_ in ["VBG"]): 
        return []
    return tokens_arr

def get_text_from_tokens(tokens_arr):
    op = ' '.join([x.text for x in tokens_arr])
    op = op.replace(" nt", "nt").replace(" '", "'")
    return op

def clause_split_by_verbs(doc):
    op = []
    for token in doc:
        if token.pos_ == "VERB":
            arr = flatten_list([get_children(l) for l in token.lefts]) + [token] + flatten_list([get_children(r) for r in token.rights])
            arr = postprocess(arr)
            op.append(arr)
    if len(op)==0:
        op.append(doc)
    return op

df['split_by_verbs_arr'] = df['nlp_doc'].apply(clause_split_by_verbs)
df.sample(frac = 1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr
115,1987.04,1987,4,37,"These days, work",is a Devine expression of the universe moving ...,6.0,42,,"These days, work is a Devine expression of the...",These days work is a Devine expression of the ...,"(These, days, work, is, a, Devine, expression,...","[[These, days, work, is, a, Devine, expression..."
487,3151.01,3151,1,1,Raising a family,"is love swimming in all of creation, a living ...",6.0,18,,Raising a family is love swimming in all of cr...,Raising a family is love swimming in all of cr...,"(Raising, a, family, is, love, swimming, in, a...","[[Raising, a, family], [is], [love, swimming, ..."
118,1996.22,1996,22,43,At times I worry about,how differently I see and process things from ...,4.5,18,,At times I worry about how differently I see a...,At times I worry about how differently I see a...,"(At, times, I, worry, about, how, differently,...","[[At, times, I, worry, about], [how, different..."
60,1831.11,1831,11,39,What I like to do best is,be in a complete experience of effortless flow...,5.0,25,,What I like to do best is be in a complete exp...,What I like to do best is be in a complete exp...,"(What, I, like, to, do, best, is, be, in, a, c...","[[I, like, What, to, do, best], [What, to, do,..."
228,2338.32,2338,32,32,If I can\'t get what I want,The ultimate question of life and mind! Really...,6.0,29,,If I can\'t get what I want The ultimate quest...,If I can t get what I want The ultimate questi...,"(If, I, can, t, get, what, I, want, The, ultim...","[[], [If, I, can, t, get], [what, I, want, The..."


df postprocessing and the clause delimiting

In [8]:
def remove_prompts(df):
    prompt, tokens_arr = df.prompt, df.split_by_verbs_arr
    pdoc = nlp(prompt)
    ignore_indices = [x.i for x in pdoc]
    new_arr = []
    for clause in tokens_arr:
        new_clause = [t for t in clause if t.i not in ignore_indices]
        if len(new_clause) >= 0:
            new_arr.append(new_clause)
    return [x for x in new_arr if len(x) != 0]

def filter_valid_text_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(i)
    new_arr = [idx for idx in first_pass if len(clauses_arr[idx]) > 0]
    return new_arr

def get_valid_text_df(row):
    clauses_arr = row["clauses_doc_final"]
    valid_indices = row["valid_indices_per_doc"]
    filtered_clauses = [get_text_from_tokens(clauses_arr[x]) for x in valid_indices]
    return filtered_clauses

def process_verbs_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(clauses_arr[i])
    
    for clauses in first_pass:
        if len(clauses) == 0:
            continue
        txt = get_text_from_tokens(clauses)
        new_arr.append(txt)
    
    return new_arr
        
df['clauses_doc_final'] = df[['prompt', 'split_by_verbs_arr']].apply(remove_prompts, axis = 1) 
df["valid_indices_per_doc"] = df['clauses_doc_final'].apply(filter_valid_text_df)
df['clauses_text_final'] = df.apply(lambda row: get_valid_text_df(row), axis = 1)
df['split_by_verbs_arr_cleaned'] = df['split_by_verbs_arr'].apply(process_verbs_df)
df.sample(frac = 1).head(20)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
74,1878.09,1878,9,9,Education,"Yes, in love. Yes let\&#039;s, for everyone! C...",5.0,13,,"Education Yes, in love. Yes let\&#039;s, for e...",Education Yes in love Yes let s for everyone C...,"(Education, Yes, in, love, Yes, let, s, for, e...","[[Yes, let], [], [s, for, everyone, Could, edu...","[[Yes, let], [s, for, everyone, Could, educati...","[0, 1, 3, 4, 5, 6, 7, 10]","[Yes let, s for everyone Could education simpl...","[Yes let, s for everyone Could education simpl..."
57,1825.03,1825,3,3,Change is,the ever evolving universe.,5.0,6,,Change is the ever evolving universe.,Change is the ever evolving universe,"(Change, is, the, ever, evolving, universe)","[[Change, is, the, universe], [ever, evolving]]","[[the, universe], [ever, evolving]]","[0, 1]","[the universe, ever evolving]","[Change is the universe, ever evolving]"
436,2837.29,2837,29,29,If my mother,needs my help I want to be able to take care o...,2.5,38,,If my mother needs my help I want to be able t...,If my mother needs my help I want to be able t...,"(If, my, mother, needs, my, help, I, want, to,...","[[If, my, mother, needs, my, help], [I, want, ...","[[needs, my, help], [I, want, to, be, able, to...","[0, 1]","[needs my help, I want to be able to take care...","[If my mother needs my help, I want to be able..."
134,2047.25,2047,25,25,My main problem is,I get excited and don\&#039;t think things thr...,4.0,14,,My main problem is I get excited and don\&#039...,My main problem is I get excited and don t thi...,"(My, main, problem, is, I, get, excited, and, ...","[[My, main, problem, is, excited, and], [I, ge...","[[excited, and], [I, get], [don], [don, t, thi...","[0, 1, 3, 4, 5, 7, 8, 9, 10]","[excited and, I get, don t think things throug...","[My main problem is excited and, I get, don t ..."
20,1698.25,1698,25,25,My main problem is,I haven&#039;t figured out how to warp the tem...,4.5,27,,My main problem is I haven&#039;t figured out ...,My main problem is I haven t figured out how t...,"(My, main, problem, is, I, haven, t, figured, ...","[[My, main, problem, is], [I, haven, t, figure...","[[I, haven, t, figured, out, how, to, warp, th...","[0, 2, 3, 4, 6, 7]",[I haven t figured out how to warp the tempora...,"[My main problem is, I haven t figured out how..."
199,2245.1,2245,10,10,When people are helpless,"sometimes I want to help them, sometimes I wan...",4.5,24,,When people are helpless sometimes I want to h...,When people are helpless sometimes I want to h...,"(When, people, are, helpless, sometimes, I, wa...","[[When, people, are, helpless, sometimes], [I,...","[[sometimes], [I, want, to, help, them], [to, ...","[0, 1, 3, 5, 6, 8, 9, 10, 11, 12]","[sometimes, I want to help them, sometimes I w...","[When people are helpless sometimes, I want to..."
93,1889.3,1889,30,30,If I were in charge,/if I were following...together are a unity wh...,6.5,15,,If I were in charge /if I were following...tog...,If I were in charge if I were following togeth...,"(If, I, were, in, charge, if, I, were, followi...","[[If, I, were, in, charge], [], [if, I, were, ...","[[if, I, were, following, together], [are, a, ...","[0, 1, 2, 3, 4]","[if I were following together, are a unity NOW...","[If I were in charge, if I were following toge..."
224,2338.23,2338,23,23,I am,no thing as apparently I see this whole panopl...,6.5,13,,I am no thing as apparently I see this whole p...,I am no thing as apparently I see this whole p...,"(I, am, no, thing, as, apparently, I, see, thi...","[[I, am, no, thing], [as, apparently, I, see],...","[[no, thing], [as, apparently, I, see], [this,...","[0, 1, 2, 3, 4, 5]","[no thing, as apparently I see, this whole pan...","[I am no thing, as apparently I see, this whol..."
118,1996.22,1996,22,43,At times I worry about,how differently I see and process things from ...,4.5,18,,At times I worry about how differently I see a...,At times I worry about how differently I see a...,"(At, times, I, worry, about, how, differently,...","[[At, times, I, worry, about], [how, different...","[[how, differently, I, see, and, and], [proces...","[0, 1, 2, 3]","[how differently I see and and, process things...","[At times I worry about, how differently I see..."
497,3158.02,3158,2,2,When I am criticized,"I realise that my being or expression, my cons...",5.5,18,,When I am criticized I realise that my being o...,When I am criticized I realise that my being o...,"(When, I, am, criticized, I, realise, that, my...","[[], [When, I, am, criticized], [I, realise], ...","[[I, realise], [that, my, being, or, expressio...","[0, 1, 2, 3, 4, 5, 6]","[I realise, that my being or expression my con...","[When I am criticized, I realise, that my bein..."


In [9]:
#We will solve the inconsistency in voice length  using valid_indices
df[df["clauses_text_final"].apply(len) != df["clauses_doc_final"].apply(len)]

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
0,1357.14,1357,14,14,The past,"Winds through us, both from our lives and cult...",5.5,15,,"The past Winds through us, both from our lives...",The past Winds through us both from our lives ...,"(The, past, Winds, through, us, both, from, ou...","[[The, past, Winds, through, us, both, from, o...","[[Winds, through, us, both, from, our, lives, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...",[Winds through us both from our lives and cult...,[The past Winds through us both from our lives...
3,1522.10,1522,10,10,When people are helpless,They often don&#039;t know it so they flak aro...,2.0,6,,When people are helpless They often don&#039;t...,When people are helpless They often don t know...,"(When, people, are, helpless, They, often, don...","[[When, people, are, helpless], [They, often, ...","[[They, often, don], [t, know, it], [so, they,...","[0, 1, 2]","[They often don, t know it, so they flak aroun...","[When people are helpless, They often don, t k..."
4,1522.15,1522,15,41,Privacy,is a sense of hiding from others that which yo...,1.0,1,,Privacy is a sense of hiding from others that ...,Privacy is a sense of hiding from others that ...,"(Privacy, is, a, sense, of, hiding, from, othe...","[[Privacy, is, a, sense, of], [hiding, from, o...","[[is, a, sense, of], [hiding, from, others, th...","[0, 1, 3, 4, 5]","[is a sense of, hiding from others that, which...","[Privacy is a sense of, hiding from others tha..."
7,1529.10,1529,10,10,When people are helpless,At times I try to find other ways of doing thi...,3.5,12,,When people are helpless At times I try to fin...,When people are helpless At times I try to fin...,"(When, people, are, helpless, At, times, I, tr...","[[When, people, are, helpless], [At, times, I,...","[[At, times, I, try, to, find, other, ways, of...","[0, 2, 3, 4, 5]","[At times I try to find other ways of, doing t...","[When people are helpless, At times I try to f..."
11,1668.27,1668,27,45,People who step out of line,change the line and provide others the opportu...,5.5,14,,People who step out of line change the line an...,People who step out of line change the line an...,"(People, who, step, out, of, line, change, the...","[[who, step, out, of, line], [People, change, ...","[[change, the, line, and, that, where, not, po...","[0, 1, 2, 3, 4]",[change the line and that where not possible u...,"[who step out of line, People change the line ..."
12,1668.34,1668,34,47,Technology,has been one of the most significant disruptor...,5.0,37,,Technology has been one of the most significan...,Technology has been one of the most significan...,"(Technology, has, been, one, of, the, most, si...","[[], [Technology, has, been, one, of, the, mos...","[[has, been, one, of, the, most, significant, ...","[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12]",[has been one of the most significant disrupto...,[Technology has been one of the most significa...
16,1681.31,1681,31,31,My father,was a hard worker and always tried to do his b...,2.5,48,,My father was a hard worker and always tried t...,My father was a hard worker and always tried t...,"(My, father, was, a, hard, worker, and, always...","[[My, father, was, a, hard, worker, and], [alw...","[[was, a, hard, worker, and], [always, tried, ...","[0, 1, 3]","[was a hard worker and, always tried to do his...","[My father was a hard worker and, always tried..."
20,1698.25,1698,25,25,My main problem is,I haven&#039;t figured out how to warp the tem...,4.5,27,,My main problem is I haven&#039;t figured out ...,My main problem is I haven t figured out how t...,"(My, main, problem, is, I, haven, t, figured, ...","[[My, main, problem, is], [I, haven, t, figure...","[[I, haven, t, figured, out, how, to, warp, th...","[0, 2, 3, 4, 6, 7]",[I haven t figured out how to warp the tempora...,"[My main problem is, I haven t figured out how..."
25,1717.13,1717,13,40,We could make the world a better place if,we would truly listen to each other enough to ...,5.5,22,,We could make the world a better place if we w...,We could make the world a better place if we w...,"(We, could, make, the, world, a, better, place...","[[], [We, could, make, the, world, a, better, ...","[[we, would, truly, listen, to, each, other, e...","[0, 1, 2, 4, 5]","[we would truly listen to each other enough, t...","[We could make the world a better place, if we..."
27,1731.11,1731,11,39,What I like to do best is,"be here in this moment with all parts of me, h...",5.5,30,,What I like to do best is be here in this mome...,What I like to do best is be here in this mome...,"(What, I, like, to, do, best, is, be, here, in...","[[I, like, What, to, do, best], [What, to, do,...","[[be, here, in, this, moment, with, all, parts...","[0, 1, 3]",[be here in this moment with all parts of me h...,"[I like What to do best, is be here in this mo..."


The voice determination of each clause in the actual entence is done using the rules below. 

In [10]:
a_poss, p_yn, p_beverb, p_get, a_def, undef = "A_pron_x", "P_yn", "P_bevb_x", "P_get_x", "A_def", "Undefined"

def voice_rule_engine(clause):
    if True not in [x.pos_ == "VERB" for x in clause]:
        return undef
    
    for x in clause:
        if x.dep_ == "poss":
            return a_poss
        
    ct = 0
    for x in clause:
        if x.text.lower().strip() in ['yes', 'no']:
            ct += 1
    if ct >= len(clause)/2:
        return p_yn

    BEING_VERBS = ['be', 'am', 'is', 'isn', 'are', 'aren', \
                   'was', 'were', 'wasn', 'weren', 'been', 'being', \
                   'have', 'haven', 'has', 'hasn', 'could', 'couldn', \
                   'should', 'shouldn', 'would', 'wouldn', 'may', 'might', 'mightn', \
                   'must','mustn', 'shall', 'can', 'will', \
                   'do', 'don', 'did', 'didn', 'does', 'doesn', 'having']
    for x in clause:
        if x.text.lower().strip() in BEING_VERBS and x.pos_ == "VERB":
            return p_beverb

    for x in clause:
        if x.dep_ == "acomp":
            return p_get
    
    return a_def
    
def clauses_voice(arr_of_clauses):
    op = []
    for clause in arr_of_clauses:
        voice = voice_rule_engine(clause)
        op.append(voice)         
    return op

df['voice'] = df.clauses_doc_final.apply(clauses_voice)
df["voice_filtered"] = df.apply(lambda row: [row["voice"][i] for i in range(len(row["voice"])) if i in row["valid_indices_per_doc"]], axis = 1)
df["voice"] = df["voice_filtered"]
df[['sentence', 'clauses_doc_final', 'voice', "voice_filtered"]].sample(frac = 1).head()

Unnamed: 0,sentence,clauses_doc_final,voice,voice_filtered
283,Children and parents are lucky when when they ...,"[[when, they, go, to, special, places]]",[A_def],[A_def]
49,If I were in charge then i would be the I of all.,"[[then], [i, would, be, the, I, of, all]]","[Undefined, P_bevb_x]","[Undefined, P_bevb_x]"
349,What gets me into trouble is going through the...,"[[going, through, the, black, gates, but], [I,...","[A_def, P_bevb_x, P_bevb_x]","[A_def, P_bevb_x, P_bevb_x]"
321,Bullying could be stopped if we tell the polic...,"[[we, tell, the, police, to, stop, them], [to,...",[A_def],[A_def]
367,What I like to do best is read,[[read]],[A_def],[A_def]


In [11]:
df[df["clauses_text_final"].apply(len) != df["voice"].apply(len)].shape[0] # assert 0

0

This is the visualization of each sentence's parse tree. The output for each sentence in the input dataframe is in the /html folder. 

In [12]:
!pip install "msgpack-numpy<0.4.4.0"



In [15]:
def htmlise(row):
    html_fs = """
    <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <div>{}</div>
            <div>{}</div>
            <div>{}</div>
        </body>
    </html>"""
    op = spacy.displacy.render(row.nlp_doc, style='dep')
    with open("./html/file_{}.html".format(row.UID), "w") as f:
        f.write(html_fs.format(row.prompt, row.response, row.clauses_text_final, op))
    return
        
df['idx'] = df.index
#df.apply(htmlise, axis = 1)
#print("HTML processing done")
spacy.displacy.render([df.iloc[0].nlp_doc], style='dep')

ValueError: buffer source array is read-only

The output is the split clauses. This is stored in voice_classified.csv . This will be the input to the second notebook

In [16]:
df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'idx']].to_csv("./voice_classified.csv", index = False)