# Clause Parser Algorithm with Custom Metrics

## Introduction
This is the first of four notebooks that is to be run in sequence to qualify and quantify clauses. This notebook does the following:
1. Take an input from input.csv. This will have two required columns: "prompt" and "response". These two columns together make a coherent sentence. 
2. Preprocesses the coherent sentence to remove non alphanumeric characters. 
3. Splits them into clauses such that each clause contains a verb. 

Load the spacy models which will be used to determine the verbs. It will also be used to determine the voices based on the rules elaborated. This uses 'en_core_web_md'. If you want better tokenization of words, use 'en_core_web_lg'

In [1]:
import sys  # tjm
print(sys.version) # tjm
# !python --version  -- does not work for tjm
# tjm done once already in terminal   :
# !python -m spacy download en_core_web_lg # tjm changed _md to _lg
# print("Downloaded")
#TODO: use en_core_web_lg in a better machine. lg is running out of space in binder. 

3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
import spacy
import html
from spacy import displacy

nlp = spacy.load('en_core_web_lg') # tjm _md to _lg
print("Loaded models")

Loaded models


Get the input file from the current directory

In [3]:
from io import StringIO
import pandas as pd, numpy as np

df = pd.read_csv("./input.csv")
print(df.columns)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID'],
      dtype='object')


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID
33,Rules,are meant to be broken.,2.0,p,3.5,1864,18,42,01864.018.
194,What I like to do best is,employ useless concentration while dancing wit...,6.0,p,5.5,3287,11,39,03287.011.
52,If I were in charge,the first thing i would do is buying house to ...,2.5,a,4.0,2012,30,30,02012.030.
17,Sometimes I wish that,I had more money! LOL.,1.5,a,6.5,2338,36,48,02338.036.
0,We could make the world a better place if,we could stop smoking!,1.5,a,1.5,2500,13,40,02500.013.


Get the actual sentence by joining the prompt and response.

In [4]:
if "prompt" in df.columns: #Original dataset
    df['sentence'] = df.apply(lambda row : "{} {}".format(row['prompt'], row['response']), axis = 1)

df.sample(frac=1).head()

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence
133,The thing I like about myself is,above all else I love to learn by connecting d...,4.5,a,5.0,1894,6,6,01894.006.,The thing I like about myself is above all els...
94,My co-workers and I,do not have much contact. As an adjunct in the...,3.5,a,4.5,3423,7,38,03423.007.,My co-workers and I do not have much contact. ...
57,At times I worry about,"my children, my home and now also my business",2.5,a,3.5,1526,22,43,01526.022.,"At times I worry about my children, my home an..."
73,My mother and I,were both different and similar in some ways.,3.0,p,3.5,1666,7,7,01666.007.,My mother and I were both different and simila...
104,I am,hungry and excited to make a fresh and hot pum...,4.0,p,5.5,3339,23,23,03339.023.,I am hungry and excited to make a fresh and ho...


Preprocessing to remove non-alphanumeric characters and tokenize the sentence using Spacy.

In [5]:
import re, html
PATTERN = "[^a-zA-Z0-9\s]+"
rgx = re.compile(PATTERN, re.IGNORECASE)

df['preprocessed_sentence'] = df['sentence'].apply(lambda ip : re.sub('\s+', ' ', rgx.sub(' ', html.unescape(ip))))
print(df.columns, df.shape)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID', 'sentence',
       'preprocessed_sentence'],
      dtype='object') (220, 11)


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence
62,If I had more money,I would feel less constrained to travel to see...,3.0,p,4.0,2688,24,24,02688.024.,If I had more money I would feel less constrai...,If I had more money I would feel less constrai...
184,I am,...here. And lost inside of everyone and every...,6.0,p,5.5,3227,23,23,03227.023.,I am ...here. And lost inside of everyone and ...,I am here And lost inside of everyone and ever...
167,Sometimes I wish that,that every being could see what already is th...,5.5,a,5.5,3223,36,48,03223.036.,Sometimes I wish that that every being could s...,Sometimes I wish that that every being could s...
181,Raising a family,"is love swimming in all of creation, a living ...",6.0,p,6.5,3151,1,1,03151.001.,Raising a family is love swimming in all of cr...,Raising a family is love swimming in all of cr...
190,My father,was a thread of eternity through the instant o...,6.0,p,6.0,2182,31,31,02182.031.,My father was a thread of eternity through the...,My father was a thread of eternity through the...


In [6]:
df['nlp_doc'] = df['preprocessed_sentence'].apply(lambda ip : nlp(ip))
print(df.columns)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object')


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc
122,Sometimes I wish that,making the world better would be easier and ev...,4.5,a,4.5,3379,36,48,03379.036.,Sometimes I wish that making the world better ...,Sometimes I wish that making the world better ...,"(Sometimes, I, wish, that, making, the, world,..."
217,Raising a family,Love enlivens reality through Kosmic expressio...,6.5,a,6.0,2721,1,1,02721.001.,Raising a family Love enlivens reality through...,Raising a family Love enlivens reality through...,"(Raising, a, family, Love, enlivens, reality, ..."
145,If I can't get what I want,"i'm frustrated, and sometimes feel sorry for m...",5.0,p,5.0,1904,32,32,01904.032.,"If I can't get what I want i'm frustrated, and...",If I can t get what I want i m frustrated and ...,"(If, I, can, t, get, what, I, want, i, m, frus..."
175,Being with other people,is attunement of a cacophony of frequencies th...,5.5,a,5.0,3409,5,5,03409.005.,Being with other people is attunement of a cac...,Being with other people is attunement of a cac...,"(Being, with, other, people, is, attunement, o..."
64,When I get mad,I am completely ineffective and irrational,3.0,p,4.5,3380,26,26,03380.026.,When I get mad I am completely ineffective and...,When I get mad I am completely ineffective and...,"(When, I, get, mad, I, am, completely, ineffec..."


### Actual splitting of clauses
#### Metrics

* Total % of sentences with correct reconstructions from a existing dataset =  0.9061 . It's actually greater than 91% since complex first clauses followed by conjunctions put the conjuction with the parent clause in the first.
* Response expected = actual verbatim : 

#### Algorithm
NOTE: Check http://universaldependencies.org/ to understand the grammatical dependencies. To visualize each sentence, look into the html folder. They contain the parsing which can be used to determine direct parents and sub-sentences aka clauses. 
1. Each doc contains clauses such that they have a main verb. 
2. These verbs are connected together to make the entire document in Spacy.
3. We use a recursive method 'get_children' to determine if a child verb is linking two clauses or not. If they are not linking two clauses (these are auxilliary verbs (aux) or clausal complements (xcomp)), they are part of the same clause.
4. This gives an array of clauses and each clause is an array of Spacy token. 
5. This 2D array might have one or more clauses which are sub-clauses of another clause in the same 2D array. These are removed in the postprocessing

In [7]:
def flatten_list(l):
    flat_list = [item for sublist in l for item in sublist]
    return flat_list

def get_children(doc):
    if len([x for x in doc.children]) == 0:
        return [doc]
    if doc.pos_ == "VERB" and doc.dep_ not in ["xcomp", "aux"]:
        return []

    op = flatten_list([get_children(l) for l in doc.lefts]) + [doc] + flatten_list([get_children(r) for r in doc.rights])
    return op

def postprocess(tokens_arr):
    if len(tokens_arr) == 1 and ( tokens_arr[0].dep_ in ["aux", "auxpass"] or tokens_arr[0].tag_ in ["VBG"]): 
        return []
    return tokens_arr

def get_text_from_tokens(tokens_arr):
    op = ' '.join([x.text for x in tokens_arr])
    op = op.replace(" nt", "nt").replace(" '", "'")
    return op

def clause_split_by_verbs(doc):
    op = []
    for token in doc:
        if token.pos_ == "VERB":
            arr = flatten_list([get_children(l) for l in token.lefts]) + [token] + flatten_list([get_children(r) for r in token.rights])
            arr = postprocess(arr)
            op.append(arr)
    if len(op)==0:
        op.append(doc)
    return op

df['split_by_verbs_arr'] = df['nlp_doc'].apply(clause_split_by_verbs)
df.sample(frac = 1).head()

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr
126,My co-workers and I,have established mutual respect and a two-way ...,4.5,a,4.5,2292,7,38,02292.007.,My co-workers and I have established mutual re...,My co workers and I have established mutual re...,"(My, co, workers, and, I, have, established, m...","[[], [My, co, workers, and, I, have, establish..."
153,What I like to do best is,be in a complete experience of effortless flow...,5.0,p,4.5,1831,11,39,01831.011.,What I like to do best is be in a complete exp...,What I like to do best is be in a complete exp...,"(What, I, like, to, do, best, is, be, in, a, c...","[[I, like, What, to, do, best], [What, to, do,..."
168,My conscience bothers me if,I don't clean up the messes I make as an imper...,5.5,a,5.0,1765,35,35,01765.035.,My conscience bothers me if I don't clean up t...,My conscience bothers me if I don t clean up t...,"(My, conscience, bothers, me, if, I, don, t, c...","[[My, conscience, bothers, me], [if, I, don, t..."
131,Change is,often quite hard for me if what is changing ha...,4.5,a,4.5,3303,3,3,03303.003.,Change is often quite hard for me if what is c...,Change is often quite hard for me if what is c...,"(Change, is, often, quite, hard, for, me, if, ...","[[Change, is, often, quite, hard, for, me], []..."
190,My father,was a thread of eternity through the instant o...,6.0,p,6.0,2182,31,31,02182.031.,My father was a thread of eternity through the...,My father was a thread of eternity through the...,"(My, father, was, a, thread, of, eternity, thr...","[[My, father, was, a, thread, of, eternity, th..."


df postprocessing and the clause delimiting

In [8]:
def remove_prompts(df):
    prompt, tokens_arr = df.prompt, df.split_by_verbs_arr
    pdoc = nlp(prompt)
    ignore_indices = [x.i for x in pdoc]
    new_arr = []
    for clause in tokens_arr:
        new_clause = [t for t in clause if t.i not in ignore_indices]
        if len(new_clause) >= 0:
            new_arr.append(new_clause)
    return [x for x in new_arr if len(x) != 0]

def filter_valid_text_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(i)
    new_arr = [idx for idx in first_pass if len(clauses_arr[idx]) > 0]
    return new_arr

def get_valid_text_df(row):
    clauses_arr = row["clauses_doc_final"]
    valid_indices = row["valid_indices_per_doc"]
    filtered_clauses = [get_text_from_tokens(clauses_arr[x]) for x in valid_indices]
    return filtered_clauses

def process_verbs_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(clauses_arr[i])
    
    for clauses in first_pass:
        if len(clauses) == 0:
            continue
        txt = get_text_from_tokens(clauses)
        new_arr.append(txt)
    
    return new_arr
        
df['clauses_doc_final'] = df[['prompt', 'split_by_verbs_arr']].apply(remove_prompts, axis = 1) 
df["valid_indices_per_doc"] = df['clauses_doc_final'].apply(filter_valid_text_df)
df['clauses_text_final'] = df.apply(lambda row: get_valid_text_df(row), axis = 1)
df['split_by_verbs_arr_cleaned'] = df['split_by_verbs_arr'].apply(process_verbs_df)
df.sample(frac = 1).head(20)

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
63,Education,is opportunity for people to grow,3.0,p,4.5,2559,9,9,02559.009.,Education is opportunity for people to grow,Education is opportunity for people to grow,"(Education, is, opportunity, for, people, to, ...","[[Education, is, opportunity], [for, people, t...","[[is, opportunity], [for, people, to, grow]]","[0, 1]","[is opportunity, for people to grow]","[Education is opportunity, for people to grow]"
31,When I am nervous,I get my tummy gets sick kind of.,2.0,p,2.0,2546,33,33,02546.033.,When I am nervous I get my tummy gets sick kin...,When I am nervous I get my tummy gets sick kin...,"(When, I, am, nervous, I, get, my, tummy, gets...","[[When, I, am, nervous], [I, get], [my, tummy,...","[[I, get], [my, tummy, gets, sick, kind, of]]","[0, 1]","[I get, my tummy gets sick kind of]","[When I am nervous, I get, my tummy gets sick ..."
134,Rules,", their necessity and their impact, are differ...",4.5,a,4.5,3106,18,42,03106.018.,"Rules , their necessity and their impact, are ...",Rules their necessity and their impact are dif...,"(Rules, their, necessity, and, their, impact, ...","[[their, impact, are, different, for, differen...","[[their, impact, are, different, for, differen...","[0, 1]",[their impact are different for different pers...,[their impact are different for different pers...
22,My father,is unknown to me,2.0,p,2.0,2783,31,31,02783.031.,My father is unknown to me,My father is unknown to me,"(My, father, is, unknown, to, me)","[[My, father, is, unknown, to, me]]","[[is, unknown, to, me]]",[0],[is unknown to me],[My father is unknown to me]
158,The past,Simultaneously non-existant a construct of mem...,5.0,p,4.5,2394,14,14,02394.014.,The past Simultaneously non-existant a constru...,The past Simultaneously non existant a constru...,"(The, past, Simultaneously, non, existant, a, ...","[[], [so, we, don], [t, repeat, the, same, bla...","[[so, we, don], [t, repeat, the, same, blasted...","[0, 1]","[so we don, t repeat the same blasted mistakes]","[so we don, t repeat the same blasted mistakes]"
62,If I had more money,I would feel less constrained to travel to see...,3.0,p,4.0,2688,24,24,02688.024.,If I had more money I would feel less constrai...,If I had more money I would feel less constrai...,"(If, I, had, more, money, I, would, feel, less...","[[If, I, had, more, money], [], [I, would, fee...","[[I, would, feel, less, constrained, to, trave...",[0],[I would feel less constrained to travel to se...,"[If I had more money, I would feel less constr..."
53,If I were in charge,I would just give the opportunity to one of m...,2.5,a,2.0,2546,30,30,02546.030.,If I were in charge I would just give the opp...,If I were in charge I would just give the oppo...,"(If, I, were, in, charge, I, would, just, give...","[[If, I, were, in, charge], [], [I, would, jus...","[[I, would, just, give, the, opportunity, to, ...","[0, 1]",[I would just give the opportunity to one of m...,"[If I were in charge, I would just give the op..."
136,A good boss,creates an environment in which you can succee...,4.5,a,4.5,1725,12,12,01725.012.,A good boss creates an environment in which yo...,A good boss creates an environment in which yo...,"(A, good, boss, creates, an, environment, in, ...","[[A, good, boss, creates, an, environment], []...","[[creates, an, environment], [in, which, you, ...","[0, 1, 2, 3, 4]","[creates an environment, in which you can succ...","[A good boss creates an environment, in which ..."
4,Children who step out of line,I don't know what that means (when children do...,1.5,a,2.0,2508,27,98,02508.027.,Children who step out of line I don't know wha...,Children who step out of line I don t know wha...,"(Children, who, step, out, of, line, I, don, t...","[[who, step, out, of, line], [I, don], [t, kno...","[[I, don], [t, know], [what, that, means], [wh...","[0, 1, 2, 3, 4, 5]","[I don, t know, what that means, when children...","[who step out of line, I don, t know, what tha..."
217,Raising a family,Love enlivens reality through Kosmic expressio...,6.5,a,6.0,2721,1,1,02721.001.,Raising a family Love enlivens reality through...,Raising a family Love enlivens reality through...,"(Raising, a, family, Love, enlivens, reality, ...","[[Raising, a, family], [Love, enlivens, realit...","[[Love, enlivens, reality, through, Kosmic, ex...","[0, 1]",[Love enlivens reality through Kosmic expressi...,"[Raising a family, Love enlivens reality throu..."


In [9]:
#We will solve the inconsistency in voice length  using valid_indices
df[df["clauses_text_final"].apply(len) != df["clauses_doc_final"].apply(len)]

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
6,The past,you have to be careful,1.5,a,1.5,2498,14,14,02498.014.,The past you have to be careful,The past you have to be careful,"(The, past, you, have, to, be, careful)","[[you, have, to, be, careful], [to, be, careful]]","[[you, have, to, be, careful], [to, be, careful]]",[0],[you have to be careful],[you have to be careful]
9,If I had more money,I would get to see the Star Wars movie actors ...,1.5,a,2.0,2532,24,24,02532.024.,If I had more money I would get to see the Sta...,If I had more money I would get to see the Sta...,"(If, I, had, more, money, I, would, get, to, s...","[[If, I, had, more, money], [], [I, would, get...","[[I, would, get, to, see, the, Star, Wars, mov...",[0],[I would get to see the Star Wars movie actors...,"[If I had more money, I would get to see the S..."
10,We could make the world a better place if,if we don't run on the concrete,1.5,a,1.5,2533,13,40,02533.013.,We could make the world a better place if if w...,We could make the world a better place if if w...,"(We, could, make, the, world, a, better, place...","[[], [We, could, make, the, world, a, better, ...","[[if, we, don, t, run, on, the, concrete], [ru...",[0],[if we don t run on the concrete],"[We could make the world a better place, if if..."
11,If my mother,liked watching TV we would probably watch it e...,1.5,a,2.0,2535,29,29,02535.029.,If my mother liked watching TV we would probab...,If my mother liked watching TV we would probab...,"(If, my, mother, liked, watching, TV, we, woul...","[[If, my, mother, liked, watching, TV], [watch...","[[liked, watching, TV], [watching, TV], [we, w...","[0, 2, 3]","[liked watching TV, we would probably watch it...","[If my mother liked watching TV, we would prob..."
21,If I had more money,I would be who I'm suppost to be.,2.0,p,3.0,2806,24,24,02806.024.,If I had more money I would be who I'm suppost...,If I had more money I would be who I m suppost...,"(If, I, had, more, money, I, would, be, who, I...","[[If, I, had, more, money], [], [I, would, be,...","[[I, would, be, who], [I, m], [suppost, to, be...","[0, 1, 2]","[I would be who, I m, suppost to be]","[If I had more money, I would be who, I m, sup..."
23,Children who step out of line,if it's to push in front of someone I wouldn't...,2.0,p,2.0,2545,27,98,02545.027.,Children who step out of line if it's to push ...,Children who step out of line if it s to push ...,"(Children, who, step, out, of, line, if, it, s...","[[who, step, out, of, line], [if, it, s, to, p...","[[if, it, s, to, push, in, front, of, someone]...","[0, 2]","[if it s to push in front of someone, I wouldn]","[who step out of line, if it s to push in fron..."
27,My father,gives you treats,2.0,p,1.5,2504,31,31,02504.031.,My father gives you treats,My father gives you treats,"(My, father, gives, you, treats)","[[My, father, gives, you, treats], [treats]]","[[gives, you, treats], [treats]]",[0],[gives you treats],[My father gives you treats]
33,Rules,are meant to be broken.,2.0,p,3.5,1864,18,42,01864.018.,Rules are meant to be broken.,Rules are meant to be broken,"(Rules, are, meant, to, be, broken)","[[], [Rules, are, meant, to, be, broken], [], ...","[[are, meant, to, be, broken], [to, be, broken]]",[0],[are meant to be broken],[Rules are meant to be broken]
41,When I am nervous,I remain calm and don't show it.,2.5,a,3.5,3178,33,33,03178.033.,When I am nervous I remain calm and don't show...,When I am nervous I remain calm and don t show...,"(When, I, am, nervous, I, remain, calm, and, d...","[[When, I, am, nervous], [I, remain, calm, and...","[[I, remain, calm, and], [don], [don, t, show,...","[0, 2]","[I remain calm and, don t show it]","[When I am nervous, I remain calm and, don t s..."
43,When I am nervous,I tend to clean and organize.,2.5,a,4.0,2601,33,33,02601.033.,When I am nervous I tend to clean and organize.,When I am nervous I tend to clean and organize,"(When, I, am, nervous, I, tend, to, clean, and...","[[When, I, am, nervous], [I, tend, to, clean, ...","[[I, tend, to, clean, and, organize], [to, cle...",[0],[I tend to clean and organize],"[When I am nervous, I tend to clean and organize]"


The voice determination of each clause in the actual entence is done using the rules below. 

In [10]:
a_poss, p_yn, p_beverb, p_get, a_def, undef = "A_pron_x", "P_yn", "P_bevb_x", "P_get_x", "A_def", "Undefined"

def voice_rule_engine(clause):
    if True not in [x.pos_ == "VERB" for x in clause]:
        return undef
    
    for x in clause:
        if x.dep_ == "poss":
            return a_poss
        
    ct = 0
    for x in clause:
        if x.text.lower().strip() in ['yes', 'no']:
            ct += 1
    if ct >= len(clause)/2:
        return p_yn

    BEING_VERBS = ['be', 'am', 'is', 'isn', 'are', 'aren', \
                   'was', 'were', 'wasn', 'weren', 'been', 'being', \
                   'have', 'haven', 'has', 'hasn', 'could', 'couldn', \
                   'should', 'shouldn', 'would', 'wouldn', 'may', 'might', 'mightn', \
                   'must','mustn', 'shall', 'can', 'will', \
                   'do', 'don', 'did', 'didn', 'does', 'doesn', 'having']
    for x in clause:
        if x.text.lower().strip() in BEING_VERBS and x.pos_ == "VERB":
            return p_beverb

    for x in clause:
        if x.dep_ in ["advcl", "ROOT"] and x.text in ["get", "seem", "feel", "gets", "seems", "feels", "got", "seemed", "felt"]:
            return p_get
    
    return a_def
    
def clauses_voice(arr_of_clauses):
    op = []
    for clause in arr_of_clauses:
        voice = voice_rule_engine(clause)
        op.append(voice)         
    return op

df['voice'] = df.clauses_doc_final.apply(clauses_voice)
df["voice_filtered"] = df.apply(lambda row: [row["voice"][i] for i in range(len(row["voice"])) if i in row["valid_indices_per_doc"]], axis = 1)
df["voice"] = df["voice_filtered"]
df[['sentence', 'clauses_doc_final', 'voice', "voice_filtered"]].sample(frac = 1).head()

Unnamed: 0,sentence,clauses_doc_final,voice,voice_filtered
56,Crime and delinquency could be halted if peopl...,"[[people, were, loved, as, kids]]",[P_bevb_x],[P_bevb_x]
118,Raising a family is the most incredible advent...,"[[is, the, most, incredible, adventure], [I, v...","[P_bevb_x, P_bevb_x, P_bevb_x, P_bevb_x, A_def]","[P_bevb_x, P_bevb_x, P_bevb_x, P_bevb_x, A_def]"
194,What I like to do best is employ useless conce...,"[[employ, useless, concentration], [while, dan...","[A_def, A_def]","[A_def, A_def]"
126,My co-workers and I have established mutual re...,"[[established, mutual, respect, and, a, two, w...",[A_def],[A_def]
57,"At times I worry about my children, my home an...","[[my, children, my, home, and, now, also, my, ...",[Undefined],[Undefined]


In [11]:
df[df["clauses_text_final"].apply(len) != df["voice"].apply(len)].shape[0] # assert 0

0

This is the visualization of each sentence's parse tree. The output for each sentence in the input dataframe is in the /html folder. 

In [12]:
!pip install "msgpack-numpy<0.4.4.0"  # tjm removed, not needed?

  utils.DeprecatedIn23,
[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.[0m
Collecting msgpack>=0.3.0 (from msgpack-numpy<0.4.4.0)
[?25l  Downloading https://files.pythonhosted.org/packages/81/9c/0036c66234482044070836cc622266839e2412f8108849ab0bfdeaab8578/msgpack-0.6.1.tar.gz (118kB)
[K    100% |████████████████████████████████| 122kB 210kB/s ta 0:00:01
Building wheels for collected packages: msgpack
  Building wheel for msgpack (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/tmurray/Library/Caches/pip/wheels/e0/eb/73/79c4057260fcb51c5f12cee027dda5cf79b92b618a82529c74
Successfully built msgpack
Installing collected packages: msgpack
Successfully installed msgpack-0.6.1
[33mYou are using pip version 19.0.2, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrad

In [13]:
def htmlise(row):
    html_fs = """
    <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <div>{}</div>
            <div>{}</div>
            <div>{}</div>
        </body>
    </html>"""
    op = spacy.displacy.render(row.nlp_doc, style='dep')
    with open("./html/file_{}.html".format(row.UID), "w") as f:
        f.write(html_fs.format(row.prompt, row.response, row.clauses_text_final, op))
    return
        
df['idx'] = df.index
#df.apply(htmlise, axis = 1)
#print("HTML processing done")
spacy.displacy.render([df.iloc[0].nlp_doc], style='dep')

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="0" class="displacy" width="2325" height="574.5" style="max-width: none; height: 574.5px; color: #000000; background: #ffffff; font-family: Arial">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="50">We</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PRON</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="225">could</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">VERB</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="400">make</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="400">VERB</tspan>\n</text>\n\n<text class="d

The output is the split clauses. This is stored in voice_classified.csv . This will be the input to the second notebook

In [14]:
#df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'idx']].to_csv("./voice_classified.csv", index = False)
# tjm: split above into two below:
df_out = df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'score','PassAct','idx']]
df_out.to_csv("./voice_classified.csv", index = False)