# Clause Parser Algorithm with Custom Metrics

## Introduction
This is the first of four notebooks that is to be run in sequence to qualify and quantify clauses. This notebook does the following:
1. Take an input from input.csv. This will have two required columns: "prompt" and "response". These two columns together make a coherent sentence. 
2. Preprocesses the coherent sentence to remove non alphanumeric characters. 
3. Splits them into clauses such that each clause contains a verb. 

Load the spacy models which will be used to determine the verbs. It will also be used to determine the voices based on the rules elaborated. This uses 'en_core_web_md'. If you want better tokenization of words, use 'en_core_web_lg'

In [1]:
import sys  # tjm
print(sys.version) # tjm
# !python --version  -- does not work for tjm
# tjm done once already in terminal   :
# !python -m spacy download en_core_web_lg # tjm changed _md to _lg
# print("Downloaded")
#TODO: use en_core_web_lg in a better machine. lg is running out of space in binder. 

3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
import spacy
import html
from spacy import displacy

nlp = spacy.load('en_core_web_lg') # tjm _md to _lg
print("Loaded models")

Loaded models


Get the input file from the current directory

In [3]:
from io import StringIO
import pandas as pd, numpy as np

df = pd.read_csv("./input.csv")
print(df.columns)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID'],
      dtype='object')


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID
80,My father,"is someone i would like to understand more, be...",3.5,a,3.5,2402,31,31,02402.031.
95,When I am nervous,I typically do better.,3.5,a,3.5,2155,33,33,02155.033.
7,If I had more money,you could buy lots of things,1.5,a,1.5,2550,24,24,02550.024.
18,My father,always smacks me,1.5,a,1.5,2503,31,31,02503.031.
48,My conscience bothers me if,when I accidentally hurt Chattie.,2.5,a,1.5,2505,35,35,02505.035.


Get the actual sentence by joining the prompt and response.

In [4]:
if "prompt" in df.columns: #Original dataset
    df['sentence'] = df.apply(lambda row : "{} {}".format(row['prompt'], row['response']), axis = 1)

df.sample(frac=1).head()

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence
160,Being with other people,is like watching life itself unfold through th...,5.5,a,6.0,1823,5,5,01823.005.,Being with other people is like watching life ...
39,I just can't stand people who,are very annoying and do silly stuff in the cl...,2.0,p,2.0,2513,21,21,02513.021.,I just can't stand people who are very annoyin...
141,The past,is a reflection that often gives us the impres...,5.0,p,5.0,3184,14,14,03184.014.,The past is a reflection that often gives us t...
140,A healthy person,"takes care of their body, mind and spirit and ...",5.0,p,4.5,2283,12,52,02283.012.,"A healthy person takes care of their body, min..."
98,Crime and delinquency could be halted if,If we defined no crime and no delinquency.,3.5,a,5.5,1806,19,19,01806.019.,Crime and delinquency could be halted if If we...


Preprocessing to remove non-alphanumeric characters and tokenize the sentence using Spacy.

In [5]:
import re, html
PATTERN = "[^a-zA-Z0-9\s]+"
rgx = re.compile(PATTERN, re.IGNORECASE)

df['preprocessed_sentence'] = df['sentence'].apply(lambda ip : re.sub('\s+', ' ', rgx.sub(' ', html.unescape(ip))))
print(df.columns, df.shape)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID', 'sentence',
       'preprocessed_sentence'],
      dtype='object') (220, 11)


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence
59,When people are helpless,I feel bad. I hate when I feel like,2.5,a,3.5,2670,10,10,02670.010.,When people are helpless I feel bad. I hate wh...,When people are helpless I feel bad I hate whe...
173,My co-workers and I,move in a way that allows humanity-the univers...,5.5,a,5.5,1804,7,38,01804.007.,My co-workers and I move in a way that allows ...,My co workers and I move in a way that allows ...
96,When they avoided me,I watched my embodied response. And they kept ...,3.5,a,5.5,2845,17,17,02845.017.,When they avoided me I watched my embodied res...,When they avoided me I watched my embodied res...
87,The thing I like about myself is,my ability to adapt to any situation.,3.5,a,3.5,1864,6,6,01864.006.,The thing I like about myself is my ability to...,The thing I like about myself is my ability to...
32,Privacy,when I am in the shower I like privacy.,2.0,p,2.0,2532,15,41,02532.015.,Privacy when I am in the shower I like privacy.,Privacy when I am in the shower I like privacy


In [6]:
df['nlp_doc'] = df['preprocessed_sentence'].apply(lambda ip : nlp(ip))
print(df.columns)
df.sample(frac=1).head()

Index(['prompt', 'response', 'score', 'PassAct', 'survOgive', 'survey_id',
       'prompt_number', 'prompt_id', 'UID', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object')


Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc
110,When I am nervous,My biology unconsciously and consciously overw...,4.0,p,5.0,2854,33,33,02854.033.,When I am nervous My biology unconsciously and...,When I am nervous My biology unconsciously and...,"(When, I, am, nervous, My, biology, unconsciou..."
192,My main problem is,"No problem at all, just that this Life is a Bl...",6.0,p,5.5,3212,25,25,03212.025.,"My main problem is No problem at all, just tha...",My main problem is No problem at all just that...,"(My, main, problem, is, No, problem, at, all, ..."
107,If I can't get what I want,I let go and be with what is unfolding.,4.0,p,4.5,1876,32,32,01876.032.,If I can't get what I want I let go and be wit...,If I can t get what I want I let go and be wit...,"(If, I, can, t, get, what, I, want, I, let, go..."
174,My conscience bothers me if,I don't clean up the messes I make as an imper...,5.5,a,5.0,1765,35,35,01765.035.,My conscience bothers me if I don't clean up t...,My conscience bothers me if I don t clean up t...,"(My, conscience, bothers, me, if, I, don, t, c..."
106,The past,might be considered a richly textured tapestry...,4.0,p,4.0,1884,14,14,01884.014.,The past might be considered a richly textured...,The past might be considered a richly textured...,"(The, past, might, be, considered, a, richly, ..."


### Actual splitting of clauses
#### Metrics

* Total % of sentences with correct reconstructions from a existing dataset =  0.9061 . It's actually greater than 91% since complex first clauses followed by conjunctions put the conjuction with the parent clause in the first.
* Response expected = actual verbatim : 

#### Algorithm
NOTE: Check http://universaldependencies.org/ to understand the grammatical dependencies. To visualize each sentence, look into the html folder. They contain the parsing which can be used to determine direct parents and sub-sentences aka clauses. 
1. Each doc contains clauses such that they have a main verb. 
2. These verbs are connected together to make the entire document in Spacy.
3. We use a recursive method 'get_children' to determine if a child verb is linking two clauses or not. If they are not linking two clauses (these are auxilliary verbs (aux) or clausal complements (xcomp)), they are part of the same clause.
4. This gives an array of clauses and each clause is an array of Spacy token. 
5. This 2D array might have one or more clauses which are sub-clauses of another clause in the same 2D array. These are removed in the postprocessing

In [7]:
def flatten_list(l):
    flat_list = [item for sublist in l for item in sublist]
    return flat_list

def get_children(doc):
    if len([x for x in doc.children]) == 0:
        return [doc]
    if doc.pos_ == "VERB" and doc.dep_ not in ["xcomp", "aux"]:
        return []

    op = flatten_list([get_children(l) for l in doc.lefts]) + [doc] + flatten_list([get_children(r) for r in doc.rights])
    return op

def postprocess(tokens_arr):
    if len(tokens_arr) == 1 and ( tokens_arr[0].dep_ in ["aux", "auxpass"] or tokens_arr[0].tag_ in ["VBG"]): 
        return []
    return tokens_arr

def get_text_from_tokens(tokens_arr):
    op = ' '.join([x.text for x in tokens_arr])
    op = op.replace(" nt", "nt").replace(" '", "'")
    return op

def clause_split_by_verbs(doc):
    op = []
    for token in doc:
        if token.pos_ == "VERB":
            arr = flatten_list([get_children(l) for l in token.lefts]) + [token] + flatten_list([get_children(r) for r in token.rights])
            arr = postprocess(arr)
            op.append(arr)
    if len(op)==0:
        op.append(doc)
    return op

df['split_by_verbs_arr'] = df['nlp_doc'].apply(clause_split_by_verbs)
df.sample(frac = 1).head()

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr
95,When I am nervous,I typically do better.,3.5,a,3.5,2155,33,33,02155.033.,When I am nervous I typically do better.,When I am nervous I typically do better,"(When, I, am, nervous, I, typically, do, better)","[[When, I, am, nervous], [I, typically, do, be..."
51,When they avoided me,I questioned myself.,2.5,a,3.5,2807,17,17,02807.017.,When they avoided me I questioned myself.,When they avoided me I questioned myself,"(When, they, avoided, me, I, questioned, myself)","[[When, they, avoided, me], [I, questioned, my..."
186,Sometimes I wish that,I could see what's ahead.. but when I close my...,6.0,p,6.0,1823,36,48,01823.036.,Sometimes I wish that I could see what's ahead...,Sometimes I wish that I could see what s ahead...,"(Sometimes, I, wish, that, I, could, see, what...","[[Sometimes, I, wish], [], [that, I, could, se..."
85,People who step out of line,should be given the chance to get back to norm...,3.5,a,4.0,2281,27,45,02281.027.,People who step out of line should be given th...,People who step out of line should be given th...,"(People, who, step, out, of, line, should, be,...","[[who, step, out, of, line], [], [], [People, ..."
28,When I am nervous,I don't really get nervous so I don't really know,2.0,p,1.5,2550,33,33,02550.033.,When I am nervous I don't really get nervous s...,When I am nervous I don t really get nervous s...,"(When, I, am, nervous, I, don, t, really, get,...","[[When, I, am, nervous], [I, don], [t, really,..."


df postprocessing and the clause delimiting

In [8]:
def remove_prompts(df):
    prompt, tokens_arr = df.prompt, df.split_by_verbs_arr
    pdoc = nlp(prompt)
    ignore_indices = [x.i for x in pdoc]
    new_arr = []
    for clause in tokens_arr:
        new_clause = [t for t in clause if t.i not in ignore_indices]
        if len(new_clause) >= 0:
            new_arr.append(new_clause)
    return [x for x in new_arr if len(x) != 0]

def filter_valid_text_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(i)
    new_arr = [idx for idx in first_pass if len(clauses_arr[idx]) > 0]
    return new_arr

def get_valid_text_df(row):
    clauses_arr = row["clauses_doc_final"]
    valid_indices = row["valid_indices_per_doc"]
    filtered_clauses = [get_text_from_tokens(clauses_arr[x]) for x in valid_indices]
    return filtered_clauses

def process_verbs_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(clauses_arr[i])
    
    for clauses in first_pass:
        if len(clauses) == 0:
            continue
        txt = get_text_from_tokens(clauses)
        new_arr.append(txt)
    
    return new_arr
        
df['clauses_doc_final'] = df[['prompt', 'split_by_verbs_arr']].apply(remove_prompts, axis = 1) 
df["valid_indices_per_doc"] = df['clauses_doc_final'].apply(filter_valid_text_df)
df['clauses_text_final'] = df.apply(lambda row: get_valid_text_df(row), axis = 1)
df['split_by_verbs_arr_cleaned'] = df['split_by_verbs_arr'].apply(process_verbs_df)
df.sample(frac = 1).head(20)

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
90,Privacy,is both and valid and useful right and also ho...,3.5,a,4.0,2833,15,41,02833.015.,Privacy is both and valid and useful right and...,Privacy is both and valid and useful right and...,"(Privacy, is, both, and, valid, and, useful, r...","[[Privacy, is, both, and, valid, and, useful, ...","[[is, both, and, valid, and, useful, right, an...","[0, 1]","[is both and valid and useful right and, also ...",[Privacy is both and valid and useful right an...
121,When they avoided me,"I observed that I was being avoided, and noted...",4.5,a,4.5,3239,17,17,03239.017.,When they avoided me I observed that I was bei...,When they avoided me I observed that I was bei...,"(When, they, avoided, me, I, observed, that, I...","[[When, they, avoided, me], [I, observed], [],...","[[I, observed], [that, I, was, being, avoided,...","[0, 1, 2, 4, 6, 8, 9]","[I observed, that I was being avoided and, not...","[When they avoided me, I observed, that I was ..."
173,My co-workers and I,move in a way that allows humanity-the univers...,5.5,a,5.5,1804,7,38,01804.007.,My co-workers and I move in a way that allows ...,My co workers and I move in a way that allows ...,"(My, co, workers, and, I, move, in, a, way, th...","[[My, co, workers, and, I, move, in, a, way], ...","[[in, a, way], [that, allows], [humanity, the,...","[0, 1, 2]","[in a way, that allows, humanity the universe ...","[My co workers and I move in a way, that allow..."
51,When they avoided me,I questioned myself.,2.5,a,3.5,2807,17,17,02807.017.,When they avoided me I questioned myself.,When they avoided me I questioned myself,"(When, they, avoided, me, I, questioned, myself)","[[When, they, avoided, me], [I, questioned, my...","[[I, questioned, myself]]",[0],[I questioned myself],"[When they avoided me, I questioned myself]"
180,Being with other people,Is a mutually mutating meeting of universes,6.0,p,5.5,1665,5,5,01665.005.,Being with other people Is a mutually mutating...,Being with other people Is a mutually mutating...,"(Being, with, other, people, Is, a, mutually, ...","[[Being, with, other, people], [Is, a, meeting...","[[Is, a, meeting, of, universes], [mutually, m...","[0, 1]","[Is a meeting of universes, mutually mutating]","[Being with other people, Is a meeting of univ..."
57,At times I worry about,"my children, my home and now also my business",2.5,a,3.5,1526,22,43,01526.022.,"At times I worry about my children, my home an...",At times I worry about my children my home and...,"(At, times, I, worry, about, my, children, my,...","[[At, times, I, worry, about, my, children, my...","[[my, children, my, home, and, now, also, my, ...",[0],[my children my home and now also my business],[At times I worry about my children my home an...
6,The past,you have to be careful,1.5,a,1.5,2498,14,14,02498.014.,The past you have to be careful,The past you have to be careful,"(The, past, you, have, to, be, careful)","[[you, have, to, be, careful], [to, be, careful]]","[[you, have, to, be, careful], [to, be, careful]]",[0],[you have to be careful],[you have to be careful]
127,At times I worry about,not seeing and embodying the truth.,4.5,a,4.5,3425,22,43,03425.022.,At times I worry about not seeing and embodyin...,At times I worry about not seeing and embodyin...,"(At, times, I, worry, about, not, seeing, and,...","[[At, times, I, worry, about], [not, seeing, a...","[[not, seeing, and], [embodying, the, truth]]","[0, 1]","[not seeing and, embodying the truth]","[At times I worry about, not seeing and, embod..."
110,When I am nervous,My biology unconsciously and consciously overw...,4.0,p,5.0,2854,33,33,02854.033.,When I am nervous My biology unconsciously and...,When I am nervous My biology unconsciously and...,"(When, I, am, nervous, My, biology, unconsciou...","[[When, I, am, nervous, My, biology, unconscio...","[[My, biology, unconsciously, and], [conscious...","[0, 1, 2]","[My biology unconsciously and, consciously ove...",[When I am nervous My biology unconsciously an...
58,My conscience bothers me if,I have done something wrong to someone else.,2.5,a,3.5,1663,35,35,01663.035.,My conscience bothers me if I have done someth...,My conscience bothers me if I have done someth...,"(My, conscience, bothers, me, if, I, have, don...","[[My, conscience, bothers, me], [], [if, I, ha...","[[I, have, done, something, wrong, to, someone...",[0],[I have done something wrong to someone else],"[My conscience bothers me, if I have done some..."


In [9]:
#We will solve the inconsistency in voice length  using valid_indices
df[df["clauses_text_final"].apply(len) != df["clauses_doc_final"].apply(len)]

Unnamed: 0,prompt,response,score,PassAct,survOgive,survey_id,prompt_number,prompt_id,UID,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,valid_indices_per_doc,clauses_text_final,split_by_verbs_arr_cleaned
6,The past,you have to be careful,1.5,a,1.5,2498,14,14,02498.014.,The past you have to be careful,The past you have to be careful,"(The, past, you, have, to, be, careful)","[[you, have, to, be, careful], [to, be, careful]]","[[you, have, to, be, careful], [to, be, careful]]",[0],[you have to be careful],[you have to be careful]
9,If I had more money,I would get to see the Star Wars movie actors ...,1.5,a,2.0,2532,24,24,02532.024.,If I had more money I would get to see the Sta...,If I had more money I would get to see the Sta...,"(If, I, had, more, money, I, would, get, to, s...","[[If, I, had, more, money], [], [I, would, get...","[[I, would, get, to, see, the, Star, Wars, mov...",[0],[I would get to see the Star Wars movie actors...,"[If I had more money, I would get to see the S..."
10,We could make the world a better place if,if we don't run on the concrete,1.5,a,1.5,2533,13,40,02533.013.,We could make the world a better place if if w...,We could make the world a better place if if w...,"(We, could, make, the, world, a, better, place...","[[], [We, could, make, the, world, a, better, ...","[[if, we, don, t, run, on, the, concrete], [ru...",[0],[if we don t run on the concrete],"[We could make the world a better place, if if..."
11,If my mother,liked watching TV we would probably watch it e...,1.5,a,2.0,2535,29,29,02535.029.,If my mother liked watching TV we would probab...,If my mother liked watching TV we would probab...,"(If, my, mother, liked, watching, TV, we, woul...","[[If, my, mother, liked, watching, TV], [watch...","[[liked, watching, TV], [watching, TV], [we, w...","[0, 2, 3]","[liked watching TV, we would probably watch it...","[If my mother liked watching TV, we would prob..."
21,If I had more money,I would be who I'm suppost to be.,2.0,p,3.0,2806,24,24,02806.024.,If I had more money I would be who I'm suppost...,If I had more money I would be who I m suppost...,"(If, I, had, more, money, I, would, be, who, I...","[[If, I, had, more, money], [], [I, would, be,...","[[I, would, be, who], [I, m], [suppost, to, be...","[0, 1, 2]","[I would be who, I m, suppost to be]","[If I had more money, I would be who, I m, sup..."
23,Children who step out of line,if it's to push in front of someone I wouldn't...,2.0,p,2.0,2545,27,98,02545.027.,Children who step out of line if it's to push ...,Children who step out of line if it s to push ...,"(Children, who, step, out, of, line, if, it, s...","[[who, step, out, of, line], [if, it, s, to, p...","[[if, it, s, to, push, in, front, of, someone]...","[0, 2]","[if it s to push in front of someone, I wouldn]","[who step out of line, if it s to push in fron..."
27,My father,gives you treats,2.0,p,1.5,2504,31,31,02504.031.,My father gives you treats,My father gives you treats,"(My, father, gives, you, treats)","[[My, father, gives, you, treats], [treats]]","[[gives, you, treats], [treats]]",[0],[gives you treats],[My father gives you treats]
33,Rules,are meant to be broken.,2.0,p,3.5,1864,18,42,01864.018.,Rules are meant to be broken.,Rules are meant to be broken,"(Rules, are, meant, to, be, broken)","[[], [Rules, are, meant, to, be, broken], [], ...","[[are, meant, to, be, broken], [to, be, broken]]",[0],[are meant to be broken],[Rules are meant to be broken]
41,When I am nervous,I remain calm and don't show it.,2.5,a,3.5,3178,33,33,03178.033.,When I am nervous I remain calm and don't show...,When I am nervous I remain calm and don t show...,"(When, I, am, nervous, I, remain, calm, and, d...","[[When, I, am, nervous], [I, remain, calm, and...","[[I, remain, calm, and], [don], [don, t, show,...","[0, 2]","[I remain calm and, don t show it]","[When I am nervous, I remain calm and, don t s..."
43,When I am nervous,I tend to clean and organize.,2.5,a,4.0,2601,33,33,02601.033.,When I am nervous I tend to clean and organize.,When I am nervous I tend to clean and organize,"(When, I, am, nervous, I, tend, to, clean, and...","[[When, I, am, nervous], [I, tend, to, clean, ...","[[I, tend, to, clean, and, organize], [to, cle...",[0],[I tend to clean and organize],"[When I am nervous, I tend to clean and organize]"


The voice determination of each clause in the actual entence is done using the rules below. 

In [10]:
a_poss, p_yn, p_beverb, p_get, a_def, undef = "A_pron_x", "P_yn", "P_bevb_x", "P_get_x", "A_def", "Undefined"

def voice_rule_engine(clause):
    if True not in [x.pos_ == "VERB" for x in clause]:
        return undef
    
    for x in clause:
        if x.dep_ == "poss":
            return a_poss
        
    ct = 0
    for x in clause:
        if x.text.lower().strip() in ['yes', 'no']:
            ct += 1
    if ct >= len(clause)/2:
        return p_yn

    BEING_VERBS = ['be', 'am', 'is', 'isn', 'are', 'aren', \
                   'was', 'were', 'wasn', 'weren', 'been', 'being', \
                   'have', 'haven', 'has', 'hasn', 'could', 'couldn', \
                   'should', 'shouldn', 'would', 'wouldn', 'may', 'might', 'mightn', \
                   'must','mustn', 'shall', 'can', 'will', \
                   'do', 'don', 'did', 'didn', 'does', 'doesn', 'having']
    for x in clause:
        if x.text.lower().strip() in BEING_VERBS and x.pos_ == "VERB":
            return p_beverb

    for x in clause:
        if x.dep_ in ["advcl", "ROOT"] and x.text in ["get", "seem", "feel", "gets", "seems", "feels", "got", "seemed", "felt"]:
            return p_get
    
    return a_def
    
def clauses_voice(arr_of_clauses):
    op = []
    for clause in arr_of_clauses:
        voice = voice_rule_engine(clause)
        op.append(voice)         
    return op

df['voice'] = df.clauses_doc_final.apply(clauses_voice)
df["voice_filtered"] = df.apply(lambda row: [row["voice"][i] for i in range(len(row["voice"])) if i in row["valid_indices_per_doc"]], axis = 1)
df["voice"] = df["voice_filtered"]
df[['sentence', 'clauses_doc_final', 'voice', "voice_filtered"]].sample(frac = 1).head()

Unnamed: 0,sentence,clauses_doc_final,voice,voice_filtered
121,When they avoided me I observed that I was bei...,"[[I, observed], [that, I, was, being, avoided,...","[A_def, P_bevb_x, A_def, A_def, A_pron_x, A_de...","[A_def, P_bevb_x, A_def, A_def, A_pron_x, A_de..."
195,Raising a family has helped me grow into an aw...,"[[has, helped], [me, grow, into, an, awareness...","[P_bevb_x, A_def]","[P_bevb_x, A_def]"
130,People who step out of line are doing so for a...,"[[are, doing, so, for, a, reason], [I, want, t...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def]","[P_bevb_x, A_def, A_def, P_bevb_x, A_def]"
135,Crime and delinquency could be halted if human...,"[[humankind, could, overcome, its, cancerous, ...",[A_pron_x],[A_pron_x]
77,I feel sorry for the true victims of this world.,"[[for, the, true, victims, of, this, world]]",[Undefined],[Undefined]


In [11]:
df[df["clauses_text_final"].apply(len) != df["voice"].apply(len)].shape[0] # assert 0

0

This is the visualization of each sentence's parse tree. The output for each sentence in the input dataframe is in the /html folder. 

In [12]:
# !pip install "msgpack-numpy<0.4.4.0"  # tjm removed, not needed?

In [13]:
def htmlise(row):
    html_fs = """
    <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <div>{}</div>
            <div>{}</div>
            <div>{}</div>
        </body>
    </html>"""
    op = spacy.displacy.render(row.nlp_doc, style='dep')
    with open("./html/file_{}.html".format(row.UID), "w") as f:
        f.write(html_fs.format(row.prompt, row.response, row.clauses_text_final, op))
    return
        
df['idx'] = df.index
#df.apply(htmlise, axis = 1)
#print("HTML processing done")
spacy.displacy.render([df.iloc[0].nlp_doc], style='dep')

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="0" class="displacy" width="2325" height="574.5" style="max-width: none; height: 574.5px; color: #000000; background: #ffffff; font-family: Arial">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="50">We</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PRON</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="225">could</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">VERB</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="484.5">\n    <tspan class="displacy-word" fill="currentColor" x="400">make</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="400">VERB</tspan>\n</text>\n\n<text class="d

The output is the split clauses. This is stored in voice_classified.csv . This will be the input to the second notebook

In [14]:
#df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'idx']].to_csv("./voice_classified.csv", index = False)
# tjm: split above into two below:
df_out = df[['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response', 'clauses_text_final', 'voice', 'score','PassAct','idx']]
df_out.to_csv("./voice_classified.csv", index = False)