# Clause Parser Algorithm with Custom Metrics

## Introduction
This is the first of four notebooks that is to be run in sequence to qualify and quantify clauses. This notebook does the following:
1. Take an input from input.csv. This will have two required columns: "prompt" and "response". These two columns together make a coherent sentence. 
2. Preprocesses the coherent sentence to remove non alphanumeric characters. 
3. Splits them into clauses such that each clause contains a verb. 

Load the spacy models which will be used to determine the verbs. It will also be used to determine the voices based on the rules elaborated. This uses 'en_core_web_md'. If you want better tokenization of words, use 'en_core_web_lg'

In [3]:
!python --version
!python -m spacy download en_core_web_md
print("Downloaded")
#TODO: use en_core_web_lg in a better machine. lg is running out of space in binder. 

Python 3.6.5
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |████████████████████████████████| 120.9MB 24.0MB/s ta 0:00:01 4% |█▍                              | 5.1MB 52.2MB/s eta 0:00:03    9% |███                             | 11.1MB 83.6MB/s eta 0:00:02    40% |████████████▉                   | 48.6MB 65.0MB/s eta 0:00:02    49% |███████████████▊                | 59.2MB 83.3MB/s eta 0:00:01
[?25hInstalling collected packages: en-core-web-md
  Running setup.py install for en-core-web-md ... [?25ldone
[?25hSuccessfully installed en-core-web-md-2.0.0
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Linking successful[0m
    /srv/conda/lib/python3.6/site-p

In [4]:
import spacy
import html
from spacy import displacy

nlp = spacy.load('en_core_web_md')
print("Loaded models")

Loaded models


Get the input file from the current directory

In [6]:
from io import StringIO
import pandas as pd, numpy as np

df = pd.read_csv("./input.csv")
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments
492,3151.09,3151,9,9,Education,is selfing and then un-selfing into unity and ...,6.0,37,
171,2178.26,2178,26,26,When I get mad,the anger is both a collective and personal ph...,6.0,15,
500,3165.03,3165,3,3,Change is,inevitable.,3.0,10,
107,1964.29,1964,29,29,If my mother,"was alive, I would love to talk with her, shar...",4.5,46,
405,2704.06,2704,6,6,The thing I like about myself is,"My ability of open mindedness, realizing that ...",4.5,21,


Get the actual sentence by joining the prompt and response.

In [7]:
if "prompt" in df.columns: #Original dataset
    df['sentence'] = df.apply(lambda row : "{} {}".format(row['prompt'], row['response']), axis = 1)

df.sample(frac=1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence
335,2541.23,2541,23,23,I am,I can play,1.5,24,,I am I can play
26,1731.09,1731,9,9,Education,"on a need to know basis, it will be there wait...",5.0,38,,"Education on a need to know basis, it will be ..."
533,3321.05,3321,5,5,Being with other people,is most often a pleasure and fascinating exper...,5.0,30,,Being with other people is most often a pleasu...
190,2211.15,2211,15,41,Privacy,is a gift,3.0,28,,Privacy is a gift
119,1997.21,1997,21,21,I just can\'t stand people who,judge and complain.,3.5,39,,I just can\'t stand people who judge and compl...


Preprocessing to remove non-alphanumeric characters and tokenize the sentence using Spacy.

In [10]:
import re, html
PATTERN = "[^a-zA-Z0-9\s]+"
rgx = re.compile(PATTERN, re.IGNORECASE)

df['preprocessed_sentence'] = df['sentence'].apply(lambda ip : re.sub('\s+', ' ', rgx.sub(' ', html.unescape(ip))))
print(df.columns, df.shape)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object') (539, 12)


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc
248,2395.15,2395,15,41,Privacy,is something which is required at times and so...,5.5,46,,Privacy is something which is required at time...,Privacy is something which is required at time...,"(Privacy, is, something, which, is, required, ..."
285,2503.1,2503,10,10,When people are helpless,people doesn't spank people,1.5,4,,When people are helpless people doesn't spank ...,When people are helpless people doesn t spank ...,"(When, people, are, helpless, people, doesn, t..."
286,2503.23,2503,23,23,I am,Lucy,1.0,10,,I am Lucy,I am Lucy,"(I, am, Lucy)"
149,2128.27,2128,27,45,People who step out of line,are sometimes judged unfairly by others becaus...,5.0,28,,People who step out of line are sometimes judg...,People who step out of line are sometimes judg...,"(People, who, step, out, of, line, are, someti..."
174,2182.02,2182,2,2,When I am criticized,it is one face of divinity noticing separation...,5.5,2,,When I am criticized it is one face of divinit...,When I am criticized it is one face of divinit...,"(When, I, am, criticized, it, is, one, face, o..."


In [9]:
df['nlp_doc'] = df['preprocessed_sentence'].apply(lambda ip : nlp(ip))
print(df.columns)
df.sample(frac=1).head()

Index(['UID', 'survey_id', 'prompt_number', 'prompt_id', 'prompt', 'response',
       'score', 'selectionTag', 'AnalystComments', 'sentence',
       'preprocessed_sentence', 'nlp_doc'],
      dtype='object')


Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc
359,2550.31,2550,31,31,My father,goes to work,1.5,18,,My father goes to work,My father goes to work,"(My, father, goes, to, work)"
89,1889.22,1889,22,43,At times I worry about,the fragility of this vessel. Will it hold muc...,6.5,27,,At times I worry about the fragility of this v...,At times I worry about the fragility of this v...,"(At, times, I, worry, about, the, fragility, o..."
231,2343.29,2343,29,29,If my mother,I feel a pull to look back at my own upbringin...,5.5,40,,If my mother I feel a pull to look back at my ...,If my mother I feel a pull to look back at my ...,"(If, my, mother, I, feel, a, pull, to, look, b..."
471,3119.03,3119,3,3,Change is,"fascinating, unavoidable, somehow always to be...",5.0,50,,"Change is fascinating, unavoidable, somehow al...",Change is fascinating unavoidable somehow alwa...,"(Change, is, fascinating, unavoidable, somehow..."
250,2400.07,2400,7,55,My child(ren) and I,not applicable,3.0,32,,My child(ren) and I not applicable,My child ren and I not applicable,"(My, child, ren, and, I, not, applicable)"


### Actual splitting of clauses
#### Metrics

* Total % of sentences with correct reconstructions from a existing dataset =  0.9061 . It's actually greater than 91% since complex first clauses followed by conjunctions put the conjuction with the parent clause in the first.
* Response expected = actual verbatim : 

#### Algorithm
NOTE: Check http://universaldependencies.org/ to understand the grammatical dependencies. To visualize each sentence, look into the html folder. They contain the parsing which can be used to determine direct parents and sub-sentences aka clauses. 
1. Each doc contains clauses such that they have a main verb. 
2. These verbs are connected together to make the entire document in Spacy.
3. We use a recursive method 'get_children' to determine if a child verb is linking two clauses or not. If they are not linking two clauses (these are auxilliary verbs (aux) or clausal complements (xcomp)), they are part of the same clause.
4. This gives an array of clauses and each clause is an array of Spacy token. 
5. This 2D array might have one or more clauses which are sub-clauses of another clause in the same 2D array. These are removed in the postprocessing

In [11]:
def flatten_list(l):
    flat_list = [item for sublist in l for item in sublist]
    return flat_list

def get_children(doc):
    if len([x for x in doc.children]) == 0:
        return [doc]
    if doc.pos_ == "VERB" and doc.dep_ not in ["xcomp", "aux"]:
        return []

    op = flatten_list([get_children(l) for l in doc.lefts]) + [doc] + flatten_list([get_children(r) for r in doc.rights])
    return op

def postprocess(tokens_arr):
    if len(tokens_arr) == 1 and ( tokens_arr[0].dep_ in ["aux", "auxpass"] or tokens_arr[0].tag_ in ["VBG"]): 
        return []
    return tokens_arr

def get_text_from_tokens(tokens_arr):
    op = ' '.join([x.text for x in tokens_arr])
    op = op.replace(" nt", "nt").replace(" '", "'")
    return op

def clause_split_by_verbs(doc):
    op = []
    for token in doc:
        if token.pos_ == "VERB":
            arr = flatten_list([get_children(l) for l in token.lefts]) + [token] + flatten_list([get_children(r) for r in token.rights])
            arr = postprocess(arr)
            op.append(arr)
    if len(op)==0:
        op.append(doc)
    return op

df['split_by_verbs_arr'] = df['nlp_doc'].apply(clause_split_by_verbs)
df.sample(frac = 1).head()

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr
525,3286.06,3286,6,6,The thing I like about myself is,I am continually curious about life and how it...,5.5,37,,The thing I like about myself is I am continua...,The thing I like about myself is I am continua...,"(The, thing, I, like, about, myself, is, I, am...","[[I, like, about, myself], [The, thing, is], [..."
245,2387.09,2387,9,9,Education,can be many things to many different peoples.,3.0,14,,Education can be many things to many different...,Education can be many things to many different...,"(Education, can, be, many, things, to, many, d...","[[], [Education, can, be, many, things, to, ma..."
409,2721.25,2721,25,25,My main problem is,This stem is irrelevant. I have no problem. ...,6.0,32,,My main problem is This stem is irrelevant. I...,My main problem is This stem is irrelevant I h...,"(My, main, problem, is, This, stem, is, irrele...","[[My, main, problem, is], [This, stem, is, irr..."
12,1668.34,1668,34,47,Technology,has been one of the most significant disruptor...,5.0,37,,Technology has been one of the most significan...,Technology has been one of the most significan...,"(Technology, has, been, one, of, the, most, si...","[[], [Technology, has, been, one, of, the, mos..."
117,1993.23,1993,23,23,I am,is the only statement worth exploring and know...,5.0,5,,I am is the only statement worth exploring and...,I am is the only statement worth exploring and...,"(I, am, is, the, only, statement, worth, explo...","[[I, am], [is, the, only, statement, worth, ex..."


df postprocessing and the clause delimiting

In [12]:
def remove_prompts(df):
    prompt, tokens_arr = df.prompt, df.split_by_verbs_arr
    pdoc = nlp(prompt)
    ignore_indices = [x.i for x in pdoc]
    new_arr = []
    for clause in tokens_arr:
        new_clause = [t for t in clause if t.i not in ignore_indices]
        if len(new_clause) >= 0:
            new_arr.append(new_clause)
    return [x for x in new_arr if len(x) != 0]

def process_text_df(clauses_arr):
    new_arr = []
    # first pass
    first_pass = []
    tok_arr = [[ tok.i for tok in clause] for clause in clauses_arr]

    for i in range(len(tok_arr)):
        x = tok_arr[i]
        if len(x) ==  0:
            continue
        is_subset = False
        for y in tok_arr:
            if set(x).issubset(y) and not set(x) == set(y):
                is_subset = True
        if not is_subset:
            first_pass.append(clauses_arr[i])
    
    for clauses in first_pass:
        if len(clauses) == 0:
            continue
        txt = get_text_from_tokens(clauses)
        new_arr.append(txt)
    
    return new_arr
        
df['clauses_doc_final'] = df[['prompt', 'split_by_verbs_arr']].apply(remove_prompts, axis = 1) 
df['clauses_text_final'] = df['clauses_doc_final'].apply(process_text_df)
df['split_by_verbs_arr_cleaned'] = df['split_by_verbs_arr'].apply(process_text_df)
df.sample(frac = 1).head(20)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,score,selectionTag,AnalystComments,sentence,preprocessed_sentence,nlp_doc,split_by_verbs_arr,clauses_doc_final,clauses_text_final,split_by_verbs_arr_cleaned
131,2041.11,2041,11,39,What I like to do best is,"To Train, give the direction and give support ...",4.0,30,,"What I like to do best is To Train, give the d...",What I like to do best is To Train give the di...,"(What, I, like, to, do, best, is, To, Train, g...","[[I, like, What, to, do, best], [What, to, do,...","[[To, Train, give, the, direction, and], [To, ...","[To Train give the direction and, To Train giv...","[I like What to do best, is To Train give the ..."
467,3105.12,3105,12,12,A good boss,"is okay, a great leader is even better.",3.0,25,,"A good boss is okay, a great leader is even be...",A good boss is okay a great leader is even bet...,"(A, good, boss, is, okay, a, great, leader, is...","[[A, good, boss, is, okay], [a, great, leader,...","[[is, okay], [a, great, leader, is, even, bett...","[is okay, a great leader is even better]","[A good boss is okay, a great leader is even b..."
433,2812.1,2812,10,10,When people are helpless,I am helpful,2.0,32,,When people are helpless I am helpful,When people are helpless I am helpful,"(When, people, are, helpless, I, am, helpful)","[[When, people, are, helpless], [I, am, helpful]]","[[I, am, helpful]]",[I am helpful],"[When people are helpless, I am helpful]"
525,3286.06,3286,6,6,The thing I like about myself is,I am continually curious about life and how it...,5.5,37,,The thing I like about myself is I am continua...,The thing I like about myself is I am continua...,"(The, thing, I, like, about, myself, is, I, am...","[[I, like, about, myself], [The, thing, is], [...","[[I, am, continually, curious, about, life, an...","[I am continually curious about life and, how ...","[I like about myself, The thing is, I am conti..."
64,1838.21,1838,21,21,I just can\'t stand people who,Worry too much,2.5,46,,I just can\'t stand people who Worry too much,I just can t stand people who Worry too much,"(I, just, can, t, stand, people, who, Worry, t...","[[], [I, just, can, t, stand, people], [who, W...","[[who, Worry, too, much]]",[who Worry too much],"[I just can t stand people, who Worry too much]"
278,2500.04,2500,4,92,"These days, school",really hard.,2.0,31,,"These days, school really hard.",These days school really hard,"(These, days, school, really, hard)","[(These, days, school, really, hard)]",[[hard]],[hard],[These days school really hard]
522,3247.01,3247,1,1,Raising a family,Has been one of the most transformative and li...,4.0,44,,Raising a family Has been one of the most tran...,Raising a family Has been one of the most tran...,"(Raising, a, family, Has, been, one, of, the, ...","[[Raising, a, family], [], [Has, been, one, of...","[[Has, been, one, of, the, most, transformativ...",[Has been one of the most transformative and l...,"[Raising a family, Has been one of the most tr..."
196,2233.31,2233,31,31,My father,"I did not experienced him more, because he die...",3.5,32,,"My father I did not experienced him more, beca...",My father I did not experienced him more becau...,"(My, father, I, did, not, experienced, him, mo...","[[], [My, father, I, did, not, experienced, hi...","[[I, did, not, experienced, him, more], [becau...","[I did not experienced him more, because he di...","[My father I did not experienced him more, bec..."
203,2282.2,2282,20,44,Business and society,can\'t have one without the other.,2.5,29,,Business and society can\'t have one without t...,Business and society can t have one without th...,"(Business, and, society, can, t, have, one, wi...","[[], [Business, and, society, can, t, have, on...","[[can, t, have, one, without, the, other]]",[can t have one without the other],[Business and society can t have one without t...
121,2003.33,2003,33,33,When I am nervous,I sleep,1.5,35,,When I am nervous I sleep,When I am nervous I sleep,"(When, I, am, nervous, I, sleep)","[[When, I, am, nervous], [I, sleep]]","[[I, sleep]]",[I sleep],"[When I am nervous, I sleep]"


The voice determination of each clause in the actual entence is done using the rules below. 

In [13]:
a_poss, p_yn, p_beverb, p_get, a_def, undef = "A_pron_x", "P_yn", "P_bevb_x", "P_get_x", "A_def", "Undefined"

def voice_rule_engine(clause):
    if True not in [x.pos_ == "VERB" for x in clause]:
        return undef
    
    for x in clause:
        if x.dep_ == "poss":
            return a_poss
        
    ct = 0
    for x in clause:
        if x.text.lower().strip() in ['yes', 'no']:
            ct += 1
    if ct >= len(clause)/2:
        return p_yn

    BEING_VERBS = ['be', 'am', 'is', 'isn', 'are', 'aren', \
                   'was', 'were', 'wasn', 'weren', 'been', 'being', \
                   'have', 'haven', 'has', 'hasn', 'could', 'couldn', \
                   'should', 'shouldn', 'would', 'wouldn', 'may', 'might', 'mightn', \
                   'must','mustn', 'shall', 'can', 'will', \
                   'do', 'don', 'did', 'didn', 'does', 'doesn', 'having']
    for x in clause:
        if x.text.lower().strip() in BEING_VERBS and x.pos_ == "VERB":
            return p_beverb

    for x in clause:
        if x.dep_ == "acomp":
            return p_get
    
    return a_def
    
def clauses_voice(arr_of_clauses):
    op = []
    for clause in arr_of_clauses:
        voice = voice_rule_engine(clause)
        op.append(voice)         
    return op

df['voice'] = df.clauses_doc_final.apply(clauses_voice)
df[['sentence', 'clauses_doc_final', 'voice']].sample(frac = 1).head()

Unnamed: 0,sentence,clauses_doc_final,voice
259,"Love changes when First, when there is a conne...","[[when, there, is, a, connection, oneness], [t...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def, A_de..."
271,"These days, work Is something I continue to en...","[[something], [I, continue, to, enjoy, and], [...","[Undefined, A_def, A_def, A_def, A_def, A_pron..."
303,We could make the world a better place if we h...,"[[we, had, more, people, there]]",[A_def]
262,Women are lucky because Women are lucky becaus...,"[[Women, are, lucky], [because, they, seem, to...","[P_bevb_x, A_def, A_def, A_def]"
277,When I am nervous I've never been nervous...,"[[I, ve, never, been, nervous]]",[P_bevb_x]


This is the visualization of each sentence's parse tree. The output for each sentence in the input dataframe is in the /html folder. 

In [14]:
def htmlise(df):
    html_fs = """
    <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <div>{}</div>
            <div>{}</div>
            <div>{}</div>
        </body>
    </html>"""
    op = spacy.displacy.render(df.nlp_doc, style='dep')
    with open("./html/file_{}.html".format(df.idx), "w") as f:
        f.write(html_fs.format(df.prompt, df.response, df.clauses_text_final, op))
    return
        
df['idx'] = df.index
df.apply(htmlise, axis = 1)
print("HTML processing done")

HTML processing done


The output is the split clauses. This is stored in voice_classified.csv . This will be the input to the second notebook

In [15]:
df[['prompt', 'response', 'clauses_text_final', 'voice', 'idx']].to_csv("./voice_classified.csv", index = False)