# Absctraction Scorer

## Introduction
This is the second notebook which scores each clause in each sentence for its abstraction. Its input is "voice_classified.csv".

In [1]:
import nltk
# tjm: commented these out because they are all "already up-to-date" and give error when offline
# nltk.download("punkt")
# nltk.download("stopwords")
# nltk.download("wordnet")

In [2]:
import pandas as pd

df = pd.read_csv("./voice_classified.csv")
df.voice = df.voice.apply(eval)
df.clauses_text_final = df.clauses_text_final.apply(eval)
df

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,score,PassAct,idx
0,02500.013.,2500,13,40,We could make the world a better place if,we could stop smoking!,[we could stop smoking],[P_bevb_x],1.5,a,0
1,02544.024.,2544,24,24,If I had more money,I'll buy a mansion,[I ll buy a mansion],[A_def],1.5,a,1
2,02543.013.,2543,13,40,We could make the world a better place if,you could do whatever you want,"[you could do, whatever you want]","[P_bevb_x, A_def]",1.5,a,2
3,02821.017.,2821,17,17,When they avoided me,I avoid them back,[I avoid them back],[A_def],1.5,a,3
4,02508.027.,2508,27,98,Children who step out of line,I don't know what that means (when children do...,"[I don, t know, what that means, when children...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def, P_be...",1.5,a,4
5,02784.008.,2784,8,8,What gets me into trouble is,Eating The HARVEST,[],[],1.5,a,5
6,02498.014.,2498,14,14,The past,you have to be careful,[you have to be careful],[P_bevb_x],1.5,a,6
7,02550.024.,2550,24,24,If I had more money,you could buy lots of things,[you could buy lots of things],[P_bevb_x],1.5,a,7
8,02507.001.,2507,1,90,My family,has a cat...I don't have a cat though I have 2...,"[has a cat, I don t, have a cat, though I have...","[P_bevb_x, P_bevb_x, P_bevb_x, P_bevb_x]",1.5,a,8
9,02532.024.,2532,24,24,If I had more money,I would get to see the Star Wars movie actors ...,[I would get to see the Star Wars movie actors...,[P_bevb_x],1.5,a,9


This part is vital to determine the actual abstraction score recursively. It consists of 2 classes that represent a tree datastructure with multiple children (https://en.wikipedia.org/wiki/Tree_structure):
* The Hypernym Tree: Its attributes are:
    * The input token (text)
    * The input token's hypernyms as detected by NLTK. Each hypernym token is another object of HypernymTree
* The Hyponym Tree
    * The input token (text)
    * The input token's hyponyms as detected by NLTK. Each hyponym token is another object of HyponymTree
    
Both classes also have two methods:
* Get Max Depth: This gives the total depth of a node in the tree.
* Print Tree: A debug method to print a node with its children indented appropriately.

In [3]:
from nltk.corpus import wordnet as wn, stopwords
ignore_words = list(set(stopwords.words('english')))
ignore_words = ignore_words + ['keep']

DEBUG_ABSTRACTION_HIERARCHY = False
class GenericTree:
    def __init__(self, nltk_word, print_pad = ""):
        self.word = nltk_word
        self.print_pad = print_pad

    def get_max_depth(self):
        if len(self.children) == 0:
            return 0
        return max([tree.get_max_depth() for tree in self.children ])+1
    
    def print_tree(self):
        print("{}>{}".format(self.print_pad, self.word.lemma_names()[0]))
        for tree in self.children:
            tree.print_tree()
            
class HypernymTree(GenericTree):
    def __init__(self, nltk_word, pos, print_pad = ""):
        super().__init__(nltk_word, print_pad)
        if len(nltk_word.hypernyms()) == 0:
            self.children = [] 
        else:
            hyper = nltk_word.hypernyms() 
            hyper = [x for x in hyper if x.pos() == pos]
            self.children = [HypernymTree(x, pos, print_pad = "{}==".format(self.print_pad)) for x in hyper]
        
class HyponymTree(GenericTree):
    def __init__(self, nltk_word, pos, print_pad = ""): 
        super().__init__(nltk_word, print_pad)
        if len(nltk_word.hyponyms()) == 0:
            self.children = [] 
        else:
            hypo = nltk_word.hyponyms()
            hypo = [x for x in hypo if x.pos() == pos]
            self.children = [HyponymTree(x, pos, print_pad = "{}==".format(self.print_pad)) for x in hypo]


For each word in each clause, 
1. The max depth of the hypernym tree is calculated (Let's call it M)
2. The max depth of the hyponym tree is calculated (Let's call it N).
3. The more hyponyms there are to the input token, the more its abstraction score. 
4. So the abstraction score for this word will be the Maximum Depth of Hyponym Tree/Sum of the depths of hypernym and hyponym trees ( N/(M + N)). 

In [4]:
from nltk.tokenize import word_tokenize

def hypernym_hierarchy(toks, pos):
    hyper_trees = [HypernymTree(tok, pos) for tok in toks]
    hypers = [tree.get_max_depth() for tree in hyper_trees]
    hypo_trees = [HyponymTree(tok, pos) for tok in toks]
    hypos = [tree.get_max_depth() for tree in hypo_trees]
    if DEBUG_ABSTRACTION_HIERARCHY:
        for tree in hyper_trees:
            print("#####################################################3")
            tree.print_tree()
        for tree in hypo_trees:
            print("#####################################################3")
            tree.print_tree()

    assert len(hypers) == len(hypos)
    op = [hypos[i]*1./(hypers[i] + hypos[i]) if hypers[i]>0 or hypos[i]>0 else 0 for i in range(len(hypers))]
    #print(op)
    return max(op)

def hypernym_score(word):
    toks = wn.synsets(word)
    if len(toks) == 0 or word in ignore_words:
        return 0

    op = hypernym_hierarchy(toks, toks[0].pos())
    return round(op, 2)
    
def score_abstraction(clauses):
    op = []
    for clause in clauses:
        op.append(max([hypernym_score(x) for x in clause]))
    return op

df['abstraction_score'] = df.clauses_text_final.apply(score_abstraction)
df.sample(frac=1).head(20)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,score,PassAct,idx,abstraction_score
16,02506.033.,2506,33,33,When I am nervous,this one is another tough one: I got to a grownup,"[this one is another tough one, I got to a gro...","[P_bevb_x, A_def]",1.5,a,16,"[0.14, 0.22]"
2,02543.013.,2543,13,40,We could make the world a better place if,you could do whatever you want,"[you could do, whatever you want]","[P_bevb_x, A_def]",1.5,a,2,"[0.25, 0.14]"
6,02498.014.,2498,14,14,The past,you have to be careful,[you have to be careful],[P_bevb_x],1.5,a,6,[0.25]
174,01765.035.,1765,35,35,My conscience bothers me if,I don't clean up the messes I make as an imper...,"[I don t clean up the messes, I make as an imp...","[P_bevb_x, A_def, A_pron_x]",5.5,a,174,"[0.25, 0.25, 0.14]"
32,02532.015.,2532,15,41,Privacy,when I am in the shower I like privacy.,[when I am in the shower I like privacy],[P_bevb_x],2.0,p,32,[0.25]
152,01804.017.,1804,17,17,When they avoided me,"my guess is that it is me avoiding them, just ...","[my guess is, that it is, me avoiding them jus...","[A_pron_x, P_bevb_x, A_def]",5.0,p,152,"[0.12, 0.14, 0.25]"
183,03287.011.,3287,11,39,What I like to do best is,employ useless concentration while dancing wit...,"[employ useless concentration, while dancing w...","[A_def, A_def]",6.0,p,183,"[0.25, 0.25]"
149,02250.022.,2250,22,43,At times I worry about,I don't tend to worry much. What arises is par...,"[and, I don, t tend to worry much, What arises...","[Undefined, P_bevb_x, A_def, A_def, A_pron_x, ...",5.0,p,149,"[0.14, 0.22, 0.25, 0.14, 0.25, 0.25, 0.14]"
36,02536.003.,2536,3,91,Grandparents,make me feel happy,"[make, me feel happy]","[A_def, A_def]",2.0,p,36,"[0.12, 0.14]"
191,01959.009.,1959,9,9,Education,"isn't education if it's isn't whole, and refle...",[if it s isn t whole and reflective of creatio...,[A_pron_x],6.0,p,191,[0.25]


The abstraction score is already betwee 0 and 1. But it is normalized to determine the valid metric value in later notebooks

In [5]:
def normalize(row, x_max, x_min, reverse_arr = False):
    if not reverse_arr:
        return [round((x - x_min)/(x_max - x_min), 2) for x in row]
    return [round((-1*x - x_min)/(x_max - x_min), 2) for x in row]

abstraction_score = df['abstraction_score'].tolist()
abstraction_score = [j for i in abstraction_score for j in i]
x_max, x_min = max(abstraction_score), min(abstraction_score)
df['abstraction_score_normalized'] = df['abstraction_score'].apply(lambda arr : normalize(arr, x_max, x_min))
df

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,score,PassAct,idx,abstraction_score,abstraction_score_normalized
0,02500.013.,2500,13,40,We could make the world a better place if,we could stop smoking!,[we could stop smoking],[P_bevb_x],1.5,a,0,[0.25],[1.0]
1,02544.024.,2544,24,24,If I had more money,I'll buy a mansion,[I ll buy a mansion],[A_def],1.5,a,1,[0.22],[0.88]
2,02543.013.,2543,13,40,We could make the world a better place if,you could do whatever you want,"[you could do, whatever you want]","[P_bevb_x, A_def]",1.5,a,2,"[0.25, 0.14]","[1.0, 0.56]"
3,02821.017.,2821,17,17,When they avoided me,I avoid them back,[I avoid them back],[A_def],1.5,a,3,[0.25],[1.0]
4,02508.027.,2508,27,98,Children who step out of line,I don't know what that means (when children do...,"[I don, t know, what that means, when children...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def, P_be...",1.5,a,4,"[0.22, 0.14, 0.14, 0.25, 0.14, 0.25]","[0.88, 0.56, 0.56, 1.0, 0.56, 1.0]"
5,02784.008.,2784,8,8,What gets me into trouble is,Eating The HARVEST,[],[],1.5,a,5,[],[]
6,02498.014.,2498,14,14,The past,you have to be careful,[you have to be careful],[P_bevb_x],1.5,a,6,[0.25],[1.0]
7,02550.024.,2550,24,24,If I had more money,you could buy lots of things,[you could buy lots of things],[P_bevb_x],1.5,a,7,[0.25],[1.0]
8,02507.001.,2507,1,90,My family,has a cat...I don't have a cat though I have 2...,"[has a cat, I don t, have a cat, though I have...","[P_bevb_x, P_bevb_x, P_bevb_x, P_bevb_x]",1.5,a,8,"[0.25, 0.22, 0.25, 0.22]","[1.0, 0.88, 1.0, 0.88]"
9,02532.024.,2532,24,24,If I had more money,I would get to see the Star Wars movie actors ...,[I would get to see the Star Wars movie actors...,[P_bevb_x],1.5,a,9,[0.25],[1.0]


In [6]:
df.to_csv("./abstraction_scored.csv", index = False)