# Absctraction Scorer

## Introduction
This is the second notebook which scores each clause in each sentence for its abstraction. Its input is "voice_classified.csv".

In [1]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import pandas as pd

df = pd.read_csv("./voice_classified.csv")
df.voice = df.voice.apply(eval)
df.clauses_text_final = df.clauses_text_final.apply(eval)
df

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx
0,1357.14,1357,14,14,The past,"Winds through us, both from our lives and cult...",[Winds through us both from our lives and cult...,"[A_pron_x, A_pron_x, P_bevb_x, P_bevb_x, P_bev...",0
1,1357.22,1357,22,43,At times I worry about,Insufficient care and attention for his egocen...,[Insufficient care and attention for his egoce...,[A_pron_x],1
2,1522.08,1522,8,8,What gets me into trouble is,not considering others possibilities.,[not considering others possibilities],[A_def],2
3,1522.10,1522,10,10,When people are helpless,They often don&#039;t know it so they flak aro...,"[They often don, t know it, so they flak aroun...","[P_bevb_x, A_def, A_def]",3
4,1522.15,1522,15,41,Privacy,is a sense of hiding from others that which yo...,"[is a sense of, hiding from others that, which...","[P_bevb_x, A_def, P_bevb_x, A_def, A_def]",4
5,1522.32,1522,32,32,If I can\'t get what I want,I really don&#039;t want that much any more.,"[want, I really don, t want that much any more]","[A_def, P_bevb_x, A_def]",5
6,1524.03,1524,3,3,Change is,hj;oh,[hj],[Undefined],6
7,1529.10,1529,10,10,When people are helpless,At times I try to find other ways of doing thi...,"[At times I try to find other ways of, doing t...","[A_def, A_def, P_bevb_x, P_get_x, A_def]",7
8,1665.02,1665,2,2,When I am criticized,I often find myself running through a microgen...,"[I often find, myself running through a microg...","[A_def, A_def, A_def, A_def, A_def, A_def]",8
9,1665.05,1665,5,5,Being with other people,Is a mutually mutating meeting of universes,"[Is a meeting of universes, mutually mutating]","[P_bevb_x, A_def]",9


This part is vital to determine the actual abstraction score recursively. It consists of 2 classes that represent a tree datastructure with multiple children (https://en.wikipedia.org/wiki/Tree_structure):
* The Hypernym Tree: Its attributes are:
    * The input token (text)
    * The input token's hypernyms as detected by NLTK. Each hypernym token is another object of HypernymTree
* The Hyponym Tree
    * The input token (text)
    * The input token's hyponyms as detected by NLTK. Each hyponym token is another object of HyponymTree
    
Both classes also have two methods:
* Get Max Depth: This gives the total depth of a node in the tree.
* Print Tree: A debug method to print a node with its children indented appropriately.

In [3]:
from nltk.corpus import wordnet as wn, stopwords
ignore_words = list(set(stopwords.words('english')))
ignore_words = ignore_words + ['keep']

DEBUG_ABSTRACTION_HIERARCHY = False
class GenericTree:
    def __init__(self, nltk_word, print_pad = ""):
        self.word = nltk_word
        self.print_pad = print_pad

    def get_max_depth(self):
        if len(self.children) == 0:
            return 0
        return max([tree.get_max_depth() for tree in self.children ])+1
    
    def print_tree(self):
        print("{}>{}".format(self.print_pad, self.word.lemma_names()[0]))
        for tree in self.children:
            tree.print_tree()
            
class HypernymTree(GenericTree):
    def __init__(self, nltk_word, pos, print_pad = ""):
        super().__init__(nltk_word, print_pad)
        if len(nltk_word.hypernyms()) == 0:
            self.children = [] 
        else:
            hyper = nltk_word.hypernyms() 
            hyper = [x for x in hyper if x.pos() == pos]
            self.children = [HypernymTree(x, pos, print_pad = "{}==".format(self.print_pad)) for x in hyper]
        
class HyponymTree(GenericTree):
    def __init__(self, nltk_word, pos, print_pad = ""): 
        super().__init__(nltk_word, print_pad)
        if len(nltk_word.hyponyms()) == 0:
            self.children = [] 
        else:
            hypo = nltk_word.hyponyms()
            hypo = [x for x in hypo if x.pos() == pos]
            self.children = [HyponymTree(x, pos, print_pad = "{}==".format(self.print_pad)) for x in hypo]


For each word in each clause, 
1. The max depth of the hypernym tree is calculated (Let's call it M)
2. The max depth of the hyponym tree is calculated (Let's call it N).
3. The more hyponyms there are to the input token, the more its abstraction score. 
4. So the abstraction score for this word will be the Maximum Depth of Hyponym Tree/Sum of the depths of hypernym and hyponym trees ( N/(M + N)). 

In [4]:
from nltk.tokenize import word_tokenize

def hypernym_hierarchy(toks, pos):
    hyper_trees = [HypernymTree(tok, pos) for tok in toks]
    hypers = [tree.get_max_depth() for tree in hyper_trees]
    hypo_trees = [HyponymTree(tok, pos) for tok in toks]
    hypos = [tree.get_max_depth() for tree in hypo_trees]
    if DEBUG_ABSTRACTION_HIERARCHY:
        for tree in hyper_trees:
            print("#####################################################3")
            tree.print_tree()
        for tree in hypo_trees:
            print("#####################################################3")
            tree.print_tree()

    assert len(hypers) == len(hypos)
    op = [hypos[i]*1./(hypers[i] + hypos[i]) if hypers[i]>0 or hypos[i]>0 else 0 for i in range(len(hypers))]
    #print(op)
    return max(op)

def hypernym_score(word):
    toks = wn.synsets(word)
    if len(toks) == 0 or word in ignore_words:
        return 0

    op = hypernym_hierarchy(toks, toks[0].pos())
    return round(op, 2)
    
def score_abstraction(clauses):
    op = []
    for clause in clauses:
        op.append(max([hypernym_score(x) for x in clause]))
    return op

df['abstraction_score'] = df.clauses_text_final.apply(score_abstraction)
df.sample(frac=1).head(20)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx,abstraction_score
33,1766.3,1766,30,30,If I were in charge,Of anything right now it would be a challenge ...,"[Of anything right now, it would be a challeng...","[Undefined, P_bevb_x, P_bevb_x, P_bevb_x, P_be...",33,"[0.14, 0.25, 0.25, 0.25, 0.22]"
207,2301.03,2301,3,3,Change is,good.,[good],[Undefined],207,[0.12]
399,2690.28,2690,28,83,A teacher has the right to,I don't mean to be oppositional. . but this se...,"[I don, t mean to be oppositional but, this se...","[P_bevb_x, P_bevb_x, A_def, P_bevb_x, P_bevb_x...",399,"[0.22, 0.14, 0.25, 0.14, 0.25, 0.25, 0.22, 0.2..."
42,1810.13,1810,13,40,We could make the world a better place if,..if individuals and systems are given the spa...,[if individuals and systems are given the spac...,"[P_bevb_x, P_bevb_x, A_def, P_bevb_x, A_def, A...",42,"[0.25, 0.14, 0.14, 0.22, 0.25, 0.14, 0.25, 0.2..."
529,3287.24,3287,24,24,If I had more money,I'd live with different opportunities that wou...,"[I d live with different opportunities, that w...","[A_def, P_bevb_x, P_bevb_x, P_bevb_x, P_get_x,...",529,"[0.22, 0.14, 0.14, 0, 0.25, 0.25]"
246,2390.17,2390,17,17,When they avoided me,"I felt safe, knowing that I could receive a wa...","[I felt safe knowing, that I could receive a w...","[A_def, A_pron_x]",246,"[0.22, 0.25]"
270,2469.02,2469,2,2,When I am criticized,"I can feel hurt but have learned that ""critici...","[I can feel hurt but, have learned, that criti...","[P_bevb_x, P_bevb_x, A_pron_x, P_bevb_x, A_def...",270,"[0.25, 0.14, 0.25, 0.25, 0.25, 0.25]"
494,3151.11,3151,11,39,What I like to do best is,The universe 'does me' as it sweetly caresses ...,"[The universe does me, as it sweetly caresses ...","[P_bevb_x, A_def, P_bevb_x, P_bevb_x, A_def, P...",494,"[0.14, 0.25, 0.14, 0.25, 0.25, 0.25, 0.14, 0.25]"
410,2721.26,2721,26,26,When I get mad,i feel adrenaline through my human body and it...,"[i feel adrenaline through my human body and, ...","[A_pron_x, P_bevb_x, A_def, A_def, P_bevb_x, A...",410,"[0.14, 0.25, 0.11, 0.25, 0.2, 0.25]"
373,2559.15,2559,15,41,Privacy,should be respected,[should be respected],[P_bevb_x],373,[0.25]


The abstraction score is already betwee 0 and 1. But it is normalized to determine the valid metric value in later notebooks

In [5]:
def normalize(row, x_max, x_min, reverse_arr = False):
    if not reverse_arr:
        return [round((x - x_min)/(x_max - x_min), 2) for x in row]
    return [round((-1*x - x_min)/(x_max - x_min), 2) for x in row]

abstraction_score = df['abstraction_score'].tolist()
abstraction_score = [j for i in abstraction_score for j in i]
x_max, x_min = max(abstraction_score), min(abstraction_score)
df['abstraction_score_normalized'] = df['abstraction_score'].apply(lambda arr : normalize(arr, x_max, x_min))
df

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx,abstraction_score,abstraction_score_normalized
0,1357.14,1357,14,14,The past,"Winds through us, both from our lives and cult...",[Winds through us both from our lives and cult...,"[A_pron_x, A_pron_x, P_bevb_x, P_bevb_x, P_bev...",0,"[0.25, 0.25, 0.25, 0.25, 0.14, 0.25, 0.12, 0.1...","[1.0, 1.0, 1.0, 1.0, 0.56, 1.0, 0.48, 0.56, 0...."
1,1357.22,1357,22,43,At times I worry about,Insufficient care and attention for his egocen...,[Insufficient care and attention for his egoce...,[A_pron_x],1,[0.25],[1.0]
2,1522.08,1522,8,8,What gets me into trouble is,not considering others possibilities.,[not considering others possibilities],[A_def],2,[0.25],[1.0]
3,1522.10,1522,10,10,When people are helpless,They often don&#039;t know it so they flak aro...,"[They often don, t know it, so they flak aroun...","[P_bevb_x, A_def, A_def]",3,"[0.14, 0.14, 0.14]","[0.56, 0.56, 0.56]"
4,1522.15,1522,15,41,Privacy,is a sense of hiding from others that which yo...,"[is a sense of, hiding from others that, which...","[P_bevb_x, A_def, P_bevb_x, A_def, A_def]",4,"[0.14, 0.14, 0.25, 0.14, 0.14]","[0.56, 0.56, 1.0, 0.56, 0.56]"
5,1522.32,1522,32,32,If I can\'t get what I want,I really don&#039;t want that much any more.,"[want, I really don, t want that much any more]","[A_def, P_bevb_x, A_def]",5,"[0.14, 0.22, 0.25]","[0.56, 0.88, 1.0]"
6,1524.03,1524,3,3,Change is,hj;oh,[hj],[Undefined],6,[0.14],[0.56]
7,1529.10,1529,10,10,When people are helpless,At times I try to find other ways of doing thi...,"[At times I try to find other ways of, doing t...","[A_def, A_def, P_bevb_x, P_get_x, A_def]",7,"[0.22, 0.14, 0.22, 0.14, 0.12]","[0.88, 0.56, 0.88, 0.56, 0.48]"
8,1665.02,1665,2,2,When I am criticized,I often find myself running through a microgen...,"[I often find, myself running through a microg...","[A_def, A_def, A_def, A_def, A_def, A_def]",8,"[0.22, 0.25, 0.14, 0.25, 0.12, 0.25]","[0.88, 1.0, 0.56, 1.0, 0.48, 1.0]"
9,1665.05,1665,5,5,Being with other people,Is a mutually mutating meeting of universes,"[Is a meeting of universes, mutually mutating]","[P_bevb_x, A_def]",9,"[0.22, 0.14]","[0.88, 0.56]"


In [6]:
df.to_csv("./abstraction_scored.csv", index = False)