# Sentence Quality Scorer

This is the third notebook which evaluates three metrics. 
* The grading level for each clause
* The reading ease in each clause
* The quality of sentence for each clause.

### Introduction to Grading ease and Reading Level Scoring
Grading ease and Reading Levels are computed using the metrics elaborated here: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests

In [1]:
!pip install textacy
!python3 -m spacy download en
import textacy
import pandas as pd
from textacy.text_stats import TextStats
print("Loaded")

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 119.3MB/s ta 0:00:01   18% |██████                          | 6.9MB 117.7MB/s eta 0:00:01

[93m    Linking successful[0m
    /srv/conda/lib/python3.6/site-packages/en_core_web_sm -->
    /srv/conda/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')

Loaded


The input for this notebook is from "abstraction_scored.csv"

In [2]:
df = pd.read_csv("./abstraction_scored.csv")
df.clauses_text_final = df.clauses_text_final.apply(eval)
df.voice = df.voice.apply(eval)
df.abstraction_score = df.abstraction_score.apply(eval)
df.sample(frac = 1).head(10)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx,abstraction_score,abstraction_score_normalized
301,2508.27,2508,27,98,Children who step out of line,I don't know what that means (when children do...,"[I don, t know, what that means, when children...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def, P_be...",301,"[0.22, 0.14, 0.14, 0.25, 0.14, 0.25]","[0.88, 0.56, 0.56, 1.0, 0.56, 1.0]"
213,2319.03,2319,3,3,Change is,Everyday in everything and always. Change is a...,"[Everyday in everything and, always Change, is...","[Undefined, A_def, P_bevb_x]",213,"[0.14, 0.25, 0.25]","[0.56, 1.0, 1.0]"
186,2197.17,2197,17,17,When they avoided me,"I saw it as an expression of form only, which ...","[I saw it as an expression of form, only which...","[A_def, P_bevb_x, P_bevb_x]",186,"[0.22, 0.25, 0.25]","[0.88, 1.0, 1.0]"
380,2612.25,2612,25,25,My main problem is,being in my head rather than fully present and...,[being in my head rather than fully present an...,"[A_pron_x, A_def, P_bevb_x, A_def, A_def, P_be...",380,"[0.14, 0.14, 0.25, 0.22, 0.14, 0.14, 0.25, 0.2...","[0.56, 0.56, 1.0, 0.88, 0.56, 0.56, 1.0, 1.0, ..."
470,3105.25,3105,25,25,My main problem is,saying no... because I want to be helpful... b...,"[saying no, because I want to be helpful but, ...","[P_yn, P_bevb_x, P_bevb_x, P_get_x, A_def, A_def]",470,"[0.14, 0.25, 0.22, 0.14, 0.14, 0.14]","[0.56, 1.0, 0.88, 0.56, 0.56, 0.56]"
161,2170.02,2170,2,2,When I am criticized,"if I am depleted, I tend to take the criticism...","[if I am depleted, I tend to take the criticis...","[P_bevb_x, A_def, P_bevb_x, A_def, A_def, A_pr...",161,"[0.22, 0.25, 0.22, 0.22, 0.25, 0.22, 0.22, 0.2...","[0.88, 1.0, 0.88, 0.88, 1.0, 0.88, 0.88, 1.0, ..."
366,2552.24,2552,24,24,If I had more money,I would spend it on cool stuff.,[I would spend it on cool stuff],[P_bevb_x],366,[0.25],[1.0]
530,3288.16,3288,16,16,I feel sorry,when I hurt others.,[when I hurt others],[A_def],530,[0.22],[0.88]
464,3099.22,3099,22,43,At times I worry about,"big, big existential questions, and little mic...",[big big existential questions and little micr...,[Undefined],464,[0.25],[1.0]
274,2499.11,2499,11,39,What I like to do best is,play and eat sugar.,"[play and, eat sugar]","[A_def, A_def]",274,"[0.14, 0.12]","[0.56, 0.48]"


The raw values of reading ease and grading levels are computed. These are available via Textacy's TextStats.

In [3]:
def score_readability(text):
    doc = textacy.Doc(text, lang = "en")
    ts = TextStats(doc)
    return ts.readability_stats

df['readability_attributes_score'] = df.clauses_text_final.apply(lambda arr: [score_readability(x) for x in arr])
df['grading_level'] = df.readability_attributes_score.apply(lambda dct_arr: [round(dct['flesch_kincaid_grade_level'], 2) for dct in dct_arr])
df['reading_ease'] = df.readability_attributes_score.apply(lambda dct_arr: [round(dct['flesch_reading_ease'], 2) for dct in dct_arr])
_ = """[{'flesch_kincaid_grade_level': 0.6257142857142846, 'flesch_reading_ease': 103.04428571428573, 'smog_index': 3.1291, 'gunning_fog_index': 2.8000000000000003, 'coleman_liau_index': 2.6518669999999993, 'automated_readability_index': 0.23714285714285666, 'lix': 7.0, 'gulpease_index': 93.28571428571428, 'wiener_sachtextformel': -2.5074571428571426}]"""
del df['readability_attributes_score']
df.sample(frac = 1).head(10)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx,abstraction_score,abstraction_score_normalized,grading_level,reading_ease
399,2690.28,2690,28,83,A teacher has the right to,I don't mean to be oppositional. . but this se...,"[I don, t mean to be oppositional but, this se...","[P_bevb_x, P_bevb_x, A_def, P_bevb_x, P_bevb_x...",399,"[0.22, 0.14, 0.25, 0.14, 0.25, 0.25, 0.22, 0.2...","[0.88, 0.56, 1.0, 0.56, 1.0, 1.0, 0.88, 0.88, ...","[-3.01, 6.42, 0.52, 0.72, 7.37, 2.88, 1.31, -1...","[120.21, 59.75, 102.05, 97.03, 54.7, 83.32, 90..."
508,3181.23,3181,23,23,I am,"the voice at the edge of hearing, the movement...","[the voice at the edge of, hearing the movemen...","[Undefined, A_def, A_def, A_def, P_bevb_x, P_b...",508,"[0.25, 0.25, 0.14, 0.14, 0.25, 0.25, 0.22, 0.1...","[1.0, 1.0, 0.56, 0.56, 1.0, 1.0, 0.88, 0.56, 0...","[-1.45, 5.86, -3.01, 4.66, 3.84, 3.72, 0.52, 2...","[116.15, 72.62, 120.21, 90.13, 88.91, 88.0, 10..."
439,2849.22,2849,22,43,At times I worry about,the fact that I can be inadvertently overbearing,"[the fact, that I can be inadvertently overbea...","[Undefined, P_bevb_x]",439,"[0.25, 0.25]","[1.0, 1.0]","[-3.01, 10.35]","[120.21, 31.55]"
41,1806.23,1806,23,23,I am,amazed at how quickly the world gives way to t...,"[amazed at, how quickly the world gives way to...","[Undefined, A_def]",41,"[0.11, 0.25]","[0.44, 1.0]","[-3.01, 1.03]","[120.21, 103.7]"
240,2360.32,2360,32,32,If I can\'t get what I want,it\'s not for me.,"[want, it s not for me]","[A_def, A_def]",240,"[0.14, 0.14]","[0.56, 0.56]","[-3.4, -1.84]","[121.22, 117.16]"
161,2170.02,2170,2,2,When I am criticized,"if I am depleted, I tend to take the criticism...","[if I am depleted, I tend to take the criticis...","[P_bevb_x, A_def, P_bevb_x, A_def, A_def, A_pr...",161,"[0.22, 0.25, 0.22, 0.22, 0.25, 0.22, 0.22, 0.2...","[0.88, 1.0, 0.88, 0.88, 1.0, 0.88, 0.88, 1.0, ...","[3.67, 3.65, -2.23, -3.01, 6.71, 1.03, 0.52, 7...","[75.88, 84.9, 118.18, 120.21, 61.24, 103.7, 10..."
318,2532.03,2532,3,91,Grandparents,are very nice.,[are very nice],[P_bevb_x],318,[0.25],[1.0],[-2.62],[119.19]
388,2628.14,2628,14,14,The past,"seems like remembering a movie I once watched,...","[seems like, remembering a movie, I once watch...","[A_def, A_def, A_def, P_bevb_x, A_def, A_def, ...",388,"[0.12, 0.14, 0.25, 0.25, 0.14, 0.14, 0.14]","[0.48, 0.56, 1.0, 1.0, 0.56, 0.56, 0.56]","[-3.01, 9.18, -2.23, 0.52, 5.24, 0.72, 2.88]","[120.21, 34.59, 118.18, 102.05, 66.4, 97.03, 8..."
470,3105.25,3105,25,25,My main problem is,saying no... because I want to be helpful... b...,"[saying no, because I want to be helpful but, ...","[P_yn, P_bevb_x, P_bevb_x, P_get_x, A_def, A_def]",470,"[0.14, 0.25, 0.22, 0.14, 0.14, 0.14]","[0.56, 1.0, 0.88, 0.56, 0.56, 0.56]","[2.89, 2.31, 0.72, 13.11, -2.23, 0.52]","[77.91, 90.96, 97.03, 6.39, 118.18, 102.05]"
79,1880.36,1880,36,48,Sometimes I wish that,I could be younger and have done things differ...,"[I could be younger and, have done things diff...","[P_bevb_x, P_bevb_x]",79,"[0.25, 0.14]","[1.0, 0.56]","[-1.84, 6.62]","[117.16, 54.73]"


The absolute metrics are normalized between a 0-1 scale. The reading ease is inversely proportional to the quality of the sentence. So the reading-ease values are negated and these negated scores are reverse normalized.

In [4]:
def normalize(row, x_max, x_min, reverse_arr = False):
    if not reverse_arr:
        return [round((x - x_min)/(x_max - x_min), 2) for x in row]
    return [round((-1*x - x_min)/(x_max - x_min), 2) for x in row]

reading_ease = df['reading_ease'].tolist()
reading_ease = [j for i in reading_ease for j in i]
reading_ease = [-1*x for x in reading_ease]
x_max, x_min = max(reading_ease), min(reading_ease)
df['reading_ease_normalized'] = df['reading_ease'].apply(lambda arr : normalize(arr, x_max, x_min, reverse_arr = True))

grading_levels = df['grading_level'].tolist()
grading_levels = [j for i in grading_levels for j in i]
x_max, x_min = max(grading_levels), min(grading_levels)
df['grading_level_normalized'] = df['grading_level'].apply(lambda arr : normalize(arr, x_max, x_min, reverse_arr = False))
df.sample(frac = 1).head(10)

Unnamed: 0,UID,survey_id,prompt_number,prompt_id,prompt,response,clauses_text_final,voice,idx,abstraction_score,abstraction_score_normalized,grading_level,reading_ease,reading_ease_normalized,grading_level_normalized
462,2952.25,2952,25,25,My main problem is,getting over decades spent proving myself.,"[getting over decades, spent proving myself]","[A_def, A_def]",462,"[0.25, 0.14]","[1.0, 0.56]","[1.31, 5.25]","[90.99, 62.79]","[0.09, 0.17]","[0.1, 0.18]"
206,2296.34,2296,34,47,Technology,is simply a tool.,[is simply a tool],[P_bevb_x],206,[0],[0.0],[0.72],[97.03],[0.07],[0.09]
200,2249.24,2249,24,24,If I had more money,"I would save and invest, hoard and spend frivo...","[I would save and, invest hoard and, spend fri...","[P_bevb_x, A_def, A_def, P_bevb_x, P_bevb_x, P...",200,"[0.22, 0.14, 0.14, 0.22, 0.22, 0.25, 0.25, 0.25]","[0.88, 0.56, 0.56, 0.88, 0.88, 1.0, 1.0, 1.0]","[-2.23, 1.31, 2.89, 0.52, 0.52, 10.21, 6.71, 1...","[118.18, 90.99, 77.91, 100.24, 100.24, 37.9, 6...","[0.01, 0.09, 0.13, 0.06, 0.06, 0.25, 0.18, 0.33]","[0.02, 0.1, 0.13, 0.08, 0.08, 0.29, 0.21, 0.37]"
382,2618.12,2618,12,12,A good boss,"fosters the creativity of the team, both indiv...",[fosters the creativity of the team both indiv...,[A_def],382,[0.25],[1.0],[14.27],[10.57],[0.33],[0.37]
8,1665.02,1665,2,2,When I am criticized,I often find myself running through a microgen...,"[I often find, myself running through a microg...","[A_def, A_def, A_def, A_def, A_def, A_def]",8,"[0.22, 0.25, 0.14, 0.25, 0.12, 0.25]","[0.88, 1.0, 0.56, 1.0, 0.48, 1.0]","[1.31, 8.9, 10.35, 14.65, -3.4, 7.63]","[90.99, 47.3, 31.55, 16.77, 121.22, 63.49]","[0.09, 0.22, 0.26, 0.31, 0.0, 0.17]","[0.1, 0.26, 0.29, 0.38, 0.0, 0.23]"
127,2025.22,2025,22,43,At times I worry about,I did not take fast action.,[I did not take fast action],[P_bevb_x],127,[0.25],[1.0],[0.52],[102.05],[0.06],[0.08]
60,1831.11,1831,11,39,What I like to do best is,be in a complete experience of effortless flow...,[be in a complete experience of effortless flo...,[P_bevb_x],60,[0.25],[1.0],[10.56],[47.83],[0.22],[0.3]
216,2333.25,2333,25,25,My main problem is,I worry too much.,[I worry too much],[A_def],216,[0.25],[1.0],[0.72],[97.03],[0.07],[0.09]
132,2045.06,2045,6,6,The thing I like about myself is,"I am adaptable and loving. I see people, natur...","[I am adaptable and, I see, people nature the ...","[P_bevb_x, A_def, A_def, P_bevb_x, A_def]",132,"[0.22, 0.22, 0.14, 0.22, 0.22]","[0.88, 0.88, 0.56, 0.88, 0.88]","[0.72, -3.01, 4.0, 6.42, -1.06]","[97.03, 120.21, 78.87, 59.75, 115.13]","[0.07, 0.0, 0.13, 0.18, 0.02]","[0.09, 0.01, 0.16, 0.21, 0.05]"
103,1951.02,1951,2,2,When I am criticized,$% Meh &#x29;&#x28;*^ Bah IO} Ouch @#$ HA! #*$...,"[Dare, I judge, He split, differentiate Self, ...","[A_def, A_def, A_def, A_def, A_def, P_bevb_x, ...",103,"[0.12, 0.22, 0.14, 0.2, 0.14, 0.22, 0.25]","[0.48, 0.88, 0.56, 0.8, 0.56, 0.88, 1.0]","[-3.4, -3.01, -3.01, 20.59, 3.76, -2.62, -3.01]","[121.22, 120.21, 120.21, -48.99, 82.39, 119.19...","[0.0, 0.0, 0.0, 0.5, 0.11, 0.01, 0.0]","[0.0, 0.01, 0.01, 0.51, 0.15, 0.02, 0.01]"


In [5]:
#Cross verify that they are correct
reading_ease = df['reading_ease_normalized'].tolist()
reading_ease = [j for i in reading_ease for j in i]
grade = df['grading_level_normalized'].tolist()
grade = [j for i in grade for j in i]
print(max(reading_ease), min(reading_ease), max(grade), min(grade))
df[['UID', 'survey_id', 'prompt_number', 'prompt_id', "prompt", "response", "clauses_text_final", "voice", "idx", "abstraction_score_normalized", "reading_ease_normalized", "grading_level_normalized"]].to_csv("readability_scored.csv", index = False)

1.0 0.0 1.0 0.0


### Introduction to Computing the clause's overall quality
This part determines how each clause adds importance to the overall intent of the sentence. To do this we evaluate keyword tuples (Usually an n-gram adds more value when compated to an individual token) of the original sentence using an unsupervised keyword extraction technique like SGRank (elaborated here: http://www.aclweb.org/anthology/S15-1013). The clauses that contain the n-grams are assigned the score of the n-gram as determined by SG Rank. The quality metric per clause is then determined as Sum(Sgrank values of tuples)/Total tuples with values.

This output is stored in "keyterm_scored.csv" which will be used to evaluate the final scores and voices.

In [6]:
from textacy.keyterms import sgrank
df['nlp_doc'] = df.apply(lambda row : textacy.Doc(row['prompt'] + " " + row['response'], lang = "en"), axis = 1)
df['sgrank'] = df['nlp_doc'].apply(lambda doc : sgrank(doc, n_keyterms = len(doc)))

def get_normalized_importance(df):
    clauses = df["clauses_text_final"]
    rank_tuples = dict(df['sgrank'])
    ngram_keys = rank_tuples.keys()
    op = []
    for clause in clauses:
        str_clause = "".join(clause)
        denominator = 0
        numerator = 0
        for x in ngram_keys:
            if x in str_clause:
                numerator += rank_tuples[x]
                denominator += 1
        op.append(round(numerator / denominator, 2) if denominator > 0 else 0.0)
    return op
    
df["sgrank_normalized"] = df.apply(get_normalized_importance, axis = 1)
df[['UID', 'survey_id', 'prompt_number', 'prompt_id', "prompt", "response", "clauses_text_final", "voice", "idx", "abstraction_score_normalized", "reading_ease_normalized", "grading_level_normalized", "sgrank_normalized"]].to_csv("keyterm_scored.csv", index = False)