<a href="https://colab.research.google.com/github/kleczekr/tolkenizer/blob/master/defining_uniqueness_metric.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import spacy
from tabulate import tabulate
from collections import Counter
import pandas as pd

In [None]:
# import nltk, vader, and set up vader sentiment analyzer

import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
# This cell is meant to accommodate the Google Colab way of dealing with reading 
# files from Google Drive; feel free to ignore it if you are running the notebook
# on your local machine
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
with  open('/content/drive/My Drive/book_txt/tbl_gbr.txt', 'r') as f:
  tbl = f.read()

In [None]:
# In order to perform paragraph analysis chapter by chapter, we need to divide
# the book by chapters. Thankfully, this is not difficult at all due to a regular
# way of dividing the chapters (a numeral, preceded and followed by line breaks).
ch_lst = list()
# extract text for chapters 1-72
for i in range(1,73):
  ch_lst.append(tbl.split('\n'+str(i)+'\n')[1].split('\n'+str(i+1)+'\n')[0])
 
# extract text for chapter 73
ch_lst.append(tbl.split('\n73\n')[1].split('\nACKNOWLEDGEMENTS\n')[0])

In [None]:
# retrieve a list of paragraphs for each chapter; start with empty list...
tb_paras = list()
# create empty lists for each chapter---later you will populate em with paragraphs
for i in range(73):
  tb_paras.append([])
# and finally populate the lists with paragraphs. Remember that we have a lot of
# semi-paragraphs (and, in fact, just line breaks) in our paragraph list---the
# code below filters the paragraphs by their lengths, rejecting any paragraph of
# length shorter than 50 characters. As the code takes a while to run, it displays
# the ongoing activity (you can also see how many paragraphs are being rejected).
for i in range(73):
  for element in ch_lst[i].splitlines():
    if len(element) > 50:
      tb_paras[i].append(element)
      print(f'added paragraph of length {len(element)}')
    else:
      print(f'rejected paragraph of length {len(element)}')
  print(f'\nfinished processing chapter {i+1}\n')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
rejected paragraph of length 10
rejected paragraph of length 0
added paragraph of length 355
rejected paragraph of length 0
added paragraph of length 105
rejected paragraph of length 0
added paragraph of length 69
rejected paragraph of length 0
added paragraph of length 157
rejected paragraph of length 0
added paragraph of length 91
rejected paragraph of length 0
added paragraph of length 197
rejected paragraph of length 0
added paragraph of length 302
rejected paragraph of length 0
added paragraph of length 298
rejected paragraph of length 0
added paragraph of length 252
rejected paragraph of length 0
added paragraph of length 349
rejected paragraph of length 0
added paragraph of length 508
rejected paragraph of length 0
rejected paragraph of length 24
rejected paragraph of length 0
added paragraph of length 308
rejected paragraph of length 0
added paragraph of length 239
rejected paragraph of length 0
rejected paragraph

In [None]:
# we repeat the process---we will populate this list with paragraphs converted
# into spacy objects
tb_spc = list()
# create empty lists for each chapter, again...
for i in range(73):
  tb_spc.append([])
# now populate it with spacy objects, in manner similar to this above. It takes
# a precious moment!
for i in range(73):
  for another_i in range(len(tb_paras[i])):
    tb_spc[i].append(nlp(tb_paras[i][another_i]))
    print(f'finished processing paragraph {another_i+1} in chapter {i+1}')
  print(f'\nfinished processing chapter {i+1}\n')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
finished processing paragraph 14 in chapter 30
finished processing paragraph 15 in chapter 30
finished processing paragraph 16 in chapter 30
finished processing paragraph 17 in chapter 30
finished processing paragraph 18 in chapter 30
finished processing paragraph 19 in chapter 30
finished processing paragraph 20 in chapter 30
finished processing paragraph 21 in chapter 30
finished processing paragraph 22 in chapter 30
finished processing paragraph 23 in chapter 30
finished processing paragraph 24 in chapter 30
finished processing paragraph 25 in chapter 30
finished processing paragraph 26 in chapter 30
finished processing paragraph 27 in chapter 30
finished processing paragraph 28 in chapter 30
finished processing paragraph 29 in chapter 30
finished processing paragraph 30 in chapter 30
finished processing paragraph 31 in chapter 30
finished processing paragraph 32 in chapter 30
finished processing paragraph 33 in chapte

In [9]:
# effectively the crux of the task is counting the POS occurrences in the
# paragraphs we've just squeezed through spacy. This is easy with a Counter().
list_pos = list()
chapter_count = 1
# retrieve POS counts, sentence counts and token counts for all the chapters
for chapter in tb_spc:
  for paragraph in chapter:
    poscount = Counter()
    poscount['chapter'] = chapter_count
    poscount['text'] = paragraph.text
    poscount['sentiment'] = sid.polarity_scores(paragraph.text)['compound']
    for sent in paragraph.sents:
      poscount['sentence_count'] += 1
      for tok in sent:
        # the line below counts parts of speech
        poscount[tok.pos_] += 1
        poscount['token_count'] += 1
    # you store the Counter in a list of counters
    list_pos.append(poscount)
  chapter_count += 1

In [10]:
# It's difficult to work with list of counters. Convert it to DataFrame
# to perform easy and quick analyses. Remember to fill NaN values
pos_df = pd.DataFrame(list_pos).fillna(0)
# change name of the index
pos_df.index.names = ['paragraph']

In [11]:
# Perform some calculations---raw numbers don't give you much
pos_df['adj_proportion'] = pos_df.ADJ/pos_df.token_count
pos_df['verb_proportion'] = pos_df.VERB/pos_df.token_count
pos_df['noun_proportion'] = pos_df.NOUN/pos_df.token_count
pos_df['adv_proportion'] = pos_df.ADV/pos_df.token_count
pos_df['propn_proportion'] = pos_df.PROPN/pos_df.token_count
pos_df['avg_sentence_len'] = pos_df.token_count/pos_df.sentence_count
# coordinating conjunction, e.g. and, or, but
pos_df['coord_conj_proportion'] = pos_df.CCONJ/pos_df.token_count
pos_df['punct_proportion'] = pos_df.PUNCT/pos_df.token_count
# adposition---in, to, during
pos_df['adposition_proportion'] = pos_df.ADP/pos_df.token_count
# auxiliary---is, has(done), should (do)
pos_df['auxiliary_proportion'] = pos_df.AUX/pos_df.token_count
# subordinating conjunction---if, while, that
pos_df['sub_conj_proportion'] = pos_df.SCONJ/pos_df.token_count
# determiner---a, an, the
pos_df['det_proportion'] = pos_df.DET/pos_df.token_count

In [12]:
# raw numbers can be dropped now
pos_df = pos_df.drop(columns=['PROPN', 'CCONJ', 'PUNCT',	'ADP', 'NOUN', 'NUM', 'AUX', 'ADJ', \
                     'PART', 'VERB', 'SCONJ', 'PRON', 'ADV', 'DET', 'INTJ', 'X', 'SYM'])

In [13]:
# add several more columns to the dataframe---counts of standard deviations above/below
# the mean for: proportion of adjectives (std_adj), proportion of verbs (std_v), proportion of nouns (std_n),
# proportion of adverbs (std_adv) and proportion of proper nouns (std_pn)
pos_df['std_adj'] = (pos_df.adj_proportion - pos_df.adj_proportion.mean(axis=0)) / pos_df.adj_proportion.std(axis=0)
pos_df['std_v'] = (pos_df.verb_proportion - pos_df.verb_proportion.mean(axis=0)) / pos_df.verb_proportion.std(axis=0)
pos_df['std_n'] = (pos_df.noun_proportion - pos_df.noun_proportion.mean(axis=0)) / pos_df.noun_proportion.std(axis=0)
pos_df['std_adv'] = (pos_df.adv_proportion - pos_df.adv_proportion.mean(axis=0)) / pos_df.adv_proportion.std(axis=0)
pos_df['std_pn'] = (pos_df.propn_proportion - pos_df.propn_proportion.mean(axis=0)) / pos_df.propn_proportion.std(axis=0)

# add additional calculations: std_coord, std_punct, std_adp, std_aux, std_sub, std_det
pos_df['std_coord'] = (pos_df.coord_conj_proportion - pos_df.coord_conj_proportion.mean(axis=0)) / pos_df.coord_conj_proportion.std(axis=0)
pos_df['std_punct'] = (pos_df.punct_proportion - pos_df.punct_proportion.mean(axis=0)) / pos_df.punct_proportion.std(axis=0)
pos_df['std_adp'] = (pos_df.adposition_proportion - pos_df.adposition_proportion.mean(axis=0)) / pos_df.adposition_proportion.std(axis=0)
pos_df['std_aux'] = (pos_df.auxiliary_proportion - pos_df.auxiliary_proportion.mean(axis=0)) / pos_df.auxiliary_proportion.std(axis=0)
pos_df['std_sub'] = (pos_df.sub_conj_proportion - pos_df.sub_conj_proportion.mean(axis=0)) / pos_df.sub_conj_proportion.std(axis=0)
pos_df['std_det'] = (pos_df.det_proportion - pos_df.det_proportion.mean(axis=0)) / pos_df.det_proportion.std(axis=0)

In [14]:
# calculate basic uniqueness of a paragraph---the sum of absolute values of the
# calculated number of standard deviations from the mean that you've
# calculated before.
pos_df['basic_uniqueness'] = abs(pos_df.std_adj) + abs(pos_df.std_v) + \
abs(pos_df.std_n) + abs(pos_df.std_adv) + abs(pos_df.std_pn)

In [15]:
# updated uniqueness metric---with more POS taken into consideration
pos_df['updated_uniqueness'] = abs(pos_df.std_adj) + abs(pos_df.std_v) + abs(pos_df.std_n) + \
abs(pos_df.std_adv) + abs(pos_df.std_pn) + abs(pos_df.std_coord) + abs(pos_df.std_punct)/2 + \
abs(pos_df.std_adp) + abs(pos_df.std_aux)*2 + abs(pos_df.std_sub) + abs(pos_df.std_det)/2

In [16]:
# for display purposes I reduce the length of the text to 30 characters
pos_df.text = pos_df.text.str.slice(0,30)+'...'

In [17]:
# display some of the results---let's see the paragraphs with original uniqueness
# above 11:
print(tabulate(pos_df[pos_df.basic_uniqueness>11], headers='keys'))

  paragraph    chapter  text                                 sentiment    sentence_count    token_count    adj_proportion    verb_proportion    noun_proportion    adv_proportion    propn_proportion    avg_sentence_len    coord_conj_proportion    punct_proportion    adposition_proportion    auxiliary_proportion    sub_conj_proportion    det_proportion    std_adj      std_v      std_n    std_adv    std_pn    std_coord    std_punct     std_adp    std_aux    std_sub     std_det    basic_uniqueness    updated_uniqueness
-----------  ---------  ---------------------------------  -----------  ----------------  -------------  ----------------  -----------------  -----------------  ----------------  ------------------  ------------------  -----------------------  ------------------  -----------------------  ----------------------  ---------------------  ----------------  ---------  ---------  ---------  ---------  --------  -----------  -----------  ----------  ---------  ---------  ---------- 

In [18]:
# Let's see the updated uniqueness metric---does it display more interesting results?
print(tabulate(pos_df[pos_df.updated_uniqueness>19], headers='keys'))

  paragraph    chapter  text                                 sentiment    sentence_count    token_count    adj_proportion    verb_proportion    noun_proportion    adv_proportion    propn_proportion    avg_sentence_len    coord_conj_proportion    punct_proportion    adposition_proportion    auxiliary_proportion    sub_conj_proportion    det_proportion    std_adj      std_v       std_n    std_adv      std_pn    std_coord    std_punct    std_adp    std_aux    std_sub     std_det    basic_uniqueness    updated_uniqueness
-----------  ---------  ---------------------------------  -----------  ----------------  -------------  ----------------  -----------------  -----------------  ----------------  ------------------  ------------------  -----------------------  ------------------  -----------------------  ----------------------  ---------------------  ----------------  ---------  ---------  ----------  ---------  ----------  -----------  -----------  ---------  ---------  ---------  -------

In [19]:
# For comparison purposes, let's see paragraphs of unnaturally high sentiment:
print(tabulate(pos_df[pos_df.sentiment>0.97], headers='keys'))

  paragraph    chapter  text                                 sentiment    sentence_count    token_count    adj_proportion    verb_proportion    noun_proportion    adv_proportion    propn_proportion    avg_sentence_len    coord_conj_proportion    punct_proportion    adposition_proportion    auxiliary_proportion    sub_conj_proportion    det_proportion     std_adj      std_v       std_n     std_adv      std_pn    std_coord    std_punct     std_adp    std_aux    std_sub     std_det    basic_uniqueness    updated_uniqueness
-----------  ---------  ---------------------------------  -----------  ----------------  -------------  ----------------  -----------------  -----------------  ----------------  ------------------  ------------------  -----------------------  ------------------  -----------------------  ----------------------  ---------------------  ----------------  ----------  ---------  ----------  ----------  ----------  -----------  -----------  ----------  ---------  ---------  -

In [20]:
# And unnaturally low sentiment:
print(tabulate(pos_df[pos_df.sentiment<-0.97], headers='keys'))

  paragraph    chapter  text                                 sentiment    sentence_count    token_count    adj_proportion    verb_proportion    noun_proportion    adv_proportion    propn_proportion    avg_sentence_len    coord_conj_proportion    punct_proportion    adposition_proportion    auxiliary_proportion    sub_conj_proportion    det_proportion    std_adj        std_v       std_n     std_adv     std_pn    std_coord    std_punct     std_adp     std_aux     std_sub     std_det    basic_uniqueness    updated_uniqueness
-----------  ---------  ---------------------------------  -----------  ----------------  -------------  ----------------  -----------------  -----------------  ----------------  ------------------  ------------------  -----------------------  ------------------  -----------------------  ----------------------  ---------------------  ----------------  ---------  -----------  ----------  ----------  ---------  -----------  -----------  ----------  ----------  ---------