## Quantifying Linguistic Degeneration

Once I have the data, I must think about how to quantify linguistic degeneration. For this, I will be using the following metrics:
<br>

***Lexical Quality***

1. Type-token ratio (TTR): measures vocabulary diversity
2. Zipf curve tail heaviness: measures use of rare words
3. Average word length: measures use of larger words
4. Age of acquisition: measures word difficulty

***Syntactic Quality***

5. Parse tree depth: measures sentence complexity
6. Fragment ratio: measure of informality
7. Index of Syntactic Complexity: measures overall syntactic quality

***Orthographical Quality***

8. Levenshtein distance to dictionary: spelling errors
9. Punctuation frequency: attention to grammar

***Substantive Quality***

10. Abstract concepts
11. Figurative language

Note that I do not use other traditional measures of lexical complexity, such as ambiguity, vague quantifier frequency, orthographic neighborhood size, terminology inconsistency. Word ambiguity almost exclusively measures the complexity of the text from the perspective of the reader. As the purpose of this analysis is to look at language degeneration over time as a result of modern phenomena, I am principally interested in the intellect required to produce the complexity, meaning I am looking at true substantive complexity as opposed to inteprative complexity. Similarly, vague quantifier frequency, orthographic neighborhood size, and terminology inconsistency also measure complexity of interpratation, hence they are excluded. It is also worth noting that I categorize abstract concepts and figurative language under substantive quality as opposed to lexical complexity, as they involve more intimately the meaning of the word.

Likewise, I exclude certain measures of syntactic complexity, such as sentence length, information overload, passive voice, and negation. In the case of sentence length, it can be a sign of high (i.e., complexity) or low (i.e., run-on sentences) quality writing, which is why I have excluded it. Additionally, sentence complexity can be better measured via ICS. Information overload, passive voice, and negation are excluded for the same interpratability versus quality distinction in the above paragraph.

## Computing Lexical Quality Metrics With Python

In [9]:
import re
import numpy as np
import pandas as pd
from wordfreq import zipf_frequency

In [10]:
# sample post from the Cornell subreddit for testing tokenizers
cornell_example = "Based on 2017 numbers:    \n\n3375 entering freshman, 56.6% yield, meaning 5962 were accepted out of 47039. That gives a 12.7% acceptance rate.    \n\nIf Cornell were to enroll 900 more students, that'd be 225 additional students per year.  That works out to (3375+225)/0.566 = 6360 accepted students, giving a theoretical acceptance rate of 13.5% if Cornell had implemented this change in 2017.    \n\nKeep in mind that this is not an accurate projection for 2021 because we get ~2000 more applicants each year, so acceptance rates will actually continue to fall.   \n\nRegardless, a 0.8% rise in acceptance doesn't seem too bad.  As long as the faculty can handle the moderate increase in class sizes and the quality of education stays the same, I don't see a reason to reject more people than we have to.    \n\nSource: http://irp.dpb.cornell.edu/tableau_visual/admissions"

In [11]:
def simple_tokenize(text):
    '''Helper function to tokenize a string of text, removing non-alphabetic characters'''
    
    text = text.lower()
    processed = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = processed.split()

    return tokens

In [12]:
example = "Nicholas $%#! @@is# $@! 123 gre4at.üëç"
print(example)
print(simple_tokenize(example))
print('The result should only contain words\n')
print(simple_tokenize(cornell_example))

Nicholas $%#! @@is# $@! 123 gre4at.üëç
['nicholas', 'is', 'great']
The result should only contain words

['based', 'on', 'numbers', 'entering', 'freshman', 'yield', 'meaning', 'were', 'accepted', 'out', 'of', 'that', 'gives', 'a', 'acceptance', 'rate', 'if', 'cornell', 'were', 'to', 'enroll', 'more', 'students', 'thatd', 'be', 'additional', 'students', 'per', 'year', 'that', 'works', 'out', 'to', 'accepted', 'students', 'giving', 'a', 'theoretical', 'acceptance', 'rate', 'of', 'if', 'cornell', 'had', 'implemented', 'this', 'change', 'in', 'keep', 'in', 'mind', 'that', 'this', 'is', 'not', 'an', 'accurate', 'projection', 'for', 'because', 'we', 'get', 'more', 'applicants', 'each', 'year', 'so', 'acceptance', 'rates', 'will', 'actually', 'continue', 'to', 'fall', 'regardless', 'a', 'rise', 'in', 'acceptance', 'doesnt', 'seem', 'too', 'bad', 'as', 'long', 'as', 'the', 'faculty', 'can', 'handle', 'the', 'moderate', 'increase', 'in', 'class', 'sizes', 'and', 'the', 'quality', 'of', 'educat

Notice that the simple tokenizer cannot adequately handle links. See the last index: "httpirpdpbcornelledutableauvisualadmissions"

In [13]:
# set up nltk tokenizer
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/nickvick/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [14]:
print('Now try the advanced tokenizer')
another_example = "Nicholas is great. Hopefully this works. I now have hope. üëç."
print(another_example)
print(word_tokenize(another_example))

Now try the advanced tokenizer
Nicholas is great. Hopefully this works. I now have hope. üëç.
['Nicholas', 'is', 'great', '.', 'Hopefully', 'this', 'works', '.', 'I', 'now', 'have', 'hope', '.', 'üëç', '.']


In [15]:
def tokenize(text):
    '''Helper function to tokenize social media text. Note that the TweetTokenizer 
    preserves mentions, contractions'''
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)

    return tokens

In [16]:
print("Let's try this on a line from the actual Cornell subreddit")
tokenize(cornell_example)

Let's try this on a line from the actual Cornell subreddit


['Based',
 'on',
 '2017',
 'numbers',
 ':',
 '3375',
 'entering',
 'freshman',
 ',',
 '56.6',
 '%',
 'yield',
 ',',
 'meaning',
 '5962',
 'were',
 'accepted',
 'out',
 'of',
 '47039',
 '.',
 'That',
 'gives',
 'a',
 '12.7',
 '%',
 'acceptance',
 'rate',
 '.',
 'If',
 'Cornell',
 'were',
 'to',
 'enroll',
 '900',
 'more',
 'students',
 ',',
 "that'd",
 'be',
 '225',
 'additional',
 'students',
 'per',
 'year',
 '.',
 'That',
 'works',
 'out',
 'to',
 '(',
 '3375',
 '+',
 '225',
 ')',
 '/',
 '0.566',
 '=',
 '6360',
 'accepted',
 'students',
 ',',
 'giving',
 'a',
 'theoretical',
 'acceptance',
 'rate',
 'of',
 '13.5',
 '%',
 'if',
 'Cornell',
 'had',
 'implemented',
 'this',
 'change',
 'in',
 '2017',
 '.',
 'Keep',
 'in',
 'mind',
 'that',
 'this',
 'is',
 'not',
 'an',
 'accurate',
 'projection',
 'for',
 '2021',
 'because',
 'we',
 'get',
 '~',
 '2000',
 'more',
 'applicants',
 'each',
 'year',
 ',',
 'so',
 'acceptance',
 'rates',
 'will',
 'actually',
 'continue',
 'to',
 'fall',
 '

In [17]:
def clean_lexical_tokens(tokens):
    '''Helper function to clean tokens by removing punctuation, numbers, and emojis
    for purely lexical analysis.'''

    cleaned = []

    for tok in tokens:
        # skip over punctuation
        if re.match(r'^\W+$', tok):
            continue
        # skip over emojis
        # if tok.encode()
        # only keep alphabetic tokens
        if tok.isalpha():
            cleaned.append(tok.lower())

    return cleaned

In [18]:
clean_lexical_tokens(tokenize(cornell_example))

['based',
 'on',
 'numbers',
 'entering',
 'freshman',
 'yield',
 'meaning',
 'were',
 'accepted',
 'out',
 'of',
 'that',
 'gives',
 'a',
 'acceptance',
 'rate',
 'if',
 'cornell',
 'were',
 'to',
 'enroll',
 'more',
 'students',
 'be',
 'additional',
 'students',
 'per',
 'year',
 'that',
 'works',
 'out',
 'to',
 'accepted',
 'students',
 'giving',
 'a',
 'theoretical',
 'acceptance',
 'rate',
 'of',
 'if',
 'cornell',
 'had',
 'implemented',
 'this',
 'change',
 'in',
 'keep',
 'in',
 'mind',
 'that',
 'this',
 'is',
 'not',
 'an',
 'accurate',
 'projection',
 'for',
 'because',
 'we',
 'get',
 'more',
 'applicants',
 'each',
 'year',
 'so',
 'acceptance',
 'rates',
 'will',
 'actually',
 'continue',
 'to',
 'fall',
 'regardless',
 'a',
 'rise',
 'in',
 'acceptance',
 'seem',
 'too',
 'bad',
 'as',
 'long',
 'as',
 'the',
 'faculty',
 'can',
 'handle',
 'the',
 'moderate',
 'increase',
 'in',
 'class',
 'sizes',
 'and',
 'the',
 'quality',
 'of',
 'education',
 'stays',
 'the',
 's

In [19]:
def ttr(text):
    '''Function that returns the type-token ratio'''

    tokens = tokenize(text)
    tokens = clean_lexical_tokens(tokens)

    # error handling for when there are no tokens
    if len(tokens) == 0:
        return 0.0
    
    # recall that TTR is number of unique words / number of words
    num_types = len(set(tokens))
    num_tokens = len(tokens)
    ttr = num_types / num_tokens

    return ttr

In [12]:
example2 = "Nicholas is Nicholas"
print(ttr(example2))
print('The result should be 0.66')

0.6666666666666666
The result should be 0.66


In [20]:
def avg_word_length(text):
    '''Function that determines the average word length of a given text'''
    
    tokens = tokenize(text)
    tokens = clean_lexical_tokens(tokens)

    # error handling for when there are no tokens
    if len(tokens) == 0:
        return 0.0
    
    average_length = np.mean([len(word) for word in tokens])

    return average_length

In [21]:
example6 = "\nNicholas is great üëç &*&()"
print(example6)
print(tokenize(example6))
print(clean_lexical_tokens(tokenize(example6)))

print(avg_word_length(example6))
print('The result should be 5')


Nicholas is great üëç &*&()
['Nicholas', 'is', 'great', 'üëç', '&', '*', '&', '(', ')']
['nicholas', 'is', 'great']
5.0
The result should be 5


In [23]:
# build aoa_dict: word -> average age of acquisition
aoa_df = pd.read_csv("../Data/KupermanAoAData.csv")
aoa_dict = dict(zip(aoa_df["word"], aoa_df["rating_mean"]))

def aoa_score(text, aoa_dict):
    '''Returns the average age of acquisition score for a given text'''
    
    tokens = tokenize(text)
    tokens = clean_lexical_tokens(tokens)
    aoa_values = [aoa_dict[word] for word in tokens if word in aoa_dict]

    # if there are no words, return a default value
    if len(aoa_values) == 0:
        return np.nan
    
    aoa_score = np.mean(aoa_values)

    return aoa_score

In [16]:
example3 = "because I am cool"
print(aoa_score(example3, aoa_dict))
print('This result should be lower than:')
example4 = "sophisticated technical jargon"
print(aoa_score(example4, aoa_dict))

4.51
This result should be lower than:
11.163333333333334


In [24]:
def zipf_score(text):
    '''Returns a frequency score (higher -> more frequent) based on the Zipf scale'''
    
    tokens = tokenize(text)
    tokens = clean_lexical_tokens(tokens)
    
    zipf_values = [zipf_frequency(word, 'en') for word in tokens]

     # if there are no words, return a default value
    if len(zipf_values) == 0:
        return np.nan

    zipf_score = np.mean(zipf_values)

    return zipf_score

In [18]:
print(zipf_score(example3))
print('This result should be higher than:')
print(zipf_score(example4))
print()

example5 = "word"
print(zipf_score(example5))
print('This result should be 5.26')

6.012500000000001
This result should be higher than:
4.013333333333334

5.26
This result should be 5.26


## Computing Syntactic Quality Metrics With Python

Note that in order to run the following text block, the following must be run to install the relevant NLP model:
<br>
**python3 -m spacy download en_core_web_sm**

In [25]:
from nltk import pos_tag
from nltk.corpus import treebank
from nltk.tree import *
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('treebank')

import spacy, benepar
nlp = spacy.load("en_core_web_sm") # pre-trained English model

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/nickvick/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [26]:
import stanza
stanza.download("en")
stanza_parser = stanza.Pipeline("en", processors="tokenize,pos,constituency")

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 436kB [00:00, 186MB/s]                     
2026-01-27 09:47:07 INFO: Downloaded file to /Users/nickvick/stanza_resources/resources.json
2026-01-27 09:47:07 INFO: Downloading default packages for language: en (English) ...
2026-01-27 09:47:08 INFO: File exists: /Users/nickvick/stanza_resources/en/default.zip
2026-01-27 09:47:09 INFO: Finished downloading models and saved to /Users/nickvick/stanza_resources
2026-01-27 09:47:09 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 436kB [00:00, 42.2MB/s]                    
2026-01-27 09:47:09 INFO: Downloaded file to /Users/nickvick/stanza_resource

In [27]:
def split_sentences(text):
    '''Helper function to split a given post into separate sentences'''

    sentence_tokens = sent_tokenize(text)

    return sentence_tokens

In [28]:
print(cornell_example)
print('-----------------------')

split_result = split_sentences(cornell_example)
print(split_result)
print()
for sent in split_result:
    print(sent)

Based on 2017 numbers:    

3375 entering freshman, 56.6% yield, meaning 5962 were accepted out of 47039. That gives a 12.7% acceptance rate.    

If Cornell were to enroll 900 more students, that'd be 225 additional students per year.  That works out to (3375+225)/0.566 = 6360 accepted students, giving a theoretical acceptance rate of 13.5% if Cornell had implemented this change in 2017.    

Keep in mind that this is not an accurate projection for 2021 because we get ~2000 more applicants each year, so acceptance rates will actually continue to fall.   

Regardless, a 0.8% rise in acceptance doesn't seem too bad.  As long as the faculty can handle the moderate increase in class sizes and the quality of education stays the same, I don't see a reason to reject more people than we have to.    

Source: http://irp.dpb.cornell.edu/tableau_visual/admissions
-----------------------
['Based on 2017 numbers:    \n\n3375 entering freshman, 56.6% yield, meaning 5962 were accepted out of 47039.'

In [156]:
# define relevant sets of tags and words
FINITE_VERB_TAGS = {"VB", "VBD", "VBN", "VBP", "VBZ"}
SUBJECT_TAGS = {"NN", "NNS", "NNP", "NNPS", "PRP"}
SUBORDINATING_CONJ_TAGS = {"IN"} # tag for subordinating conjunction
COORDINATING_CONJ_TAGS = {"CC"} # tag for coordinating conjunction
SUBORDINATE_CLAUSE_MARKERS = {"that", "which", "who", "whom", "whose"}

PUNCT = '?!.,:;({[]})-‚Äì‚Äî"\''
CLOSING_PUNCT = '.!?‚Ä¶'
TRAILING_CLOSERS = set(['"', "'", ')', ']', '}', '‚Äù', '‚Äô'])

# normalize curly quotes and fancy punctuation
FANCY_TO_ASCII = {
                '‚Äú': '"', '‚Äù': '"',
                '‚Äò': "'", '‚Äô': "'",
                '‚Äî': '-', '‚Äì': '-',
                '‚Ä¶': '...'
                }

In [153]:
def is_complete_sentence(sentence):
    '''Helper function to determine whether a sentence is complete. Recall that a complete sentence follows these rules:
    -contains at least one subject 
    -contains at least one finite verb
    -ends with appropriate punctuation (.?!) 
    -if it begins with a subordinator, has an independent clause after
    -does not end with a conjunction
    '''

    cleaned = sentence.strip() # removing trailing/leading whitespace
    # account for differences in straight vs. smart quotes
    for f, a in FANCY_TO_ASCII.items():
        cleaned = cleaned.replace(f, a)
    # remove leading/trailing quotes
    cleaned = cleaned.strip('\"')
    cleaned = cleaned.strip('\'')

    # empty string
    if not cleaned:
        return False
    
    # tokenize sentence and tag tokens
    tokens = tokenize(cleaned)
    tags = pos_tag(tokens)

    # ensure length is appropriate
    if len(tokens) < 2:
        return False

    # first letter should be capital
    j = 0
    while j < len(cleaned) and cleaned[j] in PUNCT:
        j += 1
    if j >= len(cleaned):
        return False
    if not cleaned[j].isalpha() or not cleaned[j].isupper():
        return False
        
    # last relevant char must end with proper punctuation
    i = len(cleaned) - 1
    while i > 0 and cleaned[i] in TRAILING_CLOSERS:
        i -= 1
    if i <= 0 or cleaned[i] not in CLOSING_PUNCT:
        return False
    
    # find the first words tag
    first_word = None
    first_tag = None
    for word, tag in tags:
        if word.isalpha():
            first_word = word
            first_tag = tag
            break
    # if first word is subordinating conjunction (including "when"), need independent clause after
    if first_tag in SUBORDINATING_CONJ_TAGS or first_word == "When":
        if ',' in tokens: # indepdent clause will start after a comma
            comma_index = tokens.index(',')
            post_sub_tags = tags[comma_index+1:]
            # check if independent clause is a complete thought
            has_finite_verb_post_sub = any(tag in FINITE_VERB_TAGS for _, tag in post_sub_tags)
            has_subject_post_sub = any(tag in SUBJECT_TAGS for _, tag in tags)
            if not (has_finite_verb_post_sub and has_subject_post_sub):
                return False
        # if no comma separating clauses
        else:
            noun_count = sum(1 for _, tag in tags if tag in SUBJECT_TAGS)
            verb_count = sum(1 for _, tag in tags if tag in FINITE_VERB_TAGS)
            # edge case for when first word is if
            if first_word == "If" and verb_count < 2:
                return False
            # check for two nouns, if not assume fragment
            if noun_count < 2:
                return False

    # find the last words tag
    last_tag = None
    for word, tag in reversed(tags):
        if word.isalpha():
            last_tag = tag
            break
    # last word cannot be conjunction
    if last_tag in COORDINATING_CONJ_TAGS:
        return False

    # check if it has finite verb and subject
    has_finite_verb = any(tag in FINITE_VERB_TAGS for _, tag in tags)
    has_subject = any(tag in SUBJECT_TAGS for _, tag in tags)

    return has_finite_verb and has_subject


In [37]:
complete_tests = [
    "Where did he go?",
    "The quick brown fox jumps over the lazy dog.",
    "Although she was tired, she finished her homework.",
    "He asked if I was coming.",
    "\"I was there.\"",
    "'I was there.'",
    "‚ÄúI don‚Äôt know what you mean,‚Äù she said.",
    "The sign read ‚ÄúNo parking after 6 PM.‚Äù",
    "She said, ‚ÄúAfter the storm ended.‚Äù",
    "(After a long day,) he went straight to bed.",
    "Despite the rain, the game continued.",
    "Wait! Are you sure this is the right address?",
    "Yes, I understand the instructions.",
    "John, who had been waiting for hours, finally boarded the train.",
    "The committee approved the proposal‚Äîafter much debate.",
    "In the end, everything turned out well.",
    "They arrived at 7 PM; the meeting began shortly after.",
    "‚ÄúStop!‚Äù the officer shouted.",
    "Because it was late, they decided to head home.",
    "If you want my opinion, that was the right choice.",
    "No one knew where the noise came from.",
    "In the end, he thought it was cool, but it was not.",
    "In the end it was cool.",
    "Because she was exhausted, she went to bed early.",
    "Although the results were surprising, the conclusion was clear.",
    "The main reason was clear: no one had prepared.",
    "One important issue remained: funding was insufficient.",
    "She was tired; she still finished the assignment.",
    "The weather was terrible; the game continued anyway."
]

incomplete_tests = [
    "In the end talk.",
    "Running down the street.",
    "Because she was tired.",
    "Although it was raining.",
    "If you need anything.",
    "Yes!",
    "Maybe.",
    "So we went.",
    "Or maybe not.",
    "\"After the meeting.\"",
    "‚ÄúBefore the storm.‚Äù",
    "When he arrived.",
    "If possible.",
    "Such as this example.",
    "To the store.",
    "Under the old bridge.",
    "The big red barn.",
    "(Before the show.)",
    "While waiting for the bus.",
    "Because she was exhausted,",
    "Although the results were surprising,",
    "The main reason:",
    "One important issue:",
    "Because she was tired;",
    "Although it seemed unlikely;",
    "Because she was tired, after working all night",
    "Although the results were promising, according to the report",
    "The main issue was: a lack of preparation",
    "One thing became clear: during the final review",
    "Because she was exhausted; after working all night",
    "Although it seemed reasonable; given the circumstances"
]
    
print("The following should result in true")
for sentence in complete_tests:
    result_true = is_complete_sentence(sentence)
    print(f"{result_true!s:5} | {sentence}")


print("\nThe following should result in false")
for sentence in incomplete_tests:
    result_false = is_complete_sentence(sentence)
    print(f"{result_false!s:5} | {sentence}")

The following should result in true
True  | Where did he go?
True  | The quick brown fox jumps over the lazy dog.
True  | Although she was tired, she finished her homework.
True  | He asked if I was coming.
True  | "I was there."
True  | 'I was there.'
True  | ‚ÄúI don‚Äôt know what you mean,‚Äù she said.
True  | The sign read ‚ÄúNo parking after 6 PM.‚Äù
True  | She said, ‚ÄúAfter the storm ended.‚Äù
True  | (After a long day,) he went straight to bed.
True  | Despite the rain, the game continued.
True  | Wait! Are you sure this is the right address?
True  | Yes, I understand the instructions.
True  | John, who had been waiting for hours, finally boarded the train.
True  | The committee approved the proposal‚Äîafter much debate.
True  | In the end, everything turned out well.
True  | They arrived at 7 PM; the meeting began shortly after.
True  | ‚ÄúStop!‚Äù the officer shouted.
True  | Because it was late, they decided to head home.
True  | If you want my opinion, that was the right c

In [77]:
def fragment_ratio(text):
    '''Function to determine the ratio of fragments to lines in a given text'''

    sentences = split_sentences(text)
    total = len(sentences)
    if total == 0:
        return None

    # add complete sentences to a list
    is_complete = []
    for sent in sentences:
        if is_complete_sentence(sent):
            is_complete.append(sent)

    num_fragment = total - len(is_complete)

    fragment_ratio = num_fragment/total

    return fragment_ratio

In [28]:
print(fragment_ratio(cornell_example))
fragment_example = "I just got back from hiking. Weather bad. We ate chicken. In order to drive. (This is an interesting example.) If you want. He said, \"Try some of this.\" Okay!"
print("The following should print 0.5")
print(fragment_ratio(fragment_example))

0.125
The following should print 0.5
0.5


In [234]:
def create_nltk_tree(sentence):
    '''Helper function to create a tree for a valid sentence'''

    '''if not is_complete_sentence(sentence):
        raise ValueError("Sentence is not complete")'''
    
    doc = stanza_parser(sentence)
    stanza_tree = doc.sentences[0].constituency
    nltk_tree = Tree.fromstring(str(stanza_tree))
    
    return nltk_tree

In [60]:
example7 = "In my honest opinion."
tree = create_nltk_tree(example7)
print(tree)
print(TreePrettyPrinter(tree).text())

example8 = "In my honest opinion, I do not have time for this play."
tree = create_nltk_tree(example8)
print(tree)
print(TreePrettyPrinter(tree).text())

(ROOT
  (FRAG (PP (IN In) (NP (PRP$ my) (JJ honest) (NN opinion))) (. .)))
              ROOT             
               |                
              FRAG             
           ____|_____________   
          PP                 | 
  ________|____              |  
 |             NP            | 
 |    _________|_______      |  
 IN PRP$       JJ      NN    . 
 |   |         |       |     |  
 In  my      honest opinion  . 

(ROOT
  (S
    (PP (IN In) (NP (PRP$ my) (JJ honest) (NN opinion)))
    (, ,)
    (NP (PRP I))
    (VP
      (VBP do)
      (RB not)
      (VP
        (VB have)
        (NP (NP (NN time)) (PP (IN for) (NP (DT this) (NN play))))))
    (. .)))
                                    ROOT                                        
                                     |                                           
                                     S                                          
           __________________________|________________________________________   

In [76]:
def avg_tree_depth(text):
    '''Function to compute the average depth of the parse tree representing each sentence.
    Note that .height() accounts for the leaf level, which is not a true extra layer.'''

    sentences = split_sentences(text)

    # for each sentence/fragment, compute the tree height/depth
    depths = []
    for sent in sentences:
        tree = create_nltk_tree(sent)
        depths.append(tree.height()-1)

    avg_tree_depth = np.mean(depths)
    
    return avg_tree_depth

In [62]:
print("The following should print 6")
print(avg_tree_depth(example7))
print("The following should print 9")
print(avg_tree_depth(example8))
print()

example9 = example7 + " " + example8
print(example9)
print(split_sentences(example9))
print("The following should print 7.5")
print(avg_tree_depth(example9))

The following should print 6
6.0
The following should print 9
9.0

In my honest opinion. In my honest opinion, I do not have time for this play.
['In my honest opinion.', 'In my honest opinion, I do not have time for this play.']
The following should print 7.5
7.5


In [63]:
example10 = "Risa bought a blue watch."
Risa = create_nltk_tree(example10)
print(TreePrettyPrinter(Risa).text())

example11 = "He went to the market yesterday."
market = create_nltk_tree(example11)
print(TreePrettyPrinter(market).text())

            ROOT                   
             |                      
             S                     
  ___________|___________________   
 |                VP             | 
 |      __________|___           |  
 NP    |              NP         | 
 |     |      ________|_____     |  
NNP   VBD    DT       JJ    NN   . 
 |     |     |        |     |    |  
Risa bought  a       blue watch  . 

             ROOT                         
              |                            
              S                           
  ____________|_________________________   
 |            VP                        | 
 |    ________|__________________       |  
 |   |        PP                 |      | 
 |   |     ___|____              |      |  
 NP  |    |        NP            NP     | 
 |   |    |    ____|____         |      |  
PRP VBD   IN  DT        NN       NN     . 
 |   |    |   |         |        |      |  
 He went  to the      market yesterday  . 



In [None]:
SUBORDINATE_RELS = {"advcl", "ccomp", "xcomp", "acl", "relcl"}
CLAUSE_RELS = {"advcl", "ccomp", "xcomp", "acl", "relcl", "csubj", "csubj:pass"}

In [None]:
def count_subordinate_clauses(text):
    '''Helper function to compute the number of subordinate clauses present among the complete sentences of a given text.'''

    # extract complete sentences only
    candidate_sentences = split_sentences
    complete_sentences = [sent for sent in candidate_sentences
                          if is_complete_sentence(sent)]

In [231]:
def count_t_units(sentence):

    # filter out fragments
    if not is_complete_sentence(sentence):
        return 0
    
    # there must be at least one independent clause if sentence is complete
    t_units = 1

    # tokenize and tag sentence
    tokens = tokenize(sentence)
    tags = pos_tag(tokens)

    # initialize booleans
    in_subordinate = False

    for i, (word, tag) in enumerate(tags):
        print(f"current word:{word}")

        # enter subordinate clause
        if tag in SUBORDINATING_CONJ_TAGS and word not in {"so"}:
            print("enter subordinate")
            in_subordinate = True
            continue
        if word in SUBORDINATE_CLAUSE_MARKERS:
            print("enter subordinate")
            in_subordinate = True
            continue

        # semicolons always separate independent clauses
        if word == ";":
            for _, next_tag in tags[i+1:]:
                if next_tag in FINITE_VERB_TAGS:
                    t_units += 1
                    break
            continue
        
        # exit subordinate clause at punctuation
        if word in PUNCT:
            print("exit at punct")
            in_subordinate = False
            continue
            

        # coordinating conjunctions often introduce new t-units
        if not in_subordinate:
            if tag in COORDINATING_CONJ_TAGS or word == "so":
                print("coordinating conj and not subordinate")

                # initialize booleans
                found_subject = False

                # look ahead for a subject and finite verb
                for next_word, next_tag in tags[i+1:]:

                    # clause boundary reached so stop searching
                    if next_word in {",", ";"}:
                        print("clause boundary reached")
                        break
                    
                    # search for subject
                    if not found_subject:
                        if next_tag in SUBJECT_TAGS:
                            print("found subject")
                            found_subject = True
                        continue

                    # check if followed by finite verb
                    if next_tag in FINITE_VERB_TAGS:
                        print("found full t-unit")
                        t_units += 1
                        break
        
        # commas can sometimes introduce independent clauses
        if word == "," and not in_subordinate:
            print("word is comma")

            # initialize booleans
            found_subject = False

            for next_word, next_tag in tags[i+1:]:
                # clause boundary reached so stop searching
                if next_word in {",", ";"}:
                    break
                
                # search for subject
                if not found_subject:
                    if next_tag in SUBJECT_TAGS:
                        found_subject = True
                    continue

                # check if followed by finite verb
                if next_tag in FINITE_VERB_TAGS:
                    t_units += 1
                    break

    return t_units

In [76]:
T_UNIT_TESTS = [
    # single T-unit (simple / subordinate only)
    ("I left.", 1),
    ("I left because it was late.", 1),
    ("When it was late, I left.", 1),
    ("I left when it was late.", 1),
    ("The man who lives here is nice.", 1),
    ("I think that he knows.", 1),

    # two T-units (coordinated independent clauses)
    ("I left, and I went home.", 2),
    ("I left, but I went home later.", 2),
    ("I left and I went home.", 2),
    ("I studied, so I passed.", 2),

    # coordination + subordination
    ("I left, and I went home because it was late.", 2),
    ("When it was late, I left, and I went home.", 2),
    ("I left because it was late, and I went home.", 2),

    # three T-units
    ("I came, I saw, and I conquered.", 3),
    ("I left, and I went home, but I forgot my keys.", 3),

    # relative clauses should NOT increase T-units
    ("The man who lives here and works nearby is nice.", 1),
    ("I saw the book that you mentioned, and I bought it.", 2),

    # semicolon-separated independent clauses
    ("I left; I went home.", 2),
    ("I left; I went home; I slept.", 3),

    # edge cases
    ("I left and went home.", 1),  # shared subject
    ("I left, and then I went home.", 2),
]

In [318]:
from nltk.tree import ParentedTree

In [359]:
def t_unit_counter(sentence):

    t_unit_count = 0

    # if a fragment, there are no t-units
    if not is_complete_sentence(sentence):
        return 0
    
    # create a dependency tree
    tree = create_nltk_tree(sentence)
    ptree = ParentedTree.convert(tree)

    # iterated through parented subtrees
    for subtree in ptree.subtrees():
        # check for subjects
        if subtree.label() == "S":
            # if subject belongs to subordinate clause, ignore
            if subtree.parent().label() == "SBAR":
                continue
            # otherwise increment
            else:
                t_unit_count += 1
    
    # if more than one t-unit, ignore duplicated subject below root
    if t_unit_count > 1:
        t_unit_count -= 1

    return t_unit_count

In [302]:
from nltk.tree import ParentedTree

In [358]:
t_unit_counter("I left, and I went home because it was late.")

incrementing
incrementing
incrementing
SBAR found


2

In [313]:
def test_t_unit_counter():
    failures = []

    for sent, expected in T_UNIT_TESTS:
        actual = t_unit_counter(sent)
        if actual != expected:
            failures.append((sent, expected, actual))

    if not failures:
        print("All T-unit tests passed.")
    else:
        print("Failures detected:\n")
        for sent, exp, act in failures:
            print(f"Sentence: {sent}")
            print(f"Expected: {exp}, Got: {act}\n")

In [360]:
test_t_unit_counter()

incrementing
incrementing
SBAR found
incrementing
SBAR found
incrementing
SBAR found
incrementing
SBAR found
incrementing
SBAR found
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
SBAR found
incrementing
SBAR found
incrementing
incrementing
incrementing
incrementing
SBAR found
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
SBAR found
incrementing
incrementing
SBAR found
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
incrementing
All T-unit tests passed.
