# Notebook Structure

1. Import necessary dependencies
2. Define the Utility function to analyze text
3. Create the dataset
4. Execute the utility function and visualize the results
5. Create the text corpus
6. Execute the utility function and visualize the results


# 1. Import necessary dependencies

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import re
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# 2. Define the Utility function to analyze text

### A. TTR Definition

TTR: Type-Token Ratio

TTR stands for Type-Token Ratio. It's a measure of lexical diversity in a text. Here's what the terms mean:

* Token: Every single word in the text. If a word appears multiple times, each appearance is counted as a separate token.
* Type: The unique words in the text. Repeated words are counted only once.

TTR is calculated by dividing the number of unique words (types) by the total number of words (tokens):

TTR = (Number of Types) / (Number of Tokens)

How TTR is Useful

* TTR helps us understand the variety of words used in a text. A higher TTR indicates that a text uses a greater variety of words, suggesting richer lexical diversity. Here's how it's useful:
* Comparing Texts: TTR can be used to compare the lexical diversity of different texts. For example, you could compare the diversity of vocabulary in different essays by the same author, or compare the vocabulary in different genres of writing.
* Developmental Analysis: In language acquisition research, TTR can be used to study how a person's vocabulary develops over time. A child's TTR is expected to increase as they learn more words.
* Text Complexity: TTR can provide insights into the complexity of a text. A text with a high TTR might be considered more complex because it uses a wider range of vocabulary.
* Authorship Attribution: TTR can be one of the metrics used in authorship attribution studies, where researchers try to identify the author of a text based on their writing style.
* Simplification Assessment: TTR can be used to assess the effect of text simplification.  A simplified text might have a lower TTR than the original text.

Limitations of TTR

TTR is sensitive to text length. As a text gets longer, the TTR tends to decrease because there's a higher chance of words being repeated. A very long text will likely have a lower TTR than a very short text, even if both texts have the same level of diversity. Because of this, TTR is most useful when comparing texts of similar length.

In [2]:
def calculate_ttr(text):
    """
    Calculates the Type-Token Ratio (TTR) of a given text.

    Args:
        text (str): The input text.

    Returns:
        float: The Type-Token Ratio (TTR). Returns 0 if no tokens are found.
    """
    if not text:
        return 0.0  # Handle empty text case
    tokens = word_tokenize(text)
    types = set(tokens)
    return len(types) / len(tokens) if tokens else 0.0

### B. Mattr definition

MATTR: Moving Average Type-Token Ratio

MATTR stands for Moving Average Type-Token Ratio. It's a measure of lexical diversity that addresses the limitations of the basic Type-Token Ratio (TTR).  Remember that TTR is affected by text length, generally decreasing as the text gets longer.  MATTR aims to provide a more stable measure of lexical diversity, regardless of text length.

Here's how MATTR is calculated:

* Choose a Window Size: A common window size is 50 words, but this can be adjusted.
* Slide the Window: A "window" of the chosen size is moved across the text, one word at a time.
* Calculate TTR for Each Window: For each position of the window, the TTR is calculated within that window.
* Average the TTRs: The average of all the TTR values calculated in step 3 is the MATTR score.

How MATTR is Useful

MATTR provides a more reliable measure of lexical diversity than TTR, especially when comparing texts of different lengths. Here's why it's useful:

* Comparison of Texts of Different Lengths: MATTR allows for more meaningful comparisons of lexical diversity across texts of varying lengths.  Because it averages TTR over smaller, fixed-size windows, it's less sensitive to the overall length of the text.
* More Stable Measure: MATTR is considered a more stable measure of lexical diversity.  It reduces the impact of text length on the measurement.
* Language Development Research: In studies of language acquisition, MATTR can be used to track changes in lexical diversity in a child's speech or writing over time, even if the length of their utterances or writing samples varies.
* Authorship Attribution: Like TTR, MATTR can be used as a feature in authorship attribution studies.
* Text Complexity Analysis: MATTR can provide a more accurate reflection of the lexical richness of a text, contributing to a better understanding of its complexity.
* Assessment of Writing Quality: MATTR can be used to assess the lexical variety in someone's writing.

In summary, MATTR is a valuable tool for analyzing lexical diversity because it provides a more consistent and length-independent measure compared to the traditional Type-Token Ratio.

In [3]:
def calculate_mattr(text, window_size=50):
    """
    Calculates the Moving Average Type-Token Ratio (MATTR) of a given text.

    Args:
        text (str): The input text.
        window_size (int, optional): The size of the moving window. Defaults to 50.

    Returns:
        float: The Moving Average Type-Token Ratio (MATTR). Returns 0 if text
               is shorter than the window size.
    """
    if not text:
        return 0.0
    tokens = word_tokenize(text)
    if len(tokens) < window_size:
        return 0.0  # Handle text shorter than window size
    ttr_values = []
    for i in range(len(tokens) - window_size + 1):
        window_text = tokens[i:i + window_size]
        ttr_values.append(calculate_ttr(" ".join(window_text)))
    return sum(ttr_values) / len(ttr_values) if ttr_values else 0.0

### C. flesch_kincaid_reading_ease definition


The Flesch-Kincaid Reading Ease test is a formula that calculates how easy a piece of text is to understand. It uses mathematical formulas to assess readability based on two main factors:

* Average sentence length: Longer sentences are generally harder to read than shorter ones.
* Average number of syllables per word: Words with more syllables are usually more complex.
* The formula produces a score on a scale from 0 to 100. Higher scores indicate that the text is easier to read.

How it Works

The Flesch-Kincaid formula is as follows:

Reading Ease = 206.835 - 1.015 * (average sentence length) - 84.6 * (average syllables per word)

How to Interpret the Score

Here's a general guideline for interpreting Flesch-Kincaid Reading Ease scores:

* 90-100: Very easy to read. Suitable for elementary school students.
* 80-89: Easy to read. Conversational English for consumers.
* 70-79: Fairly easy to read.
* 60-69: Standard reading difficulty.
* 50-59: Fairly difficult to read.
* 30-49: Difficult to read. Suitable for college students.
* Below 30: Very difficult to read. Best understood by university graduates.

How is it Useful?

The Flesch-Kincaid Reading Ease test is widely used in various fields to ensure that written materials are accessible to the intended audience. Here are some key applications:

* Education: Teachers use it to select reading materials that are appropriate for their students' reading levels.
* Business: Companies use it to write clear and concise documents, such as user manuals, reports, and marketing materials.
* Government: Agencies use it to create documents that are easy for the public to understand.
* Journalism: Journalists use it to write articles that are accessible to a broad audience.
* Technical Writing: Technical writers use it to create documentation that is easy for users to follow.
* Healthcare: Medical professionals use it to create patient education materials that are easy to understand.

In [4]:
def calculate_flesch_kincaid_reading_ease(text):
    """
    Calculates the Flesch-Kincaid Reading Ease score of a given text.
    Higher scores indicate easier readability.

    Args:
        text (str): The input text.

    Returns:
        float: The Flesch-Kincaid Reading Ease score. Returns 0 if there are
                no sentences or words.
    """
    if not text:
        return 0.0

    sentences = nltk.sent_tokenize(text)
    words = word_tokenize(text)
    num_sentences = len(sentences)
    num_words = len(words)

    if num_sentences == 0 or num_words == 0:
        return 0.0

    num_syllables = 0
    for word in words:
        # A basic syllable counting heuristic (more robust)
        vowels = 'aeiouy'
        word = word.lower()
        count = 0
        if word[0] in vowels:
            count += 1
        for i in range(1, len(word)):
            if word[i] in vowels and word[i-1] not in vowels:
                count += 1
        if word.endswith('e'):
            count -= 1
        if count == 0:
            count = 1
        num_syllables += count

    average_sentence_length = num_words / num_sentences
    average_syllables_per_word = num_syllables / num_words

    # Formula for Flesch-Kincaid Reading Ease
    score = 206.835 - 1.015 * average_sentence_length - 84.6 * average_syllables_per_word
    return score

### D. Dale-Chall Readability Definition

The Dale-Chall Readability Formula is another popular readability test used to assess the difficulty of a piece of text.  Unlike some other tests, it focuses on the use of familiar words. It compares the words in a text to a list of approximately 3,000 commonly used English words.

How it Works

The Dale-Chall formula calculates readability based on two factors:

* Percentage of unfamiliar words: The percentage of words in the text that are not on the Dale-Chall list of familiar words.
* Average sentence length: Similar to other readability tests, this formula considers that shorter sentences are generally easier to understand.

The formula is as follows:

Raw Score = 0.1579 * (Percentage of unfamiliar words) + 0.0496 * (Average sentence length)

Adjusted Score = Raw Score + 3.6365

If the percentage of unfamiliar words is greater than 5%, the following correction formula is used:

Adjusted Score = Raw Score + 3.6365 + 0.1 * [(Percentage of unfamiliar words) ^ 2]

How to Interpret the Score

The Dale-Chall score corresponds to U.S. grade levels, making it easy to understand the reading difficulty:

* 10-12+: College graduate
* 13-16: College
* 11-12: 11th to 12th grade
* 9-10: 9th to 10th grade
* 7-8: 7th to 8th grade
* 5-6: 5th to 6th grade
* 4: 4th grade
* 1-3: 3rd grade and below

How is it Useful?

The Dale-Chall Readability Formula is useful in many situations where it's important to ensure that text is easy to understand:

* Education: Teachers use it to select appropriate reading materials for students at different grade levels.
* Publishing: Editors use it to assess the readability of books and articles.
* Business: Companies use it to create clear and concise documents for customers and employees.
* Government: Government agencies use it to write documents that are accessible to the general public.
* Healthcare: Healthcare providers use it to create patient education materials that are easy to understand.
* Legal: Legal professionals can use it to assess the readability of contracts and other legal documents.


In [5]:
def calculate_dale_chall_readability_score(text):
    """
    Calculates the Dale-Chall Readability Score of a given text.
    Lower scores indicate easier readability.

    Args:
        text (str): The input text.

    Returns:
        float: The Dale-Chall Readability Score. Returns 0 if no words found.
    """
    if not text:
        return 0.0

    sentences = nltk.sent_tokenize(text)
    words = word_tokenize(text)
    num_sentences = len(sentences)
    num_words = len(words)

    if num_words == 0:
        return 0.0

    # Dale-Chall list (simplified for demonstration)
    dale_chall_easy_words = set([
        'a', 'about', 'above', 'after', 'again', 'all', 'always', 'and', 'an',
        'another', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before',
        'below', 'between', 'big', 'but', 'by', 'call', 'can', 'come', 'day',
        'do', 'down', 'eat', 'find', 'first', 'for', 'from', 'get', 'give', 'go',
        'good', 'have', 'he', 'her', 'here', 'him', 'his', 'how', 'I', 'if', 'in',
        'into', 'is', 'it', 'its', 'just', 'know', 'like', 'little', 'look', 'make',
        'many', 'me', 'more', 'my', 'no', 'not', 'now', 'of', 'on', 'one', 'or',
        'other', 'out', 'see', 'she', 'so', 'some', 'that', 'the', 'their', 'them',
        'then', 'there', 'they', 'this', 'to', 'up', 'was', 'we', 'what', 'when',
        'where', 'who', 'will', 'with', 'you', 'your'
    ])  # A very simplified list.  The actual Dale-Chall list has thousands.

    difficult_word_count = 0
    for word in words:
        word = word.lower()
        if word not in dale_chall_easy_words:
            difficult_word_count += 1

    percentage_difficult_words = (difficult_word_count / num_words) * 100
    average_sentence_length = num_words / num_sentences

    # Dale-Chall Formula
    raw_score = 0.1579 * percentage_difficult_words + 0.0496 * average_sentence_length
    if percentage_difficult_words > 5: #Correction factor
        raw_score += 3.6365
    return raw_score

In [6]:
def analyze_text_lexical_diversity_and_complexity(text):
    """
    Performs a univariate analysis of text lexical diversity and complexity.

    Args:
        text (str): The input text.

    Returns:
        dict: A dictionary containing the calculated metrics:
            - TTR: Type-Token Ratio
            - MATTR: Moving Average Type-Token Ratio
            - Flesch-Kincaid Reading Ease
            - Dale-Chall Readability Score
    """
    analysis = {}
    analysis['TTR'] = calculate_ttr(text)
    analysis['MATTR'] = calculate_mattr(text)
    analysis['Flesch-Kincaid Reading Ease'] = calculate_flesch_kincaid_reading_ease(text)
    analysis['Dale-Chall Readability Score'] = calculate_dale_chall_readability_score(text)
    return analysis

# 3. Create the text corpus and execute the utility function

In [7]:
# Example usage with a sample text
sample_text = """
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog. Is this sentence complex?
This is a simple sentence.
A very long sentence with many, many words, and lots of commas, that just keeps going and going, and going.
"""

In [9]:
 analysis_results = analyze_text_lexical_diversity_and_complexity(sample_text)

In [10]:
analysis_results

{'TTR': 0.5151515151515151,
 'MATTR': 0.6235294117647058,
 'Flesch-Kincaid Reading Ease': 93.12454545454547,
 'Dale-Chall Readability Score': 14.94800909090909}

### Interpretation of the analysis_result

TTR (Type-Token Ratio): 0.515

* Interpretation: A TTR of 0.515 means that 51.5% of the words in your text are unique.
* In simpler terms: The text has a moderate level of lexical diversity. A higher TTR would suggest more variety in word choice, while a lower TTR would suggest more repetition.

MATTR (Moving Average Type-Token Ratio): 0.624

* Interpretation: An MATTR of 0.624 indicates that, on average, within a set window of words, 62.4% of the words are unique.
* In simpler terms: This metric attempts to correct for the effect of text length on TTR.  It suggests a slightly higher lexical diversity than the raw TTR, implying that the diversity is fairly consistent across different sections of the text.

Flesch-Kincaid Reading Ease: 93.12

* Interpretation: A score of 93.12 suggests that the text is very easy to read.
* In simpler terms: The text is likely understandable by a wide audience, including those with an elementary school education level. Higher scores indicate easier readability.

Dale-Chall Readability Score: 14.95

* Interpretation: A Dale-Chall score of 14.95 suggests the text is quite difficult to read.
* In simpler terms: This score corresponds to text that is appropriate for college graduates.  The Dale-Chall formula uses a list of common words, so a higher score indicates a greater proportion of less common vocabulary.

In [11]:
sample_text = """
The sun dipped below the horizon, painting the sky with hues of orange, purple, and gold.
A gentle breeze rustled through the leaves of the ancient oak tree, carrying the sweet scent of wildflowers.
In the distance, the faint sound of a river flowed, a soothing melody to accompany the tranquil evening.
As twilight descended, the first stars began to twinkle, their light reflecting in the still waters of the nearby lake.
A lone owl hooted softly, its call echoing through the peaceful forest.  The world seemed to pause, suspended in a moment of quiet beauty."
"""

In [12]:
analysis_results = analyze_text_lexical_diversity_and_complexity(sample_text)
analysis_results

{'TTR': 0.6637168141592921,
 'MATTR': 0.7378124999999996,
 'Flesch-Kincaid Reading Ease': 70.92624631268438,
 'Dale-Chall Readability Score': 14.910987315634218}

### Interpretation of analysis_results

TTR (Type-Token Ratio): 0.664

* Interpretation: A TTR of 0.664 means that 66.4% of the words in your text are unique.
* In simpler terms: The text has a moderately high level of lexical diversity. This suggests a good variety in word choice, indicating that the author uses a relatively rich vocabulary and avoids excessive repetition. Compared to the previous text (TTR of 0.515), this text demonstrates greater lexical variety.

MATTR (Moving Average Type-Token Ratio): 0.738

* Interpretation: An MATTR of 0.738 indicates that, on average, within a set window of words, 73.8% of the words are unique.
* In simpler terms: This metric corrects for text length and confirms the lexical diversity observed in the TTR. The MATTR is higher than the TTR, suggesting that the diversity is consistent across different sections of the text. This text also has a higher MATTR than the previous text (0.624), reinforcing the observation of higher lexical diversity.

Flesch-Kincaid Reading Ease: 70.93

* Interpretation: A score of 70.93 suggests that the text is fairly easy to read.
* In simpler terms: The text is likely understandable by a general audience, including those with a 7th to 8th-grade reading level. This text is slightly more challenging than the previous one (93.12), but still considered easy to read.

Dale-Chall Readability Score: 14.91

* Interpretation: A Dale-Chall score of 14.91 suggests that the text is quite difficult to read.
* In simpler terms: This score indicates that the text is best understood by college graduates. The Dale-Chall formula identifies the proportion of less common vocabulary, and this text, like the previous one (14.95), contains a significant amount.

In [13]:
sample_text = """
The pervasive juxtaposition of ephemeral phenomena with the immutable constancy of existence forms the crux of philosophical discourse.
Epistemological frameworks, predicated upon the acquisition of knowledge through empirical observation and rational introspection, often grapple with the inherent limitations of human perception.
Furthermore, the ontological implications of subjective experience, particularly in the context of consciousness and self-awareness, remain a subject of intense debate among cognitive scientists and philosophers alike.
The intricate interplay between linguistic structures and cognitive processes, as elucidated by advancements in psycholinguistics, reveals the profound influence of language on shaping our understanding of reality.
Consequently, a holistic comprehension of the human condition necessitates a multidisciplinary approach, integrating insights from diverse fields such as neuroscience, anthropology, and sociology.
"""

In [14]:
analysis_results = analyze_text_lexical_diversity_and_complexity(sample_text)
analysis_results

{'TTR': 0.6791044776119403,
 'MATTR': 0.7814117647058824,
 'Flesch-Kincaid Reading Ease': -9.138641791044762,
 'Dale-Chall Readability Score': 16.160182985074627}

### Interpretation of analysis_results

TTR (Type-Token Ratio): 0.679

* Interpretation: A TTR of 0.679 means that 67.9% of the words in the text are unique.
* In simpler terms: The text exhibits a moderately high level of lexical diversity. This suggests a good variety in word choice, indicating that the author uses a relatively rich vocabulary and avoids excessive repetition.

MATTR (Moving Average Type-Token Ratio): 0.781

* Interpretation: An MATTR of 0.781 indicates that, on average, within a set window of words, 78.1% of the words are unique.
* In simpler terms: This metric corrects for text length and confirms the lexical diversity observed in the TTR. The MATTR is higher than the TTR, suggesting that the diversity is consistent across different sections of the text.

Flesch-Kincaid Reading Ease: -9.14

* Interpretation: A score of -9.14 suggests that the text is extremely difficult to read.
* In simpler terms: The text is very challenging to understand, likely requiring a college graduate level of education. Negative scores indicate very complex text.

Dale-Chall Readability Score: 16.16

* Interpretation: A Dale-Chall score of 16.16 suggests that the text is very difficult to read.
* In simpler terms: This score indicates that the text is best understood by those with postgraduate or professional level education. The Dale-Chall formula identifies the proportion of less common vocabulary, and this text contains a significant amount.