Common measures of textual complexity are derived from simple counts of words, sentences and syllables.  In this homework, you'll implement two of them: type-token ratio (a measure of vocabulary richness) and the [Flesch-Kincaid Grade Level](https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests#Flesch–Kincaid_grade_level).

In [1]:
import nltk

In [2]:
# If you haven't downloaded the sentence segmentation model before, do so here
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/liuhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Q1: Find two different texts you'd like to compare (from any source).  For potential sources, see the [The American Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union) for all state of the union addresses and [Project Gutenberg](https://www.gutenberg.org) for books in the public domain.  Paste them in the `text1` and `text2` strings below.  Ensure that both texts are a minimum of 500 words.

In [None]:
text1=""" """

In [None]:
text2=""" """

Q2: Use the `nltk.word_tokenize` method to implement the type-token ratio:

$$
TTR = {\textrm{number of distinct word types} \over \textrm{number of word tokens}}
$$

TTR is dependent on text length (intuitively, the longer a text is, the greater chance you have of a word type repeating), so this number is only comparable between documents of identical lengths.  Calculate this measure for the first 500 words of your two documents and report the results here. Exclude tokens that are exclusively punctuation from all counts.

In [None]:
def type_token_ratio(text, num_words=500):
    # your answer here

In [None]:
type_token_ratio(text1)

In [None]:
type_token_ratio(text2)

****

Now we'll implement the [Flesch-Kincaid Grade Level](https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests#Flesch–Kincaid_grade_level), which has the following formula:

$$
0.39 \left ( \frac{\mbox{total words}}{\mbox{total sentences}} \right ) + 11.8 \left ( \frac{\mbox{total syllables}}{\mbox{total words}} \right ) - 15.59
$$

Use `nltk.sent_tokenize` or spacy's `sents` function for counting the number of sentences, any word tokenization method we've covered for counting the number of words, and the `get_syllable_count` function below for counting the number of syllables in a word.  Exclude tokens that are exclusively punctuation from word and syllable counts.

For calculating the syllables, we're going to use a number of resources: the [CMU pronunciation dictionary](https://github.com/cmusphinx/cmudict), which lists the ARPABET pronunciation for a list of words, along with [g2p](https://github.com/Kyubyong/g2p), a neural model trained to predict the pronunciation for words (which we can use for words not in the CMU dictionary).

In [None]:
arpabet = nltk.corpus.cmudict.dict()

In [None]:
!pip install g2p_en

In [None]:
from g2p_en import G2p
g2p = G2p()

In [None]:
def get_pronunciation(word):
    if word in arpabet:
        # pick the first pronunciation
        return arpabet[word][0]

    else:
        return g2p(word)

def get_syllable_count(word):
    pronunciation=get_pronunciation(word)
    sylls=0
    for phon in pronunciation:
        # vowels in arpabet end in digits (indicating stress)
        if re.search("\d$", phon) is not None:
            sylls+=1
    return sylls

In [None]:
get_syllable_count("Bamman")

In [None]:
get_syllable_count("27")

Q3. Implement Flesch-Kincaid Grade Level and report its results for your two texts.  Flesch-Kincaid relies on an implicit definition of a "word" and a "sentence", and different definitions will yield different grade level estimates. (In the problem definition above, we've already ruled out punctuation as constituing stand-alone words, and other assumptions lurk with every tokenization method). State your assumptions for the definition of "word" you have implemented and why they are reasonable.

In [None]:
def flesch_kincaid_grade_level(text):
    # your answer here

In [None]:
# Should be 11.265
flesch_kincaid_grade_level("The Australian platypus is seemingly a hybrid of a mammal and reptilian creature".lower())

In [None]:
flesch_kincaid_grade_level(text1.lower())

In [None]:
flesch_kincaid_grade_level(text2.lower())

**Q3 "word" assumptions:**