## DS 7337 - Natural Language Processing

### Author: Brandon Croom

### Homework: 2

In [1]:
# import nltk and other items
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.book import *
from nltk.corpus import words
from sklearn.preprocessing import minmax_scale
import numpy as np

# load the same corpus used in HW1
corpus_root = "C://RAI//DS7337-NLP//HW1//corpus"
file_pattern = r".*/*.*\.txt"
ptb = PlaintextCorpusReader(corpus_root,file_pattern)
ptb.fileids()

# define a method that will get us all words in the english language
def english_lang_words():
    word_list = words.words()
    return len(word_list)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


#### 1.	In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.)

In [2]:
# define the vocabulary size method. Filter out everything that is not alpha numeric and make it all lower case.
# We'll normalize using the size calculation of the vocabulary and divide by the total number of the vocabulary size. Using the standard normalization formula (x - min(x) / (max(x) - min(x))).
# Assuming min(x) = 0 (this is the fewest number of words possible) this minimizes to x/max(x) where x=size of the text after cleansing and max(x) = the full size of the english language 
def s_vocab_size(text):
    size = len(set(word.lower() for word in text if word.isalpha()))
    return size/english_lang_words()

s_vocab_size(text1)

0.07159029467423628

In [3]:
# define the vocabulary size method. Filter out everything that is not alpha numeric and make it all lower case. This function is similar to the example provided in class. It allows us to
# take multiple text and analyze at once. We'll normalize this result using the scikit-learn minmax_scale method which will get us between 0,1 return values
# We'll normalize using the size calculation of the vocabulary and divide by the total number of the vocabulary size. Using the standard normalization formula (x - min(x) / (max(x) - min(x))).
# Assuming min(x) = 0 (this is the fewest number of words possible) this minimizes to x/max(x) where x=size of the text after cleansing and max(x) = the full size of the english language 
def n_vocab_size(*arg):
    vocab_size = np.array([])
    
    #### Getting the Vocab Size
    for text in arg:
        vocab_size = np.append(vocab_size,len(set(word.lower() for word in text if word.isalpha())))
    
    #### Normalizing using sklearn preprocessing 
    vocab_size_norm_sklearn = minmax_scale(vocab_size, feature_range=(0,1), axis=0)
    
    return(vocab_size_norm_sklearn)

vocab_size = n_vocab_size(text1,text2,text3,text4,text5,text6,text7,text8,text9)
print(vocab_size)

[1.         0.34073067 0.113989   0.51548495 0.24108302 0.06354701
 0.51542313 0.         0.34178154]


### 2.	After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [4]:
# define the long word score function. We'll allow the user to input the word length and the number of times the word should occur in the text before it's counted. We'll normalize by 
# dividing the length of the long words by the total number of words in the text provided. Using the standard normalization formula (x - min(x) / (max(x) - min(x))). Assuming min(x) = 0 (this is the 
# fewest number of words possible) this minimizes to x/max(x) where x=size of the text after cleansing and max(x) = the full size of the english language 
def s_long_word_score(text, word_len=7, word_freq=7):
    fdist = FreqDist(text)
    size = len(sorted(word for word in set(text) if len(word) > word_len and fdist[word] > word_freq))
    return size/english_lang_words()

s_long_word_score(text5)

8.025817788591511e-05

In [5]:
# define the long word score function. We'll allow the user to input the word length and the number of times the word should occur in the text before it's counted. Similarly to the vocab_size
# method we'll define this method to take multiple texts at once and leverage the min_max_scaler from SKLearn to do the scaling.
def n_long_word_score(*arg, word_len=7, word_freq=7):
    long_word_score = np.array([])
    
    #### Getting the Vocab Size
    for text in arg:
        fdist = FreqDist(text)
        size = len(sorted(word for word in set(text) if len(word) > word_len and fdist[word] > word_freq))
        long_word_score = np.append(long_word_score,size)
    
    #### Normalizing using sklearn preprocessing 
    long_word_score_norm_sklearn = minmax_scale(long_word_score, feature_range=(0,1), axis=0)
    return(long_word_score_norm_sklearn)

word_score = n_long_word_score(text1,text2,text3,text4,text5,text6,text7,text8,text9)
print(word_score)

[0.82779456 0.59214502 0.06193353 1.         0.01812689 0.00755287
 0.54682779 0.         0.14199396]


### 3.	Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

In [6]:
# HW 1 lexical diversity function

# define the text word count
def text_word_count(text_data):
    return len(text_data)

# define the text vocabulary size
def text_vocab_size(text_data):
    return len(set(text_data))

# define lexical diversity as it was defined in HW1
def lexical_diversity(text_data):
    word_count = text_word_count(text_data)
    c_vocab_size = text_vocab_size(text_data)
    diversity_score = c_vocab_size / word_count
    return diversity_score

# text difficulty score function. Create a weighted average function. Allow the user to input specific weights for
# lexical diversity, word score, and vocabulary size. Defaults for all are 1 to keep the weights equal
def text_diff_score(text, lex_div_weight=1,word_score_weight=1,vocab_size_weight=1):
    lex_div = lexical_diversity(text)*lex_div_weight
    word_score = s_long_word_score(text) * word_score_weight
    size_vocab = s_vocab_size(text) * vocab_size_weight

    return (lex_div + word_score + size_vocab)/3
    
# print out the new text difficulty scores
print("Text Difficulty Score The Ontario Readers: Third Book: ",text_diff_score(ptb.words(ptb.fileids()[0])))
print("Text Difficulty Score The Ontario Readers: Fourth Book: ",text_diff_score(ptb.words(ptb.fileids()[1])))
print("Text Difficulty Score The Ontario Readers: The High School Book: ",text_diff_score(ptb.words(ptb.fileids()[2])))

Text Difficulty Score The Ontario Readers: Third Book:  0.04969884364250884
Text Difficulty Score The Ontario Readers: Fourth Book:  0.05454543644267229
Text Difficulty Score The Ontario Readers: The High School Book:  0.05382365014314133


In [7]:
# text difficulty score function. Create a weighted average function. Allow the user to input specific weights for
# lexical diversity, word score, and vocabulary size. Defaults for all are 1 to keep the weights equal
def n_text_diff_score(*args, lex_div_weight=1,word_score_weight=1,vocab_size_weight=1):

    lex_div_score = np.array([])

    #### Getting the lexical diversity
    for text in args:
        lex_div_score = np.append(lex_div_score,lexical_diversity(text))

    word_score = n_long_word_score(*args)
    vocab_size = n_vocab_size(*args)

    lex_div_score = lex_div_score*lex_div_weight
    word_score = word_score* word_score_weight
    vocab_size = vocab_size * vocab_size_weight

    text_diff_score = (lex_div_score + word_score + vocab_size) / 3

    return text_diff_score

tdiff_score = n_text_diff_score(ptb.words(ptb.fileids()[0]),ptb.words(ptb.fileids()[1]),ptb.words(ptb.fileids()[2]))

# print out the new text difficulty scores
print("Text Difficulty Score The Ontario Readers: Third Book: ",tdiff_score[0])
print("Text Difficulty Score The Ontario Readers: Fourth Book: ",tdiff_score[1])
print("Text Difficulty Score The Ontario Readers: The High School Book: ",tdiff_score[2])

Text Difficulty Score The Ontario Readers: Third Book:  0.038337383337922006
Text Difficulty Score The Ontario Readers: Fourth Book:  0.19243566947386515
Text Difficulty Score The Ontario Readers: The High School Book:  0.7013226926771597


#### Analysis Notes: 
Methods for this exercise were generated in two ways: one allowing for the analysis of a single text and one allowing for analysis of multiple texts at once. The single text methods scaled data based on words in the english language. This was done since there isn't an easy way to scale single values between 0 and 1 except by a percentage based methods (Min_Max_Scalars won't calculate). The multiple text approach allows for the use of the sci-kit learn min_max_scalars method to performthe scaling. Though the approach in these methods differ we do see similarities in the output results as noted below. 

Using the Single Text Methods:
Using the text difficult score above for the single text methods, we see that this text difficulty score follows a similar pattern to just looking at lexical diversity for the texts, like in HW1. It's a bit counter intuitive. The text designed for high school has a score of 0.053, which is lower than the book for fourth grade. We would expect that the high school book would have the higher of the scores when compared to books for third and fourth graders. The text difficulty scores for the third and fourth grade books show they are close in difficulty, but do have a separation which would be indicative of the grade change.

Using the Multi-Text Methods:
The text designed for high school has a score of 0.701 which is a higher difficulty than the other two books. This is intuitive as we would expect high school books to have more difficult text when compared to third and fourth grade books. We also see a difference between the third and fourth grade books with scores of 0.038 and 0.19 respectively. Again this seems intuitive due to the expectation that as you increase in grade level the difficulty of text should increase. The over all progression of the text feels much more intuitive in this approach than with the single text approach.