# Homework 2
### DS7337 - 404

Jason Rupp

In [22]:
import pandas as pd
import nltk
import re
from urllib import request
from IPython.display import display
from sklearn.preprocessing import minmax_scale, MinMaxScaler

#### 1. In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.)

In [23]:
def scaled_vocab_size(text):
    # split words on white space and normalize case
    sv_words = re.findall('\w+', text.lower())
    # unique words
    sv_uniq_words = set(sv_words)
    # create a df with the length of each word
    sv_word_lengths = pd.DataFrame([len(w) for w in sv_uniq_words])
    
    # reshape the 1D df for the scaling function
    sv_word_lengths.values.reshape(-1,1)
    # define scaler
    scaler = MinMaxScaler()
    # fit data to scaler
    scaler.fit(sv_word_lengths)
    # transform data 
    df_rtn = scaler.transform(sv_word_lengths)
    
    # return scaler object
    return(df_rtn)

The above function will will accept the raw text from the book, split on the words using regex. The text will have the case normalized by putting words in all lower case, then the unique words will be identified and put in a list. This unique word list will be iterated over to find the length of each word, creating a new list composed of the unique word lengths.

The list with the word lengths for the unique words will be scaled using min/max scaling using `MinMaxScaler()`. This function is from the scikit-learn package and is what was suggested for this exercise, the reference guide can be found <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler">here.</a> The method by which the scaling is accomplished is shown below:


$$X_{unscaled} = \frac{(X - min(X))}{(max(X) - min(X))}$$

***

$$X_{scaled} = X_{unscaled} * (max + min) + min$$



#### 2. After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [24]:
def scaled_long_vocab(text):
    # split words on white space and normalize case
    slv_words = re.findall('\w+', text.lower())
    # unique words
    slv_uniq_words = set(slv_words)
    # create a list with each word over length 13
    slv_long_word = [w for w in slv_uniq_words if len(w) > 13]
    # count the length of each word and put in a new list
    slv_long_vSize = [len(i) for i in slv_long_word]
    # scale the list of long word lengths
    longScaler = minmax_scale(slv_long_vSize)
        
    return(longScaler)

The above function operates in a very similar manner to the first scaling function. Again the method will be passed the text, split the words using regex, case normalize word, then identify the unique words and subset to words over length 13. Then this list of long unique words will be iterated over to find the length of each word, creating a new list composed of the unique word lengths. The value of 13 was chosen by elimination of other candidates. The book example gave 15 characters, however when 14, 15 were attempted, very few were returned.

The list of long unique words will be scaled using min/max scaling using a slightly different but equivalent scikit-learn function; `minmax_scale()`. This is very similar to what was suggested for this exercise, the reference guide can be found <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html">here.</a> The method by which the scaling is accomplished the same as the formula above, in question 1, however `minmax_scale()` can accommodate one dimensional data without the need to reshape.

#### 3.	Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

In [40]:
def lexical_diversity(text):
    ld_words = re.findall('\w+', text.lower())
    ld_uniq_words = set(ld_words)
    return(round(len(ld_uniq_words)/len(ld_words),4))    

In [67]:
urls = ["https://www.gutenberg.org/files/34605/34605-0.txt",
       "https://www.gutenberg.org/cache/epub/34728/pg34728.txt",
       "https://www.gutenberg.org/files/44804/44804-0.txt"]

gradeLevels = ["9th Grade", "10th Grade", "11th Grade"]

bookTitles = ['Betty Lee, Freshman, by Harriet Pyne Grove',
             'Betty Lee, Sophomore, by Harriet Pyne Grove',
             'Betty Lee, Junior, by Harriet Pyne Grove']

normalizedWordCount = []
normalizedLongWordCount = []
lexDevs = []
text_diff = []

for x in range(3):
    response = request.urlopen(urls[x])
    raw = response.read().decode('utf8')
    nwc = scaled_vocab_size(raw)
    nlwc = scaled_long_vocab(raw)
    ld = lexical_diversity(raw)
    normalizedWordCount.append(sum(nwc)[0])
    normalizedLongWordCount.append(sum(nlwc))
    lexDevs.append(ld)
    text_diff.append(
        sum(nwc)[0] * sum(nlwc) * (ld) 
    )
    
colNames = ["Grade Level", 
            "Book Title", 
            "Normalized Word Sum", 
            "Normalized Long Word Sum", 
            "Lexical Diversity",
            "Text Difficulty Score"
           ]

allDat = pd.DataFrame(
    list(zip(gradeLevels, bookTitles, normalizedWordCount, normalizedLongWordCount, lexDevs, text_diff)),
    columns = colNames
)




In [76]:
display(round(allDat, 3))

Unnamed: 0,Grade Level,Book Title,Normalized Word Sum,Normalized Long Word Sum,Lexical Diversity,Text Difficulty Score
0,9th Grade,"Betty Lee, Freshman, by Harriet Pyne Grove",1815.6,3.5,0.09,574.456
1,10th Grade,"Betty Lee, Sophomore, by Harriet Pyne Grove",1716.667,6.0,0.088,902.28
2,11th Grade,"Betty Lee, Junior, by Harriet Pyne Grove",1799.6,5.0,0.086,770.229


The above table shows the results from the text difficulty scoring exercise, with 3 of the texts used in homework 1, along with each measure that was used in the calculation. The formula used to calculate the text difficulty is shown below.

$$td\ = \Sigma{(nwc)}\ *\ \Sigma{(nlwc)}\ * ld$$

This was a simple multiplication of the values. For the both lists of normalized word sums, shown in the formula as nwc and nlwc, the entire list was summed prior to multiplication.

Calculating the text difficulty score on these three text from the first homework brought to light some other pieces of data. The normalized word count sum will take into account the lengths of the words and will be adjusted accordingly. The first homework asked a very similar question of does the vocab count give insight into difficulty, one of the very issues with that method was the fact that the unique words could be small and very simple. This method takes it into account. 

The score generated by the text difficulty function shown above does seem to correlate better with the idea that reading/vocabulary will become more difficult and advanced as one progresses through the years. There isn't a steady increase of scores, but there is a very obvious jump between the scores for the 9th grade and 11th grade readings, even though the 10th grade book complicates this slightly. What is admirable about the text difficulty score is that there seems to now be a much greater resolution between the levels. When comparing the lexical diversity alone, the numbers are very small, and very close together. For instance, the 9th grade text has the highest normalized word count and lexical diversity, but we can see from the long word score, that it had the least amount of words longer than 13. The text difficulty score seems to provide possibly a better measure than what was previously used.

