# a4 - Data Analysis
Fill in the below code cells as specified. Note that cells may utilize variables and functions defined in previous cells; we should be able to use the `Kernal > Restart & Clear All` menu item followed by `Cell > Run All` to execute your entire notebook and see the correct output.

## Part 1. Numbers
For this part of the assignment, you will analyze some numeric data (counts of library holdings) to investiate how the distribution of numbers in natural data sets obeys the counter-intuitive [Benford's Law](https://plus.maths.org/content/os/issue9/features/benford/index). 

<small>(This exercise was adapted from Steve Wolfman).</small>

Create a variable **`holdings_data`** which is a **list** of the contents of the **`data/libraryholdings.txt`** file included in the repository (each line in the file should be a single element in the list). You will need to open up the file and read its contents into a list. You should specify a _local path_ to the file (from this notebook's location).

In [1]:
with open('data/libraryholdings.txt') as my_file:
    holding_data = []
    for line in my_file:
        holding_data.append(line)

Print out the first **ten** items from the `holdings_data` list, each on its own line. (Note that there may be extra line breaks that are included in the data items themselves).

In [2]:
print (holding_data[0:10])

['(* Library holdings (# of books in each library), *)\n', '(* collected by Christian Ayotte.                 *)\n', '(* Labels not available.                          *)\n', '\n', '12201\n', '600778\n', '14926\n', '37863\n', '14866\n', '9896\n']


Use the **slice operator (`:`)** to remove the "heading" and blank elements from the beginning of the data list, leaving only the list of numbers. The remaining values should continue to be stored (re-stored) in the `holdings_data` variable. Output the new first element in `holdings_data` to demonstrate that it is the first number in the data set.
- The values in the list _should_ be strings rather than an integers

In [3]:
holding_data = holding_data[4:]
print(holding_data)

['12201\n', '600778\n', '14926\n', '37863\n', '14866\n', '9896\n', '8064\n', '9047\n', '3388\n', '21625\n', '8779\n', '7150\n', '9441\n', '10993\n', '7850\n', '7445\n', '276157\n', '13902\n', '13078\n', '8658\n', '22852\n', '21803\n', '17050\n', '34419\n', '3240\n', '106000\n', '8365\n', '74343\n', '55626\n', '65248\n', '15390\n', '11693\n', '8248\n', '14566\n', '11296\n', '11300\n', '17926\n', '2199\n', '42999\n', '7842\n', '28966\n', '11068\n', '2733\n', '11538\n', '29684\n', '57448\n', '11824\n', '34681\n', '10701\n', '4000\n', '13555\n', '7685\n', '6850\n', '11500\n', '46628\n', '57232\n', '21774\n', '45531\n', '8700\n', '6520\n', '10741\n', '8458\n', '43690\n', '11735\n', '18558\n', '3213\n', '4802\n', '6500\n', '4952\n', '9255\n', '4600\n', '1050\n', '7178\n', '3817\n', '991\n', '14064\n', '4200\n', '4680\n', '2566\n', '4830\n', '2781\n', '5884\n', '6564\n', '1770\n', '8150\n', '1200\n', '6274\n', '9209\n', '6730\n', '9736\n', '9577\n', '54347\n', '10750\n', '17744\n', '22453\n',

Create a variable **`lead_digit_counts`** that is a dictionary whose keys are _strings_ of each digit (`"0"`, `"1"`, `"2"`, etc.), and whose values are all the number `0`. You can do this directly or with a loop. Print out the variable after you create it.

In [4]:
keys = ['0','1','2','3','4','5','6','7','8','9']
values = [0,0,0,0,0,0,0,0,0,0,0]
lead_digit_counts = dict(zip(keys, values))

Calculate the number of times each digit appears as the _first digit_ in a value of the `holdings_data` list, storing those counts in the `lead_digit_counts` dictionary.

In [5]:
for num in holding_data:
    # get the first digit of a number, or 0 if it doesn't exist
    current_count = lead_digit_counts.get(str(num)[0], 0)
    # assign the new number
    lead_digit_counts[str(num)[0]] = current_count + 1

Use a loop to print out each count in `lead_digit_counts` with the format:
```
X values have a leading digit of digit Y
```

In [6]:
for first_digit in sorted(lead_digit_counts):
    print(str(lead_digit_counts[first_digit]) + ' values have a leading digit of digit ' + first_digit)

0 values have a leading digit of digit 0
3056 values have a leading digit of digit 1
1606 values have a leading digit of digit 2
1018 values have a leading digit of digit 3
801 values have a leading digit of digit 4
640 values have a leading digit of digit 5
560 values have a leading digit of digit 6
502 values have a leading digit of digit 7
503 values have a leading digit of digit 8
452 values have a leading digit of digit 9


Print the _percentage_ of values in the the library holdings data set that have a leading digit **`1`** (round to 2 decimal places). Is this value as predicted by Benford's law?

In [7]:
#The percentage of values in the library holdings data set that have a leading digit 1
print(round(float(lead_digit_counts['1'])/sum(lead_digit_counts.values()), 4))

#To check the Benford's law, I tried to get the digit 2's value
print(round(float(lead_digit_counts['2'])/sum(lead_digit_counts.values()), 4))

print ("According to the Benford's law, about 30% began with 1, 18% with 2. Because our results (33.44%, 17.57%) also show similar patterns, the values are predicted by Benford's law.")

0.3344
0.1757
According to the Benford's law, about 30% began with 1, 18% with 2. Because our results (33.44%, 17.57%) also show similar patterns, the values are predicted by Benford's law.


***Extra credit challenge:*** Create a single variable `digit_position_counts` that contains the number of times that each digit 0 through 9 appears in _each_ position in the data set. E.g., a `1` appears in the 1st position 3056 times and in the second position 1005 times; a `2` appears in the 1st position 1606 times and in the second position 1044 times.

Use this variable to print a "table" of the percentage of the time each position contains each digit (e.g., the 1st digit is a `1` 33.44% of the time, a `2` 17.57% of the time, etc).

Note that for this extra challenge it is up to you to determine an appropriate data structure (e.g., how to combine dictionaries and lists and tuples) for representing this table. Be sure and include comments explaining your work.

Only attempt this problem once you have completed everything else!

{'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}
{'0': 0, '1': 3056, '2': 1606, '3': 1018, '4': 801, '5': 640, '6': 560, '7': 502, '8': 503, '9': 452}
{'0': 1162, '1': 4061, '2': 2650, '3': 1989, '4': 1680, '5': 1529, '6': 1419, '7': 1297, '8': 1265, '9': 1224}


## Part 2. Life Expectancy
For this part of the assignment, you'll work with data about the life expectancy (in years) for each country in the world in the years 1960 and 2013. Note that this can be really [fun](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) data!

The data is found in a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file: a plain-text data format where each line represents a record (row) of data and where feature (column) is separated by a comma.

Read in the contents of the **`data/life_expectancy.csv`** data file, and use it to construct a **list** called **`life_expectancy_list`**. Each element in this list should be a **dictionary** (one for each row in the `csv` file) with the following keys and values:

- a key `'country'` whose value is the name of the country (as a string)
- a key `'le_1960'` whose value is the life expectancy in 1960 (as a float)
- a key `'le_2013'` whose value is the life expectancy in 2013 (as a float)

Thus the first record should look like:
```
{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}
```

You should use the **`csv`** module to read this file and break up each row into different values. See [the documentation](https://docs.python.org/3/library/csv.html) for an example of how to do this. Print out the _first row_ of your list as a demonstration that you've processed the data correctly.

In [9]:
import csv

with open('data/life_expectancy.csv', newline='') as csvfile:
    life_expectancy_list = []
    csvreader = csv.DictReader(csvfile)
    dic = {}
    for line in csvreader:
        dic ['country'] = line['country']
        dic ['le_1960'] = float(line['le_1960'])
        dic ['le_2013'] = float(line['le_2013'])        
        life_expectancy_list.append(dict(dic))

print (life_expectancy_list[0])

{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}


Add another item to each dictionary in the `life_expectancy_list` whose **key** is `change` and whose **value** is the change in life expectancy from 1960 to 2013.

In [10]:
for line in life_expectancy_list:
    line['change'] = line['le_2013']-line['le_1960']

Create a variable **`num_small_gain`** that stores the **number of countries** whose life expectancy did not improve by 5 years or more between 1960 and 2013. This will include counties whose life expectancy has worsened. Print out this variable.

In [11]:
num_small_gain = 0
for line in life_expectancy_list:
    if line['change'] < 5:
        num_small_gain += 1
                
print ('The number of countries whose life expectancy did not improved by 5 years or more between 1960 and 2013 is', num_small_gain)

The number of countries whose life expectancy did not improved by 5 years or more between 1960 and 2013 is 7


Create a variable **`most_improved`** that is the **name of the country** with the largest gain in life expectancy (between 1960 and 2013). Print out this variable.

In [12]:
max_num = 0

for line in life_expectancy_list:
    if line['change'] > max_num:
        max_num = line['change']
        most_improved = line['country']
        
print ('The name of the country with the largest gain in life expectancy between 1960 and 2013 is', most_improved)

The name of the country with the largest gain in life expectancy between 1960 and 2013 is Maldives


Define a function **`compare_country_le()`** that takes in the names of _two_ countries, and returns a **tuple** containing the following information:
- the name of the country with the greater life expectancy,
- the life expectancy in 2013 of that country
- the difference between the life expectancies in 2013

Use your function to print the comparison between the life expectancies of the _United States_ and _Cuba_.  

In [13]:
def compare_country_le (country_1, country_2):
    for line in life_expectancy_list:
        if line['country'] == country_1:
            country_1_name = line['country']
            country_1_le = line['le_2013']
            country_1_diff = line['change']
        if line['country'] == country_2:
            country_2_name = line['country']
            country_2_le = line['le_2013']
            country_2_diff = line['change']
            
    if country_1_diff > country_2_diff:
        print ('The name of the country with the greater life expectancy:', country_1_name)
        print ('The life expectancy in 2013 of that country:', str(country_1_le))
        print ('The difference between the life expectancies in 2013:', str(country_1_diff))
    elif country_1_diff < country_2_diff:
        print ('The name of the country with the greater life expectancy:', country_2_name)
        print ('The life expectancy in 2013 of that country:', str(country_2_le))
        print ('The difference between the life expectancies in 2013:', str(country_2_diff))

compare_country_le('United States', 'Cuba')

The name of the country with the greater life expectancy: Cuba
The life expectancy in 2013 of that country: 79.23926829
The difference between the life expectancies in 2013: 15.334609749999998


## Part 3. Readability
For this part of the assignment, you will calculate the [readability](https://en.wikipedia.org/wiki/Readability) of a text document using the [Dale-Chall Readability Formula](http://www.readabilityformulas.com/new-dale-chall-readability-formula.php). This method determines how "easy" it is to read a particular (English) document by considering the length of sentences and how many of the words used are "easy" to understand (based on a pre-defined list of "easy" words).

Splitting real-world text documents into words and sentences is non-trivial (English is hard!). To make this easier, you should use the [Natural Language Toolkit (nltk)](http://www.nltk.org/index.html) module. This module is included with Anacaonda, but does require some additional data source files to be installed on your computer. You _should_ be able to do this by running the below cell (you only need to run it once).

In [14]:
from nltk import download
download('punkt')
download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Loaner\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Loaner\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

You will also need to load the list of "easy" words into memory. This list can be found in the **`data/dale-chall.txt`** file. Open this file and read its entire contents into a **list** variable (e.g., `easy_words_list`), where each element in the list is a single line (word) in the file.

In [25]:
easy_words_list = []
with open('data/dale-chall.txt') as words:   
    #for line in words:
        #easy_words_list.append(line.rstrip())
    easy_words_list = words.read().splitlines()

In order to "look up" easy words, convert the easy words list into a **dictionary** (e.g., `easy_words_dict`), where each **key** is a word, and each **value** is `True` (that the word is in the list).
- Make sure you do not include newline characters in your keys!

In [27]:
easy_words_dict = {}
for line in easy_words_list:
    easy_words_dict[line]=True
    #easy_words_dict = dict(zip(easy_words_list,[True for x in range(0,len(easy_words_list))]))


{'a': True,
 'able': True,
 'aboard': True,
 'about': True,
 'above': True,
 'absent': True,
 'accept': True,
 'accident': True,
 'account': True,
 'ache': True,
 'aching': True,
 'acorn': True,
 'acre': True,
 'across': True,
 'act': True,
 'acts': True,
 'add': True,
 'address': True,
 'admire': True,
 'adventure': True,
 'afar': True,
 'afraid': True,
 'after': True,
 'afternoon': True,
 'afterward': True,
 'afterwards': True,
 'again': True,
 'against': True,
 'age': True,
 'aged': True,
 'ago': True,
 'agree': True,
 'ah': True,
 'ahead': True,
 'aid': True,
 'aim': True,
 'air': True,
 'airfield': True,
 'airplane': True,
 'airport': True,
 'airship': True,
 'airy': True,
 'alarm': True,
 'alike': True,
 'alive': True,
 'all': True,
 'alley': True,
 'alligator': True,
 'allow': True,
 'almost': True,
 'alone': True,
 'along': True,
 'aloud': True,
 'already': True,
 'also': True,
 'always': True,
 'am': True,
 'america': True,
 'american': True,
 'among': True,
 'amount': True,
 

Additionally, define a dictionary **`readability_grade_dict`** to use for looking up the "grade level" associated with a readability score (see [this table](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula)). This dictionary should have **keys** that are ___tuples___ giving the range of score for a particular grade (e.g., `(5.0, 5.9)`), and **values** that are ___strings___ representing the grade (e.g., `"5th or 6th grade"`). 

In [30]:
readability_grade_dict = {(0.0, 4.9):'4th-grade student or lower',(5.0, 5.9):'5th or 6th-grade student',(6.0, 6.9):'7th or 8th-grade student',
     (7.0, 7.9):'9th or 10th-grade student',(8.0, 8.9):'11th or 12th-grade student',(9.0, 9.9):'13th or 15th-grade student',
     (10.0,100.0):'16th-grade student or above'}    

Define a function **`print_grade()`** that takes in a readability score (a number greater than or equal to 0), and **prints** a string representing the grade associated with that score (from your `readability_grade_dict` dictionary).
- _Hint:_ loop through the items in the dictionary and determine which "tuple" key has elements that the score falls between. Be sure and round to the nearest decimal).

In [33]:
def print_grade(readability_score):
    
    for x,y in readability_grade_dict:
        if x <= round(readability_score,1) & round(readability_score,1) <= y:
            print(readability_grade_dict[(x,y)])

Now to calculate the readability scores! Define a function **`count_sentences()`** that counts the number of sentences in a string. Use the [sent_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) function from the `nltk.tokenize` module to break up a string into sentences (this is like the string `split()` function, but it splits into sentences rather than dividing by spaces).
- For help and an example, see [this guide](http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize).
- You do not need to do any extra processing beyond that provided by the `sent_tokenize()` function.
- Test your function on a simple pair or trio of sentences!

In [34]:
from nltk.tokenize import sent_tokenize

def count_sentences (sentences):
    sent_tokenize_list = sent_tokenize(sentences)
    num = len(sent_tokenize_list)
    return (num)

Define a function **`extract_words()`** that takes in a string and returns a _list_ of all of the words in that string. Use the [word_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) function from the `nltk.tokenize` module to break up the string into words.
- The `nltk` tokenizer includes each punctuation character (e.g., commas, periods) as individual "words". Your list should not include these items. You can use a string method to determine whether or not the word starts with a punctuation symbol, and if so exclude it. _Hint_ think about keeping good words, rather than throwing away the bad! Note that you do not need to do any special consideration for contractions or other words that include their own punctuation.
- Test your function on a simple sentence (with punctuation!).

In [35]:
##This part is pretty tricky, so I'm going to give you a break and just share the function
##Take a minute and try to understand what's it's doing
##Then uncomment the test cases and test it out with different sentences

from nltk.tokenize import word_tokenize
import string

def extract_words(text):
    raw_words = word_tokenize(text)
    #print(raw_words)
    words = []
    alphabet = string.ascii_letters
    for word in raw_words:
        if(word[0].isalpha()):
            words.append(word)
    return words

##SOME TEST CASES
##TRY IT WITH YOUR OWN SAMPLE SENTENCES!

# text = "this’s a sent tokenize test. this is sent two. is this sent three? Now it’s your turn." ##Answer 19
# print(len(extract_words(text))) #Test: PASS

# print(extract_words("This's a test"))

Define a function **`count_easy_words()`** that takes in a _list_ of words as an argument and returns the number of words that are "easy".

- Your function should look up each word in the `easy_words_dict` you defined earlier. _Do not look up words in the list_ (the dictionary is much faster!). Be careful to look up lowercase versions of the word.

- Your function should handle detecting different parts of speech (e.g., plurals, different verb conjugations, etc.). You can do this by using the **`WordNetLemmatizer()`** function from the `nltk.stem.wordnet` module&mdash;which produces a "lemmatizer" object. You can call the **`lemmatize()`** method on this object to reduce a word to its "base" form. See [this example](https://pythonprogramming.net/lemmatizing-nltk-tutorial/). Note that you should reduce words to both their basic noun AND verb forms (you will need to call the function twice: once with `'n'` (noun) and once with `'v'` (verb) as the second argument!)

- You can test your function on the word list: `['My','words','spoken','have','consequences']`, which should have 4 of the 5 words considered easy (not "consequences").

In [36]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def count_easy_words(words_list):
    num = 0
    for word in words_list:
        n_words = wordnet_lemmatizer.lemmatize(word.lower(), pos = 'n')
        v_words = wordnet_lemmatizer.lemmatize(word.lower(), pos = 'v')
        if easy_words_dict.get(n_words) or easy_words_dict.get(v_words):
            num += 1
    return(num)

Define a function **`calc_readability_score()`** that takes in a string of text and returns a readability "score" for the test based on the [Dale-Chall readability formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula). Call your previous functions to calculate the number of sentences, total words, and number of difficult (not easy) words.
- Don't forget to adjust the score if the text is more than 5% difficult words!

In [37]:
def calc_readability_score(text_string):
    difficultWords = len(extract_words(text_string))- count_easy_words(extract_words(text_string))
    words = len(extract_words(text_string))
    sentences = count_sentences (text_string)
    dc_readability_score = (0.1579 * (((difficultWords) / words) * 100)) + (0.0496 * (words / sentences))
    if round(difficultWords / words, 2) > 0.05:
        ajusted_dc_readability_score = 3.6365 + dc_readability_score
    else:
        ajusted_dc_readability_score = dc_readability_score
    return (ajusted_dc_readability_score) 

Read in the text of the `data/alice.txt` file (the full text of Alice in Wonderland) _as a single string_. 

In [23]:
with open('data/alice.txt') as words:  
    alice_story = words.read()

str(alice_story)



Calculate the readability score for the `alice.txt` file and print it out. Then print out the reading grade associated with that score. Use your previously-defined functions!
- For testing, note that my calculations show `alice.txt` has 977 sentences and 27198 words, of which 3610 are difficult. This leads to a readability score of ~7.113.

In [38]:
readability_sco = calc_readability_score(alice_story)

print(readability_sco)
print_grade(int(readability_sco))

7.113645151364204
9th or 10th-grade student


_Note that this result may not be an especially accurate model of a text's readability&mdash;after all, it's just based on a simple estimation!_