# a4 - Data Analysis
Fill in the below code cells as specified. Note that cells may utilize variables and functions defined in previous cells; we should be able to use the `Kernal > Restart & Clear All` menu item followed by `Cell > Run All` to execute your entire notebook and see the correct output.

## Part 1. Numbers
For this part of the assignment, you will analyze some numeric data (counts of library holdings) to investiate how the distribution of numbers in natural data sets obeys the counter-intuitive [Benford's Law](https://plus.maths.org/content/os/issue9/features/benford/index). 

<small>(This exercise was adapted from Steve Wolfman).</small>

Create a variable **`holdings_data`** which is a **list** of the contents of the **`data/libraryholdings.txt`** file included in the repository (each line in the file should be a single element in the list). You will need to open up the file and read its contents into a list. You should specify a _local path_ to the file (from this notebook's location).

In [1]:
holdings_data = open("data/libraryholdings.txt").readlines()

Print out the first **ten** items from the `holdings_data` list, each on its own line. (Note that there may be extra line breaks that are included in the data items themselves).

In [2]:
print(*holdings_data[:10], sep="\n")

(* Library holdings (# of books in each library), *)

(* collected by Christian Ayotte.                 *)

(* Labels not available.                          *)



12201

600778

14926

37863

14866

9896



Use the **slice operator (`:`)** to remove the "heading" and blank elements from the beginning of the data list, leaving only the list of numbers. The remaining values should continue to be stored (re-stored) in the `holdings_data` variable. Output the new first element in `holdings_data` to demonstrate that it is the first number in the data set.
- The values in the list _should_ be strings rather than an integers

In [3]:
holdings_data=holdings_data[4:]
#holdings_data = list(map(str, holdings_data)) 
print(*holdings_data[:1], sep="\n")

12201



Create a variable **`lead_digit_counts`** that is a dictionary whose keys are _strings_ of each digit (`"0"`, `"1"`, `"2"`, etc.), and whose values are all the number `0`. You can do this directly or with a loop. Print out the variable after you create it.

In [4]:
lead_digit_counts={"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0}
print(lead_digit_counts)

{'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}


Calculate the number of times each digit appears as the _first digit_ in a value of the `holdings_data` list, storing those counts in the `lead_digit_counts` dictionary.

In [5]:
import collections
a=[x[0] for x in holdings_data]
a = list(map(int, a))
counter=collections.Counter(a)
l=counter.values()
lead_digit_counts = dict(zip(["1","6","3","9","8","2","7","5","4"],l))
lead_digit_counts["0"]=0
print(lead_digit_counts)

{'1': 3056, '6': 560, '3': 1018, '9': 452, '8': 503, '2': 1606, '7': 502, '5': 640, '4': 801, '0': 0}


Use a loop to print out each count in `lead_digit_counts` with the format:
```
X values have a leading digit of digit Y
```

In [6]:
for key in sorted(lead_digit_counts):
    print (lead_digit_counts[key], 'values have a leading digit of digit', key)

0 values have a leading digit of digit 0
3056 values have a leading digit of digit 1
1606 values have a leading digit of digit 2
1018 values have a leading digit of digit 3
801 values have a leading digit of digit 4
640 values have a leading digit of digit 5
560 values have a leading digit of digit 6
502 values have a leading digit of digit 7
503 values have a leading digit of digit 8
452 values have a leading digit of digit 9


Print the _percentage_ of values in the the library holdings data set that have a leading digit **`1`** (round to 2 decimal places). Is this value as predicted by Benford's law?

In [7]:
percentage1= round((lead_digit_counts['1']/sum(lead_digit_counts.values()))*100,2)
print((str(percentage1) + '%'))
print("The percentage value for first place digit for 1 is very close to 30% as per Bedford's law.")

33.44%
The percentage value for first place digit for 1 is very close to 30% as per Bedford's law.


***Extra credit challenge:*** Create a single variable `digit_position_counts` that contains the number of times that each digit 0 through 9 appears in _each_ position in the data set. E.g., a `1` appears in the 1st position 3056 times and in the second position 1005 times; a `2` appears in the 1st position 1606 times and in the second position 1044 times.

Use this variable to print a "table" of the percentage of the time each position contains each digit (e.g., the 1st digit is a `1` 33.44% of the time, a `2` 17.57% of the time, etc).

Note that for this extra challenge it is up to you to determine an appropriate data structure (e.g., how to combine dictionaries and lists and tuples) for representing this table. Be sure and include comments explaining your work.

Only attempt this problem once you have completed everything else!

## Part 2. Life Expectancy
For this part of the assignment, you'll work with data about the life expectancy (in years) for each country in the world in the years 1960 and 2013. Note that this can be really [fun](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) data!

The data is found in a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file: a plain-text data format where each line represents a record (row) of data and where feature (column) is separated by a comma.

Read in the contents of the **`data/life_expectancy.csv`** data file, and use it to construct a **list** called **`life_expectancy_list`**. Each element in this list should be a **dictionary** (one for each row in the `csv` file) with the following keys and values:

- a key `'country'` whose value is the name of the country (as a string)
- a key `'le_1960'` whose value is the life expectancy in 1960 (as a float)
- a key `'le_2013'` whose value is the life expectancy in 2013 (as a float)

Thus the first record should look like:
```
{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}
```

You should use the **`csv`** module to read this file and break up each row into different values. See [the documentation](https://docs.python.org/3/library/csv.html) for an example of how to do this. Print out the _first row_ of your list as a demonstration that you've processed the data correctly.

In [8]:
import csv

reader = csv.DictReader(open('data/life_expectancy.csv', "r"))
life_expectancy_list = []
for row in reader:
    life_expectancy_list.append({k:v for (k,v) in row.items() if k in ['country', 'le_1960','le_2013']})

print(life_expectancy_list[0])

{'country': 'Aruba', 'le_1960': '65.56936585', 'le_2013': '75.33217073'}


Add another item to each dictionary in the `life_expectancy_list` whose **key** is `change` and whose **value** is the change in life expectancy from 1960 to 2013.

In [9]:
for item in life_expectancy_list:
    item.update({"change":float(item['le_2013'])-float(item['le_1960'])})
print(life_expectancy_list)

[{'country': 'Aruba', 'le_1960': '65.56936585', 'le_2013': '75.33217073', 'change': 9.762804880000004}, {'country': 'Afghanistan', 'le_1960': '31.58004878', 'le_2013': '60.93141463', 'change': 29.35136585}, {'country': 'Angola', 'le_1960': '32.98482927', 'le_2013': '51.86617073', 'change': 18.88134146}, {'country': 'Albania', 'le_1960': '62.25436585', 'le_2013': '77.5372439', 'change': 15.282878050000008}, {'country': 'United Arab Emirates', 'le_1960': '52.24321951', 'le_2013': '77.13129268', 'change': 24.88807317}, {'country': 'Argentina', 'le_1960': '65.21553659', 'le_2013': '76.18729268', 'change': 10.97175609}, {'country': 'Armenia', 'le_1960': '65.86346341', 'le_2013': '74.5407561', 'change': 8.677292690000002}, {'country': 'Antigua and Barbuda', 'le_1960': '61.78273171', 'le_2013': '75.82929268', 'change': 14.046560969999994}, {'country': 'Australia', 'le_1960': '70.81707317', 'le_2013': '82.19756098', 'change': 11.380487810000005}, {'country': 'Austria', 'le_1960': '68.58560976'

Create a variable **`num_small_gain`** that stores the **number of countries** whose life expectancy did not improve by 5 years or more between 1960 and 2013. This will include counties whose life expectancy has worsened. Print out this variable.

In [10]:
count=0
for item in life_expectancy_list:
    if item['change']<=5:
        count=count+1
        print('Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 ' + item['country'])
num_small_gain=count    
print(num_small_gain)

Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Belarus
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Botswana
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Lesotho
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Lithuania
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Latvia
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Swaziland
Name of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013 Ukraine
7


Create a variable **`most_improved`** that is the **name of the country** with the largest gain in life expectancy (between 1960 and 2013). Print out this variable.

In [11]:
max1=0
for item in life_expectancy_list:
    if item['change']>max1:
        max1=item['change']
        most_improved=item['country']
    else:
        max1=max1
print(max1)
print(most_improved+' is the name of the country with the largest gain in life expectancy (between 1960 and 2013)')

42.07575609
Maldives is the name of the country with the largest gain in life expectancy (between 1960 and 2013)


Define a function **`compare_country_le()`** that takes in the names of _two_ countries, and returns a **tuple** containing the following information:
- the name of the country with the greater life expectancy,
- the life expectancy in 2013 of that country
- the difference between the life expectancies in 2013

Use your function to print the comparison between the life expectancies of the _United States_ and _Cuba_.  

In [12]:
def compare_country_le(country1,country2):
    for item in life_expectancy_list:
        if(item['country']==country1):
            A1=(item['le_2013'])
            A2=(item['country'])
        elif(item['country']==country2):
            B1=(item['le_2013'])
            B2=(item['country'])
    if(A1>B1):
        C1=round(float(A1),2)-round(float(B1),2)
        #print(str(A2)+" "+str(A1)+ " "+str(C1))
        Tuple = (str(A2),str(A1),str(C1))
        print('Name of the country with the greater life expectancy:', str(A2),'\nLife expectancy in 2013:', str(A1),'\nDifference between the life expectancies in 2013:', str(C1))
        return Tuple
    else:
            C1=round(float(B1),2)-round(float(A1),2)
            #print(str(B2) +" "+str(B1) +" "+str(C1))
            Tuple = (str(B2),str(B1),str(C1))
            print('Name of the country with the greater life expectancy:', str(B2),'\nLife expectancy in 2013:', str(B1),'\nDifference between the life expectancies in 2013:', str(C1))
            return Tuple 
        
compare_country_le('United States','Cuba')    

Name of the country with the greater life expectancy: Cuba 
Life expectancy in 2013: 79.23926829 
Difference between the life expectancies in 2013: 0.3999999999999915


('Cuba', '79.23926829', '0.3999999999999915')

## Part 3. Readability
For this part of the assignment, you will calculate the [readability](https://en.wikipedia.org/wiki/Readability) of a text document using the [Dale-Chall Readability Formula](http://www.readabilityformulas.com/new-dale-chall-readability-formula.php). This method determines how "easy" it is to read a particular (English) document by considering the length of sentences and how many of the words used are "easy" to understand (based on a pre-defined list of "easy" words).

Splitting real-world text documents into words and sentences is non-trivial (English is hard!). To make this easier, you should use the [Natural Language Toolkit (nltk)](http://www.nltk.org/index.html) module. This module is included with Anacaonda, but does require some additional data source files to be installed on your computer. You _should_ be able to do this by running the below cell (you only need to run it once).

In [13]:
from nltk import download
download('punkt')
download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MAYURESH\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\MAYURESH\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

You will also need to load the list of "easy" words into memory. This list can be found in the **`data/dale-chall.txt`** file. Open this file and read its entire contents into a **list** variable (e.g., `easy_words_list`), where each element in the list is a single line (word) in the file.

In [14]:
easy_words_list = open('data/dale-chall.txt','r').readlines()
for i in range(0,len(easy_words_list)):
        easy_words_list[i]= easy_words_list[i].strip('\n')
print(easy_words_list)

['a', 'able', 'aboard', 'about', 'above', 'absent', 'accept', 'accident', 'account', 'ache', 'aching', 'acorn', 'acre', 'across', 'act', 'acts', 'add', 'address', 'admire', 'adventure', 'afar', 'afraid', 'after', 'afternoon', 'afterward', 'afterwards', 'again', 'against', 'age', 'aged', 'ago', 'agree', 'ah', 'ahead', 'aid', 'aim', 'air', 'airfield', 'airplane', 'airport', 'airship', 'airy', 'alarm', 'alike', 'alive', 'all', 'alley', 'alligator', 'allow', 'almost', 'alone', 'along', 'aloud', 'already', 'also', 'always', 'am', 'america', 'american', 'among', 'amount', 'an', 'and', 'angel', 'anger', 'angry', 'animal', 'another', 'answer', 'ant', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'apart', 'apartment', 'ape', 'apiece', 'appear', 'apple', 'april', 'apron', 'are', "aren't", 'arise', 'arithmetic', 'arm', 'armful', 'army', 'arose', 'around', 'arrange', 'arrive', 'arrived', 'arrow', 'art', 'artist', 'as', 'ash', 'ashes', 'aside', 'ask', 'asleep', 'at', 'ate'

In order to "look up" easy words, convert the easy words list into a **dictionary** (e.g., `easy_words_dict`), where each **key** is a word, and each **value** is `True` (that the word is in the list).
- Make sure you do not include newline characters in your keys!

In [15]:
easy_words_dict={}
for i in easy_words_list:
    easy_words_dict[i]=True
easy_words_dict

{'a': True,
 'able': True,
 'aboard': True,
 'about': True,
 'above': True,
 'absent': True,
 'accept': True,
 'accident': True,
 'account': True,
 'ache': True,
 'aching': True,
 'acorn': True,
 'acre': True,
 'across': True,
 'act': True,
 'acts': True,
 'add': True,
 'address': True,
 'admire': True,
 'adventure': True,
 'afar': True,
 'afraid': True,
 'after': True,
 'afternoon': True,
 'afterward': True,
 'afterwards': True,
 'again': True,
 'against': True,
 'age': True,
 'aged': True,
 'ago': True,
 'agree': True,
 'ah': True,
 'ahead': True,
 'aid': True,
 'aim': True,
 'air': True,
 'airfield': True,
 'airplane': True,
 'airport': True,
 'airship': True,
 'airy': True,
 'alarm': True,
 'alike': True,
 'alive': True,
 'all': True,
 'alley': True,
 'alligator': True,
 'allow': True,
 'almost': True,
 'alone': True,
 'along': True,
 'aloud': True,
 'already': True,
 'also': True,
 'always': True,
 'am': True,
 'america': True,
 'american': True,
 'among': True,
 'amount': True,
 

Additionally, define a dictionary **`readability_grade_dict`** to use for looking up the "grade level" associated with a readability score (see [this table](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula)). This dictionary should have **keys** that are ___tuples___ giving the range of score for a particular grade (e.g., `(5.0, 5.9)`), and **values** that are ___strings___ representing the grade (e.g., `"5th or 6th grade"`). 

In [16]:
readability_grade_dict={5.0:"5th or 6th grade",5.9:"5th or 6th grade",6.0:"6th or 7th grade",
                       6.9:"6th or 7th grade",7.0:"7th or 8th grade", 7.9:"7th or 8th grade",8.0:"8th or 9th grade", 
                        8.9:"8th or 9th grade",9.0:"9th or 10th grade", 9.9:"9th or 10th grade"}
    
readability_grade_dict  

{5.0: '5th or 6th grade',
 5.9: '5th or 6th grade',
 6.0: '6th or 7th grade',
 6.9: '6th or 7th grade',
 7.0: '7th or 8th grade',
 7.9: '7th or 8th grade',
 8.0: '8th or 9th grade',
 8.9: '8th or 9th grade',
 9.0: '9th or 10th grade',
 9.9: '9th or 10th grade'}

Define a function **`print_grade()`** that takes in a readability score (a number greater than or equal to 0), and **prints** a string representing the grade associated with that score (from your `readability_grade_dict` dictionary).
- _Hint:_ loop through the items in the dictionary and determine which "tuple" key has elements that the score falls between. Be sure and round to the nearest decimal).

In [17]:
def print_grade(readability_score):
    score_value= round(readability_score,1)
    for i in readability_grade_dict:
        if(score_value>9.9):
            return(str("This score is not valid"))
        elif(score_value<=i):
            return(readability_grade_dict[i])
            
test=print_grade(7.8)
test

'7th or 8th grade'

Now to calculate the readability scores! Define a function **`count_sentences()`** that counts the number of sentences in a string. Use the [sent_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) function from the `nltk.tokenize` module to break up a string into sentences (this is like the string `split()` function, but it splits into sentences rather than dividing by spaces).
- For help and an example, see [this guide](http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize).
- You do not need to do any extra processing beyond that provided by the `sent_tokenize()` function.
- Test your function on a simple pair or trio of sentences!

In [18]:
from nltk.tokenize import sent_tokenize
def count_sentences(text):
    sentence= sent_tokenize(text, language='english')
    return(len(sentence))
sentences= count_sentences('My name is Optimus Prime. I am an Autobot. We are autonomous species from the planet Cybertron. I am the good guy')
print(sentences)

4


Define a function **`extract_words()`** that takes in a string and returns a _list_ of all of the words in that string. Use the [word_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) function from the `nltk.tokenize` module to break up the string into words.
- The `nltk` tokenizer includes each punctuation character (e.g., commas, periods) as individual "words". Your list should not include these items. You can use a string method to determine whether or not the word starts with a punctuation symbol, and if so exclude it. _Hint_ think about keeping good words, rather than throwing away the bad! Note that you do not need to do any special consideration for contractions or other words that include their own punctuation.
- Test your function on a simple sentence (with punctuation!).

In [19]:
##This part is pretty tricky, so I'm going to give you a break and just share the function
##Take a minute and try to understand what's it's doing
##Then uncomment the test cases and test it out with different sentences

from nltk.tokenize import word_tokenize
import string
def extract_words(text):
    raw_words = word_tokenize(text)
    #print(raw_words)
    words = []
    alphabet = string.ascii_letters
    for word in raw_words:
        if(word[0].isalpha()):
            words.append(word)
    return words

##SOME TEST CASES
##TRY IT WITH YOUR OWN SAMPLE SENTENCES!

# text = "this’s a sent tokenize test. this is sent two. is this sent three? Now it’s your turn." ##Answer 19
# print(len(extract_words(text))) #Test: PASS

print(extract_words("I love deadlines. I like the whooshing sound they make as they fly by!!"))

['I', 'love', 'deadlines', 'I', 'like', 'the', 'whooshing', 'sound', 'they', 'make', 'as', 'they', 'fly', 'by']


Define a function **`count_easy_words()`** that takes in a _list_ of words as an argument and returns the number of words that are "easy".

- Your function should look up each word in the `easy_words_dict` you defined earlier. _Do not look up words in the list_ (the dictionary is much faster!). Be careful to look up lowercase versions of the word.

- Your function should handle detecting different parts of speech (e.g., plurals, different verb conjugations, etc.). You can do this by using the **`WordNetLemmatizer()`** function from the `nltk.stem.wordnet` module&mdash;which produces a "lemmatizer" object. You can call the **`lemmatize()`** method on this object to reduce a word to its "base" form. See [this example](https://pythonprogramming.net/lemmatizing-nltk-tutorial/). Note that you should reduce words to both their basic noun AND verb forms (you will need to call the function twice: once with `'n'` (noun) and once with `'v'` (verb) as the second argument!)

- You can test your function on the word list: `['My','words','spoken','have','consequences']`, which should have 4 of the 5 words considered easy (not "consequences").

In [20]:
from nltk.stem.wordnet import WordNetLemmatizer

def count_easy_words(word_list):
    easy_words_found=[]
    lemmatizer = WordNetLemmatizer()
    
    for i in word_list:
        word=i.lower()
        wordverb=lemmatizer.lemmatize(word,'v')
        wordnoun=lemmatizer.lemmatize(word,'n')
        
        if(easy_words_dict.get(wordverb) or easy_words_dict.get(wordnoun)):
            easy_words_found.append(i)

    return(len(easy_words_found))
wordlist=['My','words','spoken','have','consequences']
count=count_easy_words(wordlist)
print(count)

4


Define a function **`calc_readability_score()`** that takes in a string of text and returns a readability "score" for the test based on the [Dale-Chall readability formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula). Call your previous functions to calculate the number of sentences, total words, and number of difficult (not easy) words.
- Don't forget to adjust the score if the text is more than 5% difficult words!

In [21]:
def calc_readability_score(text):
    sentences= count_sentences(text)
    words= float(len(extract_words(text)))
    easy_word_count=float(count_easy_words(extract_words(text)))
    #print(easy_word_count)
    difficult_words= float(words-easy_word_count)
    #print(difficult_words)
    difficult_word_percentage=(difficult_words*100)/words
    if(difficult_word_percentage>5):
        score=(0.1579*(difficult_word_percentage)+0.0496*(words/sentences))+3.6365
    else:
        score=(0.1579*(difficult_word_percentage)+0.0496*(words/sentences))
    return(score)

Read in the text of the `data/alice.txt` file (the full text of Alice in Wonderland) _as a single string_. 

In [22]:
with open('data/alice.txt',encoding='utf-8') as txtfile:
    alice = txtfile.read()
alice



Calculate the readability score for the `alice.txt` file and print it out. Then print out the reading grade associated with that score. Use your previously-defined functions!
- For testing, note that my calculations show `alice.txt` has 977 sentences and 27198 words, of which 3610 are difficult. This leads to a readability score of ~7.113.

In [23]:
score=calc_readability_score(alice)
print(score)
print_grade=print_grade(score)
print_grade

7.113090902410715


'7th or 8th grade'

_Note that this result may not be an especially accurate model of a text's readability&mdash;after all, it's just based on a simple estimation!_