**Federalist papers**

Alexander Hamilton, James Madison, or John Jay?  For more than 150 years, historians argued over the authorship of the 12 essays in _The Federalist Papers_. It wasn't until 1963 that the mystery was solved by Frederick Mosteller of Harvard University and David Wallace of the University of Chicago. [Nabokov's Favorite Word Is _Mauve_ by Ben Blatt]

Full text of _The Federalist Papers_ is available at http://www.gutenberg.org/ebooks/1404

In [1]:
# Path to our data file (source file)
# Error handling for not being able to open file
try:
    source_file_name = 'federalist_papers.txt'
    fed_papers_file = open(source_file_name, 'r')
except IOError:
    # cannot open file
    print("Cannot open the source file")
except:
    # all of the other crap gone wrong
    print("A werid error has occurred")
if fed_papers_file != None:
    # We can read all text at once
    all_text = fed_papers_file.read()
    # print(all_text)



In [26]:
# There are a couple of ways we could find frequencies of the words "while" and "whilst".  
# For now, let's convert our chunk of text into a list of words

word_list = all_text.split(" ")

In [27]:
# Will this work?  Are words always separated by spaces?
# While there are better methods for dealing with text parsing (for example, nltk toolkit)
# for now we'll take care of things in a quick and dirty way

punctuation_marks = ['!','.', ',', ':', ';', '?', '-', '\n', '(', ')', '"']
for pm in punctuation_marks:
    all_text = all_text.replace(pm, ' ')
                     
# print(all_text)

In [28]:
# It would be a good idea to convert everything to lower case before we do anything else
all_text = all_text.lower()

# Now let's build a list of words
word_list = all_text.split(" ")
# print(word_list)

In [29]:
# Now, let's find the frequency for "while"

freq_while = 0
freq_whilst = 0
for word in word_list:
    if word == "while":
        freq_while = freq_while + 1
    if word == "whilst":
        freq_whilst = freq_whilst + 1
        
print("The frequency of 'while' is: " + str(freq_while))
print("The frequency of 'whilst' is: " + str(freq_whilst))

The frequency of 'while' is: 39
The frequency of 'whilst' is: 24


**Question**: Why do we care about word frequencies?  Can you give me a use case with EMR data where this would be useful?
**Answer**: In case a patient has heart stroke problems, we would like to check his/her history how many times it had occurred in the past so we could decide whether it is a genetical issue or a first time, and act differently accordingly.
We could also check specific medications that may cause a heart stroke and see how many times the patient was prescribed with the medication.


In [31]:
# Asks user for input
# Error handling in case user inputs interrupt key  (Control c or delete)
try:
    keyword = input('Enter keyword to search word frequency: ')
except KeyboardInterrupt:
    print('You cancelled the operation.')

Enter keyword to search word frequency: which


In [32]:
# function to convert a list to dictionary( key - value ) type
def list_to_dict(data_list): 
    data_dict = {} # create a blank dictionary
    
    
    for item in data_list: # iterate through the list (passed as a parameter)
        if item in data_dict.keys(): 
            val = data_dict[item] + 1
            data_dict[item] = val
        else: 
            val = 1
            data_dict[item] = val # add key/value pair to dictionary
       
    return data_dict # return the resulting dictionary outside the loop

In [33]:
# Opens given stop words text file and save it into a list
# Error handling for I/O file needed
# Read the stop words from the provided text file and store them in a Python list programmatically
def read_stop_words(file):
    data = None 
    try:
        # best case scenario
        f = open(file, "r")
        data = f.readlines()
    except IOError:
        # cannot open file
        print("Cannot open the source file")
    except:
        # all of the other crap gone wrong
        print("A werid error has occurred")
    return data
    
# print(data)
# print(word_list)

In [56]:
# Create list to store stop words
# Remove stop words from the body of the federalist papers (the program sometimes doesn't remove completely, so I looped it 3 times. It takes a lot of time to load but it is correct)
# EOF exception handling happens when we are trying to delete an data that doesn't exist
stopwords = []
data = read_stop_words("common-english-words.txt")


for item in data: # iterate through the list (passed as a parameter)
    stopwords = item.split(",") # split each item into     
    # print(stopwords)
try:
    for i in range(5): # we check for stop words five times, because sometimes it doesnt remove the stop word completely
        for word in word_list: # we iterate through the federalist paper to remove the stop words
            if word in stopwords:
                word_list.remove(word)        
except EOFError:
    print("Item could not be deleted")

In [57]:
# Create a key for each word appeared in the paper and value as the word frequency
converted_data = list_to_dict(word_list)

In [58]:
# Show result of output, the keyword user entered
try:
    print("The word " + keyword + " appears in The Federalist Papers " + str(converted_data[keyword])  + " times.")
except:
    print("No words are found in the federalist paper")

No words are found in the federalist paper
