# COMS W1002 Computing in Context: Computing in the Humanities  
## Created by Dennis Tenen
## Counting words

__Problem 1:__ Convert 5-10 papers you have written into plain text. Write a function in Python that counts the frequencey (number of times it appears) of a specified word appearing in your text file.

In [None]:
# returns an int, the number of times the given word is found in the given file
def how_often(file_name, key_word):
    
    from string import punctuation
    from collections import Counter
    
    # open the file, get all the lines
    with open(file_name, 'r', encoding = 'utf8')as f:
        lines = f.read().splitlines()
    
    # all of the words in the text
    keys = []
    
    # isolate all of the words in the text and add them to keys
    for line in lines:
        for word in line.split():
            word = word.strip(punctuation).lower()
            keys.append(word)
            
    # counts how often every word appears        
    num_keys = Counter(keys)
    print(num_keys)
    
    # find the given word and it's value in keys, return the value
    for key in num_keys:
        if key_word == key:
            return num_keys[key]
        

In [None]:
how_often('essay1.txt', 'goods')

__Problem 2:__ Write a function in Python that returns a list of the `n` most frequently appearing words of length at least `l` in a text file. Your function should have 3 parameters, `n`, `l`, and `file_name`.

In [None]:
# returns a list of the n most frequent words of length l in a single text file
def most_frequent(n, l, file_name):
    
    from string import punctuation
    from collections import Counter
    
    # get all the lines in the file
    with open(file_name, 'r', encoding = 'utf8') as f:
        lines = f.read().splitlines()
        
    # a list of all qualifying words
    keys = []
    
    # if a word in the text is over length l, add it to the keys
    for line in lines:
        for word in line.split():
            word = word.strip(punctuation).lower()
            # check length of word before adding it
            if len(word) >= l:
                keys.append(word)
                
    # count how often every word appears            
    num_keys = Counter(keys)
    
    # get the n most common words
    common = num_keys.most_common(n)
    print(common)
    common_words = []
    
    # just the words, not their values
    for element in common:
        common_words.append(element[0])
        
    return common_words

In [None]:
most_frequent(10, 4, 'essay1.txt')

__Problem 3:__ Modify your function above so that the last parameter is variable length so that you may give it file names for as many files as you like and it will return the `n` most popular words of at least length `l` in the entire corpus. 

In [None]:
# returns a list of the n most frequent words of length l in all of the files given
def most_frequent2(n, l, *files):
    
    from string import punctuation
    from collections import Counter
    
    # if no file names are given, don't do anything
    if files == []:
        return None
    
    # add all the text into one big string to analyze, then split by lines
    text = ''
    for file_name in files:
        with open(file_name, 'r', encoding = 'utf8')as f:
            text += f.read()    
    lines = text.splitlines()
    
    # a list of all the qualifying words
    keys = []
    
    # if a word in the text is over length l, add it to keys
    for line in lines:
        for word in line.split():
            word = word.strip(punctuation).lower()
            # check length
            if len(word) >= l:
                keys.append(word)
                
    # count how often every word appears         
    num_keys = Counter(keys)
    
    # get n most common words across all texts
    common = num_keys.most_common(n)
    common_words = []
    
    # isolate the words from their frequencies
    for element in common:
        common_words.append(element[0])
        
    return common_words

__Problem 4:__ Use the functions you just wrote to analyze your own papers both individually and as a corpus. What are the 20 most popular words over length 5? Do they vary much from paper to paper? Why do you think this is? Write about what you discover.

### Answer ###
The 20 most popular words over length 5 in all of my papers were, *'which', 'because', 'would', 'erica', 'aasmi', 'their', 'looked', 'through', 'happiness', 'never', 'aristotle', 'there', 'goods', 'really', 'about', 'wealth', 'thought', 'people', 'every',* and *'little'*.  
These results, when I compare them to the results of each individual paper, seem to be influenced by the two longest papers, __essay1__ and __essay6__. *'erica'* and *'aasmi'* are two main characters in the creative writing paper I used as __essay6__. *'aristotle'* was the main author I analyzed in my CC paper which served as __essay1__, where I talked a lot about *'goods'*, *'wealth'*, and *'happiness'*.  
The list of every paper was different from the rest, probably because each paper was taken from an entirely different context. Included in my papers are two creative writing assignments, one CC essay, one ArtHum analysis, and one LitHum essay. Therefore, each list sort of revolves around the topic that I chose for the paper, with the most popular words being related to the topic I covered, as seen with the examples in the first paragraph. However, there were some words which remained consistently popular throughout most of my papers. These were, *'which', 'because', 'really'*, and *'would'*. I can then conclude that these words were the ones which I used most often in my papers, regardless of context.

In [None]:
print(most_frequent2(20, 5, 'essay1.txt', 'essay2.txt', 'essay3.txt', 'essay4.txt', 'essay5.txt', 'essay6.txt'),'\n')
print(most_frequent(20,5,'essay1.txt'),'\n')
print(most_frequent(20,5,'essay2.txt'),'\n')
print(most_frequent(20,5,'essay3.txt'),'\n')
print(most_frequent(20,5,'essay4.txt'),'\n')
print(most_frequent(20,5,'essay5.txt'),'\n')
print(most_frequent(20,5,'essay6.txt'),'\n')