# SIADS 516: Homework 1
Version 1.0.20200221.1
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

### Our first mrjob script

Recall the following example from the lectures:

Note the use of the magic command ```%%file```.  You can use this to write the contents of a cell out to a file, which is what we need to do to use mrjob:

In [1]:
%%file word_count.py
from mrjob.job import MRJob
import re

class MRWordFrequencyCount(MRJob):

  ### input: self, in_key, in_value
  def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

  ### input: self, in_key from mapper, in_value from mapper
  def reducer(self, key, values):
    yield key, sum(values)

if __name__ == "__main__":
    MRWordFrequencyCount.run()

Overwriting word_count.py


#### Testing out the output code

In [2]:
#code testing out the output of mapper:
def read_file(filename):
    """Read the text file (filename) and return a list of the lines from it"""
    fp = open(filename)
    L = fp.readlines()
    return L

def mapper_try(filename, line_num):
    line_list = read_file(filename)
    yield "chars", len(line_list[line_num])
    yield "words", len(line_list[line_num].split())
    yield "lines", 1

display(list(mapper_try('data/gutenberg/short.t1.txt', 0)), 
        list(mapper_try('data/gutenberg/short.t1.txt', 1))
       )

[('chars', 70), ('words', 12), ('lines', 1)]

[('chars', 1), ('words', 0), ('lines', 1)]

### trying out logger

In [3]:
%%file word_count_logger.py
from mrjob.job import MRJob
import re

import logging

class MRWordFrequencyCount(MRJob):

  ### input: self, in_key, in_value
  def mapper(self, _, line):
    logging.info(f'  line length: {len(line)}')
    yield "chars", len(line)
    logging.info(f'  splt length: {len(line.split())}')
    yield "words", len(line.split())
    yield "lines", 1

  ### input: self, in_key from mapper, in_value from mapper
  def reducer(self, key, values):
    logging.info(f'  word total: {key, sum(values)}')
    yield key, sum(values)

if __name__ == "__main__":
    logging.basicConfig(filename='wordfreq.log', level=logging.INFO)
    logging.info('Started')
    MRWordFrequencyCount.run()
    logging.info('Done')

Overwriting word_count_logger.py


In [4]:
!python word_count_logger.py data/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/word_count_logger.jovyan.20200707.184948.251923
Running step 1 of 1...
job output is in /tmp/word_count_logger.jovyan.20200707.184948.251923/output
Streaming final output from /tmp/word_count_logger.jovyan.20200707.184948.251923/output...
"chars"	0
"words"	0
"lines"	0
Removing temp directory /tmp/word_count_logger.jovyan.20200707.184948.251923...


### <font color="magenta">Q1: Explain what each of the yield statements in the above script do.  Provide a list of what the first few iterations through the mapper() step would yield if the script was run against the ```data/gutenberg/short.t1.txt``` file.

> Mapper will go through each line of the input text. ```yield "chars", len(line)``` will count the number of characters present in the line (including spaces).  ```yield "words", len(line.split())``` will string-split the line at each space and will count the number words, etc. that are separated by spaces.  ```yield "lines", 1``` will always return the world "lines" and the number 1.

> Each time you call the mapper generator it will return the next yield and will cycle back.  I would expect the first few iterations would yield the following:

> #this is for the first text line as determined by __logging.info__ (the first three iterations) <br>
    >("chars", 69) <br>
    >("words", 12) <br>
    >("lines", 1)  <br>
> #this is for the second line -- which is empty as determined by __logging.info__ (the next three iterations) <br>
    >("chars", 0) <br>
    >("words", 0) <br>
    >("lines", 1) <br>
    

.

Now let's look at the output of running the script against that same file:

In [5]:
!python word_count.py data/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/word_count.jovyan.20200707.184948.761482
Running step 1 of 1...
job output is in /tmp/word_count.jovyan.20200707.184948.761482/output
Streaming final output from /tmp/word_count.jovyan.20200707.184948.761482/output...
"chars"	10653
"words"	1822
"lines"	200
Removing temp directory /tmp/word_count.jovyan.20200707.184948.761482...


### <font color="magenta">Q2.  Repeat the above cell using the the works of William Shakespeare text file (data/gutenberg/t8.shakespeare.txt).  Provide an interpretation of the output (don't overthink this -- just demonstrate that you can find the relevant information in the output).</font>

In [6]:
!python word_count.py data/gutenberg/t8.shakespeare.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/word_count.jovyan.20200707.184949.236212
Running step 1 of 1...
job output is in /tmp/word_count.jovyan.20200707.184949.236212/output
Streaming final output from /tmp/word_count.jovyan.20200707.184949.236212/output...
"chars"	5333743
"words"	901325
"lines"	124456
Removing temp directory /tmp/word_count.jovyan.20200707.184949.236212...


> A temporary directory with a randomly generated name is created (and is removed after commands are executed).  
> __The output we want is:__<br>
> ```"chars" 5333743
> "words" 901325
> "lines" 124456```<br>
> The "chars" value represents the number of characters present in the whole Shakespeare textfile, and the "words" and "lines" represent the number of words and lines in the whole textfile, respectively.

### Now let's look at a slightly more complicated example:

In [7]:
%%file most_used_word.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                yield (word.lower(), 1)

    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is used so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)



if __name__ == '__main__':
    import time
    start = time.time()
    MRMostUsedWord.run()
    end = time.time()
    print(end - start)

Overwriting most_used_word.py


In [25]:
#this yields the output of the mapper...couldn't get to work with combiner or reducer...

# !python most_used_word.py --step-num=0 --mapper < data/gutenberg/short.t1.txt

### working on logging

In [9]:
%%file most_used_word_logger.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

import logging

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                logging.info(f'  words: {(word.lower(), 1)}')
                yield (word.lower(), 1)

    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        logging.info(f'  combiner word_counts: {(word, sum(counts))}')
        yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is used so we can easily use Python's max() function.
        logging.info(f'  reducer1 word_counts: {(sum(counts), word)}')
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        output = max(word_count_pairs)
        logging.info(f'  reducer2 most_used_word: {output}')
        yield output



if __name__ == '__main__':
    logging.basicConfig(filename='mostusedwords.log', level=logging.INFO)
    logging.info('Started')
    MRMostUsedWord.run()
    logging.info('Done')

Overwriting most_used_word_logger.py


In [10]:
!python most_used_word_logger.py data/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_used_word_logger.jovyan.20200707.184951.906882
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_used_word_logger.jovyan.20200707.184951.906882/output
Streaming final output from /tmp/most_used_word_logger.jovyan.20200707.184951.906882/output...
0	"yesterday"
Removing temp directory /tmp/most_used_word_logger.jovyan.20200707.184951.906882...


#### Testing some of the the output code (outside MRJob)

In [11]:
import re
STOPWORDS = ['i', 'we', 'ourselves', 'hers', 'between', 'yourself', 
             'but', 'again', 'there', 'about', 'once', 'during', 'out', 
             'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 
             'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 
             'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 
             'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 
             'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 
             'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 
             'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 
             'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 
             'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 
             'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 
             'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'
            ]

def mapper_get_words_try(filename, line_num):
    # yield each word in the line

    WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below
    line_list = read_file(filename)
    for word in WORD_RE.findall(line_list[line_num]):
        if word.lower() not in STOPWORDS:
            yield (word.lower(), 1)

In [12]:
list(mapper_get_words_try('data/gutenberg/short.t1.txt', 0))

[('project', 1),
 ("gutenberg's", 1),
 ('year', 1),
 ('2889', 1),
 ('jules', 1),
 ('verne', 1),
 ('michel', 1),
 ('verne', 1)]

### <font color="magenta">Q3: Explain what the yield statements in the  above script do.  Provide a list of what the first few iterations through the steps would yield.

> - __mapper_get_words__ will extract each word from a line of text and return an tuple consisting of each word and the the count "1".  If a word is repeated, it will not combine them.<br>
>   - _below are the results from the first several iterations as determined by __logging.info__:_ <br>
         > ('project', 1)<br>
         > ("gutenberg's", 1)<br>
         > ('year', 1)<br>
         > ('2889', 1)<br>
         > ('jules', 1)<br>
         > ('verne', 1)<br>
         > ('michel', 1) <br>
         > ('verne', 1)<br>
         > ('ebook', 1)<br>
         > ('use', 1)<br>
         > ('anyone', 1)<br>
         > ('anywhere', 1)<br>
         > ('cost', 1)<br>
         > ('almost', 1)<br>
         > ('restrictions', 1)<br>
         > ('whatsoever', 1)<br>
         > ('may', 1)<br>
         > ('copy', 1)<br>
         > ('give', 1)<br>
         > ('away', 1)<br>
<br>
> - __combiner_count_words__ will shuffle together common words and combine the counts so, for example, if a word is repeated (as in line 10) it will yield the word and its total count up to that point, e.g. ('verne', 2).
>   - _below are the results from the first several iterations as determined by __logging.info__:_ <br>
         > ('2889', 1)<br>
         > ('almost', 1)<br>
         > ('anyone', 1)<br> 
         > ('anywhere', 1)<br> 
         > ('away', 1)<br>
         > ('copy', 1)<br>
         > ('cost', 1)<br>
         > ('ebook', 1)<br>
         > ('give', 1)<br>
         > ("gutenberg's", 1)<br>
         > ('jules', 1)<br>
         > ('may', 1)<br>
         > ('michel', 1)<br>
         > ('project', 1)<br>
         > ('restrictions', 1)<br>
         > ('use', 1)<br>
         > ('verne', 2)<br>
         > ('whatsoever', 1)<br> 
         > ('year', 1)<br>
<br>         
> - __reducer_count_words__ will perform the same sum but reorders the tuple that is yielded by the prior generator, numerical occurrence first.
>   - _below are the results from the first several iterations as determined by __logging.info__:_ <br>
         > (7, '000')<br>
         > (2, '10')<br>
         > (2, '1000')<br>
         > (1, '1100')<br>
         > (1, '1889')<br>
         > (1, '19362')<br>
         > (1, '2')<br>
         > (1, '200')<br>
         > (1, '2006')<br>
         > (1, '2007')<br>
         > (1, '23')<br>
         > (1, '250')<br>
         > (1, '253d')<br>
         > (1, '25th')<br>
         > (1, '262')<br>
         > (1, '2792')<br>
         > (6, '2889')<br>
         > (1, '2889_')<br>
         > (1, '3')<br>
<br>         
> - __reducer_find_max_word__ will yield a tuple with the word that is used the most times in a text and the count of the number of times it was used (with the numerical occurrence being the first element of the tuple).
>   - _below are the results as determined by __logging.info__:_ <br>
         > (11, 'day')<br>
    

Now run the file against data/gutenberg/short.t1.txt.

In [13]:
!python most_used_word.py data/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_used_word.jovyan.20200707.184952.966224
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_used_word.jovyan.20200707.184952.966224/output
Streaming final output from /tmp/most_used_word.jovyan.20200707.184952.966224/output...
11	"day"
Removing temp directory /tmp/most_used_word.jovyan.20200707.184952.966224...
0.7241528034210205


In [14]:
!python most_used_word.py data/gutenberg/t8.shakespeare.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_used_word.jovyan.20200707.184953.915087
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_used_word.jovyan.20200707.184953.915087/output
Streaming final output from /tmp/most_used_word.jovyan.20200707.184953.915087/output...
5479	"thou"
Removing temp directory /tmp/most_used_word.jovyan.20200707.184953.915087...
4.554636001586914


### <font color="magenta">Q4: Run the above script on the Shakespeare text file.  What answer do you get?</font>

> The above code indicates that the most common word in the Shakespeare text is "thou" and it was used 5479 times.  After closing the temporary directory it also prints how long it takes to run __MRMostUsedWord__ by recording the time right before and right after running the program.

### Checking impact of removing the combiner

In [15]:
%%file most_used_word2.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
#                    combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                yield (word.lower(), 1)
                
#     def combiner_count_words(self, word, counts):
#         # optimization: sum the words we've seen so far
#         yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is used so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)



if __name__ == '__main__':
    import time
    start = time.time()
    MRMostUsedWord.run()
    end = time.time()
    print(end - start)

Overwriting most_used_word2.py


In [16]:
!python most_used_word2.py data/gutenberg/t8.shakespeare.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_used_word2.jovyan.20200707.184958.757899
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_used_word2.jovyan.20200707.184958.757899/output
Streaming final output from /tmp/most_used_word2.jovyan.20200707.184958.757899/output...
5479	"thou"
Removing temp directory /tmp/most_used_word2.jovyan.20200707.184958.757899...
3.7420153617858887


### <font color="magenta">Q5: What is the impact of removing the combiner from the above code in terms of efficiency?  What does that suggest?</font>

> It takes less time to run the code without the "combiner" generator.  This appears to be true because the reducer code already has __sum(counts)__ in it so the reducer is already performing the tast of the combiner, thus the combiner makes the process take longer with the data input (in this case, one text file).

### <font color="magenta">Q6: Write an mrjob script that finds the 10 words that have the most syllables from the t5.churchill.txt file.  Interpret your results.</font>

### installs & imports

In [17]:
pip install syllapy

Note: you may need to restart the kernel to use updated packages.


In [18]:
import syllapy
    #to count syllables

import sys
    # sys.exit() allows us to quit (if we can't read a file)

from functools import lru_cache
    #to use the @lru_cache decorator
    
from operator import itemgetter
    #to help with more efficient sorting
    
import re
    #to apply regular expressions to extract words from text (in my read_file function)

from collections import defaultdict
    #for better dictionary functionality (in my read_file function)

### First: creating code/functions that returns the words with the most syllables

In [19]:
#got this from a SIADS 515 assignment of mine (HW4)

@lru_cache(128)
def read_file(filename):
    """Read the text file (filename) and return a dictionary (defaultdict) with unique words and their frequency.
    
    Function uses regex to extract all of the words in each line of filename after converting words to lower-case.
    It then stores the words as keys and counts their frequency in the defaultdict.
    """
    try:
        wordfreq = defaultdict(int)
        with open(filename, 'r') as file:
            for line in file:
                for word in re.findall(r'[A-Za-z0-9]+', line.lower()):
                    wordfreq[word]+=1
        
    except IOError as excObj:
        print(str(excObj))
        print("Error opening or reading input file: " + filename)
        sys.exit()
        
    return wordfreq

In [20]:
@lru_cache(128)
def top_n_syllables(filename, n):
    
    word_dict = read_file(filename)  
    
    #retrieve just the keys (the unique words in the textfile)
    word_list = list(word_dict.keys())
    
    word_list.sort(key=len, reverse=True)
    
    def syllables_list(word_list):
        #EXCEPTIONS: words that contain the parts in the lists below are frequently miscounted by syllapy
        undercount_word_part = ['bio', 'ia', 'ism']
        overcount_word_part = ['ate', 'ive', 'ize']
#         l = []
        i = 0
        while True:
            try:
                #for undercounted words
                if any(word in word_list[i] for word in undercount_word_part):
                    yield (word_list[i], syllapy.count(word_list[i])+1) 
               
                #for overcountered words
                elif any(word in word_list[i] for word in overcount_word_part):
                    yield (word_list[i], syllapy.count(word_list[i])-1)
                
                 # this was an easy way to check other word parts (e.g. suffixes) to check accuracy                 
#                elif any(word in word_list[i] for word in l):
#                     print((word_list[i], syllapy.count(word_list[i])))
                
                #for the rest!    
                else:
                    yield (word_list[i], syllapy.count(word_list[i]))
                i += 1
            except:
                break
    
    #converts generator object produced into a list
    syllables_list = list(syllables_list(word_list))
    
    #first, sort alphabetically so it is retained within next sort
    syllables_list.sort()
    
    #sort the list by the second value within the tuple, the syllable number
    syllables_list.sort(key=itemgetter(1), reverse=True)
    
    return syllables_list[:n]

In [None]:
@lru_cache(128)
def top_n_syllables(filename, n):
    
    word_dict = read_file(filename)  
    #retrieve just the keys (the unique words in the textfile)
    word_list = list(word_dict.keys())
    word_list.sort(key=len, reverse=True)
    
    def syllables_list(word_list):
        #EXCEPTIONS: words that contain the parts in the lists below are frequently miscounted by syllapy
        undercount_word_part = ['bio', 'ia', 'ism']
        overcount_word_part = ['ate', 'ive', 'ize']

        i = 0
        while True:
            try:
                #for undercounted words
                if any(word in word_list[i] for word in undercount_word_part):
                    yield (word_list[i], syllapy.count(word_list[i])+1) 
               
                #for overcountered words
                elif any(word in word_list[i] for word in overcount_word_part):
                    yield (word_list[i], syllapy.count(word_list[i])-1)
                
                #for the rest!    
                else:
                    yield (word_list[i], syllapy.count(word_list[i]))
                i += 1
            except:
                break
    
    #converts generator object produced into a list
    syllables_list = list(syllables_list(word_list))
    
    #first, sort alphabetically so it is retained within next sort
    syllables_list.sort()
    
    #sort the list by the second value within the tuple, the syllable number
    syllables_list.sort(key=itemgetter(1), reverse=True)
    
    return syllables_list[:n]

In [22]:
%timeit top_n_syllables('data/gutenberg/t5.churchill.txt', 10)

108 ns ± 0.242 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


#### Interesting: much faster than MRJob...but task is also fairly simple, I suppose

In [21]:
top_n_syllables('data/gutenberg/t5.churchill.txt', 10)

[('incommunicability', 8),
 ('materialistically', 8),
 ('overcapitalization', 8),
 ('apologetically', 7),
 ('appreciatively', 7),
 ('artificiality', 7),
 ('autobiographical', 7),
 ('characteristically', 7),
 ('cosmopolitanism', 7),
 ('enthusiastically', 7)]

### Using MRJob

In [23]:
%%file syllables_count.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import syllapy
from operator import itemgetter
    #to help with more efficient sorting

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below

class MRWordSyllableCounter(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_syllables),
            MRStep(reducer=self.reducer_syllable_sorter)
               ]
    
    ### input: self, in_key, in_value
    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
            
    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        yield (word, sum(counts))

    def reducer_count_syllables(self, word, _):
        syllables = syllapy.count(word)
        #EXCEPTIONS: words that contain these parts are frequently miscounted by syllapy
        undercount_word_part = ['bio', 'ia', 'ism']
        overcount_word_part = ['ate', 'ive', 'ize']
        if any(w in word for w in undercount_word_part):
            syllables += 1
        elif any(w in word for w in overcount_word_part):
            syllables -= 1
        yield None, (word, syllables)
    
    def reducer_syllable_sorter(self, _, syllable_word_pair):
        #sorts the word, syllable list alphabetically, then it sorts the tuples by the second element
        #which is the syllable count for the word, returning the top ten.
        for word, syllables in sorted(sorted(syllable_word_pair), key=itemgetter(1), reverse=True)[:10]:
                yield (word, syllables)
        
if __name__ == "__main__":
    import time
    start = time.time()
    MRWordSyllableCounter.run()
    end = time.time()
    print(end - start)

Overwriting syllables_count.py


In [24]:
!python syllables_count.py data/gutenberg/t5.churchill.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/syllables_count.jovyan.20200707.185013.850822
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/syllables_count.jovyan.20200707.185013.850822/output
Streaming final output from /tmp/syllables_count.jovyan.20200707.185013.850822/output...
"incommunicability"	8
"materialistically"	8
"overcapitalization"	8
"apologetically"	7
"appreciatively"	7
"artificiality"	7
"autobiographical"	7
"characteristically"	7
"cosmopolitanism"	7
"enthusiastically"	7
Removing temp directory /tmp/syllables_count.jovyan.20200707.185013.850822...
10.054217338562012


> Again, temporary directories are created for processing.<br>
> __The output is:__<br>
> ```"incommunicability"	8
> "materialistically"	8
> "overcapitalization"	8
> "apologetically"	7
> "appreciatively"	7     <- this one is miscounted
> "artificiality"	7
> "autobiographical"	7
> "characteristically"	7
> "cosmopolitanism"	7
> "enthusiastically"	7```<br>
> Each of the above is a tuple with a word followed by the number of syllables in the word.  Within a given syllable about, words are ordered alphabetically.  After the temporary directory is closed, the time for the function is displayed in seconds (10.05 seconds for my last run).