# SI 330: Data Manipulation 
## 18 - Big Data I

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Introduction to Big Data

### **NOTE: conda install -c conda-forge mrjob**

[Distributed Computing I](assets/distcomp1.pdf)

In [1]:
!conda install -y -c conda-forge mrjob

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.



### Generators

In [2]:
def f():
    return 1

In [3]:
def f123():
    yield 1
    yield 2
    yield 3

In [4]:
def f123x(x):
    for i in range(0,x):
        yield i

In [5]:
f()

1

In [6]:
x = f()

In [7]:
x

1

In [8]:
f123()

<generator object f123 at 0x0000026D57506E60>

In [9]:
x = f123()

In [10]:
x

<generator object f123 at 0x0000026D5754A1A8>

In [11]:
for thing in x:
    print(thing)

1
2
3


In [12]:
x = f123x(10)

In [13]:
x

<generator object f123x at 0x0000026D5754A410>

In [14]:
for thing in x:
    print(thing)

0
1
2
3
4
5
6
7
8
9


In [15]:
x = f123x(10)

In [16]:
for thing in x:
    print(thing)
    break;

0


In [17]:
for thing in x:
    print(thing)
    break;

1


In [18]:
next(x)

2

### Our first mrjob script

Note the use of the magic command ```%%file```.  You can use this to write the contents of a cell out to a file, which is what we need to do to use mrjob:

In [19]:
%%file word_count.py
from mrjob.job import MRJob
import re

class MRWordFrequencyCount(MRJob):

  ### input: self, in_key, in_value
  def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

  ### input: self, in_key from mapper, in_value from mapper
  def reducer(self, key, values):
    yield key, sum(values)
if __name__ == "__main__":
    MRWordFrequencyCount.run()

Writing word_count.py


In [20]:
!python word_count.py data/simpletext.txt

"chars"	41
"lines"	3
"words"	9


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190404.225552.644830
job output is in C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190404.225552.644830\output
Streaming final output from C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190404.225552.644830\output...
Removing temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190404.225552.644830...


### <font color="magenta">Q1: Explain what the yield statements in the  above script do.  Provide a list of the first few iterations through the mapper() step would yield.

Now we can run the script from the "command line", supplying a file to
analyze as the first argument.  We have supplied data/simpletext.txt, which contains the text from the slides, in today's zip file.

In [26]:
!python word_count.py data/simpletext.txt

"chars"	41
"lines"	3
"words"	9


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194921.990248
job output is in C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194921.990248\output
Streaming final output from C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194921.990248\output...
Removing temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194921.990248...


The first yield statement gives a tuple of 'chars' and the number of characters in the text. The second yield statement gives a tuple of 'lines' and the number of lines in the text. The third yield statement gives a tuple of 'words' and the number of words in the text.

### <font color="magenta">Q2.  Repeat the above cell using the Tale of Two Cities text file (data/totc.txt).  Report your findings below.

In [27]:
!python word_count.py data/totc.txt

"chars"	760429
"lines"	16273
"words"	138883


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194929.054638
job output is in C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194929.054638\output
Streaming final output from C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194929.054638\output...
Removing temp directory C:\Users\jodiy\AppData\Local\Temp\word_count.jodiy.20190406.194929.054638...


There is 760429 characters, 16273 lines and 138883 words in the Tale of Two Cities text file.

### Now let's look at a slightly more complicated example:

In [23]:
%%file most_used_word.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                yield (word.lower(), 1)

    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)


if __name__ == '__main__':
    MRMostUsedWord.run()

Writing most_used_word.py


### <font color="magenta">Q3: Explain what the yield statements in the  above script do.  Provide a list of the first few iterations through the mapper() step would yield.

In [24]:
!python most_used_word.py data/simpletext.txt

3	"car"


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190404.225619.830234
Running step 2 of 2...
job output is in C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190404.225619.830234\output
Streaming final output from C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190404.225619.830234\output...
Removing temp directory C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190404.225619.830234...


The mapper gives a tuple (value pair) of the word in lower case and the number '1'. 
    example: ('deer', 1)
The combiner gives a tuple of the word and the count/sum of that word.
The reducer gives tuples where the key is the same for all of them and flip the combined tuple from above.

('deer', 1)('bear', 1)('river', 1)('car', 1)

### <font color="magenta">Q4: Run the above script on the Tale of Two Cities text file (data/totc.txt).  What answer do you get?</font>

In [28]:
!python most_used_word.py data/totc.txt

661	"said"


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190406.195002.703899
Running step 2 of 2...
job output is in C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190406.195002.703899\output
Streaming final output from C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190406.195002.703899\output...
Removing temp directory C:\Users\jodiy\AppData\Local\Temp\most_used_word.jodiy.20190406.195002.703899...


The most used word in totc.txt is "said". It is used 661 times.