# Hadoop MapReduce Exercise

In [1]:
! pip install mrjob

Defaulting to user installation because normal site-packages is not writeable
Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 KB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: mrjob
Successfully installed mrjob-0.7.4


#### Create subfile for new python function:

In [2]:
%%file wordcount.py
# %%file is an Ipython magic function that saves the code cell as a file

from mrjob.job import MRJob # import the mrjob library

class MRSongCount(MRJob):
    '''
    (https://mrjob.readthedocs.io/en/latest/guides/quickstart.html)
    The MRJob class contains multiple 'steps' that MapReduce should follow: "mapper", "combiner", and "reducer"
    All of these steps are optional, but you must have at least one.

    mapper() = takes a key and value as arguments and yields one or more key-value tuples
    reducer() = takes a key and an iterator of values and yields one or more key-value tuples (e.g., count of words).

    !! Important !!
    The code below must ALWAYS be at the end of the Class file. These lines pass control over command line arguments to MrJob.

    if __name__ == "__main__":
    MRSongCount.run()
    '''
    
    # the map step: each line in the txt file is read as a key, value pair
    # in this case, each line in the txt file only contains a value but no key
    # _ means that in this case, there is no key for each line
    def mapper(self, _, song):
        # output each line as a tuple of (song_names, 1) 
        yield (song, 1)

    # the reduce step: combine all tuples with the same key
    # in this case, the key is the song name
    # then sum all the values of the tuple, which will give the total song plays
    def reducer(self, key, values):
        #Note: 'yield' is returning a Python generator (https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do)
        yield (key, sum(values))
        
if __name__ == "__main__":
    MRSongCount.run()

Writing wordcount.py


In [4]:
# run the code as a terminal command
! python3 wordcount.py songplays.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/wordcount.rambino.20220825.180315.875123
Running step 1 of 1...
job output is in /tmp/wordcount.rambino.20220825.180315.875123/output
Streaming final output from /tmp/wordcount.rambino.20220825.180315.875123/output...
"Deep Dreams"	1131
"Broken Networks"	510
"Data House Rock"	828
Removing temp directory /tmp/wordcount.rambino.20220825.180315.875123...


# Summary of what happens in the code (copied from Udacity course)

There is a list of songs in songplays.txt that looks like the following:

Deep Dreams
Data House Rock
Deep Dreams
Data House Rock
Broken Networks
Data House Rock
etc.....

During the map step, the code reads in the txt file one line at a time. The map steps outputs a set of tuples that look like this:

(Deep Dreams, 1)  
(Data House Rock, 1)  
(Deep Dreams, 1)  
(Data House Rock, 1)  
(Broken Networks, 1)  
(Data House Rock, 1)  
etc.....

Finally, the reduce step combines all of the values by keys and sums the values:  

(Deep Dreams, \[1, 1, 1, 1, 1, 1, ... \])  
(Data House Rock, \[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\])  
(Broken Networks, \[1, 1, 1, ...\]  

With the output 

(Deep Dreams, 1131)  
(Data House Rock, 510)  
(Broken Networks, 828)  