# <center> Introduction to Hadoop MapReduce </center>

## 2. Debugging Hadoop MapReduce Jobs

** Data: Movie Ratings and Recommendation **

An independent movie company is looking to invest in a new movie project. With limited finance, the company wants to 
analyze the reaction of audiences, particularly toward various movie genres, in order to identify beneficial 
movie project to focus on. The company relies on data collected from a publicly available recommendation service 
by [MovieLens](http://dl.acm.org/citation.cfm?id=2827872). This 
[dataset](http://files.grouplens.org/datasets/movielens/ml-10m-README.html) contains **24404096** ratings and **668953**
 tag applications across **40110** movies. These data were created by **247753** users between January 09, 1995 and January 29, 2016. This dataset was generated on October 17, 2016. 

From this dataset, several analyses are possible, include the followings:
1.   Find movies which have the highest average ratings over the years and identify the corresponding genre.
2.   Find genres which have the highest average ratings over the years.
3.   Find users who rate movies most frequently in order to contact them for in-depth marketing analysis.

These types of analyses, which are somewhat ambiguous, demand the ability to quickly process large amount of data in 
elatively short amount of time for decision support purposes. In these situations, the sizes of the data typically 
make analysis done on a single machine impossible and analysis done using a remote storage system impractical. For 
remainder of the lessons, we will learn how HDFS provides the basis to store massive amount of data and to enable 
the programming approach to analyze these data.

In [1]:
!hdfs dfs -ls -h /repository/movielens

Found 7 items
-rw-r--r--   2 lngo hdfs-user      9.3 K 2017-03-15 09:49 /repository/movielens/README.txt
-rw-r--r--   2 lngo hdfs-user    317.9 M 2017-03-15 09:49 /repository/movielens/genome-scores.csv
-rw-r--r--   2 lngo hdfs-user     17.7 K 2017-03-15 09:49 /repository/movielens/genome-tags.csv
-rw-r--r--   2 lngo hdfs-user    839.2 K 2017-03-15 09:49 /repository/movielens/links.csv
-rw-r--r--   2 lngo hdfs-user      1.9 M 2017-03-15 09:49 /repository/movielens/movies.csv
-rw-r--r--   2 lngo hdfs-user    632.7 M 2017-03-15 09:49 /repository/movielens/ratings.csv
-rw-r--r--   2 lngo hdfs-user     22.9 M 2017-03-15 09:49 /repository/movielens/tags.csv


### Find movies which have the highest average ratings over the years and report their ratings and genres

- Find the average ratings of all movies over the years
- Sort the average ratings from highest to lowest
- Report the results, augmented by genres

In [None]:
!hdfs dfs -ls /repository/movielens

In [None]:
!hdfs dfs -cat /repository/movielens/README.txt

In [2]:
!hdfs dfs -cat /repository/movielens/links.csv \
    2>/dev/null | head -n 5

movieId,imdbId,tmdbId
1,0114709,862
2,0113497,8844
3,0113228,15602
4,0114885,31357


In [3]:
!hdfs dfs -cat /repository/movielens/movies.csv \
    2>/dev/null | head -n 5

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [4]:
!hdfs dfs -cat /repository/movielens/ratings.csv \
    2>/dev/null | head -n 5

userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871


In [5]:
!hdfs dfs -cat /repository/movielens/tags.csv \
    2>/dev/null | head -n 5

userId,movieId,tag,timestamp
28,63062,angelina jolie,1263047558
40,4973,Poetic,1436439070
40,117533,privacy,1436439140
57,356,life positive,1291771526


### Note:

To write a MapReduce program, you have to be able to identify the necessary (Key,Value) that can contribute to the final realization of the required results. This is the reducing phase. From this (Key,Value) pair format, you will be able to develop the mapping phase. 

In [6]:
%%writefile codes/avgRatingMapper01.py
#!/usr/bin/env python

import sys

for oneMovie in sys.stdin:
    oneMovie = oneMovie.strip()
    ratingInfo = oneMovie.split(",")
    movieID = ratingInfo[1]
    rating = ratingInfo[2]
    print ("%s\t%s" % (movieID, rating)) 

Writing codes/avgRatingMapper01.py


In [7]:
!hdfs dfs -cat /repository/movielens/ratings.csv \
    2>/dev/null | head -n 5 | python ./codes/avgRatingMapper01.py

movieId	rating
122	2.0
172	1.0
1221	5.0
1441	4.0


#### *Do we really need the headers?*

In [8]:
%%writefile codes/avgRatingMapper02.py
#!/usr/bin/env python

import sys

for oneMovie in sys.stdin:
    oneMovie = oneMovie.strip()
    ratingInfo = oneMovie.split(",")
    try:
        movieID = ratingInfo[1]
        rating = float(ratingInfo[2])
        print ("%s\t%s" % (movieID, rating))
    except ValueError:
        continue

Writing codes/avgRatingMapper02.py


In [9]:
!hdfs dfs -cat /repository/movielens/ratings.csv \
    2>/dev/null | head -n 5 | python ./codes/avgRatingMapper02.py

122	2.0
172	1.0
1221	5.0
1441	4.0


#### *The outcome is correct. Is it useful?*

Getting additional file

In [10]:
!mkdir movielens
!hdfs dfs -get /repository/movielens/movies.csv movielens/movies.csv

In [11]:
%%writefile codes/avgRatingMapper03.py
#!/usr/bin/env python

import sys
import csv

movieFile = "./movielens/movies.csv"
movieList = {}

with open(movieFile, mode = 'r') as infile:
    reader = csv.reader(infile)
    for row in reader:
        movieList[row[0]] = {}
        movieList[row[0]]["title"] = row[1]
        movieList[row[0]]["genre"] = row[2]

for oneMovie in sys.stdin:
    oneMovie = oneMovie.strip()
    ratingInfo = oneMovie.split(",")
    try:
        movieTitle = movieList[ratingInfo[1]]["title"]
        movieGenre = movieList[ratingInfo[1]]["genre"]
        rating = float(ratingInfo[2])
        print ("%s\t%s\t%s" % (movieTitle, rating, movieGenre))
    except ValueError:
        continue

Writing codes/avgRatingMapper03.py


In [12]:
!hdfs dfs -cat /repository/movielens/ratings.csv \
    2>/dev/null | head -n 5 | python ./codes/avgRatingMapper03.py

Boomerang (1992)	2.0	Comedy|Romance
Johnny Mnemonic (1995)	1.0	Action|Sci-Fi|Thriller
Godfather: Part II, The (1974)	5.0	Crime|Drama
Benny & Joon (1993)	4.0	Comedy|Romance


#### *Test reducer:*

In [13]:
%%writefile codes/avgRatingReducer01.py
#!/usr/bin/env python
import sys

current_movie = None
current_rating_sum = 0
current_rating_count = 0

for line in sys.stdin:
    line = line.strip()
    movie, rating, genre = line.split("\t", 2)
    try:
        rating = float(rating)
    except ValueError:
        continue

    if current_movie == movie:
        current_rating_sum += rating
        current_rating_count += 1
    else:
        if current_movie:
            rating_average = current_rating_sum / current_rating_count
            print ("%s\t%s\t%s" % (current_movie, rating_average, genre))    
        current_movie = movie
        current_rating_sum = rating
        current_rating_count = 1

if current_movie == movie:
    rating_average = current_rating_sum / current_rating_count
    print ("%s\t%s\t%s" % (current_movie, rating_average, genre))


Writing codes/avgRatingReducer01.py


In [14]:
!hdfs dfs -cat /repository/movielens/ratings.csv 2>/dev/null \
    | head -n 5 \
    | python ./codes/avgRatingMapper03.py \
    | sort \
    | python ./codes/avgRatingReducer01.py

Benny & Joon (1993)	4.0	Comedy|Romance
Boomerang (1992)	2.0	Crime|Drama
Godfather: Part II, The (1974)	5.0	Action|Sci-Fi|Thriller
Johnny Mnemonic (1995)	1.0	Action|Sci-Fi|Thriller


#### Non-HDFS correctness test

In [15]:
!hdfs dfs -cat /repository/movielens/ratings.csv 2>/dev/null \
    | head -n 2000 \
    | python ./codes/avgRatingMapper03.py \
    | grep Matrix

Matrix Reloaded, The (2003)	4.0	Action|Adventure|Sci-Fi|Thriller|IMAX
Matrix, The (1999)	3.5	Action|Sci-Fi|Thriller
Matrix, The (1999)	3.5	Action|Sci-Fi|Thriller
Matrix, The (1999)	4.5	Action|Sci-Fi|Thriller
Matrix Reloaded, The (2003)	1.0	Action|Adventure|Sci-Fi|Thriller|IMAX
Matrix, The (1999)	5.0	Action|Sci-Fi|Thriller
Matrix Reloaded, The (2003)	5.0	Action|Adventure|Sci-Fi|Thriller|IMAX
Matrix Revolutions, The (2003)	2.5	Action|Adventure|Sci-Fi|Thriller|IMAX
Matrix, The (1999)	3.5	Action|Sci-Fi|Thriller


In [16]:
!hdfs dfs -cat /repository/movielens/ratings.csv 2>/dev/null \
    | head -n 2000 \
    | python ./codes/avgRatingMapper03.py \
    | grep Matrix \
    | sort \
    | python ./codes/avgRatingReducer01.py

Matrix Reloaded, The (2003)	3.3333333333333335	Action|Adventure|Sci-Fi|Thriller|IMAX
Matrix Revolutions, The (2003)	2.5	Action|Sci-Fi|Thriller
Matrix, The (1999)	4.0	Action|Sci-Fi|Thriller


In [19]:
# Manual calculation check via python
(4.0+1.0+5.0)/3

3.3333333333333335

#### Full execution on HDFS

In [20]:
!yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /repository/movielens/ratings.csv \
    -output intro-to-hadoop/output-movielens-01 \
    -file ./codes/avgRatingMapper03.py \
    -mapper avgRatingMapper03.py \
    -file ./codes/avgRatingReducer01.py \
    -reducer avgRatingReducer01.py \

17/10/06 12:46:01 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./codes/avgRatingMapper03.py, ./codes/avgRatingReducer01.py] [/usr/hdp/2.6.0.3-8/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.0.3-8.jar] /hadoop_java_io_tmpdir/streamjob9165299252749628929.jar tmpDir=null
17/10/06 12:46:03 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:46:03 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:46:03 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 14322 for lngo on ha-hdfs:dsci
17/10/06 12:46:03 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 14322 for lngo)
17/10/06 12:46:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/10/06 12:46:04 INFO lzo.LzoCodec: Successfull

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

17/10/06 12:46:28 INFO mapreduce.Job: Task Id : attempt_1505269880969_005

17/10/06 12:46:39 INFO mapreduce.Job: Counters: 18
	Job Counters 
		Failed map tasks=16
		Killed map tasks=4
		Killed reduce tasks=1
		Launched map tasks=20
		Other local map tasks=15
		Data-local map tasks=3
		Rack-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=259143
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=86381
		Total time spent by all reduce tasks (ms)=0
		Total vcore-milliseconds taken by all map tasks=86381
		Total vcore-milliseconds taken by all reduce tasks=0
		Total megabyte-milliseconds taken by all map tasks=1113623852
		Total megabyte-milliseconds taken by all reduce tasks=0
	Map-Reduce Framework
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
17/10/06 12:46:39 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!


#### 2.1.1 First Error!!!

Go back to the first few lines of the previously and look for the INFO line **Submitted application application_xxxx_xxxx**. Running the logs command of yarn with the provided application ID is a straightforward way to access all available log information for that application. The syntax to view yarn log is:

```
! yarn logs -applicationId APPLICATION_ID
```

In [None]:
# Run the yarn view log command here
# Do not run this command in a notebook browser, it will likely crash the browser
#!yarn logs -applicationId application_1476193845089_0123

However, this information is often massive, as it contains the aggregated logs from all tasks (map and reduce) of the job, which can be in the hundreds. The example below demonstrates this problem by displaying all the possible information of a single-task MapReduce job.
In this example, the log of a container has three types of log (LogType): 
- stderr: Error messages from the actual task execution
- stdout: Print out messages if the task includes them
- syslog: Logging messages from the Hadoop MapReduce operation

One approach to reduce the number of possible output is to comment out all non-essential lines (lines containing **INFO**)

In [21]:
!yarn logs -applicationId application_1505269880969_0056 | grep -v INFO

17/10/06 12:49:36 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:49:38 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/10/06 12:49:38 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e30_1505269880969_0056_01_000021 on dsci017.palmetto.clemson.edu_45454_1507308404811
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15156
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data08/hadoop/yarn/local/usercache/lngo/filecache/32/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data09/hadoop/yarn/local/usercache/lngo/filecache/33/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data10/hadoop/yarn/local/usercache/ln

54657611    8 -r-xr-xr-x   1 yarn     hadoop       6546 Apr  1  2017 ./mr-framework/hadoop/sbin/hadoop-daemon.sh
54657599    4 -r-xr-xr-x   1 yarn     hadoop       1455 Apr  1  2017 ./mr-framework/hadoop/sbin/stop-dfs.cmd
54657600    4 -r-xr-xr-x   1 yarn     hadoop       1642 Apr  1  2017 ./mr-framework/hadoop/sbin/stop-yarn.cmd
54657603    4 -r-xr-xr-x   1 yarn     hadoop       2752 Apr  1  2017 ./mr-framework/hadoop/sbin/distribute-exclude.sh
54657591    4 -r-xr-xr-x   1 yarn     hadoop       1421 Apr  1  2017 ./mr-framework/hadoop/sbin/stop-yarn.sh
54657590    4 -r-xr-xr-x   1 yarn     hadoop       1552 Apr  1  2017 ./mr-framework/hadoop/sbin/start-all.sh
54657594    4 -r-xr-xr-x   1 yarn     hadoop       1770 Apr  1  2017 ./mr-framework/hadoop/sbin/stop-all.cmd
54657609    4 -r-xr-xr-x   1 yarn     hadoop       1421 Apr  1  2017 ./mr-framework/hadoop/sbin/stop-secure-dns.sh
54657592    4 -r-xr-xr-x   1 yarn     hadoop       1779 Apr  1  2017 ./mr-framework/hadoop/sbin/star

Container: container_e30_1505269880969_0056_01_000009 on dsci019.palmetto.clemson.edu_45454_1507308404894
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15274
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data05/hadoop/yarn/local/usercache/lngo/filecache/20/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data06/hadoop/yarn/local/usercache/lngo/filecache/21/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data07/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data08/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

Container: container_e30_1505269880969_0056_01_000006 on dsci024.palmetto.clemson.edu_45454_1507308404632
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15158
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data09/hadoop/yarn/local/usercache/lngo/filecache/20/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data10/hadoop/yarn/local/usercache/lngo/filecache/21/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data11/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data12/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

Container: container_e30_1505269880969_0056_01_000008 on dsci029.palmetto.clemson.edu_45454_1507308404724
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15157
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data09/hadoop/yarn/local/usercache/lngo/filecache/23/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data10/hadoop/yarn/local/usercache/lngo/filecache/24/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data11/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data12/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

export LOG_DIRS="/data01/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data02/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data03/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data04/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data05/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data06/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data07/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data08/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data09/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/data10/hadoop/yarn/log/application_1505269880969_0056/container_e30_1505269880969_0056_01_000016,/da

Container: container_e30_1505269880969_0056_01_000020 on dsci034.palmetto.clemson.edu_45454_1507308404402
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15162
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data09/hadoop/yarn/local/usercache/lngo/filecache/16/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data10/hadoop/yarn/local/usercache/lngo/filecache/17/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data11/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data12/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

Container: container_e30_1505269880969_0056_01_000017 on dsci037.palmetto.clemson.edu_45454_1507308404735
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15152
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data02/hadoop/yarn/local/usercache/lngo/filecache/19/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data03/hadoop/yarn/local/usercache/lngo/filecache/20/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data04/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data05/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

Container: container_e30_1505269880969_0056_01_000019 on dsci039.palmetto.clemson.edu_45454_1507308404706
LogAggregationType: AGGREGATED
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15160
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data06/hadoop/yarn/local/usercache/lngo/filecache/26/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data07/hadoop/yarn/local/usercache/lngo/filecache/27/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.jar -> /data08/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/10/job.jar
lrwxrwxrwx 1 lngo hadoop  101 Oct  6 12:46 job.xml -> /data09/hadoop/yarn/local/usercache/lngo/appcache/application_1505269880969_0056/filecache/11/job.xml
-rwx------ 1 lngo hadoop 8342 Oct  6 12:46 launch_container.sh
lrwxrwxrwx 1

Can we refine the information further:
- In a MapReduce setting, containers (often) execute the same task.
- Can we extract only message listing the Container IDs?

~~~
!yarn logs -applicationId APPLICATION_ID | grep '^Container:'
~~~

In [22]:
!yarn logs -applicationId application_1505269880969_0056 | grep '^Container:'

17/10/06 12:50:39 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:50:41 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/10/06 12:50:41 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e30_1505269880969_0056_01_000021 on dsci017.palmetto.clemson.edu_45454_1507308404811
Container: container_e30_1505269880969_0056_01_000013 on dsci017.palmetto.clemson.edu_45454_1507308404811
Container: container_e30_1505269880969_0056_01_000007 on dsci018.palmetto.clemson.edu_45454_1507308404709
Container: container_e30_1505269880969_0056_01_000011 on dsci018.palmetto.clemson.edu_45454_1507308404709
Container: container_e30_1505269880969_0056_01_000009 on dsci019.palmetto.clemson.edu_45454_1507308404894
Container: container_e30_1505269880969_0056_01_000018 on dsci020.palmetto.clemson.edu_45454_1507308403987
Container: container_e30_1505269880969_0056_01_000001 o

Looking at the previous report, we can further identify container information:

```
Container: container_XXXXXX on  YYYY.palmetto.clemson.edu_ZZZZZ
```

- Container ID: container_XXXXXX
- Address of node where container is placed: YYYY.palmetto.clemson.edu

To request yarn to provide a more detailed log at container level, we run:
```
!yarn logs -applicationId APPLICATION_ID -containerId CONTAINER_ID --nodeAddress NODE_ADDRESS \
    | grep -v INFO
```

In [25]:
!yarn logs -applicationId application_1505269880969_0056 \
    -containerId container_e30_1505269880969_0056_01_000012 \
    --nodeAddress dsci035.palmetto.clemson.edu \
    | grep -v INFO

17/10/06 12:53:32 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:53:33 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/10/06 12:53:33 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e30_1505269880969_0056_01_000012 on dsci035.palmetto.clemson.edu_45454_1507308404608
LogAggregationType: AGGREGATED
17/10/06 12:53:33 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
LogType:directory.info
Log Upload Time:Fri Oct 06 12:46:44 -0400 2017
LogLength:15279
Log Contents:
ls -l:
total 36
lrwxrwxrwx 1 lngo hadoop   74 Oct  6 12:46 avgRatingMapper03.py -> /data01/hadoop/yarn/local/usercache/lngo/filecache/28/avgRatingMapper03.py
lrwxrwxrwx 1 lngo hadoop   75 Oct  6 12:46 avgRatingReducer01.py -> /data02/hadoop/yarn/local/usercache/lngo/filecache/29/avgRatingReducer01.py
-rw------- 1 lngo hadoop  368 Oct  6 12:46 container_tokens
lrwxrwxrwx 

This error message gives us some insights into the mechanism of Hadoop MapReduce. 
- Where are the map and reduce python scripts located?
- Where would the *movies.csv* file be, if the *-file* flag is used to upload this file?

In [26]:
%%writefile codes/avgRatingMapper04.py
#!/usr/bin/env python

import sys
import csv

movieFile = "./movies.csv"
movieList = {}

with open(movieFile, mode = 'r') as infile:
    reader = csv.reader(infile)
    for row in reader:
        movieList[row[0]] = {}
        movieList[row[0]]["title"] = row[1]
        movieList[row[0]]["genre"] = row[2]

for oneMovie in sys.stdin:
    oneMovie = oneMovie.strip()
    ratingInfo = oneMovie.split(",")
    try:
        movieTitle = movieList[ratingInfo[1]]["title"]
        movieGenre = movieList[ratingInfo[1]]["genre"]
        rating = float(ratingInfo[2])
        print ("%s\t%s\t%s" % (movieTitle, rating, movieGenre))
    except ValueError:
        continue

Writing codes/avgRatingMapper04.py


In [27]:
!yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /repository/movielens/ratings.csv \
    -output intro-to-hadoop/output-movielens-01 \
    -file ./codes/avgRatingMapper04.py \
    -mapper avgRatingMapper04.py \
    -file ./codes/avgRatingReducer01.py \
    -reducer avgRatingReducer01.py \
    -file ./movielens/movies.csv

17/10/06 12:56:09 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./codes/avgRatingMapper04.py, ./codes/avgRatingReducer01.py, ./movielens/movies.csv] [/usr/hdp/2.6.0.3-8/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.0.3-8.jar] /hadoop_java_io_tmpdir/streamjob512904205931449469.jar tmpDir=null
17/10/06 12:56:11 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:56:11 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:56:12 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 14333 for lngo on ha-hdfs:dsci
17/10/06 12:56:12 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 14333 for lngo)
17/10/06 12:56:12 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://dsci/use

#### 2.1.2 Second Error!!!

- HDFS is read only. Therefore, all output directories must not have existed prior to job submission
- This can be resolved either by specifying a new output directory or deleting the existing output directory

In [28]:
!yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /repository/movielens/ratings.csv \
    -output intro-to-hadoop/output-movielens-02 \
    -file ./codes/avgRatingMapper04.py \
    -mapper avgRatingMapper04.py \
    -file ./codes/avgRatingReducer01.py \
    -reducer avgRatingReducer01.py \
    -file ./movielens/movies.csv

17/10/06 12:56:52 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./codes/avgRatingMapper04.py, ./codes/avgRatingReducer01.py, ./movielens/movies.csv] [/usr/hdp/2.6.0.3-8/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.0.3-8.jar] /hadoop_java_io_tmpdir/streamjob5078678851774131449.jar tmpDir=null
17/10/06 12:56:54 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:56:55 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/06 12:56:55 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 14338 for lngo on ha-hdfs:dsci
17/10/06 12:56:55 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 14338 for lngo)
17/10/06 12:56:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/10/06 12:56:56 INFO l

In [29]:
!hdfs dfs -ls intro-to-hadoop/output-movielens-02

Found 2 items
-rw-r--r--   2 lngo hdfs          0 2017-10-06 12:58 intro-to-hadoop/output-movielens-02/_SUCCESS
-rw-r--r--   2 lngo hdfs    2045080 2017-10-06 12:58 intro-to-hadoop/output-movielens-02/part-00000


In [30]:
!hdfs dfs -cat intro-to-hadoop/output-movielens-02/part-00000 \
    2>/dev/null | head -n 20

"Great Performances" Cats (1998)	2.78199052133	Comedy|Drama
#1 Cheerleader Camp (2010)	2.75	Drama|Horror|Mystery|Thriller
#Horror (2015)	2.22222222222	Documentary
#chicagoGirl: The Social Network Takes on a Dictator (2013)	3.66666666667	Comedy|Crime|Drama
$ (Dollars) (1971)	2.75	Western
$1,000 on the Black (1966)	3.0	Drama|Western
$100,000 for Ringo (1965)	2.5	Comedy|Drama
$5 a Day (2008)	2.97169811321	Drama
$50K and a Call Girl: A Love Story (2014)	3.75	Animation
$9.99 (2008)	3.13846153846	Documentary
$ellebrity (Sellebrity) (2012)	2.25	Comedy|Western
'49-'17 (1917)	2.5	Action|Drama|Thriller|War
'71 (2014)	3.69689119171	Action|Adventure|Comedy|Documentary|Fantasy
'Hellboy': The Seeds of Creation (2004)	3.05909090909	Drama|Thriller
'Human' Factor, The (Human Factor, The) (1975)	2.25	Drama
'Master Harold'... and the Boys (1985)	3.5	Western
'Neath the Arizona Skies (1934)	2.29166666667	Action
'Pimpernel' Smith (1941)	3.0	Crime|Drama
'R Xmas (2001)	2.75	Drama|Musical
'R

### Challenge:

1. Modify *avgRatingReducer02.py* so that only movies with averaged ratings higher than 3.75 are collected
2. Further enhance your modification so that not only movies with averaged ratings higher than 3.75 are collected but these movies also need to be rated at least 5000 times. 

In [None]:
%%writefile codes/avgRatingMapper04challenge.py
#!/usr/bin/env python

import sys
import csv

movieFile = "./movies.csv"
movieList = {}


with open(movieFile, mode = 'r') as infile:
    reader = csv.reader(infile)
    for row in reader:
        movieList[row[0]] = {}
        movieList[row[0]]["title"] = row[1]
        movieList[row[0]]["genre"] = row[2]

for oneMovie in sys.stdin:
    oneMovie = oneMovie.strip()
    ratingInfo = oneMovie.split(",")
    try:
        movieTitle = movieList[ratingInfo[1]]["title"]
        movieGenre = movieList[ratingInfo[1]]["genre"]
        rating = float(ratingInfo[2])
        if _________:
            print ("%s\t%s\t%s" % (movieTitle, rating, movieGenre))
    except ValueError:
        continue

In [None]:
!yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /repository/movielens/ratings.csv \
    -output intro-to-hadoop/output-movielens-challenge \
    -file ____________ \
    -mapper ___________ \
    -file ./codes/avgRatingReducer01.py \
    -reducer avgRatingReducer01.py \
    -file ./codes/movielens/movies.csv