# Exercise: UNIX, HDFS and Map-Reduce

The following are ideas to test our newly gained knowledge:

## Warm-up
1. Create a folder structure under your user directory on the HDFS, and load some data files (e.g. Yelp) from the local file-system. Split data- files into smaller chunks, and `gzip` them.
2. Extract CSV tables from Yelp or Twitter JSON files using `jq` or `python`.
3. Perform aggregations and frequency counts using UNIX command line tools and pipes.


## Walk
1. Write a Map-Reduce mapper to filter and convert rows from a single datafile, and produce a (small) CSV table. Run Map-Reduce with data on the HDFS, move the resulting CSV table onto the local file system.
2. Write a mapper and reducer to count words in Yelp reviews. Produce a list with the 200 most frequent words.

## Run
1. Join tables using Map-Reduce (use Yelp ... and incorporate JSON extraction)
2. Write a sentiment analysis script in Python, and use Map-Reduce to apply to the Yelp reviews. **Run faster:** Produce a table with 'Sentiment' and 'Rating' from Yelp reviews.


In [1]:
%%sh
ls -l 

total 8
-rw-rw-r-- 1 pmolnar pmolnar 1058 Jan 14 09:01 README.md
-rw-r--r-- 1 pmolnar pmolnar 2196 Jan 14 09:06 Untitled.ipynb


## JSON Decoding

In [2]:
import json
import os, sys

In [None]:
uf = '/home/data/yelp/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json'

In [9]:
datadir = '/home/data/yelp/yelp_dataset_challenge_academic_dataset/'
with open(datadir+'yelp_academic_dataset_user.json') as f:
    c = 0
    for lin in sys.stdin.readlines:   ##f.readlines():   
        r = json.loads(lin)
        print r['user_id'], r['name'], len(r['friends'])
        c += 1
        if c>3:
            break
        

18kPq7GPye-YQ3LyKyAZPw Russel 200
rpOyqD_893cqmDAtJLbdog Jeremy 1939
4U9kSBLuBDU391x6bxU-YA Michael 422
fHtTaujcyKvXglE33Z5yIw Ken 4


In [19]:
"""
{
    'type': 'review',
    'business_id': (encrypted business id),
    'user_id': (encrypted user id),
    'stars': (star rating, rounded to half-stars),
    'text': (review text),
    'date': (date, formatted like '2012-03-14'),
    'votes': {(vote type): (count)},
}
"""
datadir = '/home/data/yelp/yelp_dataset_challenge_academic_dataset/'
with open(datadir+'yelp_academic_dataset_review.json') as f:
    c = 0
    for lin in f.readlines():   
        try:
            r = json.loads(lin.strip())
            print 'valid record', r['business_id'], r['user_id'], r['stars'], r['text'][:20]
        except:
            print 'invalid record'
        c += 1
        if c>3:
            break
                

5UmKMjUEUNdYWqANhGckJw PUFPaY9KxDAcGqfsorJp3Q 4 Mr Hoagie is an inst
5UmKMjUEUNdYWqANhGckJw Iu6AxdBYGR4A0wspR9BYHA 5 Excellent food. Supe
5UmKMjUEUNdYWqANhGckJw auESFwWvW42h6alXgFxAXQ 5 Yes this place is a 
5UmKMjUEUNdYWqANhGckJw qiczib2fO_1VBG8IoCGvVg 3 PROS: Italian hoagie


Let's write a mapper that pulls out users and their friends.
First we'll do it on the file

In [24]:
datadir = '/home/data/yelp/yelp_dataset_challenge_academic_dataset/'
with open(datadir+'yelp_academic_dataset_user.json') as stream:
    counter = 0
    for line in stream.readlines():   
        r = json.loads(line)
        friends_counter = 0
        for f in r['friends']:
            ## printing tab seperated lines from a list of values
            print '\t'.join([r['name'], r['user_id'], f])
            friends_counter +=1
            if friends_counter>10:
                break
        counter += 1
        if counter>10:
            break
        


Russel	18kPq7GPye-YQ3LyKyAZPw	rpOyqD_893cqmDAtJLbdog
Russel	18kPq7GPye-YQ3LyKyAZPw	4U9kSBLuBDU391x6bxU-YA
Russel	18kPq7GPye-YQ3LyKyAZPw	fHtTaujcyKvXglE33Z5yIw
Russel	18kPq7GPye-YQ3LyKyAZPw	8J4IIYcqBlFch8T90N923A
Russel	18kPq7GPye-YQ3LyKyAZPw	wy6l_zUo7SN0qrvNRWgySw
Russel	18kPq7GPye-YQ3LyKyAZPw	HDQixQ-WZEV0LVPJlIGQeQ
Russel	18kPq7GPye-YQ3LyKyAZPw	T4kuUr_iJiywOPdyM7gTHQ
Russel	18kPq7GPye-YQ3LyKyAZPw	z_5D4XEIlGAPjG3Os9ix5A
Russel	18kPq7GPye-YQ3LyKyAZPw	i63u3SdbrLsP4FxiSKP0Zw
Russel	18kPq7GPye-YQ3LyKyAZPw	pnrGw4ciBXJ6U5QB2m0F5g
Russel	18kPq7GPye-YQ3LyKyAZPw	ytjCBxosVSqCOQ62c4KAxg
Jeremy	rpOyqD_893cqmDAtJLbdog	18kPq7GPye-YQ3LyKyAZPw
Jeremy	rpOyqD_893cqmDAtJLbdog	4U9kSBLuBDU391x6bxU-YA
Jeremy	rpOyqD_893cqmDAtJLbdog	fHtTaujcyKvXglE33Z5yIw
Jeremy	rpOyqD_893cqmDAtJLbdog	SIBCL7HBkrP4llolm4SC2A
Jeremy	rpOyqD_893cqmDAtJLbdog	8J4IIYcqBlFch8T90N923A
Jeremy	rpOyqD_893cqmDAtJLbdog	ysYmC-ufbdmVEX9yAv-VEQ
Jeremy	rpOyqD_893cqmDAtJLbdog	UTS9XcT14H2ZscRIf0MYHQ
Jeremy	rpOyqD_893cqmDAtJLbdog	1blidZhgxDVSBuJ_

This is what the corresponing mapper function would look like. We create a separate file `friends_mapper.py`

In [None]:
#!/usr/bin/env python

import sys
import json
# input comes from STDIN (standard input)
for line in sys.stdin:
    r = json.loads(line)
    for f in r['friends']:
        ## printing tab seperated lines from a list of values
        print '\t'.join([r['name'], r['user_id'], f])

In [27]:
%%sh
ls -l

total 28
-rw-r--r-- 1 pmolnar pmolnar 17625 Jan 14 10:18 Exercises.ipynb
-rw-r--r-- 1 pmolnar pmolnar   278 Jan 14 10:21 friends_mapper.py
-rw-rw-r-- 1 pmolnar pmolnar  1058 Jan 14 09:01 README.md


In [28]:
%%sh
chmod a+x friends_mapper.py

In [29]:
%%sh
ls -l 

total 28
-rw-r--r-- 1 pmolnar pmolnar 17625 Jan 14 10:18 Exercises.ipynb
-rwxr-xr-x 1 pmolnar pmolnar   278 Jan 14 10:21 friends_mapper.py
-rw-rw-r-- 1 pmolnar pmolnar  1058 Jan 14 09:01 README.md


In [7]:
%%sh
export UF=/home/data/yelp/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json
head -100 $UF | ./friends_mapper.py | head -20

Russel	18kPq7GPye-YQ3LyKyAZPw	rpOyqD_893cqmDAtJLbdog
Russel	18kPq7GPye-YQ3LyKyAZPw	4U9kSBLuBDU391x6bxU-YA
Russel	18kPq7GPye-YQ3LyKyAZPw	fHtTaujcyKvXglE33Z5yIw
Russel	18kPq7GPye-YQ3LyKyAZPw	8J4IIYcqBlFch8T90N923A
Russel	18kPq7GPye-YQ3LyKyAZPw	wy6l_zUo7SN0qrvNRWgySw
Russel	18kPq7GPye-YQ3LyKyAZPw	HDQixQ-WZEV0LVPJlIGQeQ
Russel	18kPq7GPye-YQ3LyKyAZPw	T4kuUr_iJiywOPdyM7gTHQ
Russel	18kPq7GPye-YQ3LyKyAZPw	z_5D4XEIlGAPjG3Os9ix5A
Russel	18kPq7GPye-YQ3LyKyAZPw	i63u3SdbrLsP4FxiSKP0Zw
Russel	18kPq7GPye-YQ3LyKyAZPw	pnrGw4ciBXJ6U5QB2m0F5g
Russel	18kPq7GPye-YQ3LyKyAZPw	ytjCBxosVSqCOQ62c4KAxg
Russel	18kPq7GPye-YQ3LyKyAZPw	r5uiIxwJ-I-oHBkNY2Ha3Q
Russel	18kPq7GPye-YQ3LyKyAZPw	niWoSKswEbooJC_M7HMbGw
Russel	18kPq7GPye-YQ3LyKyAZPw	kwoxiKMyoYjB1wTCYAjYRg
Russel	18kPq7GPye-YQ3LyKyAZPw	9A8OuP6XwLwnNb9ov3_Ncw
Russel	18kPq7GPye-YQ3LyKyAZPw	27MmRg8LfbZXNEHkEnKSdA
Russel	18kPq7GPye-YQ3LyKyAZPw	uguXfIEpI65jSCH5MgUDgA
Russel	18kPq7GPye-YQ3LyKyAZPw	6VZNGc2h2Bn-uyuEXgOt5g
Russel	18kPq7GPye-YQ3LyKyAZPw	AZ8CTtwr-4sGM2kZ

Traceback (most recent call last):
  File "./friends_mapper.py", line 10, in <module>
    print '\t'.join([r['name'], r['user_id'], f])
IOError: [Errno 32] Broken pipe
head: write error: Broken pipe
head: write error


In [10]:
%%sh
hdfs dfs -mkdir /user/pmolnar/data/yelp

In [11]:
%%sh
hdfs dfs -put home/data/yelp/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_*.json /user/pmolnar/data/yelp

put: `home/data/yelp/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_*.json': No such file or directory


In [18]:
%%sh
hdfs dfs -cat /user/pmolnar/data/yelp/yelp_academic_dataset_review.json | ./friends_mapper.py | head -200

Traceback (most recent call last):
  File "./friends_mapper.py", line 8, in <module>
    for f in r['friends']:
KeyError: 'friends'
cat: Unable to write to output stream.


In [19]:
%%sh
head -200 /home/data/yelp/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json | ./friends_mapper.py 


Russel	18kPq7GPye-YQ3LyKyAZPw	rpOyqD_893cqmDAtJLbdog
Russel	18kPq7GPye-YQ3LyKyAZPw	4U9kSBLuBDU391x6bxU-YA
Russel	18kPq7GPye-YQ3LyKyAZPw	fHtTaujcyKvXglE33Z5yIw
Russel	18kPq7GPye-YQ3LyKyAZPw	8J4IIYcqBlFch8T90N923A
Russel	18kPq7GPye-YQ3LyKyAZPw	wy6l_zUo7SN0qrvNRWgySw
Russel	18kPq7GPye-YQ3LyKyAZPw	HDQixQ-WZEV0LVPJlIGQeQ
Russel	18kPq7GPye-YQ3LyKyAZPw	T4kuUr_iJiywOPdyM7gTHQ
Russel	18kPq7GPye-YQ3LyKyAZPw	z_5D4XEIlGAPjG3Os9ix5A
Russel	18kPq7GPye-YQ3LyKyAZPw	i63u3SdbrLsP4FxiSKP0Zw
Russel	18kPq7GPye-YQ3LyKyAZPw	pnrGw4ciBXJ6U5QB2m0F5g
Russel	18kPq7GPye-YQ3LyKyAZPw	ytjCBxosVSqCOQ62c4KAxg
Russel	18kPq7GPye-YQ3LyKyAZPw	r5uiIxwJ-I-oHBkNY2Ha3Q
Russel	18kPq7GPye-YQ3LyKyAZPw	niWoSKswEbooJC_M7HMbGw
Russel	18kPq7GPye-YQ3LyKyAZPw	kwoxiKMyoYjB1wTCYAjYRg
Russel	18kPq7GPye-YQ3LyKyAZPw	9A8OuP6XwLwnNb9ov3_Ncw
Russel	18kPq7GPye-YQ3LyKyAZPw	27MmRg8LfbZXNEHkEnKSdA
Russel	18kPq7GPye-YQ3LyKyAZPw	uguXfIEpI65jSCH5MgUDgA
Russel	18kPq7GPye-YQ3LyKyAZPw	6VZNGc2h2Bn-uyuEXgOt5g
Russel	18kPq7GPye-YQ3LyKyAZPw	AZ8CTtwr-4sGM2kZ

Let's run map reduce

In [14]:
%%sh
OUTPUT=/user/$USER/output/yelp_friends
echo $OUTPUT

/user/pmolnar/output/yelp_friends


In [16]:
%%sh
OUTPUT=/user/$USER/output/yelp_friends
hdfs dfs -rm -r -f -skipTrash $OUTPUT

INPUT=/user/$USER/data/yelp/yelp_academic_dataset_user.json
yarn \
    jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
    -D mapred.reduce.tasks=0 \
    -D mapred.min.split.size=32 \
    -mapper "$PWD/friends_mapper.py" \
    -input $INPUT \
    -output $OUTPUT

packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob5574289518952680215.jar tmpDir=null


17/01/14 11:18:16 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/14 11:18:16 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/14 11:18:17 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/14 11:18:17 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/14 11:18:17 INFO mapred.FileInputFormat: Total input paths to process : 1
17/01/14 11:18:17 INFO mapreduce.JobSubmitter: number of splits:3
17/01/14 11:18:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1483712992215_0003
17/01/14 11:18:18 INFO impl.YarnClientImpl: Submitted application application_1483712992215_0003
17/01/14 11:18:18 INFO mapreduce.Job: The url to track the job: http://backend-0-1.insight.gsu.edu:8088/proxy/application_1483712992215_0003/
17/01/14 11:18:18 IN