# Notebook to apply MapReduce and move data to HDFS 

- In Pseudo-Distributed Mode;
- We go to work with dataset in <a href="https://grouplens.org/datasets/movielens/"> MovieLens </a> for this tutorial.
- I downloaded this dataset and put in repository, enjoy!
    - ml-100k.zip;
    - log_server.log.zip;
    - OrgulhoePreconceito.txt
    - amigos_facebook.csv
- Make download this dataset and move to HDFS.

## TASK 1: count how many movies by rating

### 1. Initializing HDFS and verify this

In [None]:
print("Initializing HDFS...")
!start-dfs.sh
print("Verifing if HDFS started:")
!jps

### 2. Initializing YARN and verify this

In [None]:
print("Initializing Yarn to gerenciate Jobs and Resources...")
!start-yarn.sh

### 3. Looking to HDFS

In [None]:
!hdfs dfs -ls /

### 4. Create directory to work with MapReduce

In [None]:
!hdfs dfs -mkdir /mapred

- Check if directory is created

In [None]:
!hdfs dfs -ls /

### 5. Unzip file "ml-100k.zip" e import file "u.data"
- First column is ID;
- Second column is ID_MOVIE;
- Third column is Rating;
- Fourth column is Timestamp.

In [None]:
!unzip ml-100k.zip
!cd ml-100k; cat u.data; mv u.data udata

### 6. Put this file in hdfs

In [None]:
!cd ml-100k & hdfs dfs -put udata /mapred
!hdfs dfs -ls /mapred

### 7. Applying MapReduce to count how many movies by rating

- generate file with code in python and execute in cluster hadoop

In [None]:
# If you don't have mrjob package, install with command below
# !pip install mrjob
# configuration file mrjob: '/home/hadoop/.mrjob.conf' if need

In [None]:
%%writefile /filePy/MovieEvaluateMR.py
from mrjob.job import MRJob

class MovieEvaluateMR(MRJob):
    def mapper(self, key, line):
        (ID, ID_MOVIE, rating, Timestamp) = line.split('\t')
        yield rating, 1
    
    def reducer(self, rating, occurences):
        yield rating, sum(occurences)
        
if __name__ == '__main__':
    MovieEvaluateMR.run()

- move file to dir /filePy

In [9]:
!mv MovieEvaluateMR.py ./filePy/

- Execute job MapReduce

In [None]:
!python /filePy/MovieEvaluateMR.py hdfs:///mapred/udata -r hadoop 

## TASK 2: avg friend by age

### 1. put this file in hdfs and check

In [None]:
!hdfs dfs -put amigos_facebook.csv /mapred
!hdfs dfs -ls /mapred
!jps

### 2. apply mapreduce to count avg friends by age

In [5]:
%%writefile FriendsAgeMR.py
from mrjob.job import MRJob

class FriendsAgeMR(MRJob):
    def mapper(self, key, line):
        (ID, name, age, nFriends) = line.split(',')
        yield age, float(nFriends)
    
    def reducer(self, age, nFriends):
        count = 0
        total = 0
        for x in nFriends:
            count += 1
            total += x
        
        yield age, (total / count)
        
if __name__ == '__main__':
    FriendsAgeMR.run()

Overwriting FriendsAgeMR.py


- move file to dir /filePy

In [8]:
!mv FriendsAgeMR.py ./filePy/

mv: cannot create regular file '/filePy/': Not a directory


- execute job mapreduce

In [None]:
!python /filePy/FriendsAgeMR.py hdfs:///mapred/amigos_facebook.csv -r hadoop