<DIV ALIGN=CENTER>

# Introduction to Pig
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction


In this IPython Notebook, we introduce pig.

-----

-----

## PIG

-----

In [1]:
!pig -help


Apache Pig version 0.15.0 (r1682971) 
compiled Jun 01 2015, 11:44:35

USAGE: Pig [options] [-] : Run interactively in grunt shell.
       Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
       Pig [options] [-f[ile]] file : Run cmds found in file.
  options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file

-----

### Local Pig

-----

In [2]:
%%writefile /home/data_scientist/hadoop/wordcount-local.pig

Lines = LOAD 'book.txt' AS (Line:chararray) ;
Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (Line)) AS Word ;
Groups = GROUP Words BY Word ;
Counts = FOREACH Groups GENERATE group, COUNT (Words) ;
Results = ORDER Counts BY $1 DESC ;
Top_Results = LIMIT Results 10 ;
STORE Results INTO 'top_words' ;
DUMP Top_Results ;

Writing /home/data_scientist/hadoop/wordcount-local.pig


In [3]:
%%bash

cd $HOME/hadoop

# We run locally, and send pig/hadoop messages to nowhere
pig -x local -f wordcount-local.pig 2> /dev/null

(the,13626)
(of,8133)
(and,6681)
(a,5869)
(to,4817)
(in,4651)
(his,3051)
(he,2792)
(I,2455)
(with,2401)


-----

The output of out Python based map-reduce was as follows:

  Word | Count
 :----: | ----:
the |  13600
of | 8127
and | 6542
a  | 5842
to | 4787
in | 4606
his	|  3035
he  | 2712
I  | 2432
with | 2391

-----

In [4]:
%%bash

cd $HOME/hadoop

# Show the contents of the local output
ls -la top_words

# Now display top words from output
echo
echo 'Top Words'
head -10 top_words/part-r-00000

total 468
drwxr-xr-x 2 data_scientist users   4096 Apr  8 02:09 .
drwxr-xr-x 3 data_scientist users   4096 Apr  8 02:09 ..
-rw-r--r-- 1 data_scientist users 461372 Apr  8 02:09 part-r-00000
-rw-r--r-- 1 data_scientist users   3616 Apr  8 02:09 .part-r-00000.crc
-rw-r--r-- 1 data_scientist users      0 Apr  8 02:09 _SUCCESS
-rw-r--r-- 1 data_scientist users      8 Apr  8 02:09 ._SUCCESS.crc

Top Words
the	13626
of	8133
and	6681
a	5869
to	4817
in	4651
his	3051
he	2792
I	2455
with	2401


-----

### Hadoop Pig

-----

In [5]:
%%writefile /home/data_scientist/hadoop/wordcount.pig

Lines = LOAD 'wc/in/book.txt' AS (Line:chararray) ;
Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (Line)) AS Word ;
Groups = GROUP Words BY Word ;
Counts = FOREACH Groups GENERATE group, COUNT (Words) ;
Results = ORDER Counts BY $1 DESC ;
Top_Results = LIMIT Results 10 ;
STORE Results INTO 'wc/out/top_words' ;
DUMP Top_Results ;

Writing /home/data_scientist/hadoop/wordcount.pig


In [6]:
%%bash

# We remove old output if it exists and create output directory
$HADOOP_PREFIX/bin/hdfs dfs -rm -r -f wc/out 2> /dev/null
$HADOOP_PREFIX/bin/hdfs dfs -mkdir wc/out 2> /dev/null

cd $HOME/hadoop

# We run remotely, and send pig/hadoop messages to nowhere
pig -f wordcount.pig 2> /dev/null

Deleted wc/out
(the,13626)
(of,8133)
(and,6681)
(a,5869)
(to,4817)
(in,4651)
(his,3051)
(he,2792)
(I,2455)
(with,2401)


In [7]:
%%bash

cd $HADOOP_PREFIX
# Display directory contents
bin/hdfs dfs -ls wc/in
bin/hdfs dfs -ls wc/out/top_words

# Write output

echo
echo 'Top Words'

bin/hdfs dfs -cat wc/out/top_words/part-r-00000 | head -10

Found 1 items
-rw-r--r--   1 data_scientist supergroup    1573151 2016-04-07 23:37 wc/in/book.txt
Found 2 items
-rw-r--r--   1 data_scientist supergroup          0 2016-04-08 02:10 wc/out/top_words/_SUCCESS
-rw-r--r--   1 data_scientist supergroup     461372 2016-04-08 02:10 wc/out/top_words/part-r-00000
the	13626
of	8133
and	6681
a	5869
to	4817
in	4651
his	3051
he	2792
I	2455
with	2401


cat: Unable to write to output stream.


-----

### Movie Lens Data Analysis


-----

In [8]:
# Name of the directory holding the Small MovieLens data
data_dir = '/home/data_scientist/hadoop'

In [9]:
# Grab a book to process
!wget --output-document=$data_dir/ml-latest-small.zip \
    http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

!unzip -o $data_dir/ml-latest-small.zip -d $data_dir

--2016-04-08 02:12:50--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.146
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.146|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1040425 (1016K) [application/zip]
Saving to: ‘/home/data_scientist/hadoop/ml-latest-small.zip’


2016-04-08 02:12:51 (2.37 MB/s) - ‘/home/data_scientist/hadoop/ml-latest-small.zip’ saved [1040425/1040425]

Archive:  /home/data_scientist/hadoop/ml-latest-small.zip
   creating: /home/data_scientist/hadoop/ml-latest-small/
  inflating: /home/data_scientist/hadoop/ml-latest-small/links.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/movies.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/ratings.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/README.txt  
  inflating: /home/data_scientist/hadoop/ml-latest-small/tags.csv  


In [10]:
!head -10 $data_dir/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
1,47,4.0,1217896556
1,50,4.0,1217896523
1,110,4.0,1217896150
1,150,3.0,1217895940
1,161,4.0,1217897864
1,165,3.0,1217897135


In [11]:
%%writefile /home/data_scientist/hadoop/head.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',') ;
tr = STREAM ratings THROUGH `head -10` AS (userID, mnovieID, rating, timestamp) ;
DUMP tr ;

Writing /home/data_scientist/hadoop/head.pig


In [12]:
%%bash

cd $HOME/hadoop

pig -x local -b -f head.pig 2> /dev/null

(userId,movieId,rating,timestamp)
(1,16,4.0,1217897793)
(1,24,1.5,1217895807)
(1,32,4.0,1217896246)
(1,47,4.0,1217896556)
(1,50,4.0,1217896523)
(1,110,4.0,1217896150)
(1,150,3.0,1217895940)
(1,161,4.0,1217897864)
(1,165,3.0,1217897135)


In [13]:
%%bash

cd $HOME/hadoop

# Copy original files to new name
cp ml-latest-small/ratings.csv ml-latest-small/original-ratings.csv
cp ml-latest-small/movies.csv ml-latest-small/original-movies.csv

# GNU SED allows inline editing, here we delete the first line from the file
sed -i '1d' ml-latest-small/ratings.csv
sed -i '1d' ml-latest-small/movies.csv

# List CSV files
ls -la ml-latest-small/*.csv

echo
echo '***** Ratings File *****'
head -2 ml-latest-small/ratings.csv

echo
echo '***** Movies File *****'
head -2 ml-latest-small/movies.csv

-rw-r--r-- 1 data_scientist users  207997 Jan 11 10:55 ml-latest-small/links.csv
-rw-r--r-- 1 data_scientist users  515678 Apr  8 02:12 ml-latest-small/movies.csv
-rw-r--r-- 1 data_scientist users  515700 Apr  8 02:12 ml-latest-small/original-movies.csv
-rw-r--r-- 1 data_scientist users 2580392 Apr  8 02:12 ml-latest-small/original-ratings.csv
-rw-r--r-- 1 data_scientist users 2580359 Apr  8 02:12 ml-latest-small/ratings.csv
-rw-r--r-- 1 data_scientist users  199073 Jan 11 10:54 ml-latest-small/tags.csv

***** Ratings File *****
1,16,4.0,1217897793
1,24,1.5,1217895807

***** Movies File *****
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy


In [14]:
%%writefile /home/data_scientist/hadoop/ratings.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, mnovieID:int, rating:double, timestamp:int) ;
DESCRIBE ratings ;
ILLUSTRATE ratings ;
top_rows = LIMIT ratings 10 ;
DUMP top_rows ;

Writing /home/data_scientist/hadoop/ratings.pig


In [15]:
%%bash

cd $HOME/hadoop

pig -x local -f ratings.pig 2> /dev/null

ratings: {userID: int,mnovieID: int,rating: double,timestamp: int}
(6,6711,4.0,1348881409)
-----------------------------------------------------------------------------------
| ratings     | userID:int   | mnovieID:int   | rating:double   | timestamp:int   | 
-----------------------------------------------------------------------------------
|             | 6            | 6711           | 4.0             | 1348881409      | 
-----------------------------------------------------------------------------------

(1,16,4.0,1217897793)
(1,24,1.5,1217895807)
(1,32,4.0,1217896246)
(1,47,4.0,1217896556)
(1,50,4.0,1217896523)
(1,110,4.0,1217896150)
(1,150,3.0,1217895940)
(1,161,4.0,1217897864)
(1,165,3.0,1217897135)
(1,204,0.5,1217895786)


In [16]:
%%writefile /home/data_scientist/hadoop/join.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, movieID:int, rating:double, timestamp:int) ;

movies = LOAD 'ml-latest-small/movies.csv' USING PigStorage(',')
    AS (movieID:int, title:chararray, genre:chararray) ;

movie_ratings = JOIN ratings by movieID, movies by movieID ;

DESCRIBE movie_ratings ;
top_rows = LIMIT movie_ratings 10 ;
DUMP top_rows ;

Writing /home/data_scientist/hadoop/join.pig


In [17]:
%%bash

cd $HOME/hadoop

pig -x local -b -f join.pig 2> /dev/null

movie_ratings: {ratings::userID: int,ratings::movieID: int,ratings::rating: double,ratings::timestamp: int,movies::movieID: int,movies::title: chararray,movies::genre: chararray}
(151,1,5.0,864684243,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(176,1,4.0,965402628,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(215,1,3.5,1433873781,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(218,1,3.5,1255817134,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(347,1,5.0,1274980200,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(450,1,3.0,835226407,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(650,1,5.0,965433049,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(661,1,4.0,866409965,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(29,1,4.0,846942580,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(122,1,5.0,1024806364,1,Toy Story (1995),Adventure|Ani

In [18]:
%%writefile /home/data_scientist/hadoop/join-group.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, movieID:int, rating:double, timestamp:int) ;

high_ratings = FILTER ratings BY rating > 3 ;

hr_group = GROUP high_ratings BY movieID ;

hr_count = FOREACH hr_group GENERATE group AS mvID, COUNT(high_ratings) AS cnt ;

movies = LOAD 'ml-latest-small/movies.csv' USING PigStorage(',')
    AS (movieID:int, title:chararray, genre:chararray) ;

movie_ratings = JOIN hr_count by mvID, movies by movieID ;

ordered_movies = ORDER movie_ratings BY cnt DESC ;

top_movies = LIMIT ordered_movies 10 ;

DESCRIBE top_movies ;

DUMP top_movies ;

Writing /home/data_scientist/hadoop/join-group.pig


In [19]:
%%bash

cd $HOME/hadoop

pig -x local -b -f join-group.pig 2> /dev/null

top_movies: {hr_count::mvID: int,hr_count::cnt: long,movies::movieID: int,movies::title: chararray,movies::genre: chararray}
(318,282,318,"Shawshank Redemption, The (1994)")
(296,268,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller)
(356,258,356,Forrest Gump (1994),Comedy|Drama|Romance|War)
(593,253,593,"Silence of the Lambs, The (1991)")
(2571,231,2571,"Matrix, The (1999)")
(260,230,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi)
(527,220,527,Schindler's List (1993),Drama|War)
(110,204,110,Braveheart (1995),Action|Drama|War)
(50,202,50,"Usual Suspects, The (1995)")
(589,198,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi)


In [20]:
%%bash

# Clean up the working directory (Don't run to save data for later analysis)
cd $HOME/hadoop

# Remove pig log files
rm -f pig*.log

# Remove our pig scripts
rm -f *.pig

# Remove the movieLens Data
rm -f ml-latest-small.zip
rm -rf ml-latest-small

# Remove our output file.
rm -rf top_words

# Display cleaned directory contents
ls -la

total 1556
drwxr-xr-x  2 data_scientist users    4096 Apr  8 02:13 .
drwxr-xr-x 18 data_scientist users    4096 Apr  7 21:26 ..
-rw-r--r--  1 data_scientist users 1573151 Apr  7 18:17 book.txt
-rwxr--r--  1 data_scientist users     694 Apr  7 20:07 mapper.py
-rwxr--r--  1 data_scientist users    1481 Apr  7 20:07 reducer.py


-----

### Student Activity

In the preceding cells, we introduced bayesian modeling. Now that you
have run the Notebook, go back and run it a second time. Notice how the
data and thus model fits have changed.

1. Change the number of model points (by default there are 50 model
points). How does increasing or decreasing the number of points affect
the model accuracy?
2. Try changing the model parameters, does the resulting fits replicate
the true model?
3. Compare the accuracy of the linear regression methods introduced
earlier in the corse with the Bayesian approach. What are the benefits
of the different techniques?
4. Do the distribution we use to model our priors affect the fitting?
Try changing the distributions and see what changes.

-----