# Pig Command Note

**Outline**
* [Introduction](#intro)
* [Syntax](#syntax)
* [Reference](#refer)


# <a id='intro'>Introduction</a>

* Pig is similar to Hive, is a layer on top of Java MapReduce Job
* Pig provides a high-level language known as **Pig Latin**
* To analyze data using Apache Pig, programmers need to write scripts using **Pig Latin** language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as **Pig Engine** that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
* The interactive shell of pig is called **Grunt**
* Pig is a dataflow language, meaning that
    * To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.
* Pig is developed by Yahoo; Hive is developed by Facebook.


Most of the information below are copied from [here](https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm).

> **Pig vs SQL**

<img src="pic/pigsql.png" style="width: 400px;height: 210px;"/>

> **Pig vs Hive**

Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar way Apache Pig does. In the following table, we have listed a few significant points that set Apache Pig apart from Hive.

<img src="pic/pighive.png" style="width: 400px;height: 210px;"/>


> **Apache Pig Execution Mechanisms**

* **Interactive Mode (Grunt shell)** − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
* **Batch Mode (Script)** − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.
* **Embedded Mode (UDF)** − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.


---

# <a id='syntax'>Syntax</a>

Short summary of the dataflow 

* load data into grunt
* do some manipulation of the loaded data and save as a new variable
* use dump operator to see the result

> **getting into the Grunt shell**

In [None]:
pig

> **check which pig mode in grunt shell**

[stackoverflow](https://stackoverflow.com/questions/33390099/how-to-know-pig-mode-in-grunt-shell)

In [None]:
# there are two mode of grunt shell, one is local mode, one is mapreduce mode
# it'll list the path. If the path start with hdfs
ls

> **execute pig script from command line**

In [None]:
# usually use this if our input data source is in hdfs
pig -f myscript.pig

In [None]:
# not so sure about when to use it. probably only when I want to load data from local file system.
pig -x local Sample_script.pig

> **pass parameter into pig script from command line**

In [None]:
# when running the script in cmd, put the parameter key-value as following
pig -f filename.pig -param start_date=20170201 -param end_date=20170209 -param output=lmtmpx

In [None]:
# in the pig script, refer to the parameter using $variable
# here is an example
STORE table_name INTO '/user/jochiu/$output';

> **load data into pig in grunt shell**

In [None]:
# from local file
# PigStorage indicate the deliminator of the file. 
customers = LOAD 'customers.txt' USING PigStorage(',');

# from hdfs file
customers = LOAD 'hdfs://customers.txt' USING PigStorage(',');

In [None]:
customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);

see a list of data types [here](https://www.tutorialspoint.com/apache_pig/pig_latin_basics.htm).

> **store / save data into hdfs**

In [None]:
# remember to specify a new folder otherwise it'll have 'output directory has already exist' error
grunt> STORE user_244_top10 INTO 'hdfs://wolf.xxxxx.edu:8000/user/jchiu/pig/user244' USING PigStorage (',');

# to get the file back into local file system in grunt shell
grunt> fs -getmerge hdfs://wolf.iems.northwestern.edu:8020/user/jchiu/pig/user244 /home/jchiu/pig/user_244_top10.txt

> **execute pig script in grunt shell**

In [None]:
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') 
                as (id:int,name:chararray,city:chararray);
grunt> Dump student;

# Now, let us execute the above script from the Grunt shell using the exec command as shown below.
grunt> exec /sample_script.pig

We can also use `run` instead. The difference between `exec` and the `run` command is that if we use `run`, the statements from the script are available in the command history.

The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.

In [None]:
# You can see the output of the script using the Dump operator as shown below.
grunt> Dump;

> **invoke bash or hdfs dfs command within grunt shell**

In [None]:
# this works the same as using ls in cmd
grunt> sh ls

In [None]:
# this works the same as using hdfs dfs -ls in cmd
grunt> fs –ls

> **quit grunt shell**

In [None]:
ctrl+D

In [None]:
grunt> quit

> **see the schema of the data**

The describe operator is used to view the schema of a relation.

In [None]:
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

grunt> describe student;    

> **view the top n / top 5 rows **

In [None]:
limit_data = LIMIT student_details 4; 

DUMP limit_data;

> **filter**

In [None]:
user_196 = FILTER movies BY user == 196;

[Official Doc about dealing with Null](https://pig.apache.org/docs/r0.15.0/basic.html#nulls)

In Pig Latin, nulls are implemented using the SQL definition of null as unknown or non-existent. Nulls can occur naturally in data or can be the result of an operation.

In [None]:
# filter null value
records_null = FILTER records BY col is null;
records_null = FILTER records BY col is not null;

> **groupby and calculate avg**

* [link](https://www.tutorialspoint.com/apache_pig/apache_pig_avg.htm)
* [A list of function can be used after group by](https://www.tutorialspoint.com/apache_pig/apache_pig_eval_functions.htm)
* [Blog: GROUP operator in Apache Pig](https://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig/)
* [Stackoverflow: counting elements for each group](https://stackoverflow.com/questions/25012396/counting-elements-for-each-group-using-pig)

In [None]:
# Goal: get the average movie rating order by avg_rating in descending order

# input
describe movies;
# -> movies: {user: int,movie: int,rating: int,time: long}
    
# group by    
movie_group = GROUP movies by movie;    
describe movie_group;
# -> movie_group: {group: int,movies: {(user: int,movie: int,rating: int,time: long)}}
# noted that the first column name is group after the group by syntax

movie_rating = FOREACH movie_group GENERATE group, AVG(movies.rating) AS avg_rating;
describe movie_rating;
# -> movie_rating: {group: int,avg_rating: double}

In [None]:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

grunt> student_group_all = Group student_details All;

grunt> Dump student_group_all;
   
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})

grunt> student_gpa_avg = foreach student_group_all Generate
   (student_details.firstname, student_details.gpa), AVG(student_details.gpa); 
    
grunt> Dump student_gpa_avg; 

(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) }, 
  {   (72)   ,  (83) ,   (87)  ,   (75)  ,   (93)  ,  (90)  ,   (78)   ,  (89)  }),83.375)

> **join**

In [None]:
# Inner join
result = JOIN relation1 BY columnname, relation2 BY columnname;

# Left outer join
result = JOIN relation1 BY id LEFT OUTER, relation2 BY customer_id;

# Right outer join
result = JOIN relation1 BY id RIGHT OUTER, relation2 BY customer_id;

# Full outer join



> **flatten**

[Something can be useful: Multiple ORDER by on Desc in pig](https://stackoverflow.com/questions/32643195/multiple-order-by-on-desc-in-pig)

> ****

> ****

---

# <a id='refer'>Reference</a>

* [Tutorialspoint: Pig Overview](https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm)
* [Useful Pig Syntax Summary](http://timepasstechies.com/pig-tutorial-3-flatten-group-cogroup-cross-distinct-filter-foreach-limit-load-order-sample-split-store-stream-union-operators/)