# Prelude: A crash course of Zeppelin Notebook

Similar to Jupyter Notebook, there are at least two kinds of Cells in a Zeppelin Notebook.

1. Markdown Cell - start with `%md`
2. Code Cell - starting with `%pyspark`

To run a cell, press the "play" button on its right, or press "shift-enter"



In [1]:
%pyspark

print("This is a code block")



 

# Exercise 1

In this exercise we are tasked to perform some data transformation using PySpark and RDD.

For parts marked with **[CODE CHANGE REQUIRED]** you need to modify or complete the code before execution.
For parts without **[CODE CHANGE REQUIRED]** , you can just run the given code.

## Input

The input data are stored in a text file `data/ex1/input.txt` which is a list of 2D coordinates in the following format.

```text
<label> 0:<x-value> 1:<y-value> 
...
<label> 0:<x-value> 1:<y-value>
```

## Output

The expected output of the transformation are two seperate TSV outputs, `ones` and `zeros`. Both are in the following format

```text
<x-value> <y-value> ...
<x-value> <y-value>
```


For example, given the input file as the following

```text
1 0:102 1:230
0 0:123 1:56
0 0:22  1:2
1 0:74 1:102
```
The output files in `ones` would be 

```text
102 230
74 102
```

The output files in `zeros` would be

```text
123 56 
22 2
```
where the space in between the numbers are tabs

**[CODE CHANGE REQUIRED]** 
Modify and run the following bash cell to upload the input data to HDFS

In [5]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-89-172 # TODO: fixme

hdfs dfs -rm -r hdfs://$namenode:9000/lab12/ex1/input/
hdfs dfs -mkdir -p hdfs://$namenode:9000/lab12/ex1/input/

hdfs dfs -put /home/ec2-user/git/50043-labs/lab12/data/ex1/input.txt  hdfs://$namenode:9000/lab12/ex1/input/


**[CODE CHANGE REQUIRED]** 
Complete the following PySpark script so that it performs the above-mentioned transformation.


<style>
    div.hidecode + pre {display: none}
</style>
<script>
doclick=function(e) {
    e.nextSibling.nextSibling.style.display="block";
}
</script>

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```text
Let r be an RDD, r.map(f) applies f to all elements in r and return a new RDD.
r.filter(p) fitlers out elements e from r that satisfying p(e).
r.saveAsTextFile("hdfs://...") will save an RDD into hdfs.
```

In [7]:
%pyspark

from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("Transformation notebook").getOrCreate()
sc = sparkSession.sparkContext

hdfs_nn = "ip-172-31-89-172"
def join(tokenized):
    x = (tokenized[1].split(":"))[1] 
    y = (tokenized[2].split(":"))[1]
    return "\t".join([x,y])

sc.appName = "ETL (Transform) Example"

input = sc.textFile("hdfs://%s:9000/lab12/ex1/input/" % hdfs_nn) 

tokenizeds = input.map(lambda line : line.split(" ")) 
tokenizeds.persist()

ones = tokenizeds.filter(lambda tokenized : tokenized[0] == "1").map(join)
ones.saveAsTextFile("hdfs://%s:9000/lab12/ex1/output/ones" % hdfs_nn)

zeros = None # TODO: fix me
zeros.saveAsTextFile("hdfs://%s:9000/lab12/ex1/output/zeros" %hdfs_nn) 

sc.stop()



 
### Sample answer

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>


```python

zeros = tokenizeds.filter(lambda tokenized : tokenized[0] == "0").map(join)

```


## Test case

Run the fulling bash cell to check the results

It should be something like the following 

```text
20/11/12 18:51:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
124	253
145	255
5	63
1	168
121	254
166	222
178	255
7	176
68	45
...
```

and 

```text
124	253
145	255
5	63
1	168
121	254
166	222
178	255
7	176
68	45
64	191
...
```


In [10]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-89-172

hdfs dfs -cat hdfs://$namenode:9000/lab12/ex1/output/ones/* 


In [11]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-89-172

hdfs dfs -cat hdfs://$namenode:9000/lab12/ex1/output/zeros/* 


## Cleaning up

Modify the following bash cell to clean HDFS

In [13]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-89-172

hdfs dfs -rm -r hdfs://$namenode:9000/lab12/ex1

 

## End of Exercise 1
