<link rel='stylesheet' href='../assets/css/main.css'/>

# 3.1 H: Loading Data (RDD) from HDFS

- [Standalone version](3.1-rdd-basics.md) 
- [Hadoop version](3.1H-rdd-hadoop.md)

### Overview
* Learning basic operations like filter / map / count
* work with larger sized RDDs
* Load multiple files into a single RDD
* Save computed RDDs

### Depends On
None

### Run time
30-40 mins


In [6]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
sc = spark.sparkContext
spark

Initializing Spark...
Spark found in :  /home/ubuntu/spark
Spark config:
	 executor.memory=2g
	some_property=some_value
	spark.app.name=TestApp
	spark.master=local[*]
	spark.sql.warehouse.dir=/tmp/tmp8duaf0lb
	spark.submit.deployMode=client
	spark.ui.showConsoleProgress=true
Spark UI running on port 4047


## STEP 2: Load a simple text file

In [7]:
f = spark.read.text("/data/text/twinkle/sample.txt")

In [15]:
f = sc.textFile("/data/text/twinkle/sample.txt")

**=> what is the 'type' of f ?**  
Hint : just type `f` in the shell
Here is a possible output

In [9]:
f

/data/twinkle/sample.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

## STEP 3: Filter
Let's find how many lines contain the word 'twinkle'
We will use the 'filter' function

In [16]:
filtered = f.filter(lambda line: "twinkle" in line)

After entering the above in Spark-shell...  
**=> Goto Spark shell UI **  
**=> Inspect the 'Stages' section in the UI.**  
**=> How is the filter executed? Can you explain the behavior?**  

**=> Count how many lines contain the word 'twinkle'**  
hint : apply `count()` to `filtered` variable

Here is a sample output

```
15/03/31 23:19:30 INFO DAGScheduler: Stage 0 (count at <console>:17) finished in 0.074 s
15/03/31 23:19:30 INFO DAGScheduler: Job 0 finished: count at <console>:17, took 0.141676 s
2  <--- this is the result of count()
```

**=> Check the Stages in UI,  what do you see?**  
**=> How long did the job take?**  
**=> Print out all the lines containing the word 'twinkle'**   
Hint : `collect()`  



In [17]:
#TODO: Print out all the lines containing the word 'twinkle'
filtered = f.filter(lambda line: "twinkle" in line)
# print(filtered.collect())

In [18]:
print(filtered.collect()

['twinkle twinkle little star', 'twinkle twinkle little star']

In [None]:
Here is a sample output
```
twinkle twinkle little star, twinkle twinkle little star
```

**=> Checkout 'DAG' visualization**

<img src="../assets/images/3.1c.png" style="border: 5px solid grey; max-width:100%;"/>

**=> Quit Spark-shell using 'exit'  or pressing  Control+D**


## STEP 4:  Large data sets
**==> Quit previous spark-shell session, if you haven't done so yet.. `Control + D`**  

We have some large data sets of 'twinkle' data generated in `/data/twinkle`  directory.

<img src="../assets/images/3.1a.png" style="border: 5px solid grey; max-width:100%;"/>

## STEP 5:  Start shell With More Memory

```bash
$   pyspark  --executor-memory 1G  --driver-memory 1G
```

## STEP 6: Process a large file
**=> In PySpark Shell, load `data/twinkle/100M.data`**

In [None]:
f = sc.textFile("/data/text/twinkle/100M.data")

**=> Count number of lines that have the word "diamond"**  
hint : `filter`  and `count`

**=> How many 'tasks' are used in the above calculation**  
Hint : Check spark shell UI

<img src="../assets/images/3.1b.png" style="border: 5px solid grey; max-width:100%;" />

**=> Can you explain the number of tasks?**  
Hint : check number of partitions in RDD using `getNumPartitions`  or `partitions.length`

In [None]:
f.getNumPartitions()

**=> Count number of lines that does NOT have the word 'diamond'**  


**=> Verify both counts add up to the total line count**



**=> Verify both counts add up to the total line count**

**=> Notice the time taken for each operation**

**=> Try the above with larger data files : 500M.data  ... 1G.data**
  - note the times taken
  - how many tasks?

## STEP 7: Loading multiple files
**=> Load all *.data files under  data/twinkle  directory**

In [None]:
f = sc.textFile("/data/text/twinkle/*.data")

**=> Do a count() on RDD.**  
Notice the partition count and time taken to execute
Verify partition count from Spark-Shell UI

## STEP 8:  Saving the RDD
Continuing with the big RDD created on step (5)....

**=> Create a new RDD by filtering first RDD for word 'diamond'**

In [None]:
filtered = f.filter(lambda line: "diamond" in line)

**=> Save the new RDD into a directory**

In [None]:
filtered.saveAsTextFile("MY_NAME_out")  # fix MY_NAME

**=> Inspect the output directory using HDFS File browser **

**=> What do you see as output?**


## Bonus Lab: Merging partitions into a single one
When we saved data in the above section, there are multiple files created in output directory.   Can you just create one output file?   
Hint : see the API for `coalesce or repartition`