# Lab - Basic RDD Operations

This lab introduces you to working with Spark and with RDDs using a Jupyter Notebook and Pyspark as the way to interact with Spark. 

There are many methods that can be used with RDDs. See [this great cheat sheet by the DataCamp team](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf). A copy is also in this repository.

Also, there is an [accompanying reference notebook](reference/reference-rdds.ipynb) that shows many RDD Transformations and Actions, which comes from the book [Learning Pyspark by Denny Lee and Thomas Drabasz](https://learning.oreilly.com/library/view/learning-pyspark/9781786463708/)

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext

In [3]:
sc = SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/05 03:03:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/10/05 03:03:14 WARN JettyUtils: GET /jobs/ failed: org.apache.spark.SparkException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready.
org.apache.spark.SparkException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready.
	at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:44)
	at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
	at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
	at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
	at javax.servlet.http.HttpSer

In [4]:
sc

Create an RDD called `A` that reads the following text file: `s3://bigdatateaching/shakespeare/100-0.txt`, the complete works of William Shakespeare.

In [5]:
A = sc.textFile("s3://bigdatateaching/shakespeare/100-0.txt")

Type in `A` which shows you a pointer to the file in S3

In [6]:
A

s3://bigdatateaching/shakespeare/100-0.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

Display the first 5 elements of `A` by using the `take` command.

In [7]:
A.take(5)

                                                                                

['',
 'Project Gutenberg’s The Complete Works of William Shakespeare, by',
 'William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere in the United States and']

Now, store the first 5 elements of `A` in a local Python object called `a`.

In [8]:
a = A.take(5)

What kind of object is `a`? Remember, this is local object within your Python session.

In [9]:
type(a)

list

Display the contents of `a`.

In [10]:
a

['',
 'Project Gutenberg’s The Complete Works of William Shakespeare, by',
 'William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere in the United States and']

You can index into `a` using standard Python code. What is the second element in `a`?

In [11]:
a[1]

'Project Gutenberg’s The Complete Works of William Shakespeare, by'

Now try indexing into the RDD `A`. It won't work.

In [12]:
A[1]

TypeError: 'RDD' object is not subscriptable

How many elements does `A` have?

In [14]:
A.count()

147838

We talked about keeping data in memory to reuse later. To do that, you use the `cache` method on an RDD.

In [15]:
A.cache()

s3://bigdatateaching/shakespeare/100-0.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

The following python function checks wether the word "Hamlet" exists in a line.

In [16]:
def hasHamlet( s ):
    return "Hamlet" in s

Create a new RDD called `b` that uses the Python `hasHamlet` function and returns only the RDD lines where Hamlet is in the text.

In [17]:
b = A.filter(lambda s: "Hamlet" in s)

What is `b`?

In [18]:
b

PythonRDD[6] at RDD at PythonRDD.scala:53

How many elements does `b` have?

In [20]:
b.count()

106

That took a few seconds, didn't it? Now try counting `A` again and see that it was much quicker than before (because it is cached.)

In [21]:
A.count()

147838

You can also use the `first` method to get the first element only of an RDD. 

In [23]:
b.first()

'CLAUDIUS, King of Denmark, Hamlet’s uncle.'

Now try using `first` with a value, like in the first 10 records. 

In [24]:
b.take(10)

['CLAUDIUS, King of Denmark, Hamlet’s uncle.',
 'The GHOST of the late king, Hamlet’s father.',
 'GERTRUDE, the Queen, Hamlet’s mother, now wife of Claudius.',
 'HORATIO, Friend to Hamlet.',
 'Dar’d to the combat; in which our valiant Hamlet,',
 'His fell to Hamlet. Now, sir, young Fortinbras,',
 'Unto young Hamlet; for upon my life,',
 ' Enter Claudius King of Denmark, Gertrude the Queen, Hamlet, Polonius,',
 'Though yet of Hamlet our dear brother’s death',
 'But now, my cousin Hamlet, and my son—']

How many RDD partitions does `A` RDD have? Use the `getNumPartitions` method to find out.

In [25]:
A.getNumPartitions()

2

You can also sample records from an RDD using the `takeSample` method. Sample 10 records from `b` with replacement.

In [26]:
b.takeSample(True, 10) 

['I have nothing with this answer, Hamlet; these words are not mine.',
 'The very cause of Hamlet’s lunacy.',
 'So please you, something touching the Lord Hamlet.',
 'Unto young Hamlet; for upon my life,',
 'You need not tell us what Lord Hamlet said,',
 ' [_Exit Hamlet dragging out Polonius._]',
 ' [_Exeunt all but Hamlet._]',
 'No, no, the drink, the drink! O my dear Hamlet!',
 ' Enter Hamlet and certain Players.',
 'Though yet of Hamlet our dear brother’s death']

Now we will re-do one of the first assignment problems with the `quazyilx` dataset. First, create an RDD called `quazyilx` from the `s3://bigdatateaching/quazyilx/quazyilx1.txt` file (the ~5GB file).

In [28]:
quazyilx = sc.textFile("s3://bigdatateaching/quazyilx/quazyilx1.txt")

See how many partitions the RDD has. This is analogous to the number of blocks the file is on disk.

In [29]:
quazyilx.getNumPartitions()

79

Create and cache an RDD called `badrec` that uses a filter statement to find the bad records. Remember that each records is a whole line of text. 

In [30]:
badrec = quazyilx.filter(lambda bad:"fnard:-1 fnok:-1 cark:-1 gnuck:-1" in bad)

How many bad records were there?

In [34]:
badrec.cache()

PythonRDD[18] at RDD at PythonRDD.scala:53

If you want to get all the records for an RDD, then you need to use the `collect` method. Be careful, though, because if you use it with a large dataset, it could overflow your Python session.

In [35]:
badrec.count()

190

Take a look at the first 10 elements of bad_rec.

In [36]:
badrec.take(10)

['2000-01-28 03:07:44 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-02-21 19:21:07 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-03-01 17:31:22 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-04-29 03:37:34 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-07-08 21:27:33 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-08-15 19:21:29 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-09-09 19:25:47 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-10-02 15:06:39 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-11-14 06:08:56 fnard:-1 fnok:-1 cark:-1 gnuck:-1',
 '2000-11-15 01:31:47 fnard:-1 fnok:-1 cark:-1 gnuck:-1']

What type is `bad_rec`?

In [38]:
type(badrec)

pyspark.rdd.PipelinedRDD

Now we will work the ForensicsWiki logs dataset and use RDD methods to do the same analysis we did in previous homeworks.

First, create an RDD called `forensicswiki` pointing to the ForensicsWiki dataset at `s3://bigdatateaching/forensicswiki/2012_logs.txt`.

In [39]:
forensicswiki = sc.textFile("s3://bigdatateaching/forensicswiki/2012_logs.txt")

The following two cells have Python code that will be run on the RDD.

In [40]:
import re
import datetime
date_re = re.compile("(\d\d/[a-zA-Z]+/\d\d\d\d)")

In [41]:
def extract(line):
    m = date_re.search(line)
    if m:
        d = datetime.datetime.strptime(m.group(1),"%d/%b/%Y")
        return "{:04}-{:02}".format(d.year,d.month)

Create a new RDD called `dates` that runs the `extract` function on every element in the `forensicswiki` RDD.

In [42]:
dates = forensicswiki.map(lambda line:[extract(line), 1])

Look at the `dates` RDD.

In [44]:
dates.cache()

PythonRDD[25] at RDD at PythonRDD.scala:53

Aggregate the `dates` dataset by month, effectively conducting the **reducer** step. This can be accomplished by `dates.countByKey()` or using a manual reduction step like:

```
from operator import add
add_by_date = dates.reduceByKey(add)
```

In [47]:
dates.countByKey()

                                                                                

defaultdict(int,
            {'2012-01': 1544100,
             '2012-02': 1325030,
             '2012-03': 1274061,
             '2012-04': 1016456,
             '2012-05': 1173380,
             '2012-06': 1300250,
             '2012-07': 1287187,
             '2012-08': 1450426,
             '2012-09': 1284945,
             '2012-10': 1498895,
             '2012-11': 1397343,
             '2012-12': 1396198,
             '2013-01': 1283})

## **Save to your git repo as soln.json:**
1. the first 100 rows of the `dates` dataset. This involves the `take` command.
2. the first 10 month key-value pairs in the form [('2012-10', 1234567), ...] ordered by date descending (earliest date first)

In [74]:
df_first100_result = dates.take(100)

In [75]:
from operator import add
add_by_date = dates.reduceByKey(add)
sorted_date = add_by_date.sortBy(lambda x: x[0])
first10_months = sorted_date.take(10)

                                                                                

In [76]:
import json
json.dump({'dates_df' : df_first100_result,
           'first10' : first10_months},
          fp = open('soln.json','w'))

## **Git add, commit, and push all your files to GitHub!! You can use the built-in Git module in Jupyter Lab!**

Before you close the Jupyter Notebook, it is best to close the connection to the Spark cluster. If you don't you may have an "orphan" connection that is eating up resources.