# Problem 2 - Working with SparkSQL

This is an interactive PySpark session. Remember that when you open this notebook the `SparkContext` and `SparkSession` are already created, and they are in the `sc` and `spark` variables, respectively. You can run the following two cells to make sure that the Kernel is active.

**Do not insert any additional cells than the ones that are provided.**

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("problem2").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/18 03:51:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/10/18 03:51:28 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


## Quazyilx again!

Yes, you remember it. As a reminder, here is the description of the files.

The quazyilx has been malfunctioning, and occasionally generates output with a `-1` for all four measurements, like this:

    2015-12-10T08:40:10Z fnard:-1 fnok:-1 cark:-1 gnuck:-1

There are four different versions of the _quazyilx_ file, each of a different size. As you can see in the output below the file sizes are 50MB (1,000,000 rows), 4.8GB (100,000,000 rows), 18GB (369,865,098 rows) and 36.7GB (752,981,134 rows). The only difference is the length of the number of records, the file structure is the same.

```
[hadoop@ip-172-31-1-240 ~]$ hadoop fs -ls s3://bigdatateaching/quazyilx/
Found 4 items
-rw-rw-rw-   1 hadoop hadoop    52443735 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx0.txt
-rw-rw-rw-   1 hadoop hadoop  5244417004 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx1.txt
-rw-rw-rw-   1 hadoop hadoop 19397230888 2018-01-25 15:38 s3://bigdatateaching/quazyilx/quazyilx2.txt
-rw-rw-rw-   1 hadoop hadoop 39489364082 2018-01-25 15:41 s3://bigdatateaching/quazyilx/quazyilx3.txt
```

You will use Spark to create a Spark RDD and then run some analysis on the files using custom functions and Spark RDDs.

Start off by copying the quazyilx1.txt file from the central bucket to your personal bucket.

In [2]:
!aws s3 cp s3://bigdatateaching/quazyilx/quazyilx1.txt s3://anly502-fall-2022-yl1353/

copy: s3://bigdatateaching/quazyilx/quazyilx1.txt to s3://anly502-fall-2022-yl1353/quazyilx1.txt


In the following cell, create an RDD called `quazyilx` that reads the `quazyilx1.txt` file from S3.

In [3]:
quazyilx = sc.textFile("s3://anly502-fall-2022-yl1353/quazyilx1.txt")

In the next cell, Evaluate `quazyilx.take(100)` to make sure that everything is working corectly. This should take a few seconds.

In [4]:
quazyilx.take(100)

                                                                                

['2000-01-01 00:00:03 fnard:7 fnok:8 cark:19 gnuck:25',
 '2000-01-01 00:00:08 fnard:14 fnok:19 cark:16 gnuck:37',
 '2000-01-01 00:00:17 fnard:12 fnok:11 cark:12 gnuck:8',
 '2000-01-01 00:00:22 fnard:18 fnok:16 cark:3 gnuck:8',
 '2000-01-01 00:00:32 fnard:7 fnok:16 cark:7 gnuck:37',
 '2000-01-01 00:00:40 fnard:6 fnok:14 cark:3 gnuck:30',
 '2000-01-01 00:00:47 fnard:11 fnok:10 cark:17 gnuck:7',
 '2000-01-01 00:00:55 fnard:9 fnok:14 cark:13 gnuck:30',
 '2000-01-01 00:00:56 fnard:10 fnok:1 cark:7 gnuck:6',
 '2000-01-01 00:00:59 fnard:11 fnok:11 cark:12 gnuck:18',
 '2000-01-01 00:01:03 fnard:9 fnok:13 cark:14 gnuck:49',
 '2000-01-01 00:01:06 fnard:12 fnok:10 cark:19 gnuck:30',
 '2000-01-01 00:01:16 fnard:0 fnok:12 cark:19 gnuck:26',
 '2000-01-01 00:01:26 fnard:10 fnok:11 cark:10 gnuck:49',
 '2000-01-01 00:01:30 fnard:9 fnok:5 cark:16 gnuck:13',
 '2000-01-01 00:01:38 fnard:11 fnok:10 cark:7 gnuck:47',
 '2000-01-01 00:01:43 fnard:2 fnok:2 cark:20 gnuck:35',
 '2000-01-01 00:01:53 fnard:12 fnok

We now need to work with the RDD to make it into a more structure format. In the following cell, modify the code to create a **function or class** called `quazyilx_class` that processes a line and returns it as a dictionary-like structure, with attributes for the `.time`, `.fnard`, `.fnok` and `.cark`. 

You will need to define the Regular Expression and complete the class. The scaffolding has been provided for you. Use the helpful website [https://regex101.com/](https://regex101.com/) to build your regex to extract all the variables from the line.

In [5]:
import sys
import os,datetime,re

QUAZYILX_RE = "(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2})\s(fnard\:-?\d+)\s(fnok\:-?\d+)\s(cark\:-?\d+)\s(gnuck\:-?\d+)"
quazyilx_re = re.compile(QUAZYILX_RE)

class quazyilx_class():
    import os,datetime,re
    from pyspark.sql import Row
    
    def __init__(self,line):
        
        temp_m = quazyilx_re.search(line)
        self.time = datetime.datetime.strptime(temp_m.group(1), "%Y-%m-%d %H:%M:%S")
        self.fnard = temp_m.group(2).split(':')[-1]
        self.fnok = temp_m.group(3).split(':')[-1]
        self.cark = temp_m.group(4).split(':')[-1]
        self.gnuck = temp_m.group(5).split(':')[-1]
        

You will then need to turn the quazyilx RDD into a `Row()` object. This is somewhat similar to a dictionary format. This format means you can query different "variables" from your RDD at scale. You can make this structure with a lambda function, like this:

```(python)
lambda q: Row(datetime=q.datetime.isoformat(),fnard=q.fnard,fnok=q.fnok,cark=q.cark,gnuck=q.gnuck))
```

Alternatively, you can add a new method to the Quazyilx class called `.Row()` that returns a Row. All of these ways are more or less equivalent. You just need to pick one of them.  You may find it useful to look at [this documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection).

In the next cell, create an RDD called `line` that converts the `quazyilx` RDD into a `Row()` object using the `quazyilx_class` and **cache** the result.

**Remember, start with a smaller set of data before moving to the entire large dataset by using the `.sample()` method.**

In [6]:
from pyspark.sql import Row

In [7]:
#small_qua = quazyilx.sample(fraction=0.5, withReplacement=True)

In [8]:
line = quazyilx.map(lambda q: Row(datetime = quazyilx_class(q).time.isoformat(),
                                   fnard = quazyilx_class(q).fnard, 
                                   fnok = quazyilx_class(q).fnok,
                                   cark = quazyilx_class(q).cark,
                                   gnuck = quazyilx_class(q).gnuck))

In [9]:
quazyilx_class('2000-01-01 00:08:59 fnard:13 fnok:0 cark:17 gnuck:12')

<__main__.quazyilx_class at 0x7f4d090a9f10>

In [10]:
line.cache()

PythonRDD[3] at RDD at PythonRDD.scala:53

Look at the first 10 rows to make sure everything is working.

In [11]:
line.take(10)

                                                                                

[Row(datetime='2000-01-01T00:00:03', fnard='7', fnok='8', cark='19', gnuck='25'),
 Row(datetime='2000-01-01T00:00:08', fnard='14', fnok='19', cark='16', gnuck='37'),
 Row(datetime='2000-01-01T00:00:17', fnard='12', fnok='11', cark='12', gnuck='8'),
 Row(datetime='2000-01-01T00:00:22', fnard='18', fnok='16', cark='3', gnuck='8'),
 Row(datetime='2000-01-01T00:00:32', fnard='7', fnok='16', cark='7', gnuck='37'),
 Row(datetime='2000-01-01T00:00:40', fnard='6', fnok='14', cark='3', gnuck='30'),
 Row(datetime='2000-01-01T00:00:47', fnard='11', fnok='10', cark='17', gnuck='7'),
 Row(datetime='2000-01-01T00:00:55', fnard='9', fnok='14', cark='13', gnuck='30'),
 Row(datetime='2000-01-01T00:00:56', fnard='10', fnok='1', cark='7', gnuck='6'),
 Row(datetime='2000-01-01T00:00:59', fnard='11', fnok='11', cark='12', gnuck='18')]

You will calculate the following using Spark RDD and save into dictionary objects as shown in the scaffolding code:

1. The number of rows in the dataset
1. The number of lines that have -1 for `fnard`, `fnok`, `cark` and `gnuck`.
1. The number of lines that have -1 for `fnard` but have `fnok > 5` and `cark > 5`
1. The first (earliest/smallest) datetime in the dataset
1. The first (earliest/smallest) datetime that has -1 for all of the values
1. The last (latest/largest) datetime in the dataset
1. The last (latest/largest) datetime that has a -1 for all of the values

Place each query into each of the following seven cells and run it to get the results. Remember, running the query statement itself will not give you the results you want. You need to do something else to "get" the result.

**Note: in development testing, the first query may take approximately 10-15 minutes to run with the cluster configuration for this assignment (1 master, 4 task nodes of m5.xlarge). If you cache() correctly, all subsequent queries should take no more than 5 seconds.**


In [12]:
# Store all answers in this dictionary with keys 'q1','q2','q3',...
dict_answers = {}

In [15]:
dict_answers['q1'] = line.cache().count()

                                                                                

In [49]:
df = line.toDF(['datetime','fnard','fnok','cark','gnuck'])
df_n1 = df[df['fnard'] == "-1"]
df_n2 = df_n1[df_n1['fnok'] == "-1"]
df_n3 = df_n2[df_n2['cark'] == "-1"]
df_n4 = df_n3[df_n3['gnuck'] == "-1"].cache()
dict_answers['q2'] = df_n4.count()

                                                                                

In [51]:
fn = df_n1[df_n1['fnok'] > 5]
final = fn[fn['cark']> 5]
dict_answers['q3'] = final.count()

                                                                                

In [61]:
sorted_date = line.cache().sortBy(lambda x: x[0])
dict_answers['q4'] = sorted_date.first()

Exception in thread "serve RDD 141 with partitions 0" java.net.SocketTimeoutException: Accept timed out
	at java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:560)
	at java.net.ServerSocket.accept(ServerSocket.java:528)
	at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:58)
                                                                                

In [57]:
sorted_date_bad = df_n4.sort(['datetime'],ascending=True)                          
dict_answers['q5'] = sorted_date_bad.first()

In [62]:
dict_answers['q6'] = sorted_date.top(1)

                                                                                

In [63]:
sorted_date_bad_1 = df_n4.sort(['datetime'],ascending=False)   
dict_answers['q7'] = sorted_date_bad_1.first()

### **Run the following cell to export your final dictionary results into a json file**

In [64]:
import json
json.dump(str(dict_answers), fp = open('problem-2-soln.json','w'))

When you finish this problem, click on the File -> 'Save and Checkpoint' in the menu bar to make sure that the latest version of the workbook file is saved. Also, before you close this notebook and move on, make sure you disconnect your SparkContext, otherwise you will not be able to re-allocate resources. Remember, you will commit the .ipynb file to the repository for submission (in the master node terminal.)