## From Local Mode to Cluster Mode
Spark provides 3 options working on a cluster.
<img src="images/spark_modes.png">

MESOS and YARN are for sharing the spark cluster between teams. So we will stick with Standalone mode. 

With big data the data is too big to fit on the single computer so it is kept on clusters. As a data scientist you'll run the spark jobs on the data stored in external database or a third-party storage rented from a Cloud computing provider like Amazon.

<img src="images/spark_bigdata.png">

To build a spark cluster you have to options:
1. Buy computers and build cluster 
2. Use cloud platforms like amazon web services and rent a cluster of machines and expand or shrink the cluster size as you need. Just login from anywhere to use the clusters.

Our setup will look like this:

<img src="images/rented_spark_cluster.png">

The Data will be stored on S3 storage and then machines for spark will be rented using EC2 service of AWS services. And then we'll login to the spark cluster remotely and submit the job to the cluster. 

## Setup Instructions AWS
If you want to create a spark cluster manually you can follow this [guide]( https://blog.insightdatascience.com/spinning-up-a-spark-cluster-on-spot-instances-step-by-step-e8ed14ebb3b). However its quite tedious and you have to perform same steps for multiple machines and if you have to update something you have to do it several times. Fortunately AWS offers an easier option called Elastic Map Reduce or EMR for short. EMR provides you EC2 instances with big data technologies installed.
Now following are the instructions to setup EMR cluster:
1. Create ssh key pair to securely connect to the cluster. To do this go to the EC2 service. Select **`Key pairs`** in **`NETWORK & SECURITY`** and create a key pair. Named it as `spark-cluster` and a `pem` file will be downloaded for you. 
2. Go to EMR service and create a cluster by naming it and according to the requirements. Make sure to select `Spark` in **`Software configuration`** section. Select the instance type according to your requirements. 
To compare different EC2 instance type either go [here](https://aws.amazon.com/ec2/instance-types/) or [here](https://ec2instances.info/). Select EC2 key pair and create cluster.
For more follow this [tutorial](https://www.youtube.com/watch?v=ZVdAEMGDFdo) on youtube.

### Creating EMR Cluster Using Boto3 

In [1]:
import boto3
import pandas as pd

In [2]:
credentials = pd.read_csv('credentials/credentials.csv')
ACCESS_KEY = credentials['Access key ID'][0]
SECRET_ACCESS_KEY = credentials['Secret access key'][0]

In [3]:
emr = boto3.client('emr',
                   region_name = 'us-west-2',
                   aws_access_key_id = ACCESS_KEY,
                   aws_secret_access_key = SECRET_ACCESS_KEY)

ec2 = boto3.resource('ec2', 
                     region_name = 'us-west-2',
                     aws_access_key_id = ACCESS_KEY,
                     aws_secret_access_key = SECRET_ACCESS_KEY)
client = boto3.client('ec2',
                      region_name = 'us-west-2',
                      aws_access_key_id = ACCESS_KEY,
                      aws_secret_access_key = SECRET_ACCESS_KEY)

In [18]:
VPC = client.describe_vpcs()['Vpcs'][0]['VpcId']
subnet = client.create_subnet(CidrBlock='172.31.0.0/16',VpcId=VPC,AvailabilityZone='us-west-2a')
SUBNET = subnet['Subnet']['SubnetId']

In [21]:
cluster_id = emr.run_job_flow(
    Name='spark-cluster',
    LogUri='s3://naqeeb-emr-test/logs',
    ReleaseLabel='emr-5.28.0',
    Applications=[
        {
            'Name': 'Spark'
        },
    ],
    Instances={
        'InstanceGroups': [
            {
                'Name': "Master nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'MASTER',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 1,
            },
            {
                'Name': "Slave nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'CORE',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 3,
            }
        ],
        'Ec2KeyName': 'spark-cluster',
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': SUBNET,
    },
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

print ('cluster created with the step...', cluster_id['JobFlowId'])

cluster created with the step... j-22N0Z19JS1VSW


### Deleting EMR Cluster Using Boto3

In [20]:
emr.terminate_job_flows(JobFlowIds=[
        cluster_id['JobFlowId'],
    ])

{'ResponseMetadata': {'RequestId': '677070d1-844b-4019-a0b2-5baf0dbfcd6f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '677070d1-844b-4019-a0b2-5baf0dbfcd6f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 09 Jan 2020 10:43:48 GMT'},
  'RetryAttempts': 0}}

In [17]:
client.delete_subnet(SubnetId=SUBNET)

{'ResponseMetadata': {'RequestId': 'd5f27889-1087-4813-9720-b2307f7b8ee6',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'text/xml;charset=UTF-8',
   'content-length': '225',
   'date': 'Thu, 09 Jan 2020 10:33:04 GMT',
   'server': 'AmazonEC2'},
  'RetryAttempts': 0}}

## Using Notebooks on your Cluster
After the cluster is in `Waiting` state connect to the cluster. Amazon has multiple ways to connect to the cluster. We'll use `Notebook` by clicking on the left side in the menu. Click on `create notebook`. Then name your notebook and attach a cluster to it and leave the rest to default and create the cluster. For more elaboration following this [tutorial](https://www.youtube.com/watch?v=EcIYPkCkehY) from **Udacity** on youtube.

Note: When you are using Notebook change kernel according to the environment of your choice (in this case to PySpark)

## Spark Scipts
Up until now jupyter notebooks were used. They have the following advantages:
1. Good for prototyping
2. Exploring and visualizing the data
3. Easily share the results with others

But Jupyter notebooks are not good for automating the workflows. That's where scipts come in to play. For more elaboration following this [tutorial](https://www.youtube.com/watch?v=bfOocPv54EI) from **Udacity** on Youtube.

## Submitting Spark Scripts

In [2]:
%%writefile lower_scripts.py
from pyspark.sql import SparkSession

if __name__ == '__main__':
    """
        example program to show how to submit applications
    """
    spark = SparkSession\
            .builder\
            .appName('LowerSongTitles')\
            .getOrCreate()

    log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "Despacito",
        "All the stars"
    ]
    
    distributed_song_log = spark.sparkContext.parallelize(log_of_songs)
    
    print(distributed_song_log.map(lambda x:x.lower()).collect())
    
    spark.stop()

Overwriting lower_scripts.py


To connect to the change the cluster using pem file change the permissions of pem file using the follwing linux command:

    chmod 600 your_pem_file.pem
Connect to the master node according to the instructions on AWS console and use the following command to run the spark job:

    spark-submit --master yarn ./lower_script.py

## Storing and Retrieving Data on the cloud
We'll be using Amazon Simple Storage Service S3 for short. It is:
* Safe
* Easy to use
* Cheap

### Using spark to read from and write data to S3 bucket
Spark can reads from and writes data to S3 bucket just by putting the S3 bucket like following:

    spark.read.json('s3n://bucket_name/file.json')

## Introduction to HDFS
When you use S3 you are separating Data storage from the spark cluster. One of downside is that you have to download the data across the network which can be a bottleneck. Another solution is to store the data on spark cluster with HDFS. But there's a trade off to HDFS i.e. you have to maintain and fix the system yourself. S3 is easier since we don't need to maintain the cluster. Also if you rent cluster from AWS then the data usually doesn't have to go too far in the network since the cluster hardware and S3 hardware are both on Amazon's data centers.Finally spark is smart enough to download a small chunk of data and process that chunk while waiting for the rest to download. 

## Reading and Writing to HDFS
Similar to how we can read and write to S3 we can read and write to HDFS. The only difference is the path e.g.

    spark.read.csv('hdfs:///user/sparkify_data/data.json')

## Debugging is Hard
When you are in local mode the errors show up directly. But once you are in cluster of machines errors in the code can be very hard to diagnose. 

## Syntax Errors
Suppose you have written your Spark program but there's some bug in you code. The code seems to work just fine but you have to remember the lazy evaluation of spark. Spark waits as long as it can before running your code on data.

## Code Errors

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession\
            .builder\
            .config('spark.ui.port',3000)\
            .getOrCreate()

In [2]:
logs = spark.read.json('data/sparkify_log_small.json')

In [3]:
logs.take(1)

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046')]

In [4]:
log = logs.select(['userId','firstname','page','song'])\
        .wher(logs.userId == '1046')

AttributeError: 'DataFrame' object has no attribute 'wher'

This is the type of the syntax error and it can be debugged very easily as it can be seen that there's a typo in the code if we replace `wher` with `where` the code will work fine

In [7]:
log = logs.select(['userId','firstname','page','song'])\
        .where(logs.userId == '1046')

In [8]:
userId = log.groupby('userId').count()

In [9]:
userId.collect()

[Row(userId='1046', count=30)]

In [10]:
logs2 = logs.withColumn('artist',logs.artist + 'x')

In [11]:
# logs2.crossJoin(logs).collect()

it means we ran out of memory. we'll just take 5 values.

In [12]:
logs2.crossJoin(logs).take(5)

[Row(artist=None, auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046', artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046'),
 Row(artist=None, auth='Logged In', firstName='Elizabeth', gender='F', item

In [13]:
songs = logs.where(logs.page == 'NextSong')

In [14]:
songs.head()

Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046')

In [15]:
import pyspark.sql.functions as fn

In [16]:
songs.groupby('userId').agg(fn.sum(songs.length)).show()

+------+------------------+
|userId|       sum(length)|
+------+------------------+
|  2904|         348.57751|
|   691|         808.98476|
|  2294|13926.819139999998|
|  2162|        8289.81315|
|  1436|         633.39011|
|  2088|3310.0480000000002|
|  2275|         1172.1913|
|  2756|1076.6344800000002|
|   800|         517.17134|
|  1394| 5989.630679999999|
|   451|         433.44889|
|   926|1087.8414400000001|
|  2696|         200.95955|
|   870|         463.51583|
|     7| 533.9419499999999|
|  1903|        1058.81895|
|   591|         219.79383|
|   613|         419.26439|
|   574|        1286.55491|
|   307|         281.28608|
+------+------------------+
only showing top 20 rows



## Data Errors
Even if you have perfectly written code your data might have errors like missing values or unexpected Unicode characters. These type of issues shouldn't crash your program. So suppose you run your code on the subset of the data and it works fine now you run the same code on full data set and it throws the error.

## Demo: Data Errors

In [55]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession\
            .builder\
            .config('spark.ui.port',3000)\
            .getOrCreate()

In [56]:
logs3 = spark.read.json('data/sparkify_log_small_errors.json')

this file has some records that don't conform to the schema. Let's print the schema

In [57]:
logs3.printSchema()

root
 |-- _corrupt_record: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



As it can be seen that there's a new field corrupt record . Let's look at the corrupt records

In [58]:
logs3.where(logs3['_corrupt_record'].isNotNull()).collect()

[Row(_corrupt_record='{"ts":a,"userId":"1035","sessionId":5698,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"paid","itemInSession":24,"location":"Santa Cruz-Watsonville, CA","userAgent":"\\"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\\"","lastName":"Gillespie","firstName":"Connor","registration":1506639389284,"gender":"M","artist":"Spoon","song":"Black Like Me","length":205.94893}', artist=None, auth=None, firstName=None, gender=None, itemInSession=None, lastName=None, length=None, level=None, location=None, method=None, page=None, registration=None, sessionId=None, song=None, status=None, ts=None, userAgent=None, userId=None),
 Row(_corrupt_record='{"ts":b,"userId":"2373","sessionId":2372,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"paid","itemInSession":13,"location":"San Luis Obispo-Paso Robles-Arroyo Grande, CA","userAgent":"\\"Mozilla/5.0 (Windows NT 6.1; WOW64) Appl

As it can be seen that the corrupt records have ts value which is string. 

## Debugging your Code
If you are writing traditional python script you might use print statements to output the values held by variables. A typical example would be outputting the i and j variables of a nested for-loop:
    
    for i in range(500):
        for j in range(300):
            print(x[i][j])
            
These print statements can be helpful when debugging your code. But this won't work on spark instead we need to use a special variable. 
But why can't we use print statements on a cluster? We have a driver node coordinating the tasks of various worker nodes so print statements will only run on those worker nodes. you can't see the output from them because you are not directly connected to them. Also spark makes a copy of input data every time you call a function so the original debugging variables that you created won't actually get loaded into worker nodes. Instead each worker has their own copy of these variables and only these copies get modified. The original variables stored on the driver remain unchanged making them useless for debugging. 

<img src="images/debugging.png">

To get around this limitation spark gives you special kind of variables called accumulators. Accumulators are the global variables for your entire cluster 

## How to Use Accumulators

First step is to create an accumulator

In [59]:
incorrect_records = SparkContext.accumulator(0,0)

In [60]:
incorrect_records.value

0

2nd step is to create a function that will increment the accumulator

In [61]:
def add_incorrect_record():
    global incorrect_records
    incorrect_records += 1

Now we need to just define a udf

In [62]:
from pyspark.sql.functions import udf

In [63]:
correct_ts = udf(lambda x: 1 if x is not None else add_incorrect_record())

In [64]:
logs3 = logs3\
            .withColumn('ts_digit',correct_ts(logs3.ts))

In [65]:
incorrect_records.value

0

Now the record is 0 why is that? the reason might be we don't have any corrupt record or Spark uses lazy evaluation so we have to apply some actions

In [66]:
logs3.where(logs3['_corrupt_record'].isNotNull()).show()

+--------------------+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+----+---------+------+--------+
|     _corrupt_record|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId|song|status|  ts|userAgent|userId|ts_digit|
+--------------------+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+----+---------+------+--------+
|{"ts":a,"userId":...|  null|null|     null|  null|         null|    null|  null| null|    null|  null|null|        null|     null|null|  null|null|     null|  null|    null|
|{"ts":b,"userId":...|  null|null|     null|  null|         null|    null|  null| null|    null|  null|null|        null|     null|null|  null|null|     null|  null|    null|
|{"ts":c,"userId":...|  null|null|     null|  null|         null|    null|  null| null|    null|  null|null|        null|    

In [69]:
incorrect_records.value

4

In [71]:
logs3.cache()

DataFrame[_corrupt_record: string, artist: string, auth: string, firstName: string, gender: string, itemInSession: bigint, lastName: string, length: double, level: string, location: string, method: string, page: string, registration: bigint, sessionId: bigint, song: string, status: bigint, ts: bigint, userAgent: string, userId: string, ts_digit: string]

In [74]:
logs3.take(1)

[Row(_corrupt_record=None, artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046', ts_digit='1')]

## Spark WebUI
Since debugging on a cluster is hard, Spark has a built-in user interface that you an access from your web browser. This interface, known as the web UI, helps you understand what's going on in your cluster without looking for individual workers. Spark's UI is like an EKG machine that helps you measure the health of your Spark jobs. It's a very useful tool for diagnosing issues in your code and your cluster, but it's just a tool. You still need to know how to interpret the output and know where to investigate further. When a doctor measures the patient's heart rate with an EKG, he needs to understand not only how the heart works, but also how the heart relates to other parts of the human anatomy. Understanding the Spark internals like shuffling, DAGs and stages that we discussed earlier for the Spark cluster. So, what does the web UI actually show? The web UI provides the current configuration for the cluster which can be useful for double-checking that your desired settings went into effect. The web UI also shows you the DAG, the recipe of steps for your program that we went through earlier. You'll see the DAG broken up into stages, and within each stage there are individual tasks. Tasks are the steps that the individual worker nodes are assigned. In each stage, the worker node divides up the input data and runs the task for that stage.

<img src="images/spark-web-ui.png">

The web UI only shows the pages related to current jobs that are running. For example, you won't see any pages related to other libraries like Spark Streaming unless you are also running a streaming job.

## Connecting to the Spark Web UI
Connecting to Spark's web UI is a lot like docking boats. When boats come to  a seaport, it's useful to have various types of ports so that each type of boat can dock efficiently and safely. FOr a cruise ship, you'll need a port for passenger loading. While cargo ship need a port equipped with cranes and personal speed boat would only requires a small dock or peer. As a result, they have different ports with different procedures for docking.

<img src="images/boat-ports.png">

In the same way, it's useful to have several ways to connect data with a machine. When you transfer private data through a secure shell known as SSH, you follow a different protocol than when transferring public HTML data for a webpage using HTTP. FOr this reason, we use different ports: port `22` for SSH and port `80` for HTTP to indicate that we're transferring data using different network protocols. These commonly numbered ports follow a convention that engineers agreed upon. It's like having an agreed upon layout for each support in the world.

<img src="images/ports-machine.png">

Spark uses several agreed upon ports for sharing information. Some ports are for machines to communicate with each other and aren't intended for users.  For example, the Spark master uses port `7077` to communicate with the worker nodes, but we'll never use it. There are few common ports that we'll use from time to time. For example, we've already seen that we usually connect to Jupyter notebooks on port `8888`. Another important port is `4040` which shows active Spark jobs. But the most useful port for you to memorize is the web UI for master node on port 8080. The web UI on `8080` shows the status of your cluster, configurations and the status of any recent jobs. 

<img src="images/spark-ports.png">

## Getting Familiar with the Spark UI
Here to connect to the Spark UI as we've configured the port to `3000` by going to the link: http://localhost:3000 . Under **`Environment`** tab, we can see the different parameters of our application. The Java version, the Scala version, the name of the application and so forth so on. 

<img src="images/web-ui-environment.png">

The **`Executors`** tab gives you information about the executors, what resources do they have, how many tasks they have run successfully. In this particular case, we're looking at a Spark Local Executor, so there is only one of them. It has ran `241` tasks out of which `237` tasks are completed.

<img src="images/web-ui-executors.png">

The storage tab is currently empty but if you have cached Rdds in your application, you can find that information here. 

<img src="images/web-ui-storage.png">

A Spark application consists of as many jobs, as many actions regarding the code. An action can be saving a data frame to a database or taking some records back to the drive for inspection. So for example after loading your data frame where we call head, that's an action triggering a job. We can see the jobs that we were around here. Jobs have further been broken into stages. If we click on the stage we can get access to the stages this job consists of.

<img src= "images/web-ui-jobs.png">

Stages are units of work that depends on one another and can be further paralyzed. For example, before joining two Data Frames, we need to finish transforming them both and just after that we can perform the actual join. The smallest unit within a stage is a task. 

<img src="images/web-ui-jobs2.png">

Tasks are a series of Spark transformations that can be run parallel on the different partitions of out Dataframe. So when we have 10 partitions we will run 10 of the same tasks to complete a stage. So in this particular case we had 200 partitions case so we ended up with 200 tasks. 

<img src="images/web-ui-stage.png">

For further understanding check this [tutorial](https://www.youtube.com/watch?v=88JQIalP84M) by Udacity on youtybe.

## Review of the Log Data
Using log files is a bit more difficult when we are using Spark in cluster, compared to when we are running it locally. The log file is just as everything else as speed across the different nodes. Fortunately, the Spark UI provides a convenient way to look them up, so we don't need to directly access the various workers via SSH. If we had a cluster rather than local Spark running here, we would have another column called logs with two links to the standard out and standard error files for each of the workers they had. 

<img src="images/web-ui-executors.png">

In this case sine this is a local Spark application, we only have Thread Dump for the driver. Spark uses Log4j, as standard JVM library for logging. We can configure the logging level in two different ways: 
1. Edit the Log4j properties files in the `conf` directory
2. Set in the Spark context

So if we set spark log level to error we'll only see error messages in the log files. The code for doing this is the following:
    
    spark.sparkContext.setLogLevel('ERROR')

If we would like to have more verbose logging then set this level to `INFO` and we'll have more information about the application.