## From Local Mode to Cluster Mode
Spark provides 3 options working on a cluster.
<img src="images/spark_modes.png">

MESOS and YARN are for sharing the spark cluster between teams. So we will stick with Standalone mode. 

With big data the data is too big to fit on the single computer so it is kept on clusters. As a data scientist you'll run the spark jobs on the data stored in external database or a third-party storage rented from a Cloud computing provider like Amazon.

<img src="images/spark_bigdata.png">

To build a spark cluster you have to options:
1. Buy computers and build cluster 
2. Use cloud platforms like amazon web services and rent a cluster of machines and expand or shrink the cluster size as you need. Just login from anywhere to use the clusters.

Our setup will look like this:

<img src="images/rented_spark_cluster.png">

The Data will be stored on S3 storage and then machines for spark will be rented using EC2 service of AWS services. And then we'll login to the spark cluster remotely and submit the job to the cluster. 

## Setup Instructions AWS
If you want to create a spark cluster manually you can follow this [guide]( https://blog.insightdatascience.com/spinning-up-a-spark-cluster-on-spot-instances-step-by-step-e8ed14ebb3b). However its quite tedious and you have to perform same steps for multiple machines and if you have to update something you have to do it several times. Fortunately AWS offers an easier option called Elastic Map Reduce or EMR for short. EMR provides you EC2 instances with big data technologies installed.
Now following are the instructions to setup EMR cluster:
1. Create ssh key pair to securely connect to the cluster. To do this go to the EC2 service. Select **`Key pairs`** in **`NETWORK & SECURITY`** and create a key pair. Named it as `spark-cluster` and a `pem` file will be downloaded for you. 
2. Go to EMR service and create a cluster by naming it and according to the requirements. Make sure to select `Spark` in **`Software configuration`** section. Select the instance type according to your requirements. 
To compare different EC2 instance type either go [here](https://aws.amazon.com/ec2/instance-types/) or [here](https://ec2instances.info/). Select EC2 key pair and create cluster.
For more follow this [tutorial](https://www.youtube.com/watch?v=ZVdAEMGDFdo) on youtube.

## Using Notebooks on your Cluster
After the cluster is in `Waiting` state connect to the cluster. Amazon has multiple ways to connect to the cluster. We'll use `Notebook` by clicking on the left side in the menu. Click on `create notebook`. Then name your notebook and attach a cluster to it and leave the rest to default and create the cluster. For more elaboration following this [tutorial](https://www.youtube.com/watch?v=EcIYPkCkehY) from **Udacity** on youtube.

## Spark Scipts
Up until now jupyter notebooks were used. They have the following advantages:
1. Good for prototyping
2. Exploring and visualizing the data
3. Easily share the results with others

But Jupyter notebooks are not good for automating the workflows. That's where scipts come in to play. For more elaboration following this [tutorial](https://www.youtube.com/watch?v=bfOocPv54EI) from **Udacity** on Youtube.

## Submitting Spark Scripts

In [2]:
%%writefile lower_scripts.py
from pyspark.sql import SparkSession

if __name__ == '__main__':
    """
        example program to show how to submit applications
    """
    spark = SparkSession\
            .builder\
            .appName('LowerSongTitles')\
            .getOrCreate()

    log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "Despacito",
        "All the stars"
    ]
    
    distributed_song_log = spark.sparkContext.parallelize(log_of_songs)
    
    print(distributed_song_log.map(lambda x:x.lower()).collect())
    
    spark.stop()

Overwriting lower_scripts.py


change the pem file permissions using the follwing linux command:

    chmod 600 spark-cluster.pem
and connect to the master node. Use the following command to run the spark job:

    spark-submit --master yarn ./lower_script.py

## Reading and Writing to Amazon S3
