### Introducing Apache Spark

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, which later became the AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MR was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for in-memory storage and efficient fault recovery.

In 2011, the AMPLab started to develop higher-level components on Spark such as Shark and Spark Streaming. These components are sometimes referred to as the Berkeley Data Analytics Stack (BDAS). Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013.

In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in Big Data. Now, over 250 contributors in over 50 organizations are contributing to Spark development. The user base has increased tremendously from small companies to Fortune 500 companies.

### What is Apache Spark?
Let's understand what Apache Spark is and what makes it a force to reckon with in Big Data analytics:
* Apache Spark is a fast enterprise-grade large-scale data processing engine, which is interoperable with Apache Hadoop.
* It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM.
* Spark enables applications to distribute data reliably in-memory during processing. This is the key to Spark's performance as it allows applications to avoid expensive disk access and performs computations at memory speeds.
* It is suitable for iterative algorithms by having every iteration access data through memory.
* Spark programs perform 100 times faster than MR in-memory or 10 times faster on disk (http://spark.apache.org/).
* It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily, and often 2 to 10 times less code is needed.
* Spark powers a stack of libraries, including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application.
* Spark runs on Hadoop, Mesos, standalone cluster managers, on-premise hardware, or in the cloud

### What Apache Spark is not
Hadoop provides HDFS for storage and MR for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it.

Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and others).

It's important to note that Spark is not Hadoop and does not require Hadoop to run it. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, sequence files, Avro, Parquet, and any other Hadoop InputFormat. 

*Source: Big Data Analytics - Venkat Ankam*

### Load SparkContext

Spark context sets up internal services and establishes a connection to a Spark execution environment.

Once a SparkContext is created you can use it to create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext is stopped).

*Source: Mastering Apache Spark 2 (https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details)*

In [5]:
from pyspark import SparkContext, SparkConf

In [6]:
if not 'sc' in globals(): # This 'trick' makes sure the SparkContext sc is initialized exactly once
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

In [7]:
# print out spark version
sc.version

#### Resilient Distributed Dataset
RDDs are a fundamental unit of data in Spark and Spark programming revolves around creating and performing operations on RDDs. They are immutable collections partitioned across clusters that can be rebuilt (re-computed) if a partition is lost. They are created by transforming data in a stable storage using data flow operators (map, filter, group-by) and can be cached in memory across parallel operations:

* Resilient: If data in memory is lost, it can be recreated (or recomputed)
* Distributed: Distributed across clusters
* Dataset: Initial data can come from a file or created programmatically

There are a couple of ways to create an RDD: parallelize, or read from a file:

In [9]:
# First we will fill in our credentials and connect to the s3 bucket where the data is stored
# Replace with your values
# NOTE: Set the access to this notebook appropriately to protect the security of your keys.
# Or you can delete this cell after you run the mount command below once successfully.

ACCESS_KEY = "none"
SECRET_KEY = "none"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "none"
MOUNT_NAME = "none"

In [10]:
# only execute this line once
try: 
  dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
except:
  pass

#### We just loaded the data of 2015 Flight Delays and Cancellations (https://www.kaggle.com/usdot/flight-delays)
**Context**: The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.

**Acknowledgements**: The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.

Released Under CC0: Public Domain License

To understand the data please visit: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

In [12]:
# For now we will just read data from a text file
input_rdd = sc.textFile("/mnt/%s/flights.csv" % MOUNT_NAME)
# show the first 10 instances of the data
input_rdd.take(10)

In [13]:
# count the number of rows
input_rdd.count()

In [14]:
header = input_rdd.take(1)

In [15]:
for i in header.:
  print(i)
  print('\n')

### There are two types of RDD operations - transformations and actions.
**Transformations** define new RDDs based on the current RDD. **Actions** return values
from RDDs.