# Why Spark ?

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Apache Spark has an advanced DAG (Directed Acyclic Graph) execution engine that supports acyclic data flow and in-memory
computing.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, and S3.



# How to start with Pyspark ?

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

In [1]:
import pyspark

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LinearRegression") \
                            .config("spark.config", "value") \
                            .getOrCreate()

In [15]:
df = spark.read.options(header='True', inferSchema='True', delimiter=',')\
    .csv("iris.csv")

In [16]:
df.show(5)

+------------+-----------+------------+-----------+-------+
|sepal.length|sepal.width|petal.length|petal.width|variety|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| Setosa|
|         4.9|        3.0|         1.4|        0.2| Setosa|
|         4.7|        3.2|         1.3|        0.2| Setosa|
|         4.6|        3.1|         1.5|        0.2| Setosa|
|         5.0|        3.6|         1.4|        0.2| Setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows



# Core Concepts

• Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.

• Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens
over many stages.

• Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor (machine).

• DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.

• Executor: The process responsible for executing a task.

• Master: The machine on which the Driver program runs.

• Slave: The machine on which the Executor program runs.
