## PySpark Theory

#### Definition

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores

#### Importance

First question I asked myself before dedicating time to learning Apache Spark was why is this important
to Data Science after learning Python, R, SQL, Statistics, Tableau, and the likes?

The exponential increase in data collection in the last decade meant that a new programming framework needed to be developed to efficiently handle data processing on a large scale. Also, the speed limitations of the current designs of individual computer processors meant parallel CPU cores had to be developed. Cluster computing and parallel processing were the answers, and this is why the Apache Spark framework exists today. Before jumping into the world of Apache Spark and parallel processing, I believe it is beneficial to understand some vital terms and also link them to similar concepts in from other programming languages for quicker understanding.

#### Terms

* Apache Spark is the framework that allows users efficiently process large datasets distributed across multiple connected computers using RDDs. This framework is originally written in Scala but can also be used in other languages like Python and R.

* Resilient Distributed Datasets (RDDs) are the core data structure of PySpark used to perform parallel processing across a cluster or computer processors. RDDs are low-level objects that are highly efficient and immutable elements (cannot be altered them after creation). They are also fault-tolerant meaning they will immediately recover from any failure. 

* PySpark an API that allows Apache Spark to be executed on Python. It helps connect RDDs to Apache Spark and Python.

* PySpark SQL a module in Spark which integrates relational processing with Spark's functional programming API by establishing a connection between RDDs and relational tables. It allows for data extraction using an SQL query language. 

* PySpark DataFrames are distributed collections of data that can be run on multiple machines and organize data into named columns. They can be created from external databases, structured data files or existing resilient distributed datasets (RDDs).

* Machine Learning Library (MLlib) as the name suggest is the library in Spark that makes machine learning tasks easy and scalable. It allows for execution of common ML algorithms such as classification, regression, and clustering. MLlib also creates a platform for constructing, evaluating, and tuning ML Pipelines.

* SparkContext - SparkContext is the primary point of entry for Spark capabilities. It represents a Spark cluster’s connection that is useful in building RDDs and it broadcasts variables on the cluster. 

* SparkSession They were introduced in Apache Spark 2.0 to do for DataFrames what the SparkContext does for RDDs. SparkSessions acts as an entry point to work with PySpark DataFrames and SQL.


#### Spark vs Hadoop

Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm. Developers can write massively parallelized operators without having to worry about work distribution or fault tolerance. However, a challenge to MapReduce is the sequential, multi-step process it takes to run a job. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. Because each step requires a disk read and write, MapReduce jobs are slower due to disk I/O latency.

Spark was created to address the limitations of MapReduce by processing in memory, reducing the number of steps in a job, and reusing data across multiple parallel operations. With Spark, only one step is needed: data is read into memory, operations are performed, and the results are written back—resulting in much faster execution. Spark also reuses data using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset.

#### Comparing Spark and Hadop

Let’s take a closer look at the key differences between Hadoop and Spark in six critical contexts:

**Performance**: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce.

**Cost**: Hadoop runs at a lower cost since it relies on any disk storage type for data processing. Spark runs at a higher cost because it relies on in-memory computations for real-time data processing, which requires it to use high quantities of RAM to spin up nodes.

**Processing**: Though both platforms process data in a distributed environment, Hadoop is ideal for batch processing and linear data processing. Spark is ideal for real-time processing and processing live unstructured data streams.

**Scalability**: When data volume rapidly grows, Hadoop quickly scales to accommodate the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault tolerant HDFS for large volumes of data.

**Security**: Spark enhances security with authentication via shared secret or event logging, whereas Hadoop uses multiple authentication and access control methods. Though, overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher security level.

**Machine learning (ML)**: Spark is the superior platform in this category because it includes MLlib, which performs iterative in-memory ML computations. It also includes tools that perform regression, classification, persistence, pipeline construction, evaluation, etc.

![pythonSpark.jpg](attachment:9077e607-a315-400b-b316-965cb859630e.jpg)