## Chapter 1 

### Introduction to Spark

### Big Data Processing Challenges:
- Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. 
- Big Data has processed by using two approaches.
   - Batch Processing:
     - In batch-based data processing involves collecting a series of data, storing it until a given quantity of data has been collected, then processing all of that data as a group.
     - With older technologies, it was typically more efficient to process information in batches rather than working with each small piece of data individually. 
     -  Doing so reduces the number of discrete I/O events that need to take place.
     - Batch processing works well in situations where we don’t need real-time analytics results, and when it is more important to process large volumes of data to get more detailed insights than it is to get fast analytics results. 
   - Real time processing:
     - Real time processing means "Respond before you lose the customer",and this is achieved by streaming data processing i.e. online processing.
     - It is needed to provide the expected value in a strictly specified time window. 
     - It i involving small amounts of data,using random read/writes (from/to persistance storage),and responding with low-latency.
- The first framework that enabled the processing of large-scale datasets was MapReduce(in 2003).
  - MapReduce is  based on batch processing.
  - This revolutionary tool was intended to process and generate huge datasets in an automatic and distributed way. 
  - In spite of its great popularity, MapReduce (and Hadoop) is not designed to scale well when dealing with iterative and online processes, typical in machine learning and stream analytics.
- Limitations of batch processing gave birth to Spark which is capable of iterative and online processing.


   Challenges                                                      |            Solution                                                                                                      
:----------------------------------------------------:|:---------------------------------------------------:
 Single central storage                             |       Distributed storage                                 
 Serial processing                                  |       Parallel processing                                  
 One input                                          |          Multiple inputs                                      
 One processor                                      |   Multiple processors                                    
 One output                                         |       One output                                                
Lack of ability to process unstructured data |      Ability to process every type of data      


### What Is Spark?

- Apache Spark is a unified analytics engine for large-scale data processing. 
- Unified data analytics is the process of using technology to make sense of the data that orgnizations collect across channels.
- It is open source cluster computing engine.
   - Very fast: In-memory ops 100x faster than MR
     • On-disk ops 10x faster than MR
   - General purpose: MR, SQL, streaming, machine learning, analytics
   - Compatible: Clusters run on Hadoop/YARN, Mesos, standalone
     • Works with many data stores: HDFS, S3, Cassandra, HBase, Hive, …
   - Easier to code: Word count in 2 lines
    

![Screen-Shot-2020-02-16-at-8.48.29-PM.png](attachment:Screen-Shot-2020-02-16-at-8.48.29-PM.png)

- It provides high-level APIs in **Java, Scala, Python and R, and an optimized engine that supports general execution graphs**
- It supports a rich set of higher-level tools including **Spark SQL** for SQL and structured data processing,**pandas API on Spark** for pandas workloads, **MLlib** for machine learning, **GraphX** for graph processing, and **Structured Streaming** for incremental computation and stream processing.
- It is a lightning fast real-time processing framework.
- Apache Spark was introduced as it can perform stream processing in real-time and can also take care of batch processing.
- Apache Spark is a framework aimed at performing fast distributed computing on Big Data by using in-memory primitives.
- It allows user programs to load data into memory and query it repeatedly, making it a well suited tool for online and iterative processing (especially for ML algorithms)
- It was motivated by the limitations in the MapReduce/Hadoop paradigm which forces to follow a linear dataflow that make an intensive disk-usage.


### Features of spark:

![apache-spark-features.png](attachment:apache-spark-features.png)

### Spark EcoSystem:
- The Apache Spark ecosystem is an open-source distributed cluster-computing framework. Spark is a data processing engine developed to provide faster and easier analytics than Hadoop MapReduce.
- The below figure illustrates all the spark components.

![1_bAFUX3X-oXLHXp4iNMx8Ew.png](attachment:1_bAFUX3X-oXLHXp4iNMx8Ew.png)

##### 1.Spark Core:
   - All the functionalities being provided by Apache Spark are built on the top of Spark Core. It delivers speed by providing in-memory computation capability. Thus Spark Core is the foundation of parallel and distributed   processing of huge dataset.
   -  It overcomes the snag of MapReduce by using in-memory computation.
   -  Spark Core is embedded with a special collection called RDD (resilient distributed dataset). 
   - Spark RDD handles partitioning data across all the nodes in a cluster. It holds them in the memory pool of the cluster as a single unit.

#### 2.Spark SQL + Dataframes:
   - The Spark SQL component is a distributed framework for structured data processing.
   - Using Spark SQL, Spark gets more information about the structure of data and the computation.
   - It also enables powerful, interactive, analytical application across both streaming and historical data. Spark SQL is Spark module for structured data processing. Thus, it acts as a distributed SQL query engine.

#### 3.Spark Streaming: 
   - Spark Streaming is a light weight API that allows developers to perform batch processing and streaming of data with ease, in the same application.
   - It makes use of a continuous stream of input data (Discretized Stream or Stream- a series of RDD’s) to process data in real-time. 
   - Data in Spark Streaming is ingested from various data sources and live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. in event drive, fault-tolerant and type-safe applications.

#### 4.Spark MLlib(Machine Learning):
  - MLlib is a low-level machine learning library that can be called from Scala, Python and Java programming languages.
  - MLlib is simple to use, scalable, compatible with various programming languages and can be easily integrated with other tools.
  - MLlib eases the deployment and development of scalable machine learning pipelines.
  - Data scientists can iterate through data problems 100 times faster than Hadoop MapReduce, helping them solve machine learning problems at large scale in an interactive fashion.

#### 5.GraphX:
  - To tackle the challenges of graph construction and transformation due to complex join,GraphX has introduced in Spark framework.
  - Spark GraphX introduces Resilient Distributed Graph (RDG- an abstraction of Spark RDD’s). 
  - RDG’s associate records with the vertices and edges in a graph. RDG’s help data scientists perform several graph operations through various expressive computational primitives.
  - GraphX component of Spark supports multiple use cases like social network analysis, recommendation and fraud detection.

#### 6.Language Support:
  - Apache Spark has built-in support for Scala, Java, R, and Python with 3rd party support for the .NET CLR, Julia,and more.

### Apache Spark versions:
- Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under BSD license
- Version 0 year 2012(0.5-0.9)
  - RDD and YARN support
- Version 1 year 2014(1.1 to 1.6 which was stable version)
  - Spark SQL
- Version 2 year 2016(2.1-2.4(stable))
  - Unified Dataset and dataframes API
  - Spark session
  - Spark Streaming
  - Partition pruning
  - built in support for HIVE features
  - More optimization
  - tool and solution for ETL pipelines, Analytics in a Data Lake, engine for distributed machine learning training and serving 
- latest Version 3 year 2018-2022(3.1 -3.3)
  - python pandas and Pyspark support
  - Optimization techniques(Autobroadcast join )
  - Adaptive execution of Spark SQL
  - Dynamic Partition Pruning (DPP)
  - Better Kubernetes Integration
  - Graphx
  - ACID Transactions with Delta Lake
  - Binary files data source,etc

### Architecture Of spark/Spark Structure:
- Apache Spark uses master-slave architecture.
- Just like in the real world, the master will get the job done by using his slaves. It means that you will have a master process and multiple slave processes which are controlled by that dedicated master process.


![spark_Architecture.png](attachment:spark_Architecture.png)

- Below are the high-level components of the architecture of the Apache Spark application:
  - The Spark driver
  - The Spark executors
  - The cluster manager(An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN, Kubernetes))
- As we can see that Spark follows Master-Slave architecture where we have one central coordinator and multiple distributed worker nodes. The central coordinator is called Spark Driver and it communicates with all the Workers.
- Each Worker node consists of one or more Executor(s) who are responsible for running the Task. Executors register themselves with Driver. The Driver has all the information about the Executors at all the time.
- This working combination of Driver and Workers is known as Spark Application.
- The Spark Application is launched with the help of the Cluster Manager. Cluster manager can be any one of the following –
  - Spark Standalone Mode
  - YARN
  - Mesos
  - Kubernetes
- **Spark Driver – Master Node of a Spark Application**
  - It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
  - The driver program runs the main () function of the application and is the place where the Spark Context and RDDs are created, and also where transformations and actions are performed.
  - Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster. 
  - Spark Driver performs two main tasks: Converting user programs into tasks and planning the execution of tasks by executors.
  - A detailed description of its tasks is as follows:
    - The driver program that runs on the master node of the Spark cluster schedules the job execution and negotiates with the cluster manager.
    - It translates the RDD’s into the execution graph and splits the graph into multiple stages.
    - The driver stores the metadata about all the Resilient Distributed Databases and their partitions.
    - Cockpits of Jobs and Tasks Execution -The driver program converts a user application into smaller execution units known as tasks. Tasks are then executed by the executors i.e. the worker processes which run individual tasks.
    - After the task has been completed, all the executors submit their results to the Driver.
    - Driver exposes the information about the running spark application through a Web UI at port 4040.
- **Spark Executor**:
   -  An executor is a distributed agent responsible for the execution of tasks. 
   - Every spark application has its own executor process.
   - Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”. However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload.
   - Executor performs all the data processing and returns the results to the Driver.
   - Reads from and writes data to external sources.
   - Executor stores the computation results in data in-memory, cache or on hard disk drives.
   - Interacts with the storage systems.
   - Provides in-memory storage for RDDs that are collected by user programs, via a utility called the Block Manager that resides within each executor. As RDDs are collected directly inside of executors, tasks can run parallelly with the collected data.
- **Cluster Manager**:
  - An external service is responsible for acquiring resources on the Spark cluster and allocating them to a spark job.
    





![cluster-overview.png](attachment:cluster-overview.png)

### Apache Spark Use Cases:
1.  Streaming Data
    - Apache Spark’s key use case is its ability to process streaming data. 
    - With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real-time. 
    - And Spark Streaming has the capability to handle this extra workload.
    - Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework to accommodate all their processing needs.
    - Among the general ways that Spark Streaming is being used by businesses today are:
       - Streaming ETL: With Streaming ETL, data is continually cleaned and aggregated before it is pushed into data stores
       - Data Enrichment: This Spark Streaming capability enriches live data by combining it with static data, thus allowing organizations to conduct more complete real-time data analysis.
       - Trigger Event Detection : Spark Streaming allows organizations to detect and respond quickly to rare or unusual behaviors (“trigger events”) that could indicate a potentially serious problem within the system.
       - Complex Session Analysis :Complex Session Analysis – Using Spark Streaming, events related to the live sessions—such as user activity after logging into a website or application—can be grouped together and quickly analyzed.Companies such as Netflix use this functionality to gain immediate insights as to how users are engaging on their site and provide more real-time movie recommendations.
2.  Machine Learning
    - Spark comes with an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data—which essentially amounts to processing machine learning algorithms.
    - The MLlib can work in areas such as clustering, classification, and dimensionality reduction, among many others.
    - All this enables Spark to be used for some very common big data functions, like predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis. 
    - Network security is a good business case for Spark’s machine learning capabilities.
    - Utilizing various components of the Spark stack, security providers can conduct real-time inspections of data packets for traces of malicious activity.
3. Interactive Analysis
    - Apache Spark is fast enough to perform exploratory queries without sampling. Spark also interfaces with a number of development languages including SQL, R, and Python. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively.
4. Fog Computing
   - Fog computing decentralizes data processing and storage, instead of performing those functions on the edge of the network.
   - Fog computing brings new complexities to processing decentralized data, because it increasingly requires low latency, massively parallel processing of machine learning, and extremely complex graph analytics algorithms.
   - Fortunately, with key stack components such as Spark Streaming, an interactive real-time query tool (Shark), a machine learning library (MLib), and a graph analysis engine (GraphX), Spark more than qualifies as a fog computing solution. 

![apche-spark-use-cases.png](attachment:apche-spark-use-cases.png)

#### Spark Use Case Examples
1. Finance Industry
   - Banks are using the Hadoop alternative - Spark to access and analyse the social media profiles, call recordings, complaint logs, emails, forum discussions, etc. to gain insights that can help them make the right business decisions for credit risk assessment, targeted advertising and customer segmentation.
   - A multinational financial institution has implemented real time monitoring application that runs on Apache Spark and MongoDB NoSQL database. 
2. e-commerce Industry
   -  Alibaba :It runs some of the largest Apache Spark jobs in the world in order to analyse hundreds of petabytes of data on its eCommerce platform. 
   - eBay :It provides targeted offers, enhance customer experience, and to optimize the overall performance.EBay spark users leverage the Hadoop clusters in the range of 2000 nodes, 20,000 cores, and 100TB of RAM through YARN.
3. Healthcare:As healthcare providers look for novel ways to enhance the quality of healthcare, Apache Spark is slowly becoming the heartbeat of many healthcare applications.
   - MyFitnessPal: MyFitnessPal uses apache spark to clean the data entered by users with the end goal of identifying high quality food items.
4. Media & Entertainment Industry
   - Yahoo for News Personalization
   - Conviva
   - Facebook
   - Netflix
   - Pinterest
5. Travel Industry
   - TripAdvisor
   - OpenTable:An online real time reservation service, with about 31000 restaurants and 15 million diners a month, uses Spark for training its recommendation algorithms and for NLP of the restaurant reviews to generate new topic models.
6. Gaming Industry
   - Riot Games:Spark improves the gaming experience of the users, it also helps in processing different game skins, different game characters, in-game points, and much more. It helps with performance improvement, offers, and efficiency.
   - Tencent:encent uses spark for its in-memory computing feature that boosts data processing performance in real-time in a big data context while also assuring fault tolerance and scalability.
7. Software & Information Service Industry: Spark use cases in Computer Software and Information Technology and Services takes about 32% and 14% respectively in the global market.
   - Databricks
   - Hearst
   - FINRA

### Spark in Big Data Ecosystem
- **Where does spark stands?**

- **Faster performance smaller code size**
- **Comparison with Hadoop/MapReduce**
- **Better Fit for Iterative Workload**
- **More generic programming model**
- **Spark for Lambda Architecture**

### **Popularity of spark**

![spark_3.png](attachment:spark_3.png)

### Spark vs Hadoop
#### What is Apache Hadoop? 
- Apache Hadoop is, a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
- It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
- The base Apache Hadoop framework is composed of the following modules:
  1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules
  2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
  3. Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users' applications
  4. Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing
  5. Hadoop Ozone – (introduced in 2020) An object store for Hadoop



![The-Core-Components-of-Hadoop.png](attachment:The-Core-Components-of-Hadoop.png)

#### Advantages of Hadoop for Big Data:
 - Speed: Hadoop’s concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds.
 - Diversity: Hadoop’s HDFS can store different data formats, like structured, semi-structured, and unstructured.
 - Cost-Effective: Hadoop is an open-source data framework.
 - Resilient: Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance.
 - Scalable: Since Hadoop functions in a distributed environment, you can easily add more servers.

#### Limitations of Hadoop:
- Problem with Small files
  - Hadoop can efficiently perform over a small number of files of large size. Hadoop stores the file in the form of file blocks which are from 128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small size file in a large amount. This so many small files surcharge the Namenode and make it difficult to work.
- Vulnerability
  - Hadoop is a framework that is written in java, and java is one of the most commonly used programming languages which makes it more insecure as it can be easily exploited by any of the cyber-criminal.
- Low Performance In Small Data Surrounding
  - Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for the organizations that are generating a massive volume of data. It’s efficiency decreases while performing in small data surroundings.
- Lack of Security
  - Hadoop uses Kerberos for security feature which is not easy to manage. Storage and network encryption are missing in Kerberos which makes us more concerned about it.
- High Up Processing
  - Read/Write operation in Hadoop is immoderate since we are dealing with large size data that is in TB or PB. In Hadoop, the data read or write done from the disk which makes it difficult to perform in-memory calculation and lead to processing overhead or High up processing.
- Supports Only Batch Processing
  - The batch process is nothing but the processes that are running in the background and does not have any kind of interaction with the user. The engines used for these processes inside the Hadoop core is not that much efficient. Producing the output with low latency is not possible with it.

![Hadoop-Cons.png](attachment:Hadoop-Cons.png)

### Spark Vs Hadoop:
#### Feature Comparison:
    


 

   Features                      |    Apache Spark  |        Apache Hadoop                                                                                                    
:-------------------:|:-----------------------------:|:------------------------:
 Batch Processing                     |   Yes    | Yes                             
 Streaming                            | Yes       |       No                                 
 Easy to Use                           |Yes             |          No                                     
 Caching	                            |Yes          |  No                               
                                     

#### Head To Head Comparison:


Basis                 | Spark                              |                 Hadoop
:--------------------:|:----------------------------------:|:----------------------------------------------:
Usage|Spark is designed to handle real-time data efficiently |Hadoop is designed to handle batch processing efficiently.
Processing Speed |Spark reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.|Hadoop’s MapReduce model reads and writes from a disk, thus slowing down the processing speed.
Latency|A low latency computing and can process data interactively |A high latency computing framework, which does not have an interactive mode.
Data| Process real-time data, from real-time events like Twitter, and Facebook |With Hadoop MapReduce, a developer can only process data in batch mode only
Cost|Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost|Hadoop is a cheaper option available while comparing it in terms of cost
Algorithm Used|Machine learning algorithm,Graphx |MapReduce,PageRank Algorithm
Fault Tolerance|Spark provides fault-tolerance through RDD which is the building block of Apache Spark|Each file is split and replicates ensuring it’s rebuilt even when a machine is down.
Security|Not secure, it relies on the integration with Hadoop to achieve the necessary security level |Provides more authentication
Machine Learning| Much faster as it uses MLib for computations and has in-memory processing |Data fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark.
Performance|Fast performance with reduced disk reading and writing operations|A slower performance as it uses disk for storage and depends upon disk read and write operations.  
Scalability|It is quite difficult to scale as it relies on RAM for computations. It supports thousands of nodes in a clusters |Hadoop is easily scalable by adding nodes and disk for storage. It supports tens of thousands of nodes.
Language support|Java, R, Scala, Python, or Spark SQL for the APIs |Java or Python for MapReduce apps
User-friendliness|more user-friendly|less user friendly
Resource Management|It has built-in tools for resource management  like StandAlone cluster manager,Hadoop YARN,Apache Mesos|YARN is the most common option for resource management

#### Spark's unified stack supports almost all needs: 
- The following comparison clarifies how does Spark fullfills almost all needs.


Use Cases | Hadoop | Spark
:---------:|:-------:|:-------:
Storage |HDFS|  HDFS / S3 / Cassandra and more , Tachyon (memory-centricdistributed storage)
Cluster Manager|YARN|StandAlone cluster manager/YARN/Mesos
Batch Processing|MapReduce|Spark MR
SQL Querying|HIVE|Spark SQL(can also query Hive/HQL)
Stream Processing / Real Time processing |Storm|Spark Streaming
Machine Learning|Mahout| Spark MLlib

#### MapReduce Concept:
- MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework.
- It facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
- The MapReduce paradigm consists of two sequential tasks: Map and Reduce (hence the name). Map filters and sorts data while converting it into key-value pairs. Reduce then takes this input and reduces its size by performing some kind of summary operation over the dataset.

![seo-what-is-mapreduce_gj9ehi.webp](attachment:seo-what-is-mapreduce_gj9ehi.webp)

#### Spark Vs MapReduce:

MapReduce | Spark
:---------:|:---------:
It is a framework that is open-source which is used for writing data into the Hadoop Distributed File System|It is an open-source framework used for faster data processing.
It is having a very slow speed as compared to Apache Spark|It is much faster than MapReduce.
It is unable to handle real-time processing.|It can deal with real-time processing.
It is difficult to program as you required code for every process.|It is easy to program.
It supports more security projects.|Its security is not as good as MapReduce and continuously working on its security issues.
For performing the task, It is unable to cache in memory.|It can cache the memory data for processing its task.
Its scalability is good as you can add up to n different nodes.|It is having low scalability as compared to MapReduce.
It actually needs other queries to perform the task.|It has Spark SQL as its very own query language.



#### Apache Spark Officially Sets a New Record in Large-Scale Sorting:
- Spark won Daytona GraySort contest
- Using Spark on 206 EC2 machines, Databricks sorted 100 TB of data on disk in 23 minutes. 
- In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes.
- This means that Apache Spark sorted the same data 3X faster using 10X fewer machines
-  All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

![1520235260253.jpeg](attachment:1520235260253.jpeg)

#### Spark best fit for Iterative Workload:
- Spark allow users and applications to explicitly cache a dataset by calling the cache() operation. This means that your applications can now access data from RAM instead of disk, which can dramatically improve the performance of iterative algorithms that access the same dataset repeatedly. 

##### Programming model for Spark and Hadoop:

![Map-Side-Shuffle-phase-differences-Hadoop-vs-Spark-Fig-5-Reduce-side-Shuffle-phase.png](attachment:Map-Side-Shuffle-phase-differences-Hadoop-vs-Spark-Fig-5-Reduce-side-Shuffle-phase.png)

##### Iterative workload handling in Hadoop and Spark 

![Comparison-between-Hadoop-and-Spark-in-dealing-with-memory-and-disks-Hadoop-is-slower.png](attachment:Comparison-between-Hadoop-and-Spark-in-dealing-with-memory-and-disks-Hadoop-is-slower.png)

### Spark For Lambda Architecture:
#### What is Lambda Architecture?
 - Nathan Marz came up with the term Lambda Architecture for generic, scalable and fault-tolerant data processing architecture. 
 - Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach.
 - The Lambda Architecture (LA) enables developers to build large-scale, distributed data processing systems in a flexible and extensible manner, being fault-tolerant both against hardware failures and human mistakes.
- The lambda architecture itself is composed of 3 layers:
  1. Batch Layer:
     - Manages master data set
     - Immutable, append-only, raw data
     - Also pre-computes batch views
  2. Speed Layer:
     - Accommodates requests needing low latency
     - Recent data only using fast and incremental algorithms
  3. Serving Layer:
     - Indexes batch views for low-latency queries

![lambda.png](attachment:lambda.png)

- Apache Spark can be considered as an integrated solution for processing on all Lambda Architecture layers.
- It contains Spark Core that includes high-level API and an optimized engine that supports general execution graphs, Spark SQL for SQL and structured data processing, and Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. 
- Definitely, batch processing using Spark might be quite expensive and might not fit for all scenarios and data volumes, but, other than that, it is a decent match for Lambda Architecture implementation.

![pipeline-2.png](attachment:pipeline-2.png)

### References:

- https://getthematic.com/insights/unified-data-analytics/#:~:text=Unified%20data%20analytics%20is%20the,that%20organizations%20collect%20across%20channels.
- https://blog.k2datascience.com/batch-processing-apache-spark-a67016008167
- https://fulmanski.pl/tutorials/computer-science/big-data/processing-concepts-for-big-data/
- https://en.wikipedia.org/wiki/Apache_Spark
- https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/
- https://www.projectpro.io/article/top-5-apache-spark-use-cases/271
- https://data-flair.training/blogs/wp-content/uploads/sites/2/2017/05/features-of-spark.jpg
- https://www.geeksforgeeks.org/hadoop-pros-and-cons/?ref=gcse
-    https://www.researchgate.net/publication/342689672_Performance_Analysis_of_Distributed_Computing_Frameworks_for_Big_Data_Analytics_Hadoop_Vs_Spark
- https://www.databricks.com/glossary/lambda-architecture
- https://dzone.com/articles/lambda-architecture-with-apache-spark#:~:text=It%20contains%20Spark%20Core%20that,processing%20of%20live%20data%20streams.