In this notebook, we will cover the following topics:

# Table of Contents
1. [Introduction to Big Data](#Introduction-to-Big-Data)
    1. [What is Big Data?](#What-is-Big-Data?)
    2. [Big Data Use Cases](#Big-Data-Use-Cases)
    3. [The V's of Big Data](#The-V's-of-Big-Data)
    4. [Big Data Challenges](#Big-Data-Challenges)

2. [Introduction to Spark](#Introduction-to-Spark)
    1. [What is Spark?](#What-is-Spark?)
    2. [Spark VS Hadoop MapReduce](#Spark-VS-Hadoop-MapReduce)
    3. [How does Spark work?](#How-does-Spark-work?)
    4. [Spark Session](#Spark-Session)

# Introduction to Big Data

The concept of **data has been around for centuries**, but the collection and storage have changed dramatically over the years. Early data collection methods were limited to paper and pen, but with the advent of computers, data collection became more efficient and scalable.

In the 1970s, relational databases were introduced, allowing organizations to store and manage large amounts of data. However, as the volume of data continued to grow, traditional databases became less efficient and more expensive to maintain.

The term **Big Data** was first introduced in the early 2000s, and it has since become a buzzword in the tech industry. But what exactly is Big Data?

## What is Big Data?

Big Data refers to the massive volume of structured and unstructured data that is generated by organizations on a daily basis. This data comes from a variety of sources, including social media, sensors, mobile devices, and other digital platforms.

Big Data involves two key steps:

1. **Data Collection**: This involves collecting huge amounts of data from various sources, such as social media, sensors, and other digital platforms, sometimes in real time.

2. **Data Analysis**: This involves analyzing the data to extract valuable insights and make informed decisions. This can descriptive, diagnostic, predictive, or prescriptive analysis, which aim at answering the questions "What happened?", "Why did it happen?", "What will happen?", and "What should we do?" respectively.

## Big Data Use Cases

Big Data has a wide range of use cases across various industries, including:

1. **Healthcare**: Big Data is used to analyze patient data, predict disease outbreaks, and improve patient care.
2. **Retail**: Big Data is used to analyze customer data, predict buying patterns, and optimize inventory management.
3. **Finance**: Big Data is used to detect fraud, predict market trends, and optimize trading strategies.
4. **Manufacturing**: Big Data is used to monitor equipment performance, predict maintenance needs, and optimize production processes.
5. **Transportation**: Big Data is used to optimize traffic flow, predict maintenance needs, and improve safety.

and many more!

## The V's of Big Data

Big Data is often characterized by V's:

1. **Volume**: This refers to the sheer amount of data that is generated on a daily basis. Big Data is typically measured in terabytes, petabytes, or even exabytes.

2. **Velocity**: This refers to the speed at which data is generated and collected. Big Data is often generated in real time, which requires organizations to process and analyze the data quickly.

3. **Variety**: This refers to the different types of data that are generated, including structured data (e.g., databases) and unstructured data (e.g., social media posts).

4. **Veracity**: This refers to the quality and accuracy of the data. Big Data is often messy and incomplete, which can make it difficult to analyze.

5. **Value**: This refers to the insights and value that can be derived from the data. Big Data is only valuable if organizations can extract meaningful insights and make informed decisions.

etc.

![The V's of Big Data](https://miro.medium.com/v2/resize:fit:1358/0*qmjW4B1riDDWDQcW)

## Big Data Challenges

Despite the many benefits of Big Data, organizations face several challenges when working with Big Data. This challenges arise from the V's of Big Data:

1. **Volume**: Storing and managing large amounts of data can be expensive and time-consuming. Organizations need to invest in scalable storage solutions to handle the volume of data.

2. **Velocity**: Processing and analyzing data in real time can be challenging, especially when dealing with large volumes of data. Organizations need to invest in high-performance computing solutions to handle the velocity of data.

3. **Variety**: Analyzing different types of data can be complex, especially when dealing with structured and unstructured data. Organizations need to invest in data integration solutions to handle the variety of data.

4. **Veracity**: Ensuring the quality and accuracy of data can be difficult, especially when dealing with messy and incomplete data. Organizations need to invest in data quality solutions to handle the veracity of data.

5. **Value**: Extracting meaningful insights and value from data can be challenging, especially when dealing with large volumes of data. Organizations need to invest in data analytics solutions to extract value from data.

etc.

# Introduction to Spark

[Apache Spark](https://spark.apache.org/docs/latest/index.html) is a **distributed computing framework** that is designed for Big Data processing. In other words, Spark is a powerful tool for **processing large volumes of data quickly and efficiently**. It is like a high-performance engine designed to handle huge data tasks quickly and efficiently.

It provides **development APIs** in Java, Scala, Python, and R, and supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and graph processing.

The goal of Spark is to create a framework that is fast, easy to use, and scalable. It achieves this by leveraging in-memory processing, fault tolerance, and parallel processing.

The main Spark modules are: Spark Core, Spark SQL, Spark Streaming, MLib, and GraphX.

![Spark Stack](https://www.researchgate.net/publication/337169833/figure/fig11/AS:963443322535989@1606714196995/Apache-Spark-Stack-and-libraries.png)

Spark gives a solution to the main challenges of Big Data:

- **Volume:** Spark can distribute data across multiple nodes in a cluster and process it in parallel, allowing it to scale horizontally to handle datasets of virtually any size.

- **Velocity:** Spark provides support for real-time streaming data processing through its streaming APIs. It allows users to process and analyze streaming data in real-time, enabling timely insights and actions.

- **Variety:** Spark is versatile and supports processing various types of data, including structured data stored in databases, semi-structured data like JSON or XML, and unstructured data such as text or images. It provides APIs for handling different data formats and structures efficiently.

- **Veracity:** While Spark does not directly address data quality issues, it provides tools and libraries for data cleansing, transformation, and quality assessment. Users can leverage Spark's capabilities to preprocess and clean their data before analysis, improving its quality and reliability.

- **Value:** Spark enables users to extract valuable insights from their data through advanced analytics, machine learning, and graph processing capabilities. By leveraging Spark's rich set of libraries and APIs, users can derive actionable insights and drive business value from their data.

## Spark VS Hadoop MapReduce

### Hadoop

Apache Hadoop is another popular framework for Big Data processing, and it is often compared to Spark.

Hadoop is a distributed computing framework that has:
- Hadoop Distributed File System (HDFS) as storage
- YARN as a way of managing computing resources used by different applications
- An implementation of the MapReduce programming model as an execution engine.

It emerged as a solution for processing and storing big data and introduced the concept of **distributed file storage and processing**, also known as batch processing, allowing organizations to efficiently manage and analyze large datasets across clusters of computers.

### MapReduce

MapReduce is a core component of the Hadoop ecosystem.

It enables parallel processing of large datasets by **breaking them into smaller chunks and distributing them across nodes in a cluster**.

It is like having a team of workers (mappers and reducers) who each take a piece of data, do something with it (like counting words or finding averages), and then put the results together. 

However, a challenge to MapReduce is the sequential multi-step process it takes to run a job. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O.

### Spark

**Apache Spark came onto the scene as a successor to MapReduce**, aiming to address its shortcomings.

Spark introduced in-memory computing, allowing data to be **cached in memory** rather than written to disk between processing steps. This dramatically improved processing speeds for iterative algorithms and interactive data analysis compared to MapReduce, which relied heavily on disk-based storage and retrieval. ***Note:*** If you're curious, [here](https://www.youtube.com/watch?v=myVWLaUdJac&t=290s) is a video explaining this concept (see from minute 4:00).

Spark also reuses data through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

In summary, both frameworks are designed for distributed processing of large datasets but they have some key differences:

1. **Speed**: Spark is faster than MapReduce due to its in-memory processing capabilities. Spark can cache data in memory and reuse it across multiple operations, reducing the need for disk I/O and improving processing speeds.

2. **Ease of Use**: Spark provides a more user-friendly API and supports multiple programming languages, making it easier for developers to write and run complex data processing tasks.

3. **Ecosystem**: Spark has a rich ecosystem of libraries and tools, including Spark SQL, Spark Streaming, MLib, and GraphX. This makes it easier for users to build end-to-end data processing pipelines using Spark.

4. **Versatility**: Spark supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and graph processing. It provides a rich set of libraries and APIs for different use cases.

5. **Scalability**: Both Spark and MapReduce are designed to scale horizontally across clusters of computers. However, Spark's in-memory processing capabilities make it more scalable for iterative algorithms and interactive analytics.


![Spark VS Hadoop MapReduce](https://data-flair.training/blogs/wp-content/uploads/sites/2/2016/09/Hadoop-MapReduce-vs-Apache-Spark.jpg)

The image was taken from [this article](https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/).

## How does spark work?

Spark applications run as independent processes on a cluster, coordinated by the **SparkContext** object in your main program (called the driver program).

A cluster consists of multiple nodes (individual machines) that operate on your tasks in parallel. There is one driver node for every cluster, which is the one that delegates tasks and oversees the execution of your workload.

To run on a cluster, the SparkContext can connect to various cluster managers (like Spark’s standalone manager, Mesos, YARN, or Kubernetes), which distribute resources across applications.

Once connected, Spark acquires executors on nodes in the cluster. These executors are responsible for running computations and storing data for your application. The SparkContext then dispatches tasks to these executors for execution.

![Spark Clusters](https://spark.apache.org/docs/latest/img/cluster-overview.png)

**Some concepts**

- Driver program: The process running the main function of the application and creating the SparkContext
- Cluster manager: An external service for acquiring resources on the cluster
- Worker node: Any node that can run application code in the cluster. Workers typically have a set of system resources (CPU, memory, disk) allocated to them and are responsible for executing tasks and storing data.
- Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
- Task: A unit of work that will be sent to one executor
- Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect)
- Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other

For more information check this [link](https://spark.apache.org/docs/latest/cluster-overview.html)

## Spark Session

As we've seen, the **SparkContext** is the main entry point for Spark functionality. However, in Spark 2.0, a new entry point called **SparkSession** was introduced to simplify the use of Spark features. 

![Spark Session VS Spark Context](https://abhishekbaranwal10.files.wordpress.com/2018/09/introduction-to-apache-spark-20-12-638.jpg?resize=638%2C479)

Here are the main differences between them:

**SparkContext**

- SparkContext is the older, more foundational component of Apache Spark. It is the entry point to Spark functionality in earlier versions of Spark (before 2.0).

- It primarily deals with low-level functionality like creating RDDs (Resilient Distributed Datasets) and managing distributed computing tasks.

- It does not have built-in support for structured data like DataFrames or SQL queries.

**SparkSession**


- SparkSession was introduced in version 2.0 and its a unified entry point for Spark functionality in newer versions.

- It provides a higher-level interface for working with structured data, such as DataFrames, Datasets, and Spark SQL.

- It simplifies application development and management by consolidating and abstracting away lower-level APIs like SparkContext, SQLContext, StreamingContex and HiveContext.

[Here](https://www.ksolves.com/blog/big-data/spark/sparksession-vs-sparkcontext-what-are-the-differences) is an article explaining the differences between them in more detail.