# Spark and Hadoop Comparison and Concepts

## 1. What is the difference between Hadoop and Spark?
- **Hadoop**:
  - processing on `Disk` using MapReduce.
  - `Batch processing` model. (No real time for data streaming)
  - `Slower performance` due to frequent disk I/O operations.
  - Typically uses `HDFS for storage` so fault tolerance leads to less memory efficient.
  - Has it's `own data storage HDFS`
  
- **Spark**:
  - `In-memory(RAM)` processing for faster performance. (faster 100)
  - Supports `batch` and `real-time` processing `(Spark Streaming)`.
  - Processes data using `RDDs, DataFrames, and Datasets`.
  - `Fault-tolerant` and scales easily across clusters.

    Spark's fault tolerance is achieved without needing data replication (like in Hadoop), which makes Spark `more memory-efficient`.
  - spark doesn't have its own local storage. it use `external storage` like, HDFS, S3, mySQL etc

---

## 2. Why do we need Spark? Can't we just easily read files/databases directly?
While itâ€™s possible to read files and databases directly, Spark is needed for:
- **Scalability**: Can process large datasets across multiple nodes.
- **Fault tolerance**: Data recovery in case of failures via lineage.
- **Performance**: In-memory computation speeds up iterative algorithms.
- **Distributed computing**: Efficient processing across a cluster.

---

## 3. What is a Spark Context?
A **SparkContext** is the `entry point` for Spark applications. It connects the application to the cluster and facilitates the creation of RDDs, manages job execution, and provides access to Spark's capabilities for parallel processing. It's the main point of interaction with the underlying Spark infrastructure.

---

## 4. What is the difference between a Session and a Context?
- **SparkContext**: for `old` version Spark 1.x

      managing functionality:
      - like `job execution`, 
      - creating `RDDs`, 
      - and managing distributed resources.
- **SparkSession**: for `new` version Spark 2.x
      - A unified entry point for working with `structured data (DataFrames, SQL)`. 
      - It `includes SparkContext` and supports `higher-level APIs` such as DataFrame operations, SQL queries, and machine learning tasks.

---

## 5. What is the purpose of a Spark Cluster?
A **Spark Cluster** is a collection of machines (nodes) working together to process large-scale data:
- **Distributed storage**: Data is stored across multiple machines.
- **Parallel processing**: Tasks are distributed across nodes for faster computation.
- **Fault tolerance**: Provides high availability and data recovery in case of node failure.
- **Scalability**: From a few nodes to thousands, allowing processing of massive datasets.

---

## **6. For each of the following modules/classes, explain what is its purpose and its advantages:**

#### **RDD**: 
  - **Purpose**: 
    * A distributed collection of `data processed in parallel`.
    * `unstrucured data`
    * provide `low-level API` like map, reduce

  - Advantages: `Fault tolerance`, `parallel processing`, and `immutability`

#### **DataFrame and SQL**:
  - Purpose: Structured data representation (like a table in a database). DataFrame provides a higher-level abstraction for working with distributed data.
  - Advantages: Optimized query execution, supports SQL queries, and integrates with Sparkâ€™s ecosystem for machine learning and analytics.

#### **Streaming**:
  - Purpose: Real-time data processing (Spark Streaming).
  - Advantages: Continuous data processing, easy integration with other systems, and support for complex event processing.

#### **MLlib**:
  - Purpose: Machine learning library for scalable algorithms (e.g., classification, regression).
  - Advantages: Distributed machine learning, supports multiple algorithms, and integrates with Sparkâ€™s ecosystem.

#### **GraphFrames**:
  - Purpose: A library for working with graphs and performing graph processing.
  - Advantages: Provides optimized graph algorithms and integrates seamlessly with DataFrames.

#### **Resource**:
  - Purpose: Manages and allocates resources (e.g., memory, CPU) in a Spark cluster.
  - Advantages: Efficient resource utilization, fault tolerance, and scaling.

---

## 7. What is the difference between a Spark DataFrame and a Pandas DataFrame?
- **Spark DataFrame**:
  - `Distributed` across a cluster, capable of handling large-scale data.
  - `Lazy Execution`
  - `SQL Integration`
  - `Parallel Execution`
  - `Real-time` Analytics, allow `data streaming`

- **Pandas DataFrame**:
  - `high-level` data manipulation functions for Python.
  - `Index-Based` Operations: Uses row and column indexing

---

## 8. What are the Spark data sources?
Spark supports various data sources:
- **HDFS**: Hadoop Distributed File System.
- **S3**: Amazonâ€™s cloud storage.
- **JDBC**: For reading from relational databases.
- **Parquet, Avro, ORC**: Columnar storage formats.
- **JSON, CSV, Text**: File-based data sources.

---

## 9. What is the difference between a transformation and an action?
- **Transformation**:
  - `Lazy operations` that return a new RDD/DataFrame (e.g., `map`, `filter`).

- **Action**:
  - `eager evaluation` compute the result (e.g., `collect`, `count`, `save`).

---

## 10. What are the advantages of laziness?
- `Optimized execution`: combine all transformations to be performed as a single
- `Fault tolerance`: spark will recompute lost partition from already loaded dataset
- Spark only executes when necessary, `reducing memory` and CPU usage.
---

## 11. When is a shuffle operation needed?
A **shuffle** occurs when:
- `Data needs to be redistributed` across partitions.
- This happens in operations like **groupBy**, **reduceByKey**, or **join**.
- `Shuffling is expensive` and can impact performance due to data transfer and disk I/O.

---

## 12. Explain `explain`.
explain() show:
- `Physical plan` i.e. low lvl (sort, shuffle)
- `Logical plan`  i.e. High lvl action (filter, groupby etc)
- `Optimized Logical Plan` with rearrangement of action to
  * **reduce data shuffling**
  * **pushing down filter and projection**
  * **Choosing efficient join strategies**

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc

# Initialize Spark
spark = SparkSession.builder.appName("ExplainExample").getOrCreate()

# Create DataFrame
data = [
    (1, "Alice", "HR", 2000),
    (2, "Bob", "IT", 2500),
    (3, "Charlie", "IT", 3000),
    (4, "David", "HR", 2200),
    (5, "Eve", "Finance", 2700),
]
columns = ["id", "name", "department", "salary"]
df = spark.createDataFrame(data, columns)

# ðŸ”¹ Transformations (Lazy)
df_filtered = df.filter(col("salary") > 2200)  # Filter rows
df_grouped = df_filtered.groupBy("department").agg(avg("salary").alias("avg_salary"))  # Aggregate
df_sorted = df_grouped.orderBy(desc("avg_salary"))  # Sort results

# ðŸ”¹ Action (Triggers execution)
df_sorted.show()

df_sorted.explain(True)

                                                                                

+----------+----------+
|department|avg_salary|
+----------+----------+
|        IT|    2750.0|
|   Finance|    2700.0|
+----------+----------+

== Parsed Logical Plan ==
'Sort ['avg_salary DESC NULLS LAST], true
+- Aggregate [department#14], [department#14, avg(salary#15L) AS avg_salary#25]
   +- Filter (salary#15L > cast(2200 as bigint))
      +- LogicalRDD [id#12L, name#13, department#14, salary#15L], false

== Analyzed Logical Plan ==
department: string, avg_salary: double
Sort [avg_salary#25 DESC NULLS LAST], true
+- Aggregate [department#14], [department#14, avg(salary#15L) AS avg_salary#25]
   +- Filter (salary#15L > cast(2200 as bigint))
      +- LogicalRDD [id#12L, name#13, department#14, salary#15L], false

== Optimized Logical Plan ==
Sort [avg_salary#25 DESC NULLS LAST], true
+- Aggregate [department#14], [department#14, avg(salary#15L) AS avg_salary#25]
   +- Project [department#14, salary#15L]
      +- Filter (isnotnull(salary#15L) AND (salary#15L > 2200))
         +- Log

## 13. What is the importance of repartition?
Repartitioning redistributes data across partitions, which can:
- **Improve performance**: By balancing the data across nodes, reducing skew.
- **Optimize parallelism**: Ensures a better workload distribution.
- **Avoid shuffling**: Proper partitioning can prevent unnecessary shuffling and improve job performance.

---

## 14. Describe a use case for map and another for mapPartitions.
- **map**: Use `map` when you need to apply a function to each element in an RDD/DataFrame independently. Example: Squaring each number in a dataset.
- **mapPartitions**: Use `mapPartitions` when you want to apply a function to each partition of data. Itâ€™s more efficient for operations that require access to the entire partition (e.g., accessing a database connection per partition).

---

## 15. Is there a parallel for SQL constraints in Spark? What about indexes? If yes - what is it? If no - why?
- **SQL Constraints**: Spark SQL does not support SQL constraints like `PRIMARY KEY` or `FOREIGN KEY`. It focuses on distributed processing and does not enforce data integrity constraints.
- **Indexes**: Spark does not support traditional indexes like in relational databases. However, performance can be optimized with partitioning, bucketing, and using the appropriate storage formats (e.g., Parquet).

---

## 16. Why and when are `lit` and `col` useful?
- **lit**: Used to create a column of constant values (e.g., `lit(10)` to create a column with the value `10`).
- **col**: Used to refer to a column in DataFrame operations (e.g., `col('age')` to reference the "age" column).
- These functions are useful in DataFrame transformations and operations.

---

## 17. What is the difference between parquet files and CSV files?
- **Parquet**:
  - Columnar format, optimized for storage and querying.
  - Supports compression and faster read/writes.
  - Better for big data due to its efficient storage and performance.

- **CSV**:
  - Row-based text format, simple but inefficient for large datasets.
  - No schema, slower read/writes, and lacks compression.

---

## 18. Can we read data directly from a JSON file using Spark? How? Why would we do that?
Yes, Spark can read data from a JSON file using the `spark.read.json()` method:
```python
df = spark.read.json("path_to_json_file")
