Apache Spark and PySpark are closely related but serve different purposes within the realm of big data processing. 

Here's a comparison to help understand their differences and use cases:

### Apache Spark
- **Definition**: Apache Spark is an open-source distributed computing system designed for speed and ease of use in processing large-scale data.
- **Languages**: Spark's core API is written in Scala, but it also provides APIs for Java, Python, and R.
- **Components**: Spark consists of several key components:
  - **Spark Core**: The underlying engine responsible for scheduling, distributing, and monitoring applications.
  - **Spark SQL**: Module for structured data processing, allowing SQL queries.
  - **Spark Streaming**: Enables processing of real-time data streams.
  - **MLlib**: Library for machine learning.
  - **GraphX**: API for graphs and graph-parallel computation.

### PySpark
- **Definition**: PySpark is the Python API for Apache Spark, allowing users to interact with Spark through Python.
- **Usage**: PySpark is used for integrating Python with Spark’s capabilities, making it accessible for those who are familiar with Python programming.
- **Components**: PySpark exposes the functionalities of Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX through Python.

### Key Differences
- **Language**: Spark provides APIs in multiple languages (Scala, Java, Python, R), while PySpark specifically caters to Python.

### Use Cases
- **Apache Spark**: Suitable for applications where you need the flexibility of choosing among multiple languages or need to leverage the full power of the Spark ecosystem for large-scale distributed computing.

In summary, PySpark is essentially a Python wrapper for Apache Spark, enabling Python developers to leverage Spark's distributed computing capabilities. Apache Spark provides the core functionality and infrastructure, while PySpark provides a user-friendly interface for Python users.

**RDD (Resilient Distributed Dataset)** is a fundamental data structure of Apache Spark. 

- **Definition**: An RDD is a fault-tolerant collection of elements that can be operated on in parallel across a distributed cluster.
- **Properties**:
  - **Immutable**: Once created, RDDs cannot be changed.
  - **Distributed**: Data is distributed across multiple nodes in a cluster.
  - **Fault-Tolerant**: If a node fails, data can be recomputed using lineage information (the sequence of transformations that created the RDD).
- **Creation**: RDDs can be created from:
  - **Data in storage**: Such as HDFS, S3, or local file systems.
  - **Parallelizing**: An existing collection in the driver program (e.g., a list or set).
- **Operations**:
  - **Transformations**: Lazy operations that define a new RDD (e.g., `map`, `filter`). These are not executed immediately but are recorded in the RDD's lineage.
  - **Actions**: Operations that trigger computation and return a result (e.g., `collect`, `count`). They execute the transformations to produce an output.

RDDs enable distributed processing and are the backbone of Spark’s high performance and scalability.

**RDD (Resilient Distributed Dataset)** is a fundamental data structure of Apache Spark. Here’s a concise summary:

- **Definition**: An RDD is a fault-tolerant collection of elements that can be operated on in parallel across a distributed cluster.
- **Properties**:
  - **Immutable**: Once created, RDDs cannot be changed.
  - **Distributed**: Data is distributed across multiple nodes in a cluster.
  - **Fault-Tolerant**: If a node fails, data can be recomputed using lineage information (the sequence of transformations that created the RDD).
- **Creation**: RDDs can be created from:
  - **Data in storage**: Such as HDFS, S3, or local file systems.
  - **Parallelizing**: An existing collection in the driver program (e.g., a list or set).
- **Operations**:
  - **Transformations**: Lazy operations that define a new RDD (e.g., `map`, `filter`). These are not executed immediately but are recorded in the RDD's lineage.
  - **Actions**: Operations that trigger computation and return a result (e.g., `collect`, `count`). They execute the transformations to produce an output.

RDDs enable distributed processing and are the backbone of Spark’s high performance and scalability.

In [0]:
# Create an RDD from a Python list
data = [1, 2, 3, 4, 5, 6, 7]
rdd = sc.parallelize(data)

# Perform a simple transformation (e.g., multiply each element by 20)
transformed_rdd = rdd.map(lambda x: x * 20)

# Collect the results
result = transformed_rdd.collect()
print(result)

[20, 40, 60, 80, 100, 120, 140]


In [0]:
# Filter out even numbers
even_rdd = rdd.filter(lambda x: x % 2 == 0)

# Perform a reduction to sum the numbers
sum_even = even_rdd.reduce(lambda a, b: a + b)
print(sum_even) #2 + 4 + 6

12


Text File

In Gotham's night, a shadow flies,
With cape unfurled, beneath dark skies.
A guardian fierce, with silent might,
He guards the weak throughout the night.

With heart of gold and soul of steel,
He fights for justice, true and real.
The Batman soars, a darkened knight,
Our city's hope, our guiding light.

In [0]:
# Read a text file into an RDD
file_rdd = sc.textFile("dbfs:/FileStore/poem.txt")

# Split lines into words
words_rdd = file_rdd.flatMap(lambda line: line.split(" "))

# Map each word to a tuple (word, 1)
word_pairs_rdd = words_rdd.map(lambda word: (word, 1))

# Reduce by key (word) to count occurrences
word_counts_rdd = word_pairs_rdd.reduceByKey(lambda a, b: a + b)

# Collect and print the results
word_counts = word_counts_rdd.collect()
for word, count in word_counts:
    print(f"{word}: {count}")

shadow: 1
flies,: 1
unfurled,: 1
beneath: 1
dark: 1
fierce,: 1
might,: 1
night.: 1
: 1
heart: 1
of: 2
gold: 1
steel,: 1
true: 1
The: 1
darkened: 1
knight,: 1
city's: 1
In: 1
Gotham's: 1
night,: 1
a: 2
With: 2
cape: 1
skies.: 1
A: 1
guardian: 1
with: 1
silent: 1
He: 2
guards: 1
the: 2
weak: 1
throughout: 1
and: 2
soul: 1
fights: 1
for: 1
justice,: 1
real.: 1
Batman: 1
soars,: 1
Our: 1
hope,: 1
our: 1
guiding: 1
light.: 1


In [0]:
# Create a list of tuples
data = [("krish", 33, "Rajkot", "Gifted"), ("bruce", 41, "Rajkot", "Gifted"), ("king", 19, "Gandhinagar", "Gifted"), ("steve", 25, "Rajkot",None), ("ravi", 28, "Gandhinagar",None)]

# Define the schema
columns = ["Name", "Age", 'City',"Category"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=columns)

# Read a file into a DataFrame
# df = spark.read.csv("data.csv", header=True, inferSchema=True)
# df = spark.read.json("data.json")

# Show the DataFrame
df.show()

+-----+---+-----------+--------+
| Name|Age|       City|Category|
+-----+---+-----------+--------+
|krish| 33|     Rajkot|  Gifted|
|bruce| 41|     Rajkot|  Gifted|
| king| 19|Gandhinagar|  Gifted|
|steve| 25|     Rajkot|    null|
| ravi| 28|Gandhinagar|    null|
+-----+---+-----------+--------+



Performing DataFrame Operations

In [0]:
# Select specific columns
df.select("Name", "age").show()

# Filter rows based on a condition
df.filter(df.Age > 30).show()

# Group by a column and perform aggregation
df.groupBy("City").count().show()

+-----+---+
| Name|age|
+-----+---+
|krish| 33|
|bruce| 41|
| king| 19|
|steve| 25|
| ravi| 28|
+-----+---+

+-----+---+------+--------+
| Name|Age|  City|Category|
+-----+---+------+--------+
|krish| 33|Rajkot|  Gifted|
|bruce| 41|Rajkot|  Gifted|
+-----+---+------+--------+

+-----------+-----+
|       City|count|
+-----------+-----+
|     Rajkot|    3|
|Gandhinagar|    2|
+-----------+-----+

