In [2]:
import os

# Set SPARK_HOME and JAVA_HOME environment variables
os.environ['SPARK_HOME'] = '/usr/local/Cellar/apache-spark/3.5.1/libexec'
os.environ['JAVA_HOME'] = '/usr/local/opt/openjdk/libexec/openjdk.jdk/Contents/Home'

## Basics of Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

### Key Features of Spark:
- **Speed**: Spark processes in-memory, which makes it faster than traditional disk-based processing engines.
- **Ease of Use**: High-level APIs in Java, Scala, Python, and R, and a rich set of libraries including SQL, MLlib (for machine learning), GraphX (for graph processing), and Spark Streaming.
- **Generalized**: Combine SQL, streaming, and complex analytics.
- **Fault Tolerance**: Built-in support for fault tolerance.

## Spark Architecture

The architecture of Spark comprises the following components:

- **Driver**: The process that runs the `main()` function of the application and creates the `SparkContext`.
- **Cluster Manager**: The external service for acquiring resources on the cluster (e.g., YARN, Mesos, Kubernetes, Standalone).
- **Workers**: The nodes that execute the tasks.
- **Executors**: Run on worker nodes, executing the tasks and keeping data in memory.
- **Tasks**: Units of work sent to executors by the driver.

## Spark Execution Flow:

1. **Job Submission**: User submits a job.
2. **Task Scheduling**: Driver program splits the job into tasks.
3. **Task Execution**: Tasks are sent to executors for execution.
4. **Result Collection**: Results are collected and returned to the driver.


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Sample Spark Program") \
    .getOrCreate()

# Sample data
data = [ 
    ("John", 28), # list of tuples 
    ("Doe", 35),  # It could also be in the form of a list of dictionaries ex: [{"Name": "Alice", "Age": 22}]
    ("Alice", 22), # It could also be in the form of a list of Row objects ex: [Row(Name="Bob", Age=29)]
    ("Bob", 29)    # It could also be in the form of a list of namedtuples  ex: [Person(Name="Bob", Age=29)]
]         # It could also be in the form of a list of objects                ex: [Person("Bob", 29)]

# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print("Original DataFrame:")
df.show()

# Perform some transformations
df_transformed = df.withColumn("AgeAfter5Years", col("Age") + 5)

# Show the transformed DataFrame
print("Transformed DataFrame:")
df_transformed.show()

# Perform an action: Count the number of rows
count = df_transformed.count()
print(f"Number of rows in the DataFrame: {count}")

# Stop the SparkSession
spark.stop()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/31 08:44:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Original DataFrame:


                                                                                

+-----+---+
| Name|Age|
+-----+---+
| John| 28|
|  Doe| 35|
|Alice| 22|
|  Bob| 29|
+-----+---+

Transformed DataFrame:
+-----+---+--------------+
| Name|Age|AgeAfter5Years|
+-----+---+--------------+
| John| 28|            33|
|  Doe| 35|            40|
|Alice| 22|            27|
|  Bob| 29|            34|
+-----+---+--------------+

Number of rows in the DataFrame: 4
