Basics of Spark
Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark Architecture
Spark follows a master-slave architecture. The key components are:

Driver Program: Runs the main function and handles the execution of tasks on the cluster.
Cluster Manager: Manages resources across the cluster.
Workers: Execute tasks assigned by the driver program.

In [1]:
import os

# Set SPARK_HOME and JAVA_HOME environment variables
os.environ['SPARK_HOME'] = '/usr/local/Cellar/apache-spark/3.5.1/libexec'
os.environ['JAVA_HOME'] = '/usr/local/opt/openjdk/libexec/openjdk.jdk/Contents/Home'

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Sample Spark Program") \
    .getOrCreate()

# Sample data
data = [ 
    ("John", 28), # list of tuples 
    ("Doe", 35),  # It could also be in the form of a list of dictionaries ex: [{"Name": "Alice", "Age": 22}]
    ("Alice", 22), # It could also be in the form of a list of Row objects ex: [Row(Name="Bob", Age=29)]
    ("Bob", 29)    # It could also be in the form of a list of namedtuples  ex: [Person(Name="Bob", Age=29)]
]         # It could also be in the form of a list of objects                ex: [Person("Bob", 29)]

# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print("Original DataFrame:")
df.show()

# Perform some transformations
df_transformed = df.withColumn("AgeAfter5Years", col("Age") + 5)

# Show the transformed DataFrame
print("Transformed DataFrame:")
df_transformed.show()

# Perform an action: Count the number of rows
count = df_transformed.count()
print(f"Number of rows in the DataFrame: {count}")

# Stop the SparkSession
spark.stop()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/31 00:57:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Original DataFrame:


                                                                                

+-----+---+
| Name|Age|
+-----+---+
| John| 28|
|  Doe| 35|
|Alice| 22|
|  Bob| 29|
+-----+---+

Transformed DataFrame:
+-----+---+--------------+
| Name|Age|AgeAfter5Years|
+-----+---+--------------+
| John| 28|            33|
|  Doe| 35|            40|
|Alice| 22|            27|
|  Bob| 29|            34|
+-----+---+--------------+

Number of rows in the DataFrame: 4
