Step 2: Simple Spark Jobs
Example 1: Reading a CSV File
Upload a CSV file to Databricks:

Go to the Databricks workspace.
Click on Data > Add Data > Upload File.
Upload a sample CSV file (e.g., people.csv).
Create a Notebook and Read the CSV File:

In [None]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SimpleExample").getOrCreate()

# Read the CSV file
df = spark.read.option("header", "true").csv("/FileStore/tables/people.csv")

# Show the DataFrame
df.show()

Example 2: Writing Data to Parquet Format
Transform and Write Data:

In [None]:
# Select specific columns
selected_df = df.select("name", "age")

# Write the DataFrame to Parquet format
selected_df.write.parquet("/FileStore/tables/people.parquet")

Step 3: Intermediate Spark Jobs
Example 3: Reading Data from a JDBC Source
Set Up the JDBC Connection:

In [None]:
jdbc_url = "jdbc:postgresql://your_host:your_port/your_database"
connection_properties = {
    "user": "your_username",
    "password": "your_password",
    "driver": "org.postgresql.Driver"
}

# Read data from PostgreSQL database
jdbc_df = spark.read.jdbc(url=jdbc_url, table="your_table", properties=connection_properties)

# Show the DataFrame
jdbc_df.show()

Example 4: Performing Aggregations
Aggregate Data:


In [None]:
from pyspark.sql.functions import col

# Group by and aggregate
agg_df = jdbc_df.groupBy("column1").agg({"column2": "sum"})

# Show the aggregated DataFrame
agg_df.show()

Advanced Spark Jobs
Example 5: Handling JSON Data
Upload a JSON file to Databricks:

Go to the Databricks workspace.
Click on Data > Add Data > Upload File.
Upload a sample JSON file (e.g., data.json).


In [None]:
# Read the JSON file
json_df = spark.read.json("/FileStore/tables/data.json")

# Show the DataFrame
json_df.show()


Transform JSON Data:



In [None]:
# Select specific fields from nested JSON
transformed_df = json_df.select(col("field1"), col("nested.field2"))

# Show the transformed DataFrame
transformed_df.show()

Production-Level Considerations

Partitioning Data:

In [None]:
# Write data with partitioning
selected_df.write.partitionBy("age").parquet("/FileStore/tables/partitioned_people.parquet")


In [None]:
# Repartition the DataFrame for performance
large_df = spark.read.option("header", "true").csv("/FileStore/tables/large_file.csv")
partitioned_large_df = large_df.repartition(100)

# Write the large DataFrame
partitioned_large_df.write.parquet("/FileStore/tables/large_file.parquet")
