# Anatomy of PySpark Code


In this notebook, you'll see examples of how PySpark code is structured. 
The purpose is to familiarize you with the syntax, SparkSession, method chaining, and when imports are necessary.
    

## SparkSession and Built-in DataFrame Operations


The `spark` object (SparkSession) provides core operations like reading, filtering, and writing DataFrames. 
These methods are built into the SparkSession and do not require additional imports.
    

In [0]:

# In Databricks, the SparkSession is automatically available as 'spark'.
# No need to import DataFrame methods.
df = spark.read.format("csv").load("/path/to/file.csv")
df_filtered = df.filter("Age > 30")
df_filtered.write.format("csv").save("/path/to/directory/")
    

## When You Need to Import Modules

In [0]:

# Some operations require importing modules, such as built-in functions.
from pyspark.sql import functions as F

# Example: Using the 'col' function to reference a column
df = df.withColumn("AgePlusOne", F.col("Age") + 1)
    

In [0]:

# If you need to define a schema, you must also import types from 'pyspark.sql.types'.
from pyspark.sql.types import StructType, StructField, StringType

# Example of defining a schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", StringType(), True)
])
    

## Methods and Dot Notation

In [0]:
# Read a CSV file into a DataFrame using the SparkSession.
# The format is specified as 'csv' and the file path is provided.
df = spark.read.format("csv").load("/path/to/file.csv")

In [0]:
# Filter the DataFrame to include only rows where the 'Age' column is greater than 30.
# The result is stored in a new DataFrame called df_filtered.
df_filtered = df.filter("Age > 30")

In [0]:
# Write the filtered DataFrame to a specified directory in CSV format.
# The directory path is provided as the save location.
df_filtered.write.format("csv").save("/path/to/directory/")


In this example, we performed operations step-by-step, storing results in separate variables.
    

## Chaining Methods


Here, each method is chained onto the next using dot notation. The order matters, as operations are executed in sequence.
    

In [0]:

# Methods can also be chained together for conciseness.
spark.read.format("csv").load("/path/to/file.csv").filter("Age > 30").write.format("csv").save("/path/to/directory/")
    

## Backslash (\)

In [0]:

# For readability, you can use a backslash to split long chained operations across multiple lines.
spark.\
    read.format("csv").load("/path/to/file.csv").\
    filter("Age > 30").\
    write.format("csv").save("/path/to/directory/")
    