- size of dataframe

In [None]:
import sys

# Convert DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Estimate memory usage
memory_usage_mb = sys.getsizeof(pandas_df) / (1024 * 1024)  # Convert to MB
print("Estimated memory usage:", memory_usage_mb, "MB")


In [None]:
df.explain()


Partitioning
Purpose: Partitioning divides data into logical partitions based on one or more columns, typically improving query performance by reducing the amount of data to scan.

Mechanism: Spark organizes data into directories based on partition columns, making it easy to filter and retrieve data based on partition values.



In [None]:
df.write.partitionBy("year", "month").parquet("path/to/table")


Bucketing
Purpose: Bucketing distributes data evenly into a fixed number of buckets based on a hash function applied to one or more columns, typically improving join and aggregation performance.

Mechanism: Spark writes data into a fixed number of files, or buckets, based on the hash value of the bucketing columns.

In [None]:
df.write.bucketBy(10, "customer_id").sortBy("transaction_date").saveAsTable("bucketed_table")


Columnar Storage: Parquet is a columnar storage format, similar to Redshift's internal storage format. This alignment makes data loading into Redshift more efficient because it can take advantage of Redshift's optimized columnar storage.

In [None]:
df.coalesce(1).write \
    .format("parquet") \
    .option("compression", "snappy") \  # Choose your preferred compression codec
    .mode("overwrite") \
    .save("s3://your-bucket/path/to/save/location")

The explode() function in PySpark is used to transform an array or map column into multiple rows, with one row for each element of the array or key-value pair of the map.

In [None]:
exploded_df = df.select("name", explode("fruits").alias("fruit"))

The coalesce() function in PySpark is used to return the first non-null value from a set of columns or expressions. It takes a variable number of arguments and returns the first argument that is not null. If all arguments are null, coalesce() returns null.

In [None]:
first_non_null_age = df.select(coalesce(col("age1"), col("age2")).alias("first_non_null_age")).first()[0]


In [None]:
first_non_null_age = df.select(
    "name",
    when(col("country") == "US", coalesce(col("age1"), col("age2"))). \
        when(col("country") == "UK", col("age2")). \
        when(col("country").isNull(), coalesce(col("age1"), col("age2"))). \
        alias("first_non_null_age")
)


Window functions in PySpark allow you to perform computations across rows of a DataFrame related to the current row, similar to SQL window functions.

Partitioning: Window functions are typically applied within partitions of a DataFrame. You can partition the data by one or more columns using the partitionBy() method.

Ordering: Within each partition, rows are usually ordered based on one or more columns. You can specify the ordering using the orderBy() method.



In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Window Functions Example") \
    .getOrCreate()

# Sample DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Cathy", 28),
        ("Dave", 35),
        ("Emily", 27)]

df = spark.createDataFrame(data, ["name", "age"])

# Define a window specification partitioned by no columns and ordered by age in descending order
window_spec = Window.orderBy(df["age"].desc())

# Add a new column "rank" using the row_number() window function
df_with_rank = df.withColumn("rank", row_number().over(window_spec))

# Show the DataFrame with ranks
df_with_rank.show()

# Stop Spark session
spark.stop()
