Notebook to demo the use of functions in databricks

In [0]:
%sql
CREATE OR REPLACE FUNCTION sql_multiply_numbers(a DOUBLE, b DOUBLE)
RETURNS DOUBLE
RETURN a * b;


In [0]:
%sql
select sql_multiply_numbers(4,4) as result;

In [0]:
# Note that pandas is overkill for this function but it's a nice example of how to use pandas sql functions

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

# Define the pandas UDF
@pandas_udf(DoubleType())
def pandas_multiply_numbers(a: pd.Series, b: pd.Series) -> pd.Series:
    return a * b


# Register the UDF so it can be used in SQL
spark.udf.register("pandas_multiply_numbers", pandas_multiply_numbers)



In [0]:
%sql
select pandas_multiply_numbers(4,4) as result;

Why use pd.Series instead of a numeric type?
When you're writing a pandas UDF (User Defined Function) in PySpark, you're not operating on single values — you're operating on batches of data (i.e. columns or chunks of columns). Here's why pd.Series is used:

Vectorized Operations
A pandas UDF receives entire columns (or batches of rows) as input, not individual values.
These columns are passed as pandas.Series objects, which allow for fast, vectorized operations — much faster than looping through individual values.
Parallel Processing
PySpark splits your data into partitions and processes them in parallel.
Each partition is passed to your UDF as a pd.Series, allowing pandas to operate on the entire chunk at once.


In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def multiply_numbers(a, b):
    return a * b

# Register the UDF
spark.udf.register("pyspark_multiply_numbers", multiply_numbers, IntegerType())

In [0]:
%sql

select pyspark_multiply_numbers(4,4) as result;

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

# Define the UDF
def udf_multiply_numbers(a, b):
    return a * b

# Register the UDF
udf_multiply_numbers = udf(udf_multiply_numbers, DoubleType())


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Sample DataFrame
df = spark.createDataFrame([
    (2.0, 3.0),
    (4.5, 5.5),
    (6.0, 7.0)
], ["a", "b"])

# Apply the UDF
df_with_product = df.withColumn("product", udf_multiply_numbers(col("a"), col("b")))
df_with_product.show()
