# UDF/UDTF Examples

This notebook contains diffrenet examples of how to create UDF/UDTF using the Snowpark API

In [None]:
# Make sure we do not get line breaks when doing show on wide dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

# Snowpark imports 
import snowflake.snowpark as S
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T

# Used for UDF examples
import cachetools
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Used for UDF/UDTF examples
import joblib
import sys
import os
import pandas as pd
import numpy as np

# Used for the UDTF examples
from collections import Counter
from typing import Iterable, Tuple

# Print the version of Snowpark we are using
print(f"Using Snowpark: {S.__version__}")

## Connect to Snowflake

This example is using the connections.toml file to connect to Snowflake. You can read more at https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-connect#connecting-using-the-connections-toml-file how to set it up.

In [None]:
CONNECTION_NAME = 'mstellwall_aws_us_west3'
DATABASE_NAME = 'SNOWPARK_DEMO_DB' # Database that has the source files
DATABASE_SCHEMA = 'SOURCE_DATA' # Name of schema that has the source files
FULLY_QUALIFIED_NAME = f"{DATABASE_NAME}.{DATABASE_SCHEMA}"

snf_session = Session.builder.config("connection_name", CONNECTION_NAME).create()
snf_session.use_schema(FULLY_QUALIFIED_NAME)
print("Current role: " + snf_session.get_current_role() + ", Current schema: " + snf_session.get_fully_qualified_current_schema() + ", Current WH: " + snf_session.get_current_warehouse())

# User Defined Functions (UDF)

There is two diffrent types of UDFs :
* UDF (Scalar User Defined Function)
    * Is a scalar function that returns one output row for each input row. 
    * The returned row consists of a single column/value.
    * Python UDF batch API enables defining UDFs that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series
* UDTF (User Defined Tabular Function)
    * A tabular function, also called a table function, returns zero, one, or multiple rows for each input row.

## UDF
A UDF can be created using the **@udf** decorator, the **udf** function or the **udf.register** method ofthe session object. It can be permanent or temporary.

Start by creating a UDF that returns a string, by setting *is_permanent=False* the UDF will only be avalible for our user and also only until the active Snowflake session is closed. The Function will be called for each input row ie it is not using the Batch API. By using **session.clear_imports()** and **session.clear_packages()** we make sure that old imports and packages are not included for the creation.

In [None]:
snf_session.clear_imports()
snf_session.clear_packages()

@F.udf(name="hello_udf", is_permanent=False, replace=True, session=snf_session)
def hello_udf(name: str) -> str:
    return f'Hello {name}!'

Create a DataFrame and test the function.

In [None]:
test_name_df = snf_session.create_dataframe([['Mats'], ['Pia']], schema=["name"])
test_name_df.select(F.call_function("hello_udf", F.col("name"))).show()

A NULL value can be provided to a UDF, it will be converted into a *None* value for the Python function

In [None]:
test_name_with_null_df = snf_session.create_dataframe([['Mats'],[None], ['Pia']], schema=["name"])
test_name_with_null_df.show()

In [None]:
test_name_with_null_df.select(F.call_function("hello_udf", F.col("name"))).show()

Create the same function again using the Python UDF batch API, this is done by changing the parameter to **PandasDataframe** or **PandasSeries** and the return to **PandasSeries**. The benfit of using the Python UDF batch API is that the function will not be called for each input row , but for a batches of rows instead.

In [None]:
@F.udf(name="hello_batch_udf", is_permanent=False, replace=True, session=snf_session)
def hello_batch_udf(ds: T.PandasSeries[str]) -> T.PandasSeries[str]:
    n = len(ds)
    return ds.apply(lambda x: f'Hello {x}, we got {n} rows')

In [None]:
test_name_df.select(F.call_function("hello_batch_udf", F.col("name"))).show()

Use a larger dataset for testing.

In [None]:
customers_df = snf_session.table("snowflake_sample_data.tpcds_sf100tcl.customer")
print(f"Nbr of customers: {customers_df.count():,}")
customers_df.show(5)

If we test this using **show** we will see that it is only providing 15 rows since that is the limit we are setting

In [None]:
customers_df.select(F.col("C_FIRST_NAME")).select(F.call_function("hello_batch_udf", F.col("C_FIRST_NAME"))).show(15)

By using **cache_result** we can temprary store the result of the query generated by the DataFrame and then seee that each call to the function does provide more rows.

In [None]:
batch_udf_df = customers_df.select(F.call_function("hello_batch_udf", F.col("C_FIRST_NAME"))).cache_result()
batch_udf_df.show(15)

We can also create a UDF based on a Python file, start by creating a directory for storing the files in.

In [None]:
!mkdir ../py_scripts/udf_examples

In [None]:
%%writefile ../py_scripts/udf_examples/udf_from_file.py
def hello_udf(name: str) -> str:
    return f'Hello {name}!, this is a function in a Python file'

Creating a UDF using the file we created, the file we point to will be uploaded to Snowflake during the creation.

In [None]:
local_file_udf = snf_session.udf.register_from_file(name="udf_local_file", file_path="../py_scripts/udf_examples/udf_from_file.py", func_name="hello_udf"
                                                , replace=True, is_permanent=False)

In [None]:
test_name_df.select(F.call_function("udf_local_file", F.col("name"))).show()

We can also create a UDF based on a file that is on a Snowflake stage, that enables us to updated the file without having to recreate the UDF.

In [None]:
%%writefile ../py_scripts/udf_examples/udf_from_stage.py
def hello_udf(name: str) -> str:
    return f'Hello {name}!, this is a function in a Python file that is on a stage'

We also need a Snowflake stage to store the file, we can either use a external stage (AWS S3, Azure Blob Storage , Google Cloud Storage) or a internal stage (managed by Snowflake).  In this example we are using a Snowflake internal stage.

In [None]:
snf_session.sql("create stage if not exists python_files").collect()

To ad the file to the stage we can use **file.put** if the stage is a Snowflake Internal, if using a cloud provider we need to use their tools to upload it.

In [None]:
snf_session.file.put('../py_scripts/udf_examples/udf_from_stage.py', '@python_files/udf_examples/', auto_compress=False, overwrite=True)

When creating a Python UDF from a file on a stage we need to provide what data types the arguments and return value have through the **return_type** and **input_types** parameters.

In [None]:
local_file_udf = snf_session.udf.register_from_file(name="udf_stage_file", file_path="@python_files/udf_examples/udf_from_stage.py", func_name="hello_udf"
                                                , input_types=[T.StringType()], return_type=T.StringType()
                                                , replace=True, is_permanent=False)

In [None]:
test_name_df.select(F.call_function("udf_stage_file", F.col("name"))).show(max_width=150)

If we update the file...

In [None]:
%%writefile ../py_scripts/udf_examples/udf_from_stage.py
def hello_udf(name: str) -> str:
    return f'Hello {name}!, this is a function in a Python file that is on a stage and is now updated'

And upload it to our stage, overwriting the existing

In [None]:
snf_session.file.put('../py_scripts/udf_examples/udf_from_stage.py', '@python_files/udf_examples/', auto_compress=False, overwrite=True)

And when we now call the UDF we are using the updated file

In [None]:
test_name_df.select(F.call_function("udf_stage_file", F.col("name"))).show(max_width=150)

### Reading files with UDFs
There is two ways to read files on a stage from a UDxF, either using **add_import** where the file can either be local or on a Snowflake Stage or using **SnowflakeFile** where the files needs to be on a stage that is using the Directory Table.

#### Using add_import
If we do not need to update the file, we can refeer to a local file and that file will be uploaded to Snowflake when the UDF is created. If we need to use a newer version of the file we would need to recreate the UDF.

By using cachetools we can make sure that the file is only loaded once, since cachetools will cache the return object of the function in memory and return it if the paramtere used in the call is the same.

Start by setting where the local files are and the name of the stage we will create later.

In [None]:
data_path = "../data/"
udf_stage_name = "UDF_DEMO_STAGE"

Create a text file

In [None]:
%%writefile {data_path}text_file.txt
Hello this is the first version!

Function to read a file from a stage that a UDF has access to, ie the file needs to be added using the imports parameter.

In [None]:
@cachetools.cached(cache={})
def read_file_cached(filename):
    import sys
    import os

    import_dir = sys._xoptions.get("snowflake_import_directory")
    if import_dir:
        with open(os.path.join(import_dir, filename), "r") as f:
            return f.read()


Create a UDF where the imports parameter is referring the local file, since we are using the **cachetools** library we also need to add that to the *packages* parameter. Since we point to the local file, using the **import** parameter, it will uploaded automatically and if we need to change the file we need to re-create the UDF.

In [None]:
@F.udf(name="read_file_static_udf", is_permanent=False, replace=True, packages=["cachetools"], imports=[f"{data_path}/text_file.txt"] ,session=snf_session)
def read_file_static() -> str:
    return read_file_cached('text_file.txt')


Test the function, since it does not require a input value we can use the **generator** method to generate a DataFrame with one row that has the the result of the call to the function.

In [None]:
snf_session.generator(F.call_function("read_file_static_udf"), rowcount=1).show()

If we want to be able to update the file without recreating the UDF, we need to store it in a Snowflake stage, the stage can be either internal or external.

Create a Internal Snowflake stage

In [None]:
snf_session.sql(f"create stage if not exists {udf_stage_name}").collect()

Upload a local file to the new stage. If it is a external stage you need to use the tools for it by the Cloud provider.

In [None]:
snf_session.file.put(f"{data_path}text_file.txt", f"@{udf_stage_name}", auto_compress=False, overwrite=True)

Check that the file is there.

In [None]:
snf_session.sql(f"ls @{udf_stage_name}").show()

Create a UDF that has access to the file in the stage, using the *imports* parameter.

In [None]:
@F.udf(name="read_file_stage_udf", is_permanent=False, replace=True, packages=["cachetools"], imports=[f"@{udf_stage_name}/text_file.txt"] ,session=snf_session)
def read_file_stage() -> str:
    return read_file_cached('text_file.txt')


In [None]:
snf_session.generator(F.call_function("read_file_stage_udf"), rowcount=1).show()

If we change the text_file.txt (in the data folder) and upload it it

In [None]:
%%writefile {data_path}text_file.txt
Hello this is the second version!

In [None]:
snf_session.file.put(f"{data_path}text_file.txt", f"@{udf_stage_name}", auto_compress=False, overwrite=True)
snf_session.sql(f"ls @{udf_stage_name}").show()

Rerun the call to the UDF

In [None]:
snf_session.generator(F.call_function("read_file_stage_udf"), rowcount=1).show()

Creating a UDF that uses as saved Python object. In this case a fitted scikit-learn pipline.

Create and fit a pipeline, using titanic data (use **00_Load_demo_data.ipynb** to load the data) 

In [None]:
cat_cols = ["EMBARKED", "SEX", "PCLASS"]
num_cols = ["AGE", "FARE"]

train_df = snf_session.table("titanic").select(*cat_cols, *num_cols, "SURVIVED")

train_pd = train_df.to_pandas()

X = train_pd[[*cat_cols, *num_cols]]
y = train_pd["SURVIVED"]

# Imputer and OneHotEncoder for categorical columns
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
# Imputer and Scaler for numerical columns
num_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=5)),
    ('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
  [
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols)
    ],  verbose_feature_names_out=False,
)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('classifier', RandomForestClassifier())])

rc_pipeline = pipe.fit(X, y)
rc_pipeline

Save the fitted pipeline as a file locally using joblib

In [None]:
joblib.dump(rc_pipeline, "rc_pipeline.joblib")

Upload the file to the Snowflake stage

In [None]:
snf_session.file.put("rc_pipeline.joblib", f"@{udf_stage_name}", auto_compress=False, overwrite=True)

In [None]:
snf_session.sql(f"ls @{udf_stage_name}").show()

Create a function to load the file using joblib, use cachetools so the read from stage is only done once

In [None]:
@cachetools.cached(cache={})
def load_joblib_file(filename):
    import joblib
    import sys
    import os

    import_dir = sys._xoptions.get("snowflake_import_directory")
    if import_dir:
        with open(os.path.join(import_dir, filename), 'rb') as file:
            m = joblib.load(file)
            return m


Create the UDF, it is important that the *imports* parameter is refering the stage and file. Also, only the filename is needed for the *load_joblib_file* function.

Since the function is depended on **Pandas**, **scikit-learn** and **cachetools** we need to add them to the *packages* parameter.

We will also make sure UDF scikit-learn version matches the local one.

In [None]:
from sklearn import __version__ as sk_version
sk_version

In [None]:
from pandas import __version__ as pd_version
pd_version

In [None]:
@F.udf(name = "predict_survive_udf", is_permanent = False, imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
       , packages = [f'pandas=={pd_version}', f'scikit-learn=={sk_version}', 'cachetools'], replace = True, session = snf_session)
def predict_survive(pd_df: T.PandasDataFrame[str, str, str, float, float]) -> T.PandasSeries[int]:
    
    pd_df.columns = [*cat_cols, *num_cols]
    model = load_joblib_file('rc_pipeline.joblib') # Only call with the file name!

    return model.predict(pd_df)    

Test that the UDF works.

In [None]:
input_cols = [F.col(col) for col in [*cat_cols, *num_cols]]
train_df.with_column("PREDICTION", F.call_function("predict_survive_udf",  *input_cols)).show()

A batch UDF can also be called by providing the inputs as a array for example

In [None]:

@F.udf(name = "predict_survive_array_udf", is_permanent = False, imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
       , packages = [f'pandas=={pd_version}', f'scikit-learn=={sk_version}', 'cachetools'], replace = True, session = snf_session)
def predict_survive_array(pd_s: T.PandasSeries[list]) -> T.PandasSeries[int]:
    pd_df = pd.DataFrame.from_dict(dict(zip(pd_s.index, pd_s.values))).T
    pd_df.columns = [*cat_cols, *num_cols]
    model = load_joblib_file('rc_pipeline.joblib') # Only call with the file name!

    return model.predict(pd_df)


In [None]:
train_df.with_column("PREDICTION", F.call_function("predict_survive_array_udf", F.array_construct(*input_cols))).show()

In some cases we want to provide a dict as input and that works as well.

In [None]:
@F.udf(name = "predict_survive_dict_udf", is_permanent = False, imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
       , packages = [f'pandas=={pd_version}', f'scikit-learn=={sk_version}', 'cachetools'], replace = True, session = snf_session)
def predict_survive_dict(pd_s: T.PandasSeries[dict]) -> T.PandasSeries[int]:
    pd_df = pd.json_normalize(pd_s)[["EMBARKED", "SEX", "PCLASS", "AGE", "FARE"]]
    model = load_joblib_file('rc_pipeline.joblib') # Only call with the file name!

    return model.predict(pd_df)

Using **object_construct** with '*' allows us to create a dict of each row with column_name: column_value

In [None]:
train_df.with_column("PREDICTION", F.call_function("predict_survive_dict_udf", F.object_construct('*'))).show()

To return multiple values from a UDF a list or Dict is needed, below UDF returns a list/array with the probabilities for each class and the prediced class

In [None]:
@F.udf(name = "predict_survive_array_return_udf", is_permanent = False, imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
       , packages = [f'pandas=={pd_version}', f'scikit-learn=={sk_version}', 'cachetools'], replace = True, session = snf_session)
def predict_survive_array_return(pd_df: T.PandasDataFrame[str, str, str, float, float]) -> T.PandasSeries[list]:
    
    pd_df.columns = [*cat_cols, *num_cols]
    model = load_joblib_file('rc_pipeline.joblib') # Only call with the file name!
    prediction_proba = model.predict_proba(pd_df)
        
    # Get the label for the highest probablility
    predicted_classes_idx = np.argmax(prediction_proba, axis=1)
    classes = model.classes_
    predicted_classes = classes[predicted_classes_idx]

    # Create a list with return values
    return_array = np.column_stack((prediction_proba, predicted_classes))

    return return_array



In [None]:
input_cols = [F.col(col) for col in [*cat_cols, *num_cols]]
train_df.with_column("RETURN_ARRAY", F.call_function("predict_survive_array_return_udf", *input_cols)).show()

#### Using SnowflakeFile
**SnowflakeFile** allows us to read files from a Snowflake stage without using the **add_import**, instead we pass the reference to a file as the input to the UDxF.

See end of UDTF section for an example.

## UDTF

User Defined Table Functions (UDTF) is a function that returns zero, one, or multiple rows for each input row.

When creating a UDTF a Python class has to be used as the handler

A UDTF handler class implements the following, which Snowflake invokes at run time:
* An **__init__** method. Optional. Invoked to initialize stateful processing of input partitions.
* A **process** method. Required. Invoked for each input row. The method returns a tabular value as tuples.
* An **end_partition** method. Optional. Invoked to finalize processing of input partitions.

A UDTF can be created using the **@udtf** decorator, the **udtf** function or the **udtf.register** method ofthe session object. It can be permanent or temporary.

Start with a simple UDTF that splits a string into words and fore each unique word it returns a row with it and the number of ocurrances in the string of it. We need to provide the output schema ie the columns of the returning rows. If only names are provided the data types are inheried from the process parameters

In [None]:
@F.udtf(name="word_count_udtf", output_schema=["word", "count"], is_permanent=False, replace=True, session=snf_session)
class MyWordCount:
    # Called once for each partition
    def __init__(self):
        self._total_per_partition = 0
    
    # Called for each row in a partition
    def process(self, s1: str) -> Iterable[Tuple[str, int]]:
        words = s1.split()
        self._total_per_partition = len(words)
        # Counter will return a dict with the uinique words as keys and the number ocurrances as the values
        counter = Counter(words) 
        yield from counter.items()
    
    # Called after the last row in a partion has been processed
    def end_partition(self):
        yield ("partition_total", self._total_per_partition)

Test the UDTF, by using session.table_function we will get a new DataFrame with the data generated by teh UDTF

In [None]:
df_udtf = snf_session.table_function("word_count_udtf", F.lit("w1 w2 w2 w3 w3 w3"))
df_udtf.show()

We can also use it with a DataFrame, using call_table_function

In [None]:
df_udtf_data = snf_session.create_dataframe([["w1 w2 w2 w3 w3 w3"]], schema=["text"])
df_udtf_data.show()

In [None]:
df_udtf_data.select(F.call_table_function("word_count_udtf", F.col("TEXT"))).show()

If we want to do the split/count by a column, the partition_by parameter can be used.

In [None]:
df_udtf_part_data = snf_session.create_dataframe([["1", "w1 w2 w2 w3 w3 w3"], ["2", "w4 w4 w4 w4 w1"]], schema=["partition","text"])
df_udtf_part_data.show()

In [None]:
df_udtf_part_data.select("partition", F.call_table_function("word_count_udtf", F.col("TEXT")).over(partition_by="partition")).show()

Another example of a UDTF that generate a list of previous rows values including current.

In [None]:
@F.udtf(name="collect_list", is_permanent=False, replace=True, packages=["typing"], output_schema=T.StructType([T.StructField("list", T.ArrayType())]), session=snf_session)
class collect_list_handler:
    def __init__(self) -> None:
        self.list = []
    def process(self, element: float) -> Iterable[Tuple[list]]:
        self.list.append(element)
        yield (self.list,)


In [None]:
train_df.with_column("collect_list", F.call_table_function("collect_list", F.col("FARE"))).show(5)

We can also use a UDTF for doing Scoring, for example if we want to return multiple columns. Have in mind that this will be row by row execution.

The example below uses the sklearn pipline we trained earlier to return the probabilities for 0 and 1 and the predicted class.

In [None]:
@F.udtf(name="predict_survive_udtf", is_permanent=False, replace=True, packages=['typing', f'pandas=={pd_version}', 'numpy', 'joblib', f'scikit-learn=={sk_version}']
        , imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
        , output_schema=T.StructType([T.StructField("prob_0", T.FloatType()), T.StructField("prob_1", T.FloatType()), T.StructField("prediction", T.StringType())]), session=snf_session)
class predict_survive_handler:
    # We load the model from stage at the start of each partition
    def __init__(self) -> None:
        import_dir = sys._xoptions.get("snowflake_import_directory")
        with open(os.path.join(import_dir, 'rc_pipeline.joblib'), 'rb') as file:
            self.model = joblib.load(file)
        self.classes = self.model.classes_
        
    # Score each input row
    def process(self, embarked: str, sex: str, pclass: str, age: float, fare: float) -> Iterable[Tuple[float, float, str]]:
        # Create a Pandas DataFrame of the input values
        pd_df = pd.DataFrame([[embarked, sex, pclass, age, fare]], columns=["EMBARKED","SEX", "PCLASS", "AGE", "FARE"])
        
        # Get the probabilities for 0/1
        prediction_proba = self.model.predict_proba(pd_df)[0]
        
        # Get the label for the highest probablility
        predicted_class_idx = np.argmax(prediction_proba)
        predicted_class = self.classes[predicted_class_idx]
        
        # Create a list with return values
        return_list = prediction_proba.tolist()
        return_list.append(predicted_class)
        
        # Return the list as a tuple
        yield tuple(return_list)


In [None]:
train_df = train_df.with_column("PCLASS", F.to_varchar(F.col("PCLASS")))
train_df.select( *input_cols, F.call_table_function("predict_survive_udtf", *input_cols)).show()

UDTF can also be vectorized, as a UDF, and in the above case where we do not want to process each row individual it makes more sense, and often more preformant, to use a vectroized UDTF.

In [None]:
@F.udtf(name="predict_survive_batch_2_udtf", is_permanent=True, replace=True, stage_location=udf_stage_name
        , packages=['typing', f'pandas=={pd_version}', 'numpy', 'joblib', f'scikit-learn=={sk_version}']
        , imports = [f"@{udf_stage_name}/rc_pipeline.joblib"]
        , input_types=[T.PandasDataFrameType([T.StringType(), T.StringType(), T.StringType(), T.FloatType(), T.FloatType()])]
        , output_schema=T.PandasDataFrameType([T.FloatType(), T.FloatType(), T.StringType()], ["PROB_0", "PROB_1", "PREDICTION"])
        #, output_schema=T.PandasDataFrameType([T.StringType()], ["info"])
        , session=snf_session)
class predict_survive_batch_handler:
    # We load the model from stage at the start of each partition
    def __init__(self) -> None:
        import_dir = sys._xoptions.get("snowflake_import_directory")
        with open(os.path.join(import_dir, 'rc_pipeline.joblib'), 'rb') as file:
            self.model = joblib.load(file)
        self.classes = self.model.classes_
        
    # Score all input rows
    def end_partition(self, pd_df):
        # Set the column name of the provided Pandas DataFrame
        pd_df.columns=["EMBARKED","SEX", "PCLASS", "AGE", "FARE"]
        
        # Get the probabilities for 0/1
        prediction_proba = self.model.predict_proba(pd_df)
        
        # Get the label for the highest probablility
        predicted_class_idx = prediction_proba.argmax(axis=1)
        predicted_class = self.classes[predicted_class_idx]
        predicted_class = np.expand_dims(predicted_class, axis=0)
        
        # Create a list with return values
        return_list = np.concatenate((prediction_proba, predicted_class.T), axis=1)
        
        # Return the list as a tuple
        yield pd.DataFrame(return_list)

Test the vectroized UDTF, the input columns will all have NULL values in the returning DataFrame.

In [None]:
input_df = train_df.with_column("PCLASS", F.to_varchar(F.col("PCLASS"))).with_column("P_ID", F.lit(1)).filter(F.col("EMBARKED").is_not_null())
input_df.select(*input_cols,  F.col("P_ID") 
                , F.call_table_function("predict_survive_batch_udtf", *input_cols).over(partition_by=["P_ID"])).show()

#### Using SnowflakeFile with UDTF
**SnowflakeFile** allows us to read files from a Snowflake stage without using the **add_import**, instead we pass the reference to a file as the input to the UDxF.

A good use case is to extract data from files in formats that is not supported out of the box by Snowflake, in this example we will get data from a fixed width file.

Start by creating a fixed width file.

In [None]:
data1 = (
    "id8141    360.242940   149.910199   11950.7\n"
    "id1594    444.953632   166.985655   11788.4\n"
    "id1849    364.136849   183.628767   11806.2\n"
    "id1230    413.836124   184.375703   11916.8\n"
    "id1948    502.953953   173.237159   12468.3"
)
with open(f"{data_path}fixed_width.txt", "w") as f:
    f.write(data1)

Upload the file to a stage and enable directory table on it, https://docs.snowflake.com/en/user-guide/data-load-dirtables

In [None]:
snf_session.file.put(f"{data_path}fixed_width.txt", f"@{udf_stage_name}/fixed_width/", auto_compress=False, overwrite=True)
# We need to enable directory table on the stage and then refresh it so the files are visible
snf_session.sql("ALTER STAGE " + udf_stage_name + " SET DIRECTORY = (ENABLE = TRUE)").collect()
snf_session.sql(f"ALTER STAGE {udf_stage_name} REFRESH").collect()

Create a UDTF that opens the file and creates a Pandas DataFrame on it, based on colspec, and then returns it as columns and rows.

In [None]:
from snowflake.snowpark.files import SnowflakeFile

@F.udtf(name="extract_fixed_width_udtf", output_schema=["col1", "col2", "col3", "col4"], is_permanent=False, replace=True, session=snf_session, packages=['snowflake-snowpark-python'])
class ExtractFixedWidth:
    # Called for each row
    def process(self, file_path: str, colspecs: list) -> Iterable[Tuple[str, float, float, float]]:
        # Open the file, 
        with SnowflakeFile.open(file_path, 'rb', require_scoped_url=False) as f:
            return_pd =  pd.read_fwf(f, colspecs=colspecs, header=None) 
        
        yield from list(return_pd.itertuples(index=False, name=None))
    

Get the path to the file we uploaded

In [None]:
input_file = snf_session.sql(f"ls @{udf_stage_name}/fixed_width/").collect()[0][0]
input_file

Extract the rows and columns for the file, colspecs has the start and end for each column.

In [None]:
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
df_udtf = snf_session.table_function("extract_fixed_width_udtf", F.lit(f"@{input_file}"), F.lit(colspecs))
df_udtf.show()

# Using UDF in other languages

A UDF can be created using Python, Java, Scala, JavaScript and SQL and can be used for any of the supported languages.

Start by creating a Scala UDF using SQL:

In [None]:
scala_udf_sql = """CREATE OR REPLACE TEMP FUNCTION scala_double_it(x INTEGER)
RETURNS INTEGER
LANGUAGE SCALA
CALLED ON NULL INPUT
RUNTIME_VERSION = 2.12
HANDLER='Double.doubleIt'
AS
$$
class Double {
  def doubleIt(x : Integer): Integer = {
    return 2*x;
  }
}
$$;
"""

snf_session.sql(scala_udf_sql).collect()

Create a Python UDF

In [None]:
@F.udf(name="python_echo_varchar", is_permanent=False, replace=True, session=snf_session)
def python_echo_varchar(ds: T.PandasSeries[str]) -> T.PandasSeries[str]:
    return ds.apply(lambda x: f"Hello {x}, I'm a Python UDF!")

Create a DataFrame with testdata

In [None]:
test_udf_df = snf_session.create_dataframe([['Mats', 2], ['Pia', 3]], schema=["name", "value"])
test_udf_df.show()

Use both the SCALA and PYTHON UDFs

In [None]:
test_udf_df.select(F.call_function("scala_double_it", F.col("value")).as_("SCALA_UDF_RES"), 
                   F.call_function("python_echo_varchar", F.col("name")).as_("PYTHON_UDF_RES")).show()

In [None]:
# Close session will drop all temp object created
snf_session.close()