[Connect to Spark Connect server](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

In [2]:
spark.version

'3.5.2'

[Quickstart: DataFrame](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html)

Use [Faker](https://faker.readthedocs.io/en/master/) to generate fake data

In [3]:
from faker import Faker
fake = Faker()

In [4]:
from pyspark.sql.functions import lit

names_df = spark.range(5).withColumn('name', lit(fake.name()))
names_df.show()

+---+--------------+
| id|          name|
+---+--------------+
|  0|Christina Dean|
|  1|Christina Dean|
|  2|Christina Dean|
|  3|Christina Dean|
|  4|Christina Dean|
+---+--------------+



In [5]:
names = []
for _ in range(5):
  names.append(fake.name())
print(names)

['Daniel Mcdonald', 'April Kelly', 'Mark Conway', 'Rebecca Johnson', 'Brian Clay MD']


In [6]:
names_df = spark.createDataFrame(names, ['name'])
names_df.show()

+---------------+
|           name|
+---------------+
|Daniel Mcdonald|
|    April Kelly|
|    Mark Conway|
|Rebecca Johnson|
|  Brian Clay MD|
+---------------+



In [7]:
from pyspark.sql.functions import udf

@udf
def generate_name() -> str:
    return fake.name()

In [8]:
names_df = spark.range(5).withColumn('name', generate_name())
names_df.show()

+---+---------------+
| id|           name|
+---+---------------+
|  0|William Mcguire|
|  1|William Mcguire|
|  2|William Mcguire|
|  3|William Mcguire|
|  4|William Mcguire|
+---+---------------+



It is highly likely that the above cell breaks due to incompatible Python versions on the driver and executors.

Use `PYSPARK_PYTHON` to align the Python versions.

We're using `poetry` to manage the Python version on the driver.

```console
$ poetry env info
Virtualenv
Python:         3.12.5
Implementation: CPython
Path:           /Users/jacek/Library/Caches/pypoetry/virtualenvs/jupyter-spark-CV1FMzPj-py3.12
Executable:     /Users/jacek/Library/Caches/pypoetry/virtualenvs/jupyter-spark-CV1FMzPj-py3.12/bin/python
Valid:          True

System
Platform:   darwin
OS:         posix
Python:     3.12.5
Path:       /usr/local/opt/python@3.12/Frameworks/Python.framework/Versions/3.12
Executable: /usr/local/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/bin/python3.12
```

With the above, `PYSPARK_PYTHON` should use the path of `Executable` of the `Virtualenv` section.

Stop the Spark Connect server.

```bash
./sbin/stop-connect-server.sh
```

Set `PYSPARK_PYTHON` environment variable.

```bash
export PYSPARK_PYTHON=/Users/jacek/Library/Caches/pypoetry/virtualenvs/jupyter-spark-CV1FMzPj-py3.12/bin/python
```

```bash
./sbin/start-connect-server.sh
```

Restart the cell above and it should work fine.

QUESTION: Why does the output contain the same names among the rows?

Read **Notes** in [pyspark.sql.functions.udf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html):

> The user-defined functions are considered deterministic by default.
> Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
> If your function is not deterministic, call `asNondeterministic` on the user defined function.

In [14]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

generate_name_udf = udf(lambda: fake.name(), StringType()).asNondeterministic()

In [15]:
names_df = spark.range(5).withColumn('name', generate_name_udf())
names_df.show()

+---+---------------+
| id|           name|
+---+---------------+
|  0|William Mcguire|
|  1|William Mcguire|
|  2|William Mcguire|
|  3|William Mcguire|
|  4|William Mcguire|
+---+---------------+



It does not work either! 😢

`faker.Factory` seems the answer.

In [16]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from faker import Factory

@udf
def fake_name():
    faker = Factory.create()
    return faker.name()

In [17]:
names_df = spark.range(5).withColumn('name', fake_name())
names_df.show()

+---+----------------+
| id|            name|
+---+----------------+
|  0|       John Cook|
|  1|     Tanya Crane|
|  2|Brooke Villa DDS|
|  3|    Julian Velez|
|  4| Angelica Malone|
+---+----------------+

