# Column to list performance

In pyspark, there are many approaches to accomplish the same task. For example, a common operation is to collect a column's value into a list.

Given a starting dataset containing two columns - mvv and index, Here are five methods to produce an identical list of mvv values:



In [1]:
from pyspark.sql import SparkSession
from pathlib import Path
import os
import warnings

warnings.filterwarnings('ignore')

os.chdir("../..")
from benchmarks.visualize_benchmarks import parse_results, show_boxplot, show_line_plot

## Implementations

In [2]:
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("docs/data/mvv_xsmall").limit(5)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/20 19:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# 1. toPandas()
list(df.select('mvv').toPandas()['mvv'])

[0, 1, 2, 3, 4]

In [4]:
# 2. flatMap
df.select('mvv').rdd.flatMap(lambda x: x).collect()

[0, 1, 2, 3, 4]

In [5]:
# 3. map
df.select('mvv').rdd.map(lambda row: row[0]).collect()

[0, 1, 2, 3, 4]

In [6]:
# 4. collect list comprehension 
[row[0] for row in df.select('mvv').collect()]

[0, 1, 2, 3, 4]

In [7]:
# 5. toLocalIterator() list comprehension
[row[0] for row in df.select('mvv').toLocalIterator()]

[0, 1, 2, 3, 4]

***

Although the resulting lists are equal, the time it takes to create them are not. This difference increases dramatically for larger datasets.

In [8]:
result_df, average_df = parse_results(spark)

In [9]:
show_boxplot(result_df)

In [10]:
show_line_plot(average_df)

***

This line plot displays the runtime in seconds by a log-transformed dataset size. Here it shows that all implementations have similar performance at 1K and 100k rows. Beyond this row count, toPandas() exhibits a roughly linear increase in runtime while the other methods increase more rapidly.

`toPandas()` is consistently the fastest method across all tested dataset sizes. However, pyarrow and pandas are not required dependencies of `quinn` so this method will only work with those packages available. For typical spark workloads, the `flatmap` implementation is the next best option to use by default.