# PySpark - collect()
## PySpark Collect() – Retrieve data from DataFrame
PySpark RDD/DataFrame collect() is an action operation that is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group() e.t.c. Retrieving larger datasets results in OutOfMemory error.

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()

In [2]:
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.show(truncate=False)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



let’s use the collect() to retrieve the data.<br>
deptDF.collect() retrieves all elements in a DataFrame as an Array of Row type to the driver node. printing a resultant array yields the below output.

In [3]:
dataCollect = deptDF.collect()
print(dataCollect)

[Row(dept_name='Finance', dept_id=10), Row(dept_name='Marketing', dept_id=20), Row(dept_name='Sales', dept_id=30), Row(dept_name='IT', dept_id=40)]


Note that collect() is an action hence it does not return a DataFrame instead, it returns data in an Array to the driver. Once the data is in an array, you can use python for loop to process it further.

In [5]:
for row in dataCollect:
    print(row['dept_name'] + "," +str(row['dept_id']))

Finance,10
Marketing,20
Sales,30
IT,40


In [6]:
#Returns value of First Row, First Column which is "Finance"
deptDF.collect()[0][0]

'Finance'

* deptDF.collect() returns Array of Row type.
* deptDF.collect()[0] returns the first element in an array (1st row).
* deptDF.collect[0][0] returns the value of the first row & first column.

To just return certain elements of a DataFrame, you should call PySpark select() transformation first.

In [7]:
dataCollect = deptDF.select("dept_name").collect()
print(dataCollect)

[Row(dept_name='Finance'), Row(dept_name='Marketing'), Row(dept_name='Sales'), Row(dept_name='IT')]


### When to avoid Collect()
Usually, collect() is used to retrieve the action output when you have very small result set and calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.

### collect () vs select ()
select() is a transformation that returns a new DataFrame and holds the columns that are selected whereas collect() is an action that returns the entire data set in an Array to the driver.