In [2]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init(r'C:\spark\spark-3.5.0-bin-hadoop3')
import pyspark # only run this after findspark.init()

from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

In [5]:
data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"), \
    ("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL") \
  ]
columns=["firstname","lastname","country","state"]
df=spark.createDataFrame(data=data,schema=columns)
df.show()
print(df.collect())

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|    James|   Smith|    USA|   CA|
|  Michael|    Rose|    USA|   NY|
|   Robert|Williams|    USA|   CA|
|    Maria|   Jones|    USA|   FL|
+---------+--------+-------+-----+

[Row(firstname='James', lastname='Smith', country='USA', state='CA'), Row(firstname='Michael', lastname='Rose', country='USA', state='NY'), Row(firstname='Robert', lastname='Williams', country='USA', state='CA'), Row(firstname='Maria', lastname='Jones', country='USA', state='FL')]


### Explanation

df.rdd: It assumes that df is a DataFrame in PySpark. This line converts the DataFrame df into an RDD (Resilient Distributed Dataset). RDD is a fundamental data structure in Spark, which represents a distributed collection of data that can be processed in parallel across a cluster of machines.

.map(lambda x: x[3]): This line applies a map transformation on the RDD. The map transformation takes a lambda function (anonymous function) as an argument and applies it to each element of the RDD. In this case, the lambda function lambda x: x[3] is applied to each row x in the RDD, and it selects the value at index 3 of each row. This assumes that the rows in your DataFrame df are structured in such a way that you can access elements by index.

.collect(): Finally, the collect action is called on the RDD. The collect action retrieves all the elements that resulted from the map transformation and returns them as a local Python list. This means that all the values from the 4th column (0-based indexing) of the DataFrame are collected into a Python list named states1.

After executing this line of code, states1 will contain all the values from the 4th column of your DataFrame as a list in the local Python environment. Keep in mind that using collect() can be inefficient and should be avoided for large RDDs or DataFrames because it brings all the data from the distributed Spark cluster to the local machine, potentially causing performance and memory issues. It's better to use Spark's distributed operations whenever possible.

In [7]:
states1=df.rdd.map(lambda x: x[3]).collect()
print(states1)

#['CA', 'NY', 'CA', 'FL']

['CA', 'NY', 'CA', 'FL']


In [8]:
from collections import OrderedDict 
res = list(OrderedDict.fromkeys(states1)) 
print(res)

['CA', 'NY', 'FL']
