In [1]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init(r'C:\spark\spark-3.5.0-bin-hadoop3')
import pyspark # only run this after findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

states = {"NY":"New York", "CA":"California", "FL":"Florida"}
broadcastStates = spark.sparkContext.broadcast(states)

data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]

rdd = spark.sparkContext.parallelize(data)

broadcastStates is likely a broadcast variable created in a Spark application. Broadcast variables allow you to efficiently share read-only data across multiple tasks in a distributed computation, making it accessible to all worker nodes without the need to transmit the data over the network multiple times.

broadcastStates.value is used to access the value associated with the broadcastStates variable. In this context, it's expected that broadcastStates contains a dictionary-like object where keys correspond to state codes and values correspond to state names or other information.

code is the argument passed to the state_convert function. It is expected to be a state code (e.g., a two-letter abbreviation like "CA" for California).

Inside the function, broadcastStates.value[code] retrieves the value associated with the provided code from the broadcast variable, effectively looking up the state name or information associated with that code.

For example, if you call state_convert("CA"), it would return the corresponding state name or information associated with the state code "CA" from the broadcastStates dictionary-like object.

This approach is often used in Spark applications to efficiently share reference data (like state codes and their corresponding names) across tasks running on different worker nodes in a distributed environment.

In [2]:
rdd = spark.sparkContext.parallelize(data)

def state_convert(code):
    return broadcastStates.value[code]

result = rdd.map(lambda x: (x[0],x[1],x[2],state_convert(x[3]))).collect()
print(result)


[('James', 'Smith', 'USA', 'California'), ('Michael', 'Rose', 'USA', 'New York'), ('Robert', 'Williams', 'USA', 'California'), ('Maria', 'Jones', 'USA', 'Florida')]
