### Transformations 2
##### join Function



In [0]:
# 01 Example: Join two RDDs by key
fruits = sc.parallelize([(1, "apple"), (2, "banana")])
colors = sc.parallelize([(1, "red"), (2, "yellow")])
fruits_color_join = fruits.join(colors).collect()
print("01 join fruits_color_join (join two RDDs):", fruits_color_join)

01 join fruits_color_join (join two RDDs): [(1, ('apple', 'red')), (2, ('banana', 'yellow'))]


In [0]:
# 02 Example: Join employee data with department data
employees = sc.parallelize([(1, "John"), (2, "Jane"), (3, "Joe")])
departments = sc.parallelize([(1, "HR"), (2, "Finance")])
employees_department_join = employees.join(departments).collect()
print("join example (employee-department join):", employees_department_join)

join example (employee-department join): [(1, ('John', 'HR')), (2, ('Jane', 'Finance'))]


##### cogroup Function
- The cogroup function in PySpark is used to group data from two RDDs that share the same key. 
-  It combines the values of matching keys from both RDDs into a tuple of lists.

In [0]:
# 01 Example: Cogroup two RDDs
fruits_rdd = sc.parallelize([(1, "apple"), (2, "banana"), (3, "orange")])
colors_rdd = sc.parallelize([(1, "red"), (2, "yellow")])
cogrouped_fruits_colors = fruits_rdd.cogroup(colors_rdd).mapValues(lambda x: (list(x[0]), list(x[1]))).collect()
print("01 cogroup example (group two RDDs):", cogrouped_fruits_colors)

01 cogroup example (group two RDDs): [(1, (['apple'], ['red'])), (2, (['banana'], ['yellow'])), (3, (['orange'], []))]


In [0]:
# 02 Example: Cogroup sales data with target data
sales_rdd = sc.parallelize([("store1", 100), ("store2", 200)])
targets_rdd = sc.parallelize([("store1", 150), ("store3", 250)])
Cgrouped_sales_target = sales_rdd.cogroup(targets_rdd).mapValues(lambda x: (list(x[0]), list(x[1]))).collect()
print("example_cogroup example (sales-targets cogroup):", Cgrouped_sales_target)

example_cogroup example (sales-targets cogroup): [('store2', ([200], [])), ('store3', ([], [250])), ('store1', ([100], [150]))]


#### Main Differences between `cogroup` and `join` in RDDs

##### Result

- **Cogroup**: The result is a list of values (there can be more than one value for each key).
- **Join**: The result is values in tuples (one-to-one matching).

##### Keys

- **Cogroup**: Handles all keys, even if they exist in only one RDD.
- **Join**: Only handles keys shared between the two.

##### Usage

- **Cogroup**: Suitable when you want to collect all the values associated with a specific key from more than one data set.
- **Join**: Suitable when you want to join two sets of data based on common values only.

##### Practical Uses

- **Cogroup**: Suitable when you need to collect multiple sales data from different sources on the same product, for example.
- **Join**: Suitable when you need to obtain detailed data about specific sales, such as the price and date only for the products that were sold.


##### distinct Function
- Return a new RDD containing the distinct elements in this RDD

In [0]:
# example_Example: Unique words from a list of words
words = ["cat", "dog", "cat", "elephant", "dog"]
words_rdd = sc.parallelize(words)
example__distinct = words_rdd.distinct().collect()
print("example_distinct example (unique words):", example__distinct)

example_distinct example (unique words): ['elephant', 'dog', 'cat']
