In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

# Import data
books = spark.read.csv("books.csv", header=True, inferSchema=True).rdd

# Pair RDD 
- RDDs of key-value pairs
- Create Pair RDDs that allows you to perform these aggregation by keys

In [13]:
# RDD to Pair RDD
booksPair = books.map(lambda b: (b.type, 1))
print(type(booksPair))
result = booksPair.collect()
print(type(result))
result

<class 'pyspark.rdd.PipelinedRDD'>
<class 'list'>


[('Children', 1), ('Children', 1), ('Adult NF', 1)]

## Transformations
**Reduce By Key** 
- Result is a (K, V)
- Aggregates values per key and return a single value

In [12]:
result = booksPair.reduceByKey(lambda v1, v2: v1 + v2).collect()
print(type(result))
result

<class 'list'>


[('Children', 2), ('Adult NF', 1)]

**Group By Key**
- Result is a (K, Iterable<V>).  Value is an collection of all the elements for the same key
- If the purpose of grouping is to perform an aggregation, use reduceByKey() instead

In [10]:
booksPair.groupByKey().collect()

[('Children', <pyspark.resultiterable.ResultIterable at 0x7f41c7937080>),
 ('Adult NF', <pyspark.resultiterable.ResultIterable at 0x7f41c7937400>)]

**Get all keys**

- Returns the key of each tuple/pair
- It is possible for every value to have a unique key thus is may not be possible collect all keys at one node.  Hence this is a transformation
Example:

In [16]:
booksPair.keys().distinct().count()

2

## Actions ##
**Count by Key**
- Returns a Dictionary contains the counts by key

In [17]:
booksPair.countByKey()

defaultdict(int, {'Children': 2, 'Adult NF': 1})