<a href="https://colab.research.google.com/github/mosesyhc/de300-wn2024-notes/blob/main/examples/ex_rdd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Mounting Google drive for a permanent venv

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Retrieving Java, Spark, and `findspark` in Python

In [None]:
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz

In [None]:
!pip install -q findspark

In [None]:
# spark setup
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
# findspark helps locate the environment variables
import findspark
findspark.init()

## RDD example

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

collection = [1, "two", 3.0, ("four", 4), {"five": 5}]  # generic list

sc = spark.sparkContext

collection_rdd = sc.parallelize(collection)  # list promoted to RDD

print(collection_rdd)

In [None]:
collection_rdd.collect()

# `map` example

In [None]:
from py4j.protocol import Py4JJavaError

def add_one(value):
    return value + 1

collection_rdd_p1 = collection_rdd.map(add_one)

In [None]:
try:
    print(collection_rdd_p1.collect())
except Py4JJavaError as e:
    print(e)

# You'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

### A potential fix

In [None]:
def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value

collection_rdd_p1_again = collection_rdd.map(safer_add_one)


In [None]:
print(collection_rdd_p1_again.collect())

## `filter` example

In [None]:
collection_rdd_filter = collection_rdd.filter(
    lambda elem: isinstance(elem, (float, int))
)

In [None]:
print(collection_rdd_filter.collect())

In [None]:
from operator import add, sub, mul

collection_rdd2 = sc.parallelize([4, 7, 9.2, 5.6, -20])

In [None]:
collection_rdd2.reduce(add)

In [None]:
collection_rdd2.reduce(
    lambda a, b: a + b
)

## RDD and dataframe

In [None]:
df = spark.createDataFrame([[1], [2], [3]], schema=["column"])

print(df.rdd)

In [None]:
print(df.rdd.collect())

# Exercise
- `collection_rdd.count()` returns the number of elements in the RDD.
- Reproduce `.count()` using `map`, `filter`, and `reduce`.

# Exercise
- Reproduce our word count example through `map`, `filter`, and `reduce`.

*As a reminder, this was the word count code using dataframe*:

```
import pyspark.sql.functions as F

counts = (
    spark.read.text(file_path)
     .select(F.split(F.col('value'), ' ').alias('line'))
     .select(F.explode(F.col('line')).alias('word'))
     .select(F.lower(F.col('word')).alias('word'))
     .select(F.regexp_extract(F.col('word'), r"(\W+)?([a-z]+)", 2).alias('word'))
     .where(F.col('word') != "")
     .groupby('word')
     .count()
)
```