# RDD Programming
https://spark.apache.org/docs/3.0.2/rdd-programming-guide.html

- Overview ✅
- Linking with Spark ✅
- Initializing Spark ✅
    - Using the Shell ✅
- Resilient Distributed Datasets (RDDs) ✅
    - Parallelized Collections ✅
    - External Datasets ✅ (_mostly a waste of time_)
    - RDD Operations ✅ (_pretty good_)
        - Basics ✅ (_really good_)
        - Passing Functions to Spark ✅ (_kind of sucked_)
        - Understanding closures ✅
            - Example ✅
            - Local vs. cluster modes ✅ (_REALLY REALLY important to understand_)
            - Printing elements of an RDD ✅ (_good to know_)
        - Working with Key-Value Pairs ✅ (_good_)
        - Transformations
        - Actions
        - Shuffle operations
            - Background
            - Performance Impact
    - RDD Persistence
        - Which Storage Level to Choose?
        - Removing Data
    - Shared Variables
    - Broadcast Variables
    - Accumulators
- Deploying to a Cluster
- Launching Spark jobs from Java / Scala
- Unit Testing
- Where to Go from Here


In [2]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession.builder.appName("RDD_Programming").getOrCreate()

In [3]:
data = [1, 2, 3, 4, 5]
dist_data = spark.sparkContext.parallelize(data)

In [4]:
dist_data

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [5]:
'''
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well.
For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
'''
dist_file_data = spark.sparkContext.textFile("data/mango.txt")
dist_file_data

data/mango.txt MapPartitionsRDD[2] at textFile at NativeMethodAccessorImpl.java:0

## External Datasets

mentions methods:
- .wholeTextFiles(path)
- .textFile(path)

In [7]:
dist_file_data_list = dist_file_data.collect()
dist_file_data_list

['this mango company animal',
 'cat dog ant mic laptop mango',
 'chair switch mobile am charger cover',
 'amanda mango mango any alarm ant']

In [8]:
dist_file_data = dist_file_data.map(lambda l: len(l))
dist_file_data.collect()

[25, 28, 36, 32]

In [11]:
# can optionally run:
# dist_file_data.persist() # before the action to cache the data in memory across the cluster
total_len = dist_file_data.reduce(lambda a, b: a + b)
total_len

121

In [13]:
# from the "Working with Key-Value Pairs" section
lines = spark.sparkContext.textFile("data/words_sm.txt")
display(lines.collect())
pairs = lines.map(lambda s: (s, 1))
display(pairs.collect())
counts = pairs.reduceByKey(lambda a, b: a + b)
display(counts.collect())

['Apple', 'Mic', 'Mic', 'Apple', 'Laptop', 'Mic']

[('Apple', 1), ('Mic', 1), ('Mic', 1), ('Apple', 1), ('Laptop', 1), ('Mic', 1)]

[('Apple', 2), ('Mic', 3), ('Laptop', 1)]