# **Lab Session: Algorithms and Programming with Spark RDDs using PySpark**  

## Introduction
This lab session introduces you to foundational concepts of distributed data processing using Spark's Resilient Distributed Datasets (RDDs). You'll leverage Python and the PySpark library within the Colab environment to build, execute, and analyze various Spark programs. The session covers:

- **Setting up PySpark in Colab**: Learn to configure and initialize the SparkContext.
- **Practical Exercises**:
  1. **Word Count Problem**: Implement a Spark algorithm to analyze word frequencies in a text file.
  2. **Data Aggregation**: Compute the average quantities from a sample dataset while minimizing shuffle operations.
  3. **Join Operations**: Explore algorithms to perform equi-joins and right-outer joins on RDDs without the direct `join()` transformation.
  4. **SQL Query Encoding in Spark**: Encode and test SQL-like queries using Python MapReduce transformations.

Each exercise is designed to deepen your understanding of Spark's capabilities, RDD transformations, and actions. You'll also develop skills in optimizing performance and implementing complex operations.

**Tools Required**:
- **Python**: For Spark programming.
- **Colab Notebook**: To run your Spark scripts.

By the end of this session, you'll have hands-on experience in using Spark for solving real-world problems effectively.

## **Exercise 1: Word Count Problem**
**Objective**: Design and implement a Spark algorithm to compute word frequencies in a text file.

1. **Setup**:  
   - Ensure your Colab environment is ready with Spark installed. Run the following commands:

In [23]:
!pip install pyspark
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))




- Download or use an existing text file (e.g., `shake.txt`).

In [24]:
!wget https://www.dropbox.com/s/7ae58iydjloajvt/shake.txt

--2025-01-20 11:42:10--  https://www.dropbox.com/s/7ae58iydjloajvt/shake.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/ao0ozilexrndwvfnce3kp/shake.txt?rlkey=udk3zb5n8ur7rj3xmqosgb7pz [following]
--2025-01-20 11:42:10--  https://www.dropbox.com/scl/fi/ao0ozilexrndwvfnce3kp/shake.txt?rlkey=udk3zb5n8ur7rj3xmqosgb7pz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf1b9ecad64485ab6e8c51e6820.dl.dropboxusercontent.com/cd/0/inline/CihlDaufHCQS5BvORBzA2Fppi7bhnvkDkAyRzTU1nV3ycKqeXDzH_4oYB0LMjUTmugt8dyHqJxjHq1BDpvEUz_Uhl5rnCKOkWWQl8OON05-R09YzazeqIQ_Wc_kdubRR2qo/file# [following]
--2025-01-20 11:42:11--  https://ucf1b9ecad64485ab6e8c51e6820.dl.dropboxusercontent.com/cd/0/inline/CihlDaufHCQS5BvORBzA2Fppi7bhnv

2. **Create the RDD**:
   - Load the text file into an RDD:

In [25]:
document = sc.textFile("shake.txt")

3. **Transform and Process**:
   - Tokenize the lines into words:

In [26]:
words = None

- Map each word to a key-value pair:

In [27]:
word_pairs = None

 - Reduce by key to count occurrences:

In [28]:
word_counts = None

4. **View Results**:
   - Display the word counts:

In [29]:
print(word_counts.collect() if word_counts else 'Not computed yet')

Not computed yet


## **Exercise 2: Data Aggregation**
**Objective**: Compute the average quantity of each pet from a dataset and analyze shuffle operations.

1. **Setup**:  
   - Create an RDD for the dataset. For example:

In [30]:
data = [("dog", 3), ("cat", 4), ("dog", 5), ("cat", 6)]
pets = sc.parallelize(data)

2. **Aggregate Data**:
   - Calculate the total quantity and count for each pet:

In [31]:
totals = None

- Compute the average:

In [32]:
averages = None


3. **Optimize Shuffle**:
   - How to reduce shuffle operations?

4. **View Results**:
   - Print the averages:

In [33]:
print(averages.collect() if averages else 'Not computed yet')

Not computed yet



## **Exercise 3: Join Operations**
**Objective**: Perform equi-joins and right-outer joins without the `join()` transformation.

1. **Setup**:  
   - Use two RDDs representing key-value datasets:

In [34]:
rdd1 = sc.parallelize([("A", 1), ("B", 2), ("C", 3)])
rdd2 = sc.parallelize([("A", 4), ("B", 5), ("D", 6)])

2. **Equi-Join Implementation**:
   - Perform a cartesian product and filter:

In [35]:
equi_join = None

3. **Right-Outer Join**:
   - Extend the equi-join logic to include keys exclusive to `rdd2`.

4. **Discuss Results**:
   - Compare performance with the standard `join()` transformation.


## **Exercise 4: Encoding SQL Queries in Spark**
**Objective**: Encode SQL-like queries using Python MapReduce and test them.

1. **Setup**:  
   - Download the files:

In [36]:
!wget https://www.dropbox.com/s/tmt6u80mkrwfjkv/Customer.txt
!wget https://www.dropbox.com/s/8n5cbmufqhzs4r3/Order.txt

--2025-01-20 11:42:12--  https://www.dropbox.com/s/tmt6u80mkrwfjkv/Customer.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/xa82wow8dh9modao7msn4/Customer.txt?rlkey=rlbumsfpo0ctooaxcydyz6jbv [following]
--2025-01-20 11:42:12--  https://www.dropbox.com/scl/fi/xa82wow8dh9modao7msn4/Customer.txt?rlkey=rlbumsfpo0ctooaxcydyz6jbv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc5e820409b2444d4139fa03fe11.dl.dropboxusercontent.com/cd/0/inline/Cih6O962G3zAkyhK96hCZ6eU_8PBYztVQPFJFU9iwzdAKIB-ZHTM8STjZth4G8ZOfOmWOma9gYoAmjeTPnxTbT7NXQOzxrvC1szVz-Cx8KSPK8vr4h_9mR_jstgG95xuER4/file# [following]
--2025-01-20 11:42:13--  https://uc5e820409b2444d4139fa03fe11.dl.dropboxusercontent.com/cd/0/inline/Cih6O962G3zAkyhK96hCZ

- Load data into RDDs:

In [37]:
customer_rdd = sc.textFile("Customer.txt").map(lambda line: line.split(","))
order_rdd = sc.textFile("Order.txt").map(lambda line: line.split(","))

2. **Query 1: Customers with Orders in July**:
   - Filter customers by month:   
   ``SELECT name FROM Customer WHERE month(startDate)=7``

In [38]:
from datetime import datetime
july_customers = None
print(july_customers.collect() if july_customers else 'Not computed yet')

Not computed yet


3. **Query 2: Distinct Names**:
   - Use `distinct()`:  
   `SELECT DISTINCT name FROM Customer WHERE month(startDate)=7`

In [39]:
distinct_names = None
print(distinct_names.collect() if distinct_names else 'Not computed yet')

Not computed yet


4. **Query 3: Aggregated Orders**:
   - Perform grouping and aggregation:  
`SELECT  O.cid, SUM(total), COUNT(DISTINCT total)  FROM Order O GROUP BY O.cid`

In [40]:
grouped_orders = None
aggregated = None
print(aggregated.collect() if aggregated else 'Not computed yet')

Not computed yet


5. **Query 4: Join Customers and Orders**:
   - Use a key-based join:  
   `•	SELECT C.cid, O.total FROM Customer C, Order O WHERE  C.cid=O.ci`

In [41]:
join_result = None
print(join_result.collect() if join_result else 'Not computed yet')

Not computed yet
