# 1. Installing PySpark in Google Colab

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark= SparkSession \
       .builder \
       .appName("Our First Spark Example") \
       .getOrCreate()

spark

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [[0m                                                                               Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
                                                                               Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
                                                                               Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
                                                                               Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
                                                                               Hit:6 https://cli.github.com/packages stable InRelease
Get:7 https://developer.download.nvidia.com/compute/cud

# CREATE A RDD

In PySpark, there are several ways to create an RDD (Resilient Distributed Dataset), each suited to different scenarios

1. parallelize Method

Description: The parallelize method is used to create an RDD from an existing Python collection (like a list or set). This is the simplest way to create an RDD from in-memory data.

Use Case: When you have a small dataset in memory that you want to distribute across a cluster

 Parallelize the data

rdd = sparkContext.parallelize(data)

2. textFile Method

Description: The textFile method is used to create an RDD from a text file stored in a distributed file system like HDFS, S3, or even local file systems.
Use Case: When you want to create an RDD from data stored in a text file, with each line of the file becoming an element in the RDD.

rdd1=sparkContext.textFile("/content/random_words_large.txt")

3. range Method

Description: This method creates an RDD from a specified range of integers.
Use Case: When you need to generate a sequence of numbers as an RDD.

rdd = sparkContext.range(start=0, end=100, step=1, numSlices=5)


In [None]:
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

# Get the Spark context from the session
sparkContext = spark.sparkContext

# Your data
data = [1,2,3,4,5,6,7,8,9,10]

# Parallelize the data

demo = sparkContext.parallelize(data) #
demo.collect()


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
# Create a RDD from a textFile
rdd1=sparkContext.textFile("/content/pro.txt")
rdd1.take(n)



['one floder -> ',
 '',
 'raw/year/month/file.txt',
 '',
 'raw/sub',
 '',
 '',
 'Create an interactive presentation with the LEVELUP design system using these specifications:',
 '',
 '**BRANDING & LOGO:**',
 '- LEVELUP logo in top-left corner with pulsing animation',
 '- Font: Bold, 2em size, color: #00d4ff',
 '- Animation: Pulse effect with glow (scale 1 to 1.1, text-shadow glow)',
 '',
 '**COLOR PALETTE:**',
 'Primary Colors:',
 '- Background: Linear gradient (135deg, #667eea 0%, #764ba2 100%)',
 '- Accent Blue: #00d4ff (cyan/turquoise)',
 '- Accent Pink: #ff6b9d (coral pink)',
 '- Accent Yellow: #ffff00 (bright yellow)']

In [None]:
# Create a RDD from Range

rdd2 = sparkContext.range(start=5, end=70, step=2)
rdd2.take(20)



[5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43]

# Map and FlatMap Transformation


What is map operation ?

The map operation in Spark is like a factory that takes something in and produces something else. Imagine you have a bunch of raw materials (data), and you want to process each one in a certain way to create a new product.

Example:

Imagine you have a list of numbers, and you want to create a new list where each number is doubled.

Original List:

[1, 2, 3, 4, 5]

What you want to do:

Double each number, so 1 becomes 2, 2 becomes 4, and so on.

How map works:

numbers = [1, 2, 3, 4, 5]
       
numrdd = sparkContext.parallelize(numbers)
doubled_numbers = numrdd.map(lambda x: x * 2)
doubled_numbers.take(2)


In [None]:
numbers =[1,2,3,4,5]
numberrdd= sparkContext.parallelize(numbers)
maprdd = numberrdd.map(lambda x: x*5)
maprdd.take(2)



[5, 10]

In [None]:
numbers = [1, 2, 3, 4, 5]
numrdd = sparkContext.parallelize(numbers)
doubled_numbers = numrdd.map(lambda x: x * 2)
doubled_numbers.take(2)

Example 2: Adding a Suffix to Words

Scenario:
Imagine you have a list of words, and you want to add the suffix “-ly” to each word to create an adverb.

Original List:

["quick", "bright", "silent"]

What you want to do:

Add “-ly” to the end of each word.

How map works:

You use map to append the suffix “-ly” to each word.

In [None]:
# Example - 2
words = ["quick", "bright", "silent"]
wordrdd = sparkContext.parallelize(words)
rdd2=wordrdd.map(lambda x: x+"ly")
rdd2.collect()




# Resulting List
# ["quickly", "brightly", "silently"]


['quickly', 'brightly', 'silently']

# FlatMap Operation



What is flatMap?

The flatMap operation is a bit like map, but with a twist. Instead of just transforming each item in the list into a new item, it can turn each item into zero, one, or many items, and then flatten all those items into a single list.
Example:

Imagine you have a list of sentences, and you want to create a list of all the words in those sentences.

Original List:

["Hello world", "Spark is great", "Map and flatMap"]

What you want to do:

Break each sentence into words and make a single list of all the words.

How flatMap works:

You tell flatMap to split each sentence into words, and then it combines all the words into a single list.

In [None]:
sentences = ["Hello world", "Spark is great", "Map and flatMap"]

rrd4 = sparkContext.parallelize(sentences)
characters = rrd4.flatMap(lambda x:x.split(" "))
characters.collect()






['Hello', 'world', 'Spark', 'is', 'great', 'Map', 'and', 'flatMap']

Example 2: Extracting Individual Characters from Words

Scenario:
You have a list of words, and you want to break each word into its individual characters. The goal is to create a list that contains all the characters from all the words.

Original List of Words:

["apple", "banana", "grape"]

How flatMap works:

You apply the flatMap function to break each word into its individual characters, and then flatten all the characters into a single list.

In [None]:
words = ["apple", "banana", "grape"]
rddwords = sparkContext.parallelize(words)
characters = rddwords.flatMap(lambda x:list(x))



print()

#characters.take(2)
characters.collect()

['a',
 'p',
 'p',
 'l',
 'e',
 'b',
 'a',
 'n',
 'a',
 'n',
 'a',
 'g',
 'r',
 'a',
 'p',
 'e']

In [None]:
rdd = sparkContext.range(start=0, end=100, step=2, numSlices=5)
rdd.collect()

## ReduceByKey

The reduceByKey transformation in PySpark is used to aggregate values by key in a key-value pair RDD (also known as a Pair RDD). It combines values with the same key using a specified associative and commutative function (such as addition, multiplication, etc.).



1) List of sales records (product, amount)

sales = [("apple", 100), ("banana", 200), ("apple", 150), ("orange", 300), ("banana", 50)]

2) Parallelize the data to create an RDD

rdd = sc.parallelize(sales)

3) Use reduceByKey to sum the sales amounts by product

total_sales_rdd = rdd.reduceByKey(lambda a, b: a + b)



Think of reduceByKey as a process where you group items by category and then apply some operation to combine or summarize the values in each group. For instance, if you have bags of different types of fruits, reduceByKey would allow you to count how many apples, bananas, and oranges you have in total by going through each bag and adding the counts for each fruit.

In [None]:
# example-01

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("ReduceByKeyExample2").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of sales records (product, amount)
sales = [("apple", 100), ("banana", 200), ("apple", 150), ("orange", 300), ("banana", 50)]

# Parallelize the data to create an RDD
rdd = sc.parallelize(sales)

# Use reduceByKey to sum the sales amounts by product
total_sales_rdd = rdd.reduceByKey(lambda a, b: a + b)

# Collect the results
total_sales = total_sales_rdd.collect()

print(total_sales)




[('apple', 250), ('banana', 250), ('orange', 300)]


In [None]:
# example-02


from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("ReduceByKeyWordCountExample").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of words
words = ["apple", "banana", "apple", "orange", "banana", "apple"]

# Parallelize the data to create an RDD
rdd = sc.parallelize(words)

# Map each word to a key-value pair (word, 1)
pairs_rdd = rdd.map(lambda word: (word, 1))

# Use reduceByKey to count the occurrences of each word
word_counts_rdd = pairs_rdd.reduceByKey(lambda a, b: a + b)

# Collect the results
word_counts = word_counts_rdd.collect()

print(word_counts)

# Stop the Spark session
spark.stop()


# SortByKey

sortByKey Transformation in PySpark

Description:

The sortByKey transformation in PySpark is used to sort an RDD of key-value pairs by the key. It returns a new RDD with the elements sorted according to the keys in either ascending or descending order. This transformation is particularly useful when you want to organize data based on the keys.

Key-Value Pair RDD: sortByKey only works on RDDs where each element is a key-value pair
(i.e., a tuple (key, value)).

Ascending/Descending Order: By default, sortByKey sorts the keys in ascending order. You can set it to descending order by specifying an optional parameter.
Stable Sort: The sorting is stable, meaning that if two keys are equal, their original order is preserved.

In [None]:
#  example-01

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SortByKeyExample").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of student names and scores
student_scores = [("John", 88), ("Alice", 95), ("Bob", 78), ("Diana", 85)]

# Parallelize the data to create an RDD
rdd = sparkContext.parallelize(student_scores)
sort_rdd=rdd.sortByKey(ascending=False)


# Collect the results
sorted_scores = sort_rdd.collect()

print(sorted_scores)






[('John', 88), ('Diana', 85), ('Bob', 78), ('Alice', 95)]


In [None]:
# example -02

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SortByKeyDescendingExample").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of product names and sales
sales_data = [("apple", 150), ("banana", 200), ("orange", 300), ("grape", 100)]

# Parallelize the data to create an RDD
rdd=sparkContext.parallelize(sales_data)  # T
sorted_rdd = rdd.sortByKey(ascending=False) # T

# Collect the results
newemo = sorted_rdd.collect()

print(newemo)



# Stop the Spark session


# Use sortByKey to sort the RDD by product names in descending order


# Collect the results



[('orange', 300), ('grape', 100), ('banana', 200), ('apple', 150)]


# SortBy

The sortBy transformation in PySpark is used to sort an RDD based on the values derived from each element by applying a given function. Unlike sortByKey, which only sorts based on the key in key-value pair RDDs, sortBy provides more flexibility as it allows you to specify any function that extracts a value to sort by. This function can be applied to each element of the RDD, and the RDD will be sorted based on the result of this function.

Flexible Sorting: You can sort by any derived value, not just by keys.
Ascending/Descending Order: The sorting can be done in either ascending or descending order by specifying the ascending parameter.
Number of Partitions: You can also specify the number of partitions in the resulting RDD using the numPartitions parameter.

In [None]:
#example-01

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SortByExample1").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of student names and scores
student_scores = [("John", 88), ("Alice", 95), ("Bob", 78), ("Diana", 85)]


# Parallelize the data to create an RDD
rdd= sparkContext.parallelize(student_scores)



# Use sortBy to sort the RDD by scores (second element of the tuple)
sorted_rdd = rdd.sortBy(lambda x: x[1],ascending=False)


# Collect the results
sorted_scores = sorted_rdd.collect()

print(sorted_scores)







[('Alice', 95), ('John', 88), ('Diana', 85), ('Bob', 78)]


In [None]:
from types import GeneratorType
# Example-02
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("SortByExample2").getOrCreate()

# Get the Spark context from the session
sc = spark.sparkContext

# List of words
words = ["banana", "apple", "grape", "pineapple", "orange"]

# Parallelize the data to create an RDD
rdd = sparkContext.parallelize(words)

# Use sortBy to sort the RDD by word length in descending order

sorted_rdd = rdd.sortBy(lambda x: len(x), ascending=False)





# Collect the results
sorted_words = sorted_rdd.collect()



# Collect the results


print(sorted_words)




['pineapple', 'banana', 'orange', 'apple', 'grape']


How to creata a RDD
map transformation
flat map transformation
sortbykey
sortby
Differnce btw sortby and sortbyKey

Action:-
take
collect

In [None]:
# Dataset

# Sample data: (customer_id, product_category, quantity, price_per_unit)
sales_data = [
    ("C001", "Electronics", 2, 599.99),
    ("C002", "Books", 5, 15.50),
    ("C001", "Clothing", 1, 79.99),
    ("C003", "Electronics", 1, 1299.99),
    ("C002", "Electronics", 3, 299.99),
    ("C004", "Books", 2, 25.00),
    ("C003", "Clothing", 4, 45.99),
    ("C001", "Books", 8, 12.99),
    ("C004", "Electronics", 1, 899.99),
    ("C002", "Clothing", 2, 89.99)
]

# Create RDD


# PySpark RDD Transformations Assignment

## Objective
Practice PySpark RDD transformations: `map`, `flatMap`, `sortBy`, and `sortByKey` using real-world scenarios.

---

## Exercise 1: E-commerce Sales Analytics

### Scenario
You work for an e-commerce company and need to analyze customer purchase data to generate insights for the marketing team.

### Dataset
```python
# Sample data: (customer_id, product_category, quantity, price_per_unit)
sales_data = [
    ("C001", "Electronics", 2, 599.99),
    ("C002", "Books", 5, 15.50),
    ("C001", "Clothing", 1, 79.99),
    ("C003", "Electronics", 1, 1299.99),
    ("C002", "Electronics", 3, 299.99),
    ("C004", "Books", 2, 25.00),
    ("C003", "Clothing", 4, 45.99),
    ("C001", "Books", 8, 12.99),
    ("C004", "Electronics", 1, 899.99),
    ("C002", "Clothing", 2, 89.99)
]

# Create RDD
sales_rdd = sc.parallelize(sales_data)
```

### Tasks

#### Task 1.1: Using `map` transformation
Calculate the total amount spent per transaction.
- **Input**: `(customer_id, product_category, quantity, price_per_unit)`
- **Output**: `(customer_id, product_category, total_amount)`
- **Formula**: `total_amount = quantity * price_per_unit`

```python
# Your code here
# Expected output format: ("C001", "Electronics", 1199.98)
```

#### Task 1.2: Using `flatMap` transformation
Extract all unique words from product categories and create a flattened list.
- **Input**: RDD with product categories
- **Output**: Flattened RDD of individual words
- **Note**: Split category names by spaces if any exist, convert to lowercase

```python
# Your code here
# Expected output: ["electronics", "books", "clothing", ...]
```

#### Task 1.3: Using `sortBy` transformation
Sort customers by their total spending across all transactions (descending order).
- First calculate total spending per customer
- Then sort by total amount in descending order

```python
# Your code here
# Expected output format: [("C003", total_amount), ("C001", total_amount), ...]
```

#### Task 1.4: Using `sortByKey` transformation
Create category-wise sales summary and sort by category name.
- **Input**: Calculate total sales amount per category
- **Output**: `(category, total_sales)` sorted by category name
- **Use**: `sortByKey()` for alphabetical sorting

```python
# Your code here
# Expected output format: [("Books", total_amount), ("Clothing", total_amount), ("Electronics", total_amount)]
```

---

## Exercise 2: Social Media Analytics

### Scenario
You're analyzing social media posts for a digital marketing agency to understand engagement patterns and trending topics.

### Dataset
```python
# Sample data: (user_id, post_content, likes, shares, timestamp)
social_posts = [
    ("U001", "Amazing sunset photography tips #photography #nature", 45, 12, "2024-01-15"),
    ("U002", "Best coding practices for beginners #coding #python #java", 78, 23, "2024-01-16"),
    ("U003", "Delicious homemade pizza recipe #food #cooking", 134, 45, "2024-01-15"),
    ("U001", "Mountain hiking adventure #nature #hiking #adventure", 89, 18, "2024-01-17"),
    ("U004", "AI and machine learning trends #AI #ML #technology", 156, 67, "2024-01-16"),
    ("U002", "Web development frameworks comparison #webdev #react #angular", 92, 34, "2024-01-18"),
    ("U003", "Healthy breakfast ideas #food #health #nutrition", 67, 15, "2024-01-17"),
    ("U005", "Travel photography in Europe #photography #travel", 203, 89, "2024-01-18"),
    ("U004", "Data science tools and techniques #datascience #python", 98, 41, "2024-01-19"),
    ("U001", "Sunset time-lapse creation #photography #timelapse", 112, 28, "2024-01-19")
]

# Create RDD
posts_rdd = sc.parallelize(social_posts)
```

### Tasks

#### Task 2.1: Using `map` transformation
Calculate engagement score for each post.
- **Formula**: `engagement_score = (likes * 1) + (shares * 2)`
- **Output**: `(user_id, post_content, engagement_score)`

```python
# Your code here
# Expected output format: ("U001", "Amazing sunset photography tips #photography #nature", 69)
```

#### Task 2.2: Using `flatMap` transformation
Extract all hashtags from posts to analyze trending topics.
- **Input**: Post content containing hashtags
- **Output**: Flattened RDD of individual hashtags (without # symbol)
- **Note**: Extract words starting with '#', remove the '#' symbol, convert to lowercase

```python
# Your code here
# Expected output: ["photography", "nature", "coding", "python", ...]
```

#### Task 2.3: Using `sortBy` transformation
Rank users by their average engagement score (descending order).
- Calculate average engagement score per user
- Sort users by average engagement in descending order

```python
# Your code here
# Expected output format: [("U005", avg_engagement), ("U004", avg_engagement), ...]
```

#### Task 2.4: Using `sortByKey` transformation
Create daily engagement summary sorted by date.
- **Input**: Calculate total engagement per day
- **Output**: `(date, total_daily_engagement)` sorted by date
- **Use**: `sortByKey()` for chronological sorting

```python
# Your code here
# Expected output format: [("2024-01-15", total_engagement), ("2024-01-16", total_engagement), ...]
```

---

## Bonus Challenge

Combine multiple transformations to solve this advanced problem:

### Challenge: Top Trending Hashtags by Day
Using the social media dataset, find the top 3 most frequently used hashtags for each day, sorted by date.

**Requirements:**
1. Use `flatMap` to extract hashtags with their dates
2. Use `map` to count hashtag frequencies per day
3. Use `sortByKey` to sort by date
4. Use `sortBy` to rank hashtags within each day

```python
# Your code here
# Expected output format:
# [("2024-01-15", [("photography", count), ("nature", count), ("food", count)]),
#  ("2024-01-16", [("coding", count), ("python", count), ("AI", count)]), ...]
```

---

## Submission Guidelines

1. **Code Quality**: Write clean, well-commented code
2. **Output**: Include sample outputs for each task
3. **Explanation**: Briefly explain your approach for each transformation
4. **Testing**: Test your code with the provided sample data
5. **Performance**: Consider the efficiency of your transformations

## Evaluation Criteria

- **Correctness**: Solutions produce expected outputs
- **Proper Use**: Correct application of specified transformations
- **Code Style**: Clean, readable, and well-documented code
- **Understanding**: Clear explanation of transformation logic

---

## Additional Notes

- Use only the specified transformations: `map`, `flatMap`, `sortBy`, `sortByKey`
- Do not use actions like `collect()` or `count()` in transformation chains
- Focus on the transformation logic rather than Spark configuration
- Remember that RDD transformations are lazy and only executed when an action is called

FLAtMAP opertions

Problem Statement 1 – Split Sentences into Words

Question:
You are given an RDD containing sentences:
["I love learning PySpark", "PySpark is powerful", "Big data is amazing"]
Use flatMap() to split each sentence into words and print all individual words.

In [None]:
sentences = ["I|love|learning|PySpark", "PySpark|is|powerful", "Big|data|is|amazing"]