# PySpark Installation

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark= SparkSession \
       .builder \
       .appName("Our First Spark Example") \
       .getOrCreate()

spark

Hit:1 https://cli.github.com/packages stable InRelease
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,475 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3,532 kB]


# Agenda


  ▶ SaveAsTextFile

  ▶ coalesce

  ▶ repartition
  
  ▶ foreach
  
  ▶ Pipe
  
  ▶ Join
  
  ▶ Union
  
  ▶ mapPartitions

# SaveAsTextFile

The saveAsTextFile function is used to save the contents of an RDD (Resilient Distributed Dataset) as a text file in a specified directory. Each element in the RDD is written as a separate line in the text file. This is useful for exporting data after processing so that it can be used by other systems or for storage.

Key Points:

File Format: Saves each element of the RDD as a line in a text file.

Directory Structure: It creates a directory with part files inside it (e.g., part-00000), representing different partitions of the RDD.

In [None]:
# Creating a simple RDD
sc= spark.sparkContext
rdd =sc.parallelize (["hello","world","how","are","you","this","is","pyspark"])
rdd1=rdd.coalesce(3)

# Saving the RDD to a text file
rdd1.saveAsTextFile("outputfile126")

# Result: A directory named 'outputfie' is created with part files.


##  coalesce

function is used to reduce the number of partitions in an RDD. This is particularly useful for optimizing performance when writing data to disk or when you know the dataset is small enough to fit in fewer partitions. It does not perform a full shuffle of the data, making it more efficient than repartition for reducing the number of partitions.

Key Points:

Reduces Partitions: Useful when writing data to reduce the overhead.
Efficient: Does not involve a full data shuffle, making it faster for reducing partitions.

# repartition

The repartition function in PySpark is a powerful tool used to change the number of partitions in an RDD or DataFrame. Partitions are the logical divisions of data in PySpark, and they play a critical role in how data is distributed across the cluster, affecting performance, parallelism, and resource management.

Concept:

repartition allows you to increase or decrease the number of partitions in an RDD or DataFrame.

When you increase the number of partitions, repartition distributes the data across more partitions to improve parallelism.

When you decrease the number of partitions, repartition balances the data by redistributing it, often involving a full shuffle, which ensures that the data is evenly distributed across the new partitions.
Full Data Shuffle:

Unlike coalesce, which minimizes shuffling, repartition performs a full shuffle of the data across the nodes in the cluster.

This is essential when you need to ensure that the data is balanced evenly across partitions, particularly when increasing the number of partitions.
When to Use repartition:

Increase Parallelism: When your computation is slow because some partitions are much larger than others.

Balance Workload: When you need to distribute data more evenly across the cluster to avoid "skewness."

Optimize Join Operations: When preparing data for join operations where balanced partitions can improve performance.

Prepare Data for Output: When writing data out to disk, repartitioning can help manage the number of output files.

In [None]:
rdd = sc.parallelize([1,2,3,4,5,6,7,8,8,9],5)
#print("par",rdd.getNumPartitions())
rdd2=rdd.repartition(3)
print("par",rdd2.getNumPartitions())

par 3


# foreach

The foreach function in PySpark is used to apply a function to each element of an RDD or DataFrame without returning any value. It is particularly useful when you want to perform side effects, such as writing data to an external system, logging, or updating a database. Unlike map, which transforms data and returns a new RDD or DataFrame, foreach is intended for actions that do not produce a result.

What is foreach?

Concept:

The foreach function iterates over each element in the RDD or DataFrame and applies a specified function to it.

It is often used for operations that have side effects, like writing data to a database, printing to the console, or updating some external state.
Since foreach does not return a new RDD, it is categorized as an action in PySpark (as opposed to a transformation like map).
When to Use foreach:

Logging: When you want to log information about each element in the RDD for debugging or monitoring.

Database Operations: When you need to insert or update records in a database based on the elements in the RDD.

External API Calls: When you want to send data to an external service or API for each element.

In [None]:
rdd =sc.parallelize([1,2,3,4,5,6,7,8,8,9],5)
rdd.foreach(lambda x: print(x))


In [None]:
def save_to_external_system(data):
    print(f"Saving {data} to an external system...")

rdd=sc.parallelize([("user1",100),("user2",200),("user3",150)])
rdd.foreach(save_to_external_system)
# for each will process the data without creating a new rdd

# pipe

The pipe function in PySpark allows you to send the output of an RDD through a shell command or an external program, and then collect the processed results back into a new RDD. This feature is particularly useful when you need to integrate PySpark with existing command-line tools or external programs that can process text-based data.

What is pipe?


The pipe function is used to send the data from each partition of an RDD to a shell command or external program. The command processes the data, and the output from the command is captured as a new RDD.
Each partition of the RDD is piped as a separate process to the shell command, meaning that the command operates independently on each partition.

Use Cases:

Integrating with Shell Commands: When you have existing scripts or command-line tools that perform specific processing tasks.

Interfacing with External Programs: When you need to leverage tools outside of PySpark for tasks such as data formatting, statistical analysis, or other specialized processing.

Complex Data Processing: When certain operations are easier or more efficient to implement in a shell command than in PySpark.

In [None]:
rdd =sc.parallelize(["hello world","how are you","this is pyspark"])
rdd_pipe1=rdd.pipe("wc -l")
rdd_pipe=rdd.pipe("tr a-z A-Z")
rdd_pipe.collect()
rdd_pipe1.collect()

['1', '2']

# Join

The join operation in PySpark is used to combine two RDDs or DataFrames based on a common key. This is similar to SQL joins, where you combine tables based on a related column. join is particularly useful when working with distributed datasets where you need to merge data from different sources or tables based on shared keys.

What is join?


The join operation in PySpark is used to merge two RDDs or DataFrames that have the same key type. The result is a new RDD or DataFrame where each key is associated with a tuple containing the values from both datasets.

The keys must be unique within each RDD/DataFrame, but they do not need to be unique across the two datasets being joined.

Types of Joins:

Inner Join: The default join type. Only the keys present in both RDDs or DataFrames are included in the result.

Left Outer Join: Includes all keys from the left RDD/DataFrame and their associated values. If the key is not present in the right RDD/DataFrame, the result includes None for those values.

Right Outer Join: Includes all keys from the right RDD/DataFrame and their associated values. If the key is not present in the left RDD/DataFrame, the result includes None for those values.

Full Outer Join: Includes all keys from both RDDs/DataFrames. Where keys do not match, None is used for missing values.

Use Cases:

Combining Related Data: When you have related datasets (e.g., customer information and order data) that you want to combine based on a common key (e.g., customer ID).

Data Integration: Merging data from different sources, such as joining sales records with product details.

Data Enrichment: Adding additional information to a dataset by joining it with another dataset that contains more detailed or related information.

In [None]:
rdd_names=sc.parallelize([("user1","hari"),("user2","hari3"),("user3","hari2"),("user4","hari1"),("user5","ari")])
rdd_purchases=sc.parallelize([("user1",100),("user2",200),("user3",150)])
rdd_names.join(rdd_purchases).collect()

[('user2', ('hari3', 200)),
 ('user3', ('hari2', 150)),
 ('user1', ('hari', 100))]

In [None]:
rdd_names.leftOuterJoin(rdd_purchases).collect()

[('user2', ('hari3', 200)),
 ('user3', ('hari2', 150)),
 ('user4', ('hari1', None)),
 ('user1', ('hari', 100)),
 ('user5', ('ari', None))]

In [None]:
rdd_names.rightOuterJoin(rdd_purchases).collect()

[('user2', ('hari3', 200)),
 ('user3', ('hari2', 150)),
 ('user1', ('hari', 100))]

In [None]:
rdd_names.fullOuterJoin(rdd_purchases).collect()

[('user2', ('hari3', 200)),
 ('user3', ('hari2', 150)),
 ('user4', ('hari1', None)),
 ('user1', ('hari', 100)),
 ('user5', ('ari', None))]

In [None]:
rdd=sc.parallelize([1,2,3,4,5])
rdd1=sc.parallelize([6,7,8,3,2,9,10])
rdd_union=rdd1.union(rdd1)
rdd_union.collect()

[6, 7, 8, 3, 2, 9, 10, 6, 7, 8, 3, 2, 9, 10]

# What is mapPartitions in PySpark?

# mapPartitions is a transformation in PySpark RDD that allows you to process an entire partition at once, instead of processing one record at a time like map does.*`


# `It gives you an iterator of a whole partition, and you return another iterator.`

In [None]:
rdd = sc.parallelize([1,2,3,4,5,6,7,8,8,9],5)

def partition_sum(iterator):
    yield sum(iterator)

rdd2=rdd.mapPartitions(partition_sum)
rdd2.collect()
#


[3, 7, 11, 15, 17]