<a href="https://colab.research.google.com/github/imtheguna/PySpark-Learning/blob/GoogleColab/11_PySpark_Joins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()
!pip install pyspark

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connecting to cloud.r-project.org (18.160.213.101)] [Connecting to ppa.lau                                                                                                    Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
                                                                                                    Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
0% [2 InRelease 47.5 kB/119 kB 40%] [3 InRelease 46.0 kB/110 kB 42%] [Waiting for headers] [Connecti                                                                                                    Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [2 InRelease 47.5 kB/119 kB 40%] [3 InRelease 46.0 kB/110 kB 42%] [Waiting for headers] [Connecti0% [2 InRelease 62.0 kB/119 kB 52%] [Waiting for headers] [Connecting 

In [16]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession \
      .builder \
      .appName('SelectColumns').getOrCreate()


df1 = spark.createDataFrame([(1,'Guna'),(2,'Arun'),(3,'Pavi')],['id','name'])

df2 = spark.createDataFrame([(1,23),(4,23),(3,30)],['id','age'])

df1.show()
df2.show()


+---+----+
| id|name|
+---+----+
|  1|Guna|
|  2|Arun|
|  3|Pavi|
+---+----+

+---+---+
| id|age|
+---+---+
|  1| 23|
|  4| 23|
|  3| 30|
+---+---+



In [17]:
## Inner Join
"""An inner join returns rows from both dataframes that have matching keys.
    In other words, it returns only the rows that have common keys in both dataframes. """

df_inner = df1.join(df2,on=['id'],how='inner')

df_inner.show(5)

+---+----+---+
| id|name|age|
+---+----+---+
|  1|Guna| 23|
|  3|Pavi| 30|
+---+----+---+



In [18]:
## Outer Join
"""An outer join, also known as a full join, returns all rows from both dataframes.
    If a key is present in one dataframe but not in the other, the missing values are filled with nulls. """

df_inner = df1.join(df2,on=['id'],how='outer')

df_inner.show(5)

+---+----+----+
| id|name| age|
+---+----+----+
|  1|Guna|  23|
|  3|Pavi|  30|
|  2|Arun|null|
|  4|null|  23|
+---+----+----+



In [19]:
## Left Join
"""A left join returns all rows from the left dataframe and the matched rows from the right dataframe.
    If no match is found for a key in the right dataframe, the result will contain null values."""

df_inner = df1.join(df2,on=['id'],how='left')

df_inner.show(5)

+---+----+----+
| id|name| age|
+---+----+----+
|  1|Guna|  23|
|  3|Pavi|  30|
|  2|Arun|null|
+---+----+----+



In [20]:
## Right Join
"""A right join returns all rows from the right dataframe and the matched rows from the left dataframe.
  If no match is found for a key in the left dataframe, the result will contain null values."""

df_inner = df1.join(df2,on=['id'],how='right')

df_inner.show(5)

+---+----+---+
| id|name|age|
+---+----+---+
|  1|Guna| 23|
|  3|Pavi| 30|
|  4|null| 23|
+---+----+---+



In [21]:
## Left Semi Join
"""A left semi join returns only the columns from the left dataframe for the rows with matching keys in both dataframes.
  It is similar to an inner join but only returns the columns from the left dataframe."""

df_inner = df1.join(df2,on=['id'],how='left_semi')

df_inner.show(5)

+---+----+
| id|name|
+---+----+
|  1|Guna|
|  3|Pavi|
+---+----+



In [22]:
## Left Anti Join
"""A left anti join returns the rows from the left dataframe that do not have matching keys in the right dataframe.
  It is the opposite of a left semi join."""

df_inner = df1.join(df2,on=['id'],how='left_anti')

df_inner.show(5)

+---+----+
| id|name|
+---+----+
|  2|Arun|
+---+----+



In [24]:
## Cross Join
"""A cross join, also known as a cartesian join, returns the cartesian product of both dataframes.
    It combines each row from the left dataframe with each row from the right dataframe."""

df_inner = df1.crossJoin(df2)

df_inner.show()

+---+----+---+---+
| id|name| id|age|
+---+----+---+---+
|  1|Guna|  1| 23|
|  1|Guna|  4| 23|
|  1|Guna|  3| 30|
|  2|Arun|  1| 23|
|  3|Pavi|  1| 23|
|  2|Arun|  4| 23|
|  2|Arun|  3| 30|
|  3|Pavi|  4| 23|
|  3|Pavi|  3| 30|
+---+----+---+---+

