
# Apache Spark DataFrame Joins 


## Step 1: Create Spark Session

In [None]:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkJoinExamples").getOrCreate()


## Step 2: Create Sample DataFrames

In [None]:

employees = [
    (1, "Ravi", "IT", 50000),
    (2, "Priya", "HR", 45000),
    (3, "Ankit", "IT", 55000),
    (4, "Lakshmi", "Finance", 60000),
    (5, "John", None, 40000)
]

departments = [
    ("IT", "Bengaluru"),
    ("HR", "Hyderabad"),
    ("Finance", "Chennai"),
    ("Marketing", "Pune")
]

df_emp = spark.createDataFrame(employees, ["emp_id", "name", "dept_name", "salary"])
df_dept = spark.createDataFrame(departments, ["dept_name", "location"])

df_emp.show()
df_dept.show()



## Inner Join  
**What it does:** Returns only matching records between both DataFrames.  
**Use case:** To fetch employee details along with their department location.  
**Expected output:** Only employees whose department exists in `df_dept`.


In [None]:

inner_join_df = df_emp.join(df_dept, on="dept_name", how="inner")
inner_join_df.show()



## Left Join (Left Outer Join)  
**What it does:** Returns all rows from the left DataFrame (`employees`) and matching rows from the right.  
**Use case:** To list all employees, even if their department doesn’t exist in the department list.  
**Expected output:** Every employee, with `location` as `null` for unmatched departments.


In [None]:

left_join_df = df_emp.join(df_dept, on="dept_name", how="left")
left_join_df.show()



## Right Join (Right Outer Join)  
**What it does:** Returns all rows from the right DataFrame (`departments`) and matching rows from the left.  
**Use case:** To find all departments, even those that have no employees.  
**Expected output:** Every department, with `null` values for employees not assigned to any department.


In [None]:

right_join_df = df_emp.join(df_dept, on="dept_name", how="right")
right_join_df.show()



## Full Outer Join  
**What it does:** Returns all records from both DataFrames, matching where possible.  
**Use case:** To perform a full comparison between employees and departments.  
**Expected output:** All departments and employees, matched or unmatched.


In [None]:

full_outer_df = df_emp.join(df_dept, on="dept_name", how="outer")
full_outer_df.show()



## Left Semi Join  
**What it does:** Returns only employees whose department exists in the department DataFrame.  
**Use case:** To filter out employees with invalid or missing department references.  
**Expected output:** Employees belonging to valid departments only.


In [None]:

semi_join_df = df_emp.join(df_dept, on="dept_name", how="left_semi")
semi_join_df.show()



## Left Anti Join  
**What it does:** Returns employees whose department does **not** exist in the department DataFrame.  
**Use case:** To find mismatched or orphaned records.  
**Expected output:** Employees where `dept_name` is missing in `df_dept`.


In [None]:

anti_join_df = df_emp.join(df_dept, on="dept_name", how="left_anti")
anti_join_df.show()



## Cross Join (Cartesian Product)  
**What it does:** Returns all possible combinations of rows between both DataFrames.  
**Use case:** Used in simulation, testing, or generating combinations (rare in production).  
**Expected output:** Every employee combined with every department.


In [None]:

cross_join_df = df_emp.crossJoin(df_dept)
cross_join_df.show(10)



## Join with Multiple Conditions  
**What it does:** Joins on more than one condition.  
**Use case:** When the relationship depends on multiple columns (like `dept_name` and `country`).  
**Expected output:** Employees matched with departments in India having the same `dept_name`.


In [None]:

dept_extended = [
    ("IT", "Bengaluru", "India"),
    ("HR", "Hyderabad", "India"),
    ("Finance", "Chennai", "India"),
    ("Marketing", "Pune", "India")
]
df_dept2 = spark.createDataFrame(dept_extended, ["dept_name", "location", "country"])

multi_cond_df = df_emp.join(
    df_dept2,
    (df_emp.dept_name == df_dept2.dept_name) & (df_dept2.country == "India"),
    "inner"
)
multi_cond_df.show()



## Joins Using Aliases  
**What it does:** Simplifies references to column names when both DataFrames have overlapping names.  
**Use case:** Improves readability in joins where multiple columns share the same name.  
**Expected output:** Selected columns with clear prefixes or labels.


In [None]:

emp_alias = df_emp.alias("e")
dept_alias = df_dept.alias("d")

alias_join_df = emp_alias.join(
    dept_alias,
    emp_alias.dept_name == dept_alias.dept_name,
    "inner"
).select("e.emp_id", "e.name", "d.location")
alias_join_df.show()



## Handling Duplicate Columns After Join  
**What it does:** Removes duplicate or redundant columns after a join operation.  
**Use case:** When joining DataFrames with the same column names, only required fields are selected.  
**Expected output:** Clean, non-redundant columns from the join.


In [None]:

join_with_duplicates = df_emp.join(df_dept, "dept_name", "inner")
clean_df = join_with_duplicates.select("emp_id", "name", "dept_name", "location", "salary")
clean_df.show()



## Validate Join Results  
**What it does:** Performs post-join validation to check data consistency.  
**Use case:** Ensures all employees have a valid department and identifies mismatches.  
**Expected output:** A count of employees with invalid departments.


In [None]:

missing_dept_employees = df_emp.join(df_dept, "dept_name", "left_anti").count()
print(f"Employees with missing department: {missing_dept_employees}")



## Summary of Join Types

| Join Type | Syntax Example | Rows Returned |
|------------|----------------|----------------|
| Inner | `df1.join(df2, on, "inner")` | Matching rows only |
| Left Outer | `df1.join(df2, on, "left")` | All left + matching right |
| Right Outer | `df1.join(df2, on, "right")` | All right + matching left |
| Full Outer | `df1.join(df2, on, "outer")` | All rows from both sides |
| Left Semi | `df1.join(df2, on, "left_semi")` | Left rows with match |
| Left Anti | `df1.join(df2, on, "left_anti")` | Left rows without match |
| Cross Join | `df1.crossJoin(df2)` | Cartesian product |
| Multi-condition | `df1.join(df2, cond1 & cond2)` | Matches on multiple conditions |
| Alias Join | `df1.alias("a").join(df2.alias("b"), ...)` | Cleaner joins using aliases |
