# PySpark Data Joins - Simple Example

## Goal:
Join two datasets: **Transactions** + **Cards** to create enriched transaction data.

## Files we'll use:
- `transactions_data.csv` - Transaction records
- `cards_data.csv` - Card information 

Let's see how to join these two datasets!

In [8]:
# Simple imports for joining data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Initialize Spark session
spark = SparkSession.builder \
    .appName("SimpleDataJoin") \
    .master("local[2]") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print(f"✓ Spark {spark.version} initialized")

✓ Spark 3.5.3 initialized


## 1. Load the Two Datasets

Load transactions and cards data to understand what we're working with:

In [9]:
# Load the two datasets we want to join
print("Loading datasets...")

# 1. Transaction data
transactions_df = spark.read.csv("../data/transactions_data.csv", header=True, inferSchema=True)
print(f"✓ Transactions: {transactions_df.count():,} rows, {len(transactions_df.columns)} columns")

# 2. Cards data
cards_df = spark.read.csv("../data/cards_data.csv", header=True, inferSchema=True)  
print(f"✓ Cards: {cards_df.count():,} rows, {len(cards_df.columns)} columns")

# Look at the structure
print("\nTransactions schema:")
transactions_df.printSchema()

print("\nCards schema:")
cards_df.printSchema()

Loading datasets...


                                                                                

✓ Transactions: 13,305,915 rows, 12 columns
✓ Cards: 6,146 rows, 13 columns

Transactions schema:
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)


Cards schema:
root
 |-- id: integer (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_brand: string (nullable = true)
 |-- card_type: string (nullable = true)
 |-- card_number: long (nullable = true)
 |-- expires: string (nullable = true)
 |-- cvv: integer (nullable = true)
 |-- has_chip: string (nullable = true)
 |-- num_cards_issued: integer (nullable = true)
 |-- credit_limit: string (nullab

In [10]:
# Explore sample data to understand the relationship
print("Sample data from each dataset:")
print("="*40)

print("TRANSACTIONS (first 5 rows):")
transactions_df.show(5)

print("CARDS (first 5 rows):")
cards_df.show(5)

# Find the common column for joining
tx_cols = set(transactions_df.columns)
cards_cols = set(cards_df.columns)
common_cols = tx_cols.intersection(cards_cols)

print(f"Common columns for joining: {common_cols}")

# Check join key statistics
if 'client_id' in common_cols:
    tx_clients = transactions_df.select("client_id").distinct().count()
    card_clients = cards_df.select("client_id").distinct().count()
    print(f"Unique clients in transactions: {tx_clients:,}")
    print(f"Unique clients in cards: {card_clients:,}")

Sample data from each dataset:
TRANSACTIONS (first 5 rows):


+-------+-------------------+---------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+
|     id|               date|client_id|card_id| amount|         use_chip|merchant_id|merchant_city|merchant_state|    zip| mcc|errors|
+-------+-------------------+---------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+
|7475327|2010-01-01 00:01:00|     1556|   2972|$-77.00|Swipe Transaction|      59935|       Beulah|            ND|58523.0|5499|  NULL|
|7475328|2010-01-01 00:02:00|      561|   4575| $14.57|Swipe Transaction|      67570|   Bettendorf|            IA|52722.0|5311|  NULL|
|7475329|2010-01-01 00:02:00|     1129|    102| $80.00|Swipe Transaction|      27092|        Vista|            CA|92084.0|4829|  NULL|
|7475331|2010-01-01 00:05:00|      430|   2860|$200.00|Swipe Transaction|      27092|  Crown Point|            IN|46307.0|4829|  NULL|
|7475332|2010-01-01 00:06:00|      848|   3915| $46.41|



Unique clients in transactions: 1,219
Unique clients in cards: 2,000


                                                                                

## 2. Join the Datasets

Join transactions with cards to get enriched transaction data:

In [11]:
# Perform the join - LEFT JOIN to keep all transactions
joined_data = transactions_df.alias("tx") \
    .join(cards_df.alias("cards"), 
          col("tx.client_id") == col("cards.client_id"), 
          "left")

print(f"✓ Join completed: {joined_data.count():,} records")

# Show sample joined data
print("\nSample joined data:")
joined_data.select("tx.client_id", "tx.amount", "tx.mcc", "cards.card_type").show(5)



✓ Join completed: 51,115,337 records

Sample joined data:
+---------+-------+----+---------------+
|client_id| amount| mcc|      card_type|
+---------+-------+----+---------------+
|     1556|$-77.00|5499|         Credit|
|     1556|$-77.00|5499|Debit (Prepaid)|
|     1556|$-77.00|5499|         Credit|
|     1556|$-77.00|5499|          Debit|
|      561| $14.57|5311|         Credit|
+---------+-------+----+---------------+
only showing top 5 rows



                                                                                

## 3. Join with JSON Data (MCC Codes)

Now let's add merchant category codes from the JSON file:

In [15]:
# Load MCC codes from JSON (key-value format)
mcc_raw = spark.read.json("../data/mcc.json")
print(f"✓ MCC JSON loaded")

# The JSON is a single object with key-value pairs, we need to transform it
# Convert to proper DataFrame structure
import json

# Read as text first to parse the key-value structure
with open("../data/mcc.json", "r") as f:
    mcc_dict = json.load(f)

# Convert to list of tuples for Spark DataFrame
mcc_data = [(int(mcc_code), description) for mcc_code, description in mcc_dict.items()]

# Create DataFrame with proper schema
mcc_df = spark.createDataFrame(mcc_data, ["mcc", "category_description"])
print(f"✓ MCC codes processed: {mcc_df.count():,} records")

# Show MCC structure
print("\nMCC data sample:")
mcc_df.show(5)

# Join transactions with MCC codes
# Select only the columns you need from joined_data to avoid ambiguity
joined_selected = joined_data.select(
    col("tx.client_id").alias("client_id"),
    col("tx.amount").alias("amount"),
    col("tx.mcc").alias("mcc"),
    col("cards.card_type").alias("card_type")
)

final_data = joined_selected.join(
    mcc_df,
    joined_selected.mcc == mcc_df.mcc,
    "left"
)

print(f"✓ Final dataset with MCC: {final_data.count():,} records")

# Show enriched data
print("\nEnriched transaction data:")
final_data.select(
    "client_id", 
    "amount", 
    "card_type",
    "category_description"
).show(10)

✓ MCC JSON loaded
✓ MCC codes processed: 109 records

MCC data sample:
+----+--------------------+
| mcc|category_description|
+----+--------------------+
|5812|Eating Places and...|
|5541|    Service Stations|
|7996|Amusement Parks, ...|
|5411|Grocery Stores, S...|
|4784|Tolls and Bridge ...|
+----+--------------------+
only showing top 5 rows

+----+--------------------+
| mcc|category_description|
+----+--------------------+
|5812|Eating Places and...|
|5541|    Service Stations|
|7996|Amusement Parks, ...|
|5411|Grocery Stores, S...|
|4784|Tolls and Bridge ...|
+----+--------------------+
only showing top 5 rows





✓ Final dataset with MCC: 51,115,337 records

Enriched transaction data:
+---------+-------+---------------+--------------------+
|client_id| amount|      card_type|category_description|
+---------+-------+---------------+--------------------+
|     1556|$-77.00|         Credit|Miscellaneous Foo...|
|     1556|$-77.00|Debit (Prepaid)|Miscellaneous Foo...|
|     1556|$-77.00|         Credit|Miscellaneous Foo...|
|     1556|$-77.00|          Debit|Miscellaneous Foo...|
|      561| $14.57|         Credit|   Department Stores|
|      561| $14.57|          Debit|   Department Stores|
|      561| $14.57|Debit (Prepaid)|   Department Stores|
|      561| $14.57|          Debit|   Department Stores|
|      561| $14.57|          Debit|   Department Stores|
|     1129| $80.00|         Credit|      Money Transfer|
+---------+-------+---------------+--------------------+
only showing top 10 rows



                                                                                

## Summary

### What we accomplished:

1. **Loaded CSV data**: Transactions and cards datasets
2. **Performed LEFT JOIN**: Combined transactions with card information
3. **Joined JSON data**: Added MCC category descriptions
4. **Created enriched dataset**: Three-way join for complete transaction context

### Key join concepts:
- **LEFT JOIN**: Keep all records from left table, add matching from right
- **Join keys**: Must match between datasets (`client_id`, `mcc`)
- **Multi-format joins**: CSV + JSON in same workflow
- **Table aliases**: `tx`, `cards`, `mcc` for cleaner syntax

🎉 **Simple joins completed successfully!**

In [None]:
# Clean up
spark.stop()
print("✅ Spark session terminated")

Cleaning up Spark resources...
✅ Spark session terminated successfully!
🎊 Data joins and conversions workshop completed!
✅ Spark session terminated successfully!
🎊 Data joins and conversions workshop completed!
