<a href="https://colab.research.google.com/github/nikitaj832/Training/blob/main/Data_Aggregation_and_Joins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Aggregation and Joins**

## **1. Group data by a column and calculate average, max, and sum**

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, max, sum

# Create Spark session
spark = SparkSession.builder.appName("GroupByProduct").getOrCreate()

# Sample data
data = [
    ("Laptop", "Electronics", 70000),
    ("Headphones", "Electronics", 2000),
    ("Shoes", "Footwear", 3000),
    ("Sandals", "Footwear", 1500),
    ("T-Shirt", "Clothing", 1200),
    ("Jeans", "Clothing", 2200)
]

columns = ["Product", "Category", "Sales"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Group by Category and calculate average, max, and total sales
result_df = df.groupBy("Category").agg(
    avg("Sales").alias("Average_Sales"),
    max("Sales").alias("Max_Sales"),
    sum("Sales").alias("Total_Sales")
)

# Show result
result_df.show()


+-----------+-------------+---------+-----------+
|   Category|Average_Sales|Max_Sales|Total_Sales|
+-----------+-------------+---------+-----------+
|Electronics|      36000.0|    70000|      72000|
|   Footwear|       2250.0|     3000|       4500|
|   Clothing|       1700.0|     2200|       3400|
+-----------+-------------+---------+-----------+



## **2. Sort employees by salary in ascending and descending order**

In [3]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("SortEmployees").getOrCreate()

# Sample employee data
data = [
    ("Nikita", "HR", 50000),
    ("Rahul", "IT", 65000),
    ("Anjali", "Finance", 48000),
    ("Ravi", "IT", 70000),
    ("Priya", "HR", 52000)
]

columns = ["Name", "Department", "Salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# 1. Sort by Salary in Ascending Order
print("Ascending Order:")
df.orderBy("Salary").show()

# 2. Sort by Salary in Descending Order
print("Descending Order:")
df.orderBy(df.Salary.desc()).show()


Ascending Order:
+------+----------+------+
|  Name|Department|Salary|
+------+----------+------+
|Anjali|   Finance| 48000|
|Nikita|        HR| 50000|
| Priya|        HR| 52000|
| Rahul|        IT| 65000|
|  Ravi|        IT| 70000|
+------+----------+------+

Descending Order:
+------+----------+------+
|  Name|Department|Salary|
+------+----------+------+
|  Ravi|        IT| 70000|
| Rahul|        IT| 65000|
| Priya|        HR| 52000|
|Nikita|        HR| 50000|
|Anjali|   Finance| 48000|
+------+----------+------+



## **3. Perform an inner join between two DataFrames using "customer_id"  and use diff exmple**

In [4]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("InnerJoinExample").getOrCreate()

# Customer DataFrame
customer_data = [
    (101, "Nikita"),
    (102, "Rahul"),
    (103, "Anjali"),
    (104, "Ravi")
]
customer_columns = ["customer_id", "customer_name"]
df_customers = spark.createDataFrame(customer_data, customer_columns)

# Orders DataFrame
order_data = [
    (201, 101, "Laptop"),
    (202, 102, "Mobile"),
    (203, 105, "Headphones"),  # 105 does not exist in customers
    (204, 103, "Keyboard")
]
order_columns = ["order_id", "customer_id", "product"]
df_orders = spark.createDataFrame(order_data, order_columns)

# Perform INNER JOIN on 'customer_id'
joined_df = df_customers.join(df_orders, on="customer_id", how="inner")

# Show the result
joined_df.show()


+-----------+-------------+--------+--------+
|customer_id|customer_name|order_id| product|
+-----------+-------------+--------+--------+
|        101|       Nikita|     201|  Laptop|
|        102|        Rahul|     202|  Mobile|
|        103|       Anjali|     204|Keyboard|
+-----------+-------------+--------+--------+

