<a href="https://colab.research.google.com/github/lamyse1/Data-Engineering-Projects/blob/main/DE_Complex_Queries_and_Aggregations_in_MongoDB_lamyse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Complex Queries and Aggregations in MongoDB**

In [1]:
# Install PyMongo
!pip install pymongo


Collecting pymongo
  Downloading pymongo-4.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.11


In [2]:
from pymongo import MongoClient

mongo_uri = "mongodb+srv://lamyseammar:Laura9966@cluster0.pfzed.mongodb.net/ecommerce?retryWrites=true&w=majority"
client = MongoClient(mongo_uri)

# Connect to the 'ecommerce' database
db = client["ecommerce"]

# Create or access the 'orders' collection
orders = db["orders"]

print("Connected to MongoDB successfully!")


Connected to MongoDB successfully!


In [3]:
# Insert sample orders into the 'orders' collection
orders.insert_many([
    {
        "order_id": "001",
        "customer": {"name": "John Doe", "email": "john.doe@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 1, "price": 1000},
            {"product": "Mouse", "quantity": 2, "price": 50}
        ],
        "order_date": "2024-01-15",
        "status": "shipped",
        "total": 1100
    },
    {
        "order_id": "002",
        "customer": {"name": "Jane Smith", "email": "jane.smith@example.com"},
        "items": [
            {"product": "Monitor", "quantity": 1, "price": 300}
        ],
        "order_date": "2024-01-20",
        "status": "delivered",
        "total": 300
    },
    {
        "order_id": "003",
        "customer": {"name": "John Doe", "email": "john.doe@example.com"},
        "items": [
            {"product": "Keyboard", "quantity": 1, "price": 100},
            {"product": "Mouse", "quantity": 1, "price": 50}
        ],
        "order_date": "2024-01-22",
        "status": "pending",
        "total": 150
    },
    {
        "order_id": "004",
        "customer": {"name": "Alice Brown", "email": "alice.brown@example.com"},
        "items": [
            {"product": "Laptop", "quantity": 2, "price": 1000},
            {"product": "Mouse", "quantity": 3, "price": 50}
        ],
        "order_date": "2024-01-25",
        "status": "shipped",
        "total": 2150
    }
])

print("Sample data inserted successfully!")


Sample data inserted successfully!


# **Task 1: Filtering Data**

In [4]:
# Query to find orders placed by "John Doe" with a total greater than $500
query = {
    "customer.name": "John Doe",
    "total": {"$gt": 500}
}

# Projection to show only 'order_id' and 'total'
projection = {"_id": 0, "order_id": 1, "total": 1}

# Execute the query
result = orders.find(query, projection)

# Print the results
print("Orders by John Doe with total > $500:")
for order in result:
    print(order)


Orders by John Doe with total > $500:
{'order_id': '001', 'total': 1100}


**Filter Criteria:**
Match orders where customer name is "John Doe".
Ensure order total is greater than $500.


**Projection:**
Show only order_id and total in the output.
Exclude the MongoDB _id field (default in every document).

# **Task 2: Sorting Data**


In [5]:
# Query to retrieve all orders and sort by order date (descending) and total (ascending)
sorted_orders = orders.find({}, {"_id": 0, "order_id": 1, "order_date": 1, "total": 1}).sort([
    ("order_date", -1),
    ("total", 1)
])

# Print sorted results
print("Sorted Orders:")
for order in sorted_orders:
    print(order)


Sorted Orders:
{'order_id': '004', 'order_date': '2024-01-25', 'total': 2150}
{'order_id': '003', 'order_date': '2024-01-22', 'total': 150}
{'order_id': '002', 'order_date': '2024-01-20', 'total': 300}
{'order_id': '001', 'order_date': '2024-01-15', 'total': 1100}


**Sorting Criteria:**

First, sort by order date in descending order (-1), so the latest orders appear first.Then, if two orders have the same date, sort by total in ascending order (1).
   
   **Projection (Fields to Show):**
   
   Only display order_id, order_date, and total.
Exclude _id from the output.


# **Task 3: Aggregation - Total Sales per Product**


In [6]:
# Aggregation query to calculate total sales per product
total_sales = orders.aggregate([
    {"$unwind": "$items"},
    {"$group": {
        "_id": "$items.product",
        "totalSales": {"$sum": {"$multiply": ["$items.price", "$items.quantity"]}}
    }},
    {"$sort": {"totalSales": -1}}
])

# Print results
print("Total Sales per Product:")
for product in total_sales:
    print(product)


Total Sales per Product:
{'_id': 'Laptop', 'totalSales': 3000}
{'_id': 'Monitor', 'totalSales': 300}
{'_id': 'Mouse', 'totalSales': 300}
{'_id': 'Keyboard', 'totalSales': 100}


Step 1: $unwind the items array

Since each order has multiple products, "unwind" separates them into individual documents.

Step 2: $group by product name

Each product is grouped, and its total sales are calculated.

Step 3: Compute total revenue

We use "sum" and "multiply" to calculate price * quantity for each product.

Step 4: $sort in descending order

Products with the highest sales appear first.

# **Task 4: Aggregation - Average Order Value per Customer**

In [8]:
# Aggregation query to calculate average order value per customer
average_order_value = orders.aggregate([
    {"$group": {
        "_id": "$customer.name",
        "averageOrderValue": {"$avg": "$total"}
    }},
    {"$sort": {"averageOrderValue": -1}}
])

# Print results
print("Average Order Value per Customer:")
for customer in average_order_value:
    print(customer)

Average Order Value per Customer:
{'_id': 'Alice Brown', 'averageOrderValue': 2150.0}
{'_id': 'John Doe', 'averageOrderValue': 625.0}
{'_id': 'Jane Smith', 'averageOrderValue': 300.0}


Step 1: $group by customer.name

Each customer's orders are grouped together.

Step 2: Compute averageOrderValue using $avg

The average is calculated across all orders placed by each customer.

Step 3: $sort in descending order

Customers with the highest average order value appear first.


# **Task 5: Advanced Aggregation - Top 5 Products by Quantity Sold**

In [9]:
# Aggregation query to find the top 5 products by quantity sold
top_products = orders.aggregate([
    {"$unwind": "$items"},
    {"$group": {
        "_id": "$items.product",
        "quantitySold": {"$sum": "$items.quantity"}
    }},
    {"$sort": {"quantitySold": -1}},
    {"$limit": 5}
])

# Print results
print("Top 5 Products by Quantity Sold:")
for product in top_products:
    print(product)


Top 5 Products by Quantity Sold:
{'_id': 'Mouse', 'quantitySold': 6}
{'_id': 'Laptop', 'quantitySold': 3}
{'_id': 'Monitor', 'quantitySold': 1}
{'_id': 'Keyboard', 'quantitySold': 1}


Step 1: $unwind the items array

This ensures each product is processed separately.

Step 2: $group by product name

Each product's total quantity sold is calculated.

Step 3: Compute quantitySold using $sum

Summing the quantity field across all orders gives the total units sold per product.

Step 4: $sort in descending order

The most sold products appear first.

Step 5: $limit to only the top 5

We restrict the results to the top 5 highest-selling products