# Theoretical Questions on MongoDB and Databases

## 1. What are the key differences between SQL and NoSQL databases?
SQL (Structured Query Language) databases and NoSQL (Not Only SQL) databases have fundamental differences:

- **Structure:** SQL databases use a structured, tabular format, whereas NoSQL databases use flexible schema types like document, key-value, column-family, or graph.
- **Scalability:** SQL databases typically scale vertically (adding more power to a single machine), while NoSQL databases scale horizontally (distributing data across multiple machines).
- **Schema:** SQL databases require a predefined schema, whereas NoSQL databases have dynamic schemas.
- **ACID Compliance:** SQL databases ensure ACID (Atomicity, Consistency, Isolation, Durability) properties, whereas NoSQL databases often prioritize high availability and partition tolerance over strong consistency.
- **Use Cases:** SQL is best for complex queries and structured data, while NoSQL is ideal for large-scale, unstructured, or semi-structured data applications.

## 2. What makes MongoDB a good choice for modern applications?
MongoDB is widely used in modern applications because of:

- **Flexible Schema:** Allows storage of varied and evolving data structures.
- **Scalability:** Uses horizontal scaling (sharding) to handle large data volumes efficiently.
- **High Performance:** Optimized for read and write operations using indexes and memory-mapped storage.
- **Replication & High Availability:** Supports automatic failover and redundancy through replica sets.
- **JSON-Like Documents:** Stores data in BSON format, which is easy to work with in web applications.
- **Powerful Aggregation Framework:** Enables complex queries, transformations, and real-time analytics.

## 3. Explain the concept of collections in MongoDB.
In MongoDB, a **collection** is a group of documents that share a similar purpose, similar to a table in SQL databases. However, unlike tables, collections do not enforce a strict schema, meaning documents within a collection can have different fields and structures.

Example:
```json
{
  "_id": 1,
  "name": "John Doe",
  "email": "johndoe@example.com",
  "age": 30
}


## 4. How does MongoDB ensure high availability using replication?

MongoDB ensures high availability using **replica sets**. A **replica set** is a group of MongoDB servers that maintain the same dataset.

### Components of a Replica Set:
- **Primary Node:** Handles all write operations.
- **Secondary Nodes:** Maintain copies of the primary node's data.
- **Automatic Failover:** If the primary node fails, a secondary node is elected as the new primary.
- **Read Scalability:** Secondary nodes can handle read queries to improve performance.

### Example:
A replica set typically consists of three nodes:
1. **Primary:** Accepts writes and propagates changes.
2. **Secondary 1:** Synchronizes data from the primary.
3. **Secondary 2:** Also synchronizes data from the primary and can take over if the primary fails.

Replication ensures data redundancy, system reliability, and automatic recovery from failures.

---

## 5. What are the main benefits of MongoDB Atlas?

MongoDB Atlas is a cloud-based, fully managed database service that simplifies database deployment and management.

### Key Benefits:
- **Automatic Scaling:** Adjusts database resources based on workload demand.
- **Global Distribution:** Deploys data across multiple regions for redundancy and availability.
- **Automated Backups:** Ensures data safety with periodic snapshots.
- **Security Features:** Provides built-in encryption, access controls, and authentication.
- **Monitoring & Performance Optimization:** Offers real-time monitoring tools and performance insights.

MongoDB Atlas eliminates operational overhead and allows developers to focus on building applications rather than managing databases.

---

## 6. What is the role of indexes in MongoDB, and how do they improve performance?

Indexes in MongoDB enhance query performance by allowing the database to quickly locate documents without scanning the entire collection.

### Benefits of Indexes:
- **Faster Queries:** Reduces query execution time.
- **Efficient Sorting:** Improves sorting operations.
- **Reduced Resource Usage:** Lowers CPU and memory consumption.

### Types of Indexes:
1. **Single-field Indexes:** Optimized for queries on one field.
2. **Compound Indexes:** Indexes multiple fields to optimize complex queries.
3. **Multikey Indexes:** Indexes array fields.
4. **Text Indexes:** Used for full-text search.
5. **Geospatial Indexes:** Supports location-based queries.

### Example:
Creating an index on the `email` field to speed up searches:
```javascript
db.users.createIndex({ "email": 1 });

## 7. Describe the stages of the MongoDB aggregation pipeline.

The **aggregation pipeline** is a framework for data processing and transformation in MongoDB. It consists of multiple stages that allow complex data manipulation.

### **Key Stages:**
1. **$match:** Filters documents based on specified criteria.
2. **$group:** Groups documents and performs aggregations like sum, count, average, etc.
3. **$project:** Reshapes documents by including, excluding, or modifying fields.
4. **$sort:** Sorts documents based on one or more fields.
5. **$limit:** Restricts the number of documents returned.
6. **$lookup:** Performs a join operation with another collection.
7. **$unwind:** Deconstructs an array field into separate documents.

### **Example Usage:**
Retrieving the top customers based on their total spending:
```javascript
db.orders.aggregate([
  { $match: { status: "delivered" } },
  { $group: { _id: "$customerId", totalSpent: { $sum: "$amount" } } },
  { $sort: { totalSpent: -1 } },
  { $limit: 5 }
]);


## 8. What is sharding in MongoDB? How does it differ from replication?

### **Sharding:**
Sharding is a technique used in MongoDB for **horizontal scaling**, where data is distributed across multiple servers (shards). It helps improve performance and handles large datasets efficiently.

### **Difference Between Sharding and Replication:**
| Feature       | Sharding | Replication |
|--------------|----------|------------|
| Purpose      | Improves performance and scalability | Provides data redundancy and availability |
| Data Storage | Data is partitioned across multiple servers | Each node holds the same data as the primary node |
| Write Scaling | Supports write scalability | No write scalability |
| Read Scaling | Queries are distributed across shards | Reads can be performed from secondary nodes |

### **Components of Sharding:**
1. **Shard Servers:** Store distributed portions of the dataset.
2. **Config Servers:** Maintain metadata and mapping of shards.
3. **Query Router (mongos):** Routes client queries to the appropriate shard.

### **Example of Enabling Sharding:**
```javascript
sh.enableSharding("myDatabase")

## 9. What is PyMongo, and why is it used?

### **What is PyMongo?**
**PyMongo** is the official Python driver for MongoDB, allowing Python applications to interact with MongoDB databases.

### **Why Use PyMongo?**
- Provides an easy interface for performing database operations like insert, update, delete, and query.
- Supports indexing, aggregation, and transactions.
- Integrates with Python web frameworks like Flask and Django.
- Enables developers to work with MongoDB directly using Python.

### **Basic Usage Example:**
Connecting to MongoDB and inserting a document using PyMongo:
```python
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")

# Select database and collection
db = client["mydatabase"]
collection = db["users"]

# Insert a document
collection.insert_one({"name": "Alice", "age": 25})

# Find a document
user = collection.find_one({"name": "Alice"})
print(user)


## 10. What are the ACID properties in the context of MongoDB transactions?

### **What are ACID Properties?**
ACID stands for **Atomicity, Consistency, Isolation, and Durability**. These properties ensure reliable and secure database transactions.

### **ACID Properties in MongoDB:**
1. **Atomicity:**  
   - Ensures that a transaction is **all or nothing**—if one operation fails, the entire transaction is rolled back.
   - Example: Updating multiple documents within a transaction will either succeed completely or fail entirely.

2. **Consistency:**  
   - Guarantees that the database remains in a valid state before and after a transaction.
   - Ensures that constraints (such as unique fields) are maintained.

3. **Isolation:**  
   - Transactions run independently without interfering with each other.
   - Prevents issues like dirty reads and non-repeatable reads.

4. **Durability:**  
   - Once a transaction is committed, the changes are permanently stored in the database, even in case of a system failure.

### **MongoDB Transactions:**
- Introduced in MongoDB **4.0** for replica sets and **4.2** for sharded clusters.
- Support multi-document transactions.
- Use the `start_session()` method to implement transactions.


## 11. What is the purpose of MongoDB’s `explain()` function?

### **What is `explain()`?**
The `explain()` function in MongoDB provides detailed execution statistics for a query, helping developers **analyze and optimize query performance**.

### **Why Use `explain()`?**
- Helps understand how MongoDB executes a query.
- Identifies whether an index is being used.
- Analyzes query execution time and resource usage.
- Helps detect performance bottlenecks.

### **Modes of `explain()` Output:**
MongoDB provides different execution modes for `explain()`:

1. **"queryPlanner"** (default mode)  
   - Shows the execution plan without running the query.
2. **"executionStats"**  
   - Executes the query and provides runtime statistics.
3. **"allPlansExecution"**  
   - Provides detailed statistics for all considered query plans.

## 12. How does MongoDB handle schema validation?

MongoDB allows schema validation using **JSON Schema** to enforce rules on document structure within a collection.

### **Key Features of Schema Validation:**
- Ensures data integrity by defining required fields, data types, and constraints.
- Can be applied at the collection level.
- Supports operators like `$jsonSchema`, `$type`, `$required`, `$enum`, and more.


## 13. What is the difference between a primary and a secondary node in a replica set?

### **Primary Node:**
- The primary node in a replica set is responsible for handling all **write operations**.
- It also handles **read operations** by default, unless configured otherwise.
- The primary node replicates its data to the secondary nodes to ensure data redundancy and availability.
- If the primary node fails, one of the secondary nodes is elected as the new primary.

### **Secondary Node:**
- Secondary nodes are copies of the primary node and hold the same data.
- They **replicate** data from the primary node but do not handle write operations.
- Secondary nodes can be configured to handle **read operations** (using `readPreference`).
- They are used to provide **fault tolerance** and **read scalability** by distributing read requests across the secondary nodes.

### **Key Differences:**
| Feature          | Primary Node               | Secondary Node           |
|------------------|----------------------------|--------------------------|
| **Write Operations** | Allowed                   | Not allowed              |
| **Read Operations**  | Default (unless configured) | Can be configured via `readPreference` |
| **Data Replication** | Replicates data to secondaries | Replicates data from primary |
| **Role in Failover**  | Can be replaced if fails | Can become primary in case of failure |
| **Data Updates**    | Writes data               | Syncs data from primary |


## 14. What security mechanisms does MongoDB provide for data protection?

MongoDB offers several security mechanisms to ensure data protection and secure database operations:

### **1. Authentication & Authorization**
- **Authentication** ensures that only authorized users can access MongoDB.
- MongoDB uses **role-based access control (RBAC)** to define user roles and permissions.
- MongoDB supports **SCRAM** (Salted Challenge Response Authentication Mechanism) as the default authentication method.

### **2. Transport Layer Security (TLS/SSL)**
- MongoDB supports **TLS/SSL encryption** to secure data transmission between clients and servers.
- Ensures that data is encrypted while in transit, preventing man-in-the-middle attacks.

### **3. Data Encryption**
- **Encryption at Rest** protects data stored on disk, ensuring it is encrypted and inaccessible without proper authorization.
- MongoDB Enterprise Edition provides **field-level encryption**, allowing sensitive data to be encrypted within the database itself.

### **4. Network Security**
- **IP Whitelisting** restricts access to MongoDB from specific IP addresses, enhancing the security of the network.
- MongoDB can be configured to restrict database access based on specific network rules.

### **5. Auditing**
- MongoDB supports **auditing** to track database operations for compliance, regulatory, and monitoring purposes.
- The auditing feature logs important operations such as login attempts, access to sensitive data, and changes to database settings.

These mechanisms ensure that MongoDB is protected from unauthorized access, data breaches, and other security threats, providing robust data protection in both on-premise and cloud environments.

## 15. Explain the concept of embedded documents and when they should be used.

### **What Are Embedded Documents?**
Embedded documents, also known as **nested documents**, are documents that are stored inside another document. MongoDB allows for the storage of these embedded documents within a parent document, which can help reduce the need for joins.

### **Advantages of Embedded Documents:**
- **Faster Reads:** Retrieving all related data in a single query without needing to perform joins or multiple queries.
- **Atomic Updates:** Ensures consistency when updating related data within the same document.
- **Simplified Data Structure:** Avoids the need for complex relationships and referencing data across collections.

### **When to Use Embedded Documents:**
- **One-to-few relationships:** Use embedded documents when the relationship between data is one-to-few (e.g., a user’s address or orders).
- **Data that doesn’t change independently:** If the embedded data is unlikely to change or is mostly accessed together with the parent data.
- **When performance is a priority:** Embedded documents reduce the need for joins and make read operations faster by keeping related data together in one place.

### **When Not to Use Embedded Documents:**
- **Large or growing data:** Avoid embedding data that can grow significantly, such as lists with many entries, as it can lead to performance issues.
- **Frequent independent updates:** If embedded data is frequently updated independently, it might be better to use **referencing** rather than embedding to avoid complications with large documents.


## 16. What is the purpose of MongoDB’s `$lookup` stage in aggregation?

### **What is `$lookup`?**
The `$lookup` stage in MongoDB’s **aggregation pipeline** allows you to perform a **join** operation between two collections. It is used to combine data from multiple collections based on a specified relationship, similar to SQL joins.

### **Purpose of `$lookup`:**
- **Joins Collections:** Allows you to combine documents from two collections based on a common field.
- **Optimizes Queries:** Helps in situations where you need to retrieve related data that exists in a different collection, reducing the need for multiple queries.
- **Improves Data Retrieval:** Simplifies data retrieval and enables complex data analysis by combining data from different sources.

### **Common Use Cases for `$lookup`:**
- Retrieving related data from multiple collections.
- Performing operations like aggregating information across different documents.
- Simplifying complex data queries by eliminating the need for application-side joins.

The `$lookup` stage is particularly useful when working with **relational** data or when multiple collections need to be combined to get a complete dataset.

## 17. What are some common use cases for MongoDB?

MongoDB is a popular NoSQL database and is used in various scenarios, particularly when dealing with large volumes of unstructured or semi-structured data. Some common use cases include:

- **Content Management Systems (CMS):** Flexible data models for storing articles, blog posts, and media content.
- **Real-Time Analytics:** Processing large volumes of data and generating real-time insights.
- **Mobile Applications:** Storing user data, including interactions and preferences, which can evolve over time.
- **IoT (Internet of Things):** Handling large streams of data from connected devices, often in real-time.
- **E-commerce Platforms:** Managing product catalogs, customer information, and transactions.
- **Social Media Applications:** Storing user profiles, posts, and social interactions like comments and likes.
- **Big Data and Data Warehousing:** MongoDB’s ability to scale horizontally makes it suitable for big data processing and storage.

## 18. What are the advantages of using MongoDB for horizontal scaling?

MongoDB provides several advantages when it comes to **horizontal scaling** (distributing data across multiple servers):

- **Sharding:** MongoDB uses **sharding** to split data across multiple servers (shards), allowing it to scale out and handle large datasets without impacting performance.
- **Automatic Balancing:** The system automatically balances data across shards to ensure even distribution and avoid hotspots.
- **High Availability:** MongoDB ensures high availability using **replica sets** that replicate data across multiple nodes, providing redundancy and failover.
- **Elastic Scalability:** MongoDB can scale horizontally by adding more nodes to the cluster as data grows, without requiring downtime.
- **Fault Tolerance:** If one server fails, data is still available from other shards and replica sets, ensuring continuous access to data.

## 19. How do MongoDB transactions differ from SQL transactions?

MongoDB transactions provide atomicity and consistency across multiple documents, similar to SQL transactions, but with key differences:

- **Multi-Document Transactions:** 
  - MongoDB supports **multi-document transactions**, which allow you to perform operations on multiple documents within a single transaction.
  - SQL transactions typically work within a single table, while MongoDB allows you to update multiple collections in one atomic operation.
  
- **ACID Compliance:** 
  - Both MongoDB and SQL transactions follow the **ACID** (Atomicity, Consistency, Isolation, Durability) properties, but MongoDB added support for multi-document ACID transactions starting in version **4.0**.
  
- **Default Isolation Level:**
  - SQL databases typically use a **serializable** isolation level by default.
  - MongoDB uses **read committed** isolation by default, which ensures that only committed data is visible to transactions.

- **Transaction Scope:**
  - In SQL, transactions are usually confined to a single database and may involve multiple tables.
  - MongoDB transactions can span multiple collections and databases, giving it more flexibility.

## 20. What are the main differences between capped collections and regular collections?

Capped collections are special collections in MongoDB with a fixed size and an insertion order guarantee. The main differences between capped and regular collections are:

- **Size Limitation:**
  - **Capped Collections:** Have a maximum size and automatically remove the oldest documents when the size limit is reached.
  - **Regular Collections:** Do not have a size limit and can grow indefinitely.

- **Document Order:**
  - **Capped Collections:** Preserve the insertion order and are optimized for **high-throughput** insert operations.
  - **Regular Collections:** Do not guarantee any specific order for documents.

- **Data Retention:**
  - **Capped Collections:** The oldest documents are overwritten when the collection reaches its size limit.
  - **Regular Collections:** Data is retained indefinitely unless explicitly deleted.

- **Indexes:**
  - **Capped Collections:** Automatically index on the `_id` field.
  - **Regular Collections:** Can have custom indexes created by the user.

- **Use Cases:**
  - **Capped Collections:** Best for use cases that require storing logs, time-series data, or other scenarios where data retention is not required beyond a certain point.
  - **Regular Collections:** Used for general-purpose data storage.

## 21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?

The **`$match`** stage in MongoDB’s **aggregation pipeline** filters documents based on specified conditions. It is used to:

- **Filter Data Early:** Helps reduce the dataset by limiting the number of documents that are passed to subsequent stages in the pipeline, improving performance.
- **Apply Query Conditions:** It uses the same syntax as the `find()` method to filter documents.
- **Match Specific Criteria:** It can filter documents based on field values, ranges, logical operators, and regular expressions.

### Common Use Cases for `$match`:
- Filtering documents based on specific field values (e.g., find all documents where `status` is "active").
- Combining `$match` with other stages like `$group` to refine the data before performing aggregation operations.

The `$match` stage acts as a filter and is often one of the first stages in an aggregation pipeline to optimize query performance by reducing the number of documents that need to be processed.


## 22. How can you secure access to a MongoDB database?

Securing access to a MongoDB database is crucial to protect data from unauthorized access and potential breaches. MongoDB provides several mechanisms to secure access:

### **1. Authentication**
- **Enable Authentication:** By default, MongoDB does not require authentication. To secure your MongoDB instance, enable authentication by setting up user credentials.
- **SCRAM (Salted Challenge Response Authentication Mechanism):** MongoDB uses SCRAM for authentication by default, which hashes passwords to prevent password leakage.

### **2. Role-Based Access Control (RBAC)**
- **User Roles:** MongoDB uses RBAC to assign roles to users based on what actions they can perform (e.g., read, write, admin).
- **Custom Roles:** You can create custom roles tailored to the specific needs of users.

### **3. Authorization**
- Use **built-in roles** or create **custom roles** that grant specific privileges (e.g., read-only access, read-write access).
  
### **4. Transport Layer Security (TLS/SSL)**
- **Encryption in Transit:** Enable TLS/SSL encryption to secure communications between MongoDB clients and servers, preventing man-in-the-middle attacks.

### **5. Encryption**
- **Encryption at Rest:** Ensure that data stored in the database is encrypted by using **MongoDB Enterprise Edition** to provide encryption at rest for sensitive data.
- **Field-Level Encryption:** Use field-level encryption to encrypt sensitive data directly within the database.

### **6. IP Whitelisting and Firewall Configuration**
- **IP Whitelisting:** Restrict database access to trusted IP addresses and networks.
- **Firewall Rules:** Configure firewalls to limit access to MongoDB instances to specific IPs or ranges.

### **7. Auditing**
- MongoDB supports **auditing** features, enabling tracking of sensitive database operations, such as authentication attempts and data access.

By using these security measures, you can ensure that your MongoDB instance is protected from unauthorized access and vulnerabilities.

---

## 23. What is MongoDB’s WiredTiger storage engine, and why is it important?

### **What is WiredTiger?**
**WiredTiger** is the default **storage engine** in MongoDB since version 3.2. It is designed to provide high performance and efficient use of resources. WiredTiger uses **document-level concurrency control** and is optimized for **high-throughput, low-latency performance**.

### **Key Features of WiredTiger:**
1. **Document-Level Concurrency Control:** 
   - Allows multiple operations to occur simultaneously on different documents without locking the entire collection, improving write performance and reducing contention.
   
2. **Compression:**
   - WiredTiger provides built-in compression to reduce the storage footprint of data. It supports **snappy compression** by default and **zlib** compression as an option.
   - This compression significantly reduces disk space usage without compromising performance.

3. **Write-Ahead Logging (WAL):**
   - WiredTiger uses write-ahead logging, which provides durability and helps ensure that data is not lost in the event of a crash.
   
4. **Support for Multi-Version Concurrency Control (MVCC):**
   - It uses MVCC to allow read and write operations to occur simultaneously without blocking each other, enhancing database performance under heavy load.
   
5. **High-Performance Data Access:**
   - WiredTiger supports fast in-memory data access and efficient storage of large data sets.
   
6. **Transactions Support:**
   - Starting with MongoDB 4.0, WiredTiger supports **multi-document transactions**, providing ACID guarantees across multiple documents, collections, and databases.

### **Why is WiredTiger Important?**
- **Improved Performance:** It is optimized for modern hardware, enabling faster read/write operations, especially under heavy workloads.
- **Better Resource Utilization:** It reduces disk space usage through compression and allows more efficient use of memory and CPU.
- **Supports Modern Features:** WiredTiger supports critical features such as multi-document ACID transactions and document-level concurrency control, making it ideal for mission-critical applications.
- **Scalability:** Its performance characteristics allow MongoDB to scale more effectively as data volume and application demands grow.

WiredTiger is an essential component of MongoDB, ensuring that the database engine can handle modern workloads and provide high availability, performance, and scalability.


# Practical Questions

In [7]:
import pandas as pd

file_path = r"C:\Users\hetvi\Downloads\superstore.csv"  

superstore_df = pd.read_csv(file_path, encoding='ISO-8859-1')

superstore_df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [13]:
import pandas as pd
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")  
db = client["superstore"]  
orders_collection = db["orders"]  


## 1) Load the Superstore dataset from a CSV file into MongoDB

In [14]:
file_path = r"C:\Users\hetvi\Downloads\superstore.csv"  
superstore_df = pd.read_csv(file_path, encoding='ISO-8859-1')  

data = superstore_df.to_dict(orient="records")
orders_collection.insert_many(data)

print("Data inserted into MongoDB.")


Data inserted into MongoDB.


## 2) Retrieve and print all documents from the Orders collection

In [16]:
all_orders = orders_collection.find()

for order in all_orders:
    print(order)

{'_id': ObjectId('67abb13f3bb89c39c0c8ef1d'), 'Row ID': 1, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-BO-10001798', 'Category': 'Furniture', 'Sub-Category': 'Bookcases', 'Product Name': 'Bush Somerset Collection Bookcase', 'Sales': 261.96, 'Quantity': 2, 'Discount': 0.0, 'Profit': 41.9136}
{'_id': ObjectId('67abb13f3bb89c39c0c8ef1e'), 'Row ID': 2, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



## 3) Count and display the total number of documents in the Orders collection

In [17]:
total_orders = orders_collection.count_documents({})
print(f"Total number of orders in the collection: {total_orders}")


Total number of orders in the collection: 19988


 ## 4) Query to fetch all orders from the "West" region

In [19]:
west_orders = orders_collection.find({"Region": "West"}).limit(10)

print("\nOrders from the West region (first 10):")
for order in west_orders:
    print(order)


Orders from the West region (first 10):
{'_id': ObjectId('67abb13f3bb89c39c0c8ef1f'), 'Row ID': 3, 'Order ID': 'CA-2016-138688', 'Order Date': '6/12/2016', 'Ship Date': '6/16/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'DV-13045', 'Customer Name': 'Darrin Van Huff', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Los Angeles', 'State': 'California', 'Postal Code': 90036, 'Region': 'West', 'Product ID': 'OFF-LA-10000240', 'Category': 'Office Supplies', 'Sub-Category': 'Labels', 'Product Name': 'Self-Adhesive Address Labels for Typewriters by Universal', 'Sales': 14.62, 'Quantity': 2, 'Discount': 0.0, 'Profit': 6.8714}
{'_id': ObjectId('67abb13f3bb89c39c0c8ef22'), 'Row ID': 6, 'Order ID': 'CA-2014-115812', 'Order Date': '6/9/2014', 'Ship Date': '6/14/2014', 'Ship Mode': 'Standard Class', 'Customer ID': 'BH-11710', 'Customer Name': 'Brosina Hoffman', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Los Angeles', 'State': 'California', 'Postal Code': 90032, 

## 5) Query to find orders where Sales is greater than 500

In [20]:
high_sales_orders = orders_collection.find({"Sales": {"$gt": 500}})

print("\nOrders where Sales is greater than 500:")
for order in high_sales_orders:
    print(order)


Orders where Sales is greater than 500:
{'_id': ObjectId('67abb13f3bb89c39c0c8ef1e'), 'Row ID': 2, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub-Category': 'Chairs', 'Product Name': 'Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back', 'Sales': 731.94, 'Quantity': 3, 'Discount': 0.0, 'Profit': 219.582}
{'_id': ObjectId('67abb13f3bb89c39c0c8ef20'), 'Row ID': 4, 'Order ID': 'US-2015-108966', 'Order Date': '10/11/2015', 'Ship Date': '10/18/2015', 'Ship Mode': 'Standard Class', 'Customer ID': 'SO-20335', 'Customer Name': "Sean O'Donnell", 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Fort Lauderdale', 'State': 'Florida', 'Postal Code': 33311, 'Regio

## 6) Fetch the top 3 orders with the highest Profit

In [21]:
top_profit_orders = orders_collection.find().sort("Profit", -1).limit(3)

print("Top 3 Orders with Highest Profit:")
for order in top_profit_orders:
    print(order)

Top 3 Orders with Highest Profit:
{'_id': ObjectId('67abb1923bb89c39c0c930d3'), 'Row ID': 6827, 'Order ID': 'CA-2016-118689', 'Order Date': '10/2/2016', 'Ship Date': '10/9/2016', 'Ship Mode': 'Standard Class', 'Customer ID': 'TC-20980', 'Customer Name': 'Tamara Chand', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Lafayette', 'State': 'Indiana', 'Postal Code': 47905, 'Region': 'Central', 'Product ID': 'TEC-CO-10004722', 'Category': 'Technology', 'Sub-Category': 'Copiers', 'Product Name': 'Canon imageCLASS 2200 Advanced Copier', 'Sales': 17499.95, 'Quantity': 5, 'Discount': 0.0, 'Profit': 8399.976}
{'_id': ObjectId('67abb13f3bb89c39c0c909c7'), 'Row ID': 6827, 'Order ID': 'CA-2016-118689', 'Order Date': '10/2/2016', 'Ship Date': '10/9/2016', 'Ship Mode': 'Standard Class', 'Customer ID': 'TC-20980', 'Customer Name': 'Tamara Chand', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Lafayette', 'State': 'Indiana', 'Postal Code': 47905, 'Region': 'Central', 'Product

## 7) Update all orders with Ship Mode as "First Class" to "Premium Class"

In [22]:
update_result = orders_collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)

print(f"\nUpdated {update_result.modified_count} orders where Ship Mode was 'First Class'.")



Updated 3076 orders where Ship Mode was 'First Class'.


## 8) Delete all orders where Sales is less than 50

In [23]:
delete_result = orders_collection.delete_many({"Sales": {"$lt": 50}})

print(f"\nDeleted {delete_result.deleted_count} orders with Sales less than 50.")



Deleted 9698 orders with Sales less than 50.


## 9) Use aggregation to group orders by Region and calculate total sales per region

In [24]:
region_sales = orders_collection.aggregate([
    {"$group": {"_id": "$Region", "total_sales": {"$sum": "$Sales"}}}
])

print("\nTotal Sales per Region:")
for region in region_sales:
    print(region)


Total Sales per Region:
{'_id': 'East', 'total_sales': 1302275.41}
{'_id': 'Central', 'total_sales': 959223.6916}
{'_id': 'South', 'total_sales': 752046.624}
{'_id': 'West', 'total_sales': 1389373.239}


## 10) Fetch all distinct values for Ship Mode from the collection

In [25]:
distinct_ship_modes = orders_collection.distinct("Ship Mode")

print("\nDistinct Ship Modes:")
for ship_mode in distinct_ship_modes:
    print(ship_mode)


Distinct Ship Modes:
Premium Class
Same Day
Second Class
Standard Class


## 11) Count the number of orders for each category (assuming Category field exists)

In [26]:
category_count = orders_collection.aggregate([
    {"$group": {"_id": "$Category", "order_count": {"$sum": 1}}}
])

print("\nNumber of Orders per Category:")
for category in category_count:
    print(category)


Number of Orders per Category:
{'_id': 'Furniture', 'order_count': 3146}
{'_id': 'Technology', 'order_count': 2992}
{'_id': 'Office Supplies', 'order_count': 4152}
