Q1. What are the key differences between SQL and NoSQL databases?

  ->>
- Data model → SQL uses structured, tabular data; NoSQL uses flexible formats like JSON, key-value, graph, or column-family.

- Schema → SQL has a fixed, predefined schema; NoSQL allows dynamic and flexible schemas.

- Relationships → SQL supports joins and relations; NoSQL relies on denormalization or embedded documents.

- Query language → SQL uses Structured Query Language (SQL); NoSQL uses varied query syntaxes based on the database.

- Scalability → SQL typically scales vertically (stronger hardware); NoSQL scales horizontally (distributed architecture).

- Consistency → SQL follows ACID properties (strong consistency); NoSQL often follows BASE (eventual consistency).

- Transactions → SQL excels at complex multi-step transactions; NoSQL may support limited or eventual transactional models.

- Performance → SQL is optimized for complex queries; NoSQL shines with large volumes of fast-changing data.

- Examples → SQL: MySQL, PostgreSQL, Oracle; NoSQL: MongoDB, Cassandra, Redis.

- Best for → SQL is best for structured, relational data; NoSQL suits flexible, large-scale or hierarchical data.

Q2. What makes MongoDB a good choice for modern applications?

  ->> MongoDB is a popular choice for modern applications because it’s designed to handle the speed, scale, and flexibility that today’s software demands

___Key Strengths of MongoDB for Modern Development

- Document-Oriented Model Stores data in flexible, JSON-like documents (BSON), which map naturally to objects in most programming languages. This makes development faster and more intuitive.

- Schema Flexibility You don’t need to define a rigid schema up front. You can evolve your data model as your application grows—perfect for agile teams and startups.

- Horizontal Scalability MongoDB supports sharding, allowing data to be distributed across multiple servers. This enables applications to scale out easily and handle massive workloads.

- High Availability & Fault Tolerance Built-in replication and automatic failover ensure your app stays online even if a server goes down.

- Real-Time Performance Optimized for fast reads and writes, MongoDB supports real-time analytics, live dashboards, and responsive user experiences.

- Rich Query Capabilities Supports powerful queries, indexing, and aggregation pipelines—ideal for complex data processing without needing external tools.

- Cloud-Native Integration Works seamlessly with cloud platforms, containers (Docker), orchestration tools (Kubernetes), and serverless environments.

- Developer-Friendly Ecosystem Offers drivers for all major programming languages and integrates well with modern frameworks, making it easy to build and deploy apps quickly.

Q3. Explain the concept of collections in MongoDB

  ->>
- A collection stores multiple documents, where each document is a JSON-like object (called BSON internally).

- Collections do not require a fixed schema, so documents within the same collection can have different fields and data types.

- Collections are created inside a database and can be created explicitly or automatically when you insert data.

Q4.How does MongoDB ensure high availability using replication?

  ->> MongoDB ensures high availability through a robust replication mechanism called replica sets, which provide redundancy, fault tolerance, and automatic failover

Q5. What are the main benefits of MongoDB Atlas?

  ->> MongoDB Atlas offers a powerful suite of features that make it a top choice for modern cloud-based applications.

Q6.What is the role of indexes in MongoDB, and how do they improve performance?

  ->> Indexes in MongoDB play a crucial role in boosting query performance and ensuring efficient data retrieval. Think of them like the index in a book — instead of flipping through every page, you jump straight to the info you need.

->> How Indexes Work Internally

- MongoDB uses B-tree data structures to store indexes.

- These trees keep data sorted and balanced, allowing logarithmic-time lookups.

- When a document is inserted, updated, or deleted, MongoDB updates all relevant indexes — which is why too many indexes can slow down writes

Q7.Describe the stages of the MongoDB aggregation pipeline

  ->> Common Aggregation Pipeline Stages
- $match Filters documents based on specified criteria, similar to a SQL WHERE clause. Example: Select orders where status = "shipped".

- $project Reshapes documents by including, excluding, or computing new fields. Example: Show only name and total, and hide _id.

- $group Groups documents by a field and performs aggregation operations like sum, avg, max, etc. Example: Group sales by region and calculate total revenue.

- $sort Sorts documents by one or more fields in ascending (1) or descending (-1) order. Example: Sort products by price descending.

- $limit Restricts the number of documents passed to the next stage. Example: Return only the top 5 results.

- $skip Skips a specified number of documents. Example: Skip the first 10 documents for pagination.

- $unwind Deconstructs an array field into multiple documents, one per array element. Example: Break down a tags array into individual tag documents.

Q8. What is sharding in MongoDB? How does it differ from replication?

  ->> Sharding is MongoDB’s method of horizontal scaling. It splits large datasets into smaller, more manageable pieces called shards, which are distributed across multiple servers

->> Diffrance

- Purpose → Sharding enables horizontal scaling; replication ensures high availability and fault tolerance.

- Data distribution → Sharding splits data across multiple servers; replication duplicates the same data across servers.

- Write operations → Sharding distributes writes across shards; replication writes only to the primary node.

- Read operations → Sharding routes reads based on shard key; replication allows reads from secondary nodes.

- Failure handling → Sharding doesn’t handle failover; replication promotes a secondary to primary if the primary fails.

Q9.What is PyMongo, and why is it used?

  ->> PyMongo is the official Python driver for MongoDB, designed to let Python applications interact seamlessly with MongoDB databases

->>Why

- Simplicity: Easy to learn and use, especially for Python developers.

 Performance: Efficient for synchronous operations and optimized for production use.

- Compatibility: Works with all major MongoDB versions and integrates well with MongoDB Atlas.

- Control: Offers low-level access to MongoDB features, unlike ORMs that abstract away database logic.

Q10.What are the ACID properties in the context of MongoDB transactions?

  ->> In MongoDB, ACID properties ensure that transactions are reliable, consistent, and resilient—even in distributed environments

Q11. What is the purpose of MongoDB’s explain() function

  ->> The explain() function in MongoDB is used to analyze and understand how a query is executed. It provides detailed insights into the query planner’s decisions, execution statistics, and performance metrics—making it an essential tool for debugging and optimizing queries.

Q12.How does MongoDB handle schema validation?

  ->> MongoDB handles schema validation through a flexible yet powerful mechanism that lets you enforce rules on the structure and content of documents within a collection.

Q13. What is the difference between a primary and a secondary node in a replica set?

  ->>
- Role → Primary handles writes; secondary replicates data from the primary.

- Write operations → Only the primary accepts writes; secondaries do not.

- Read operations → Primary serves reads by default; secondaries can serve reads if configured.

 Data sync → Primary logs changes in the oplog; secondaries apply those changes asynchronously.

- Failover → If the primary fails, a secondary is elected to become the new primary.

- Election participation → Both can participate in elections; only one node is primary at a time.

- Data consistency → Primary is the source of truth; secondaries mirror its state.

- Availability → Primary is critical for writes; secondaries ensure redundancy and read scalability.

- Recovery → A failed secondary rejoins and resyncs; a failed primary triggers an election.

Q14. What security mechanisms does MongoDB provide for data protection?

  ->> MongoDB offers a robust set of security mechanisms to protect data across various deployment environments.

Q15. Explain the concept of embedded documents and when they should be used

  ->> 📦 What Are Embedded Documents?
- An embedded document is a field whose value is itself a document.

- This structure mimics real-world relationships — for example, a user document might contain an embedded address document.

✅ When to Use Embedded Documents
- One-to-One or One-to-Few Relationships If the related data is tightly coupled and doesn’t grow unbounded — like a user’s profile or a product’s manufacturer details.

- Frequent Joint Access When you often retrieve the parent and child data together, embedding reduces the need for multiple queries.

- Atomic Updates Embedded documents allow updates to both parent and child data in a single atomic operation.

- Rare Updates to Subdocuments If the embedded data doesn’t change often, embedding avoids the overhead of syncing separate collections.

Q16. What is the purpose of MongoDB’s $lookup stage in aggregation?

  ->> The $lookup stage in MongoDB’s aggregation pipeline is used to perform a left outer join between documents in different collections. This allows you to combine related data from multiple sources—similar to SQL joins, but within MongoDB’s flexible document model

Q17.What are some common use cases for MongoDB?

  ->> 🛍️ E-Commerce Platforms
- Product catalogs can vary widely in structure, and MongoDB’s flexible schema handles that gracefully.

- Allows quick updates to product listings, customer profiles, and shopping carts.

📱 Mobile & Social Apps
- Stores user-generated content, profiles, and relationships efficiently.

- Supports real-time notifications, messaging, and offline-first features using MongoDB Atlas with Realm.

📊 Real-Time Analytics & Dashboards
- Processes large volumes of data quickly with the aggregation pipeline.

- Often used for monitoring systems, business intelligence tools, and interactive dashboards.

Q18.What are the advantages of using MongoDB for horizontal scaling?

  ->> Key Advantages of Horizontal Scaling in MongoDB
1. Infinite Scalability
- You can keep adding more shards (servers) as your data grows.

- No need to upgrade to expensive high-end machines—just scale out with commodity hardware.

2. Improved Performance
- Distributes read and write operations across multiple shards.

- Reduces bottlenecks and improves throughput for high-traffic applications.

3. High Availability
- Each shard is typically part of a replica set, ensuring redundancy.

- If one node fails, others can take over without downtime.

4. Cost Efficiency
- More predictable costs compared to vertical scaling, which often requires pricey hardware upgrades.

Q19.How do MongoDB transactions differ from SQL transactions?

  ->>
🧮 Core Differences Between MongoDB and SQL Transactions
1. Data Model
- SQL: Relational model with normalized tables and strict schemas.

- MongoDB: Document-oriented model with flexible, schema-less documents.

2. ACID Compliance
- SQL: Strong ACID guarantees are built-in and optimized for multi-row, multi-table operations.

- MongoDB: Supports ACID transactions since version 4.0 (single replica set) and distributed transactions since 4.2 (sharded clusters)2.

3. Transaction Scope
- SQL: Transactions commonly span multiple tables and rows.

- MongoDB: Initially focused on single-document atomicity; now supports multi-document and cross-collection transactions, but with more limitations (e.g., can't create collections across shards in a transaction).

4. Performance
- SQL: Optimized for complex transactional workloads.

- MongoDB: Transactions can be slower and more resource-intensive, especially in distributed setups.

5. Concurrency & Isolation
- SQL: Uses isolation levels (e.g., READ COMMITTED, SERIALIZABLE) to manage concurrency.

- MongoDB: Uses snapshot isolation for transactions, but with some restrictions on operations like $graphLookup in sharded collections.

6. Implementation Complexity
- SQL: Mature tooling and predictable behavior.

- MongoDB: Requires careful session management and understanding of replica sets or sharded clusters for distributed transactions2.

Q20.. What are the main differences between capped collections and regular collections?

  - Capped collections: fixed size, overwrite oldest data when full.

  - Regular collections: unlimited size, don’t overwrite automatically.

Q21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?

  ->> 🎯 Purpose of $match in Aggregation
- Filters documents based on specified conditions, similar to a find() query.

- Ensures that only matching documents proceed to the next stage of the pipeline.

- Helps optimize performance by reducing the number of documents processed downstream

Q22.How can you secure access to a MongoDB database?

  ->> Securing access to a MongoDB database is essential to protect sensitive data and prevent unauthorized usage. MongoDB offers a robust set of features and best practices to help you lock things down effectively

Q23.What is MongoDB’s WiredTiger storage engine, and why is it important?

  ->> WiredTiger is a high-performance, concurrent, and durable storage engine designed to handle large-scale data operations efficiently. It replaced the older MMAPv1 engine and became the default starting with MongoDB 3.2.

->> Why

- Performance Boost: WiredTiger’s architecture supports high-speed reads and writes, especially in write-heavy applications.

- Scalability: Efficient memory and disk usage make it suitable for large datasets and high-traffic environments.

- Reliability: Its journaling and checkpointing mechanisms help ensure data integrity even in the event of system failures.

- Storage Efficiency: Compression features reduce storage costs and improve I/O performance.

In [None]:
!pip install pymongo


Collecting pymongo
  Downloading pymongo-4.14.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.14.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.14.1


In [17]:
df = pd.read_csv("superstore.csv", encoding="ISO-8859-1")
df.shape

(9994, 21)

In [12]:
df.dtypes

Unnamed: 0,0
Row ID,int64
Order ID,object
Order Date,object
Ship Date,object
Ship Mode,object
Customer ID,object
Customer Name,object
Segment,object
Country,object
City,object


In [None]:
#1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB?

import pandas as pd
from pymongo import MongoClient

# 1. Load Superstore CSV file using pandas
df = pd.read_csv("superstore.csv", encoding="ISO-8859-1")

# 2. Connect to MongoDB (local server)
client = MongoClient("mongodb://localhost:27017/")

# 3. Create database and collection
db = client["SuperstoreDB"]
orders_collection = db["Orders"]

# 4. Convert dataframe to dictionary and insert into MongoDB
data = df.to_dict(orient="records")
orders_collection.insert_many(data)



In [25]:
#2. Retrieve and print all documents from the Orders collection


print("1. All Documents (first 5):")
print(df.head(), "\n")


1. All Documents (first 5):
   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
1       2  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
2       3  CA-2016-138688   6/12/2016   6/16/2016    Second Class    DV-13045   
3       4  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   
4       5  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   

     Customer Name    Segment        Country             City  ...  \
0      Claire Gute   Consumer  United States        Henderson  ...   
1      Claire Gute   Consumer  United States        Henderson  ...   
2  Darrin Van Huff  Corporate  United States      Los Angeles  ...   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   

  Postal Code  Region       Product ID         Category Sub-Cate

In [24]:
#3. Count and display the total number of documents in the Orders collection

print("2. Total number of documents (orders):")
print(len(df), "\n")

2. Total number of documents (orders):
9994 



In [23]:
#4. Write a query to fetch all orders from the "West" region

print("3. Orders from 'West' region (sample):")
print(df[df["Region"] == "West"].head(), "\n")

3. Orders from 'West' region (sample):
   Row ID        Order ID Order Date  Ship Date       Ship Mode Customer ID  \
2       3  CA-2016-138688  6/12/2016  6/16/2016    Second Class    DV-13045   
5       6  CA-2014-115812   6/9/2014  6/14/2014  Standard Class    BH-11710   
6       7  CA-2014-115812   6/9/2014  6/14/2014  Standard Class    BH-11710   
7       8  CA-2014-115812   6/9/2014  6/14/2014  Standard Class    BH-11710   
8       9  CA-2014-115812   6/9/2014  6/14/2014  Standard Class    BH-11710   

     Customer Name    Segment        Country         City  ... Postal Code  \
2  Darrin Van Huff  Corporate  United States  Los Angeles  ...       90036   
5  Brosina Hoffman   Consumer  United States  Los Angeles  ...       90032   
6  Brosina Hoffman   Consumer  United States  Los Angeles  ...       90032   
7  Brosina Hoffman   Consumer  United States  Los Angeles  ...       90032   
8  Brosina Hoffman   Consumer  United States  Los Angeles  ...       90032   

   Region       P

In [26]:
#5. Write a query to find orders where Sales is greater than 500

print("4. Orders with Sales > 500 (sample):")
print(df[df["Sales"] > 500].head(), "\n")

4. Orders with Sales > 500 (sample):
    Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
1        2  CA-2016-152156   11/8/2016  11/11/2016    Second Class   
3        4  US-2015-108966  10/11/2015  10/18/2015  Standard Class   
7        8  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
10      11  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
11      12  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   

   Customer ID    Customer Name   Segment        Country             City  \
1     CG-12520      Claire Gute  Consumer  United States        Henderson   
3     SO-20335   Sean O'Donnell  Consumer  United States  Fort Lauderdale   
7     BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   
10    BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   
11    BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   

    ... Postal Code  Region       Product ID    Category Sub-Category  \
1   ..

In [27]:
#6.Fetch the top 3 orders with the highest Profit

print("5. Top 3 orders with highest Profit:")
print(df.nlargest(3, "Profit"), "\n")

5. Top 3 orders with highest Profit:
      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
6826    6827  CA-2016-118689   10/2/2016   10/9/2016  Standard Class   
8153    8154  CA-2017-140151   3/23/2017   3/25/2017     First Class   
4190    4191  CA-2017-166709  11/17/2017  11/22/2017  Standard Class   

     Customer ID Customer Name    Segment        Country       City  ...  \
6826    TC-20980  Tamara Chand  Corporate  United States  Lafayette  ...   
8153    RB-19360  Raymond Buch   Consumer  United States    Seattle  ...   
4190    HL-15040  Hunter Lopez   Consumer  United States     Newark  ...   

     Postal Code   Region       Product ID    Category Sub-Category  \
6826       47905  Central  TEC-CO-10004722  Technology      Copiers   
8153       98115     West  TEC-CO-10004722  Technology      Copiers   
4190       19711     East  TEC-CO-10004722  Technology      Copiers   

                               Product Name     Sales  Quantity  Discount  \
6826  C

In [28]:
#7. Update all orders with Ship Mode as "First Class" to "Premium Class

df_updated = df.copy()
df_updated.loc[df_updated["Ship Mode"] == "First Class", "Ship Mode"] = "Premium Class"
print("6. Updated Ship Mode (sample):")
print(df_updated.head(), "\n")

6. Updated Ship Mode (sample):
   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
1       2  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
2       3  CA-2016-138688   6/12/2016   6/16/2016    Second Class    DV-13045   
3       4  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   
4       5  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   

     Customer Name    Segment        Country             City  ...  \
0      Claire Gute   Consumer  United States        Henderson  ...   
1      Claire Gute   Consumer  United States        Henderson  ...   
2  Darrin Van Huff  Corporate  United States      Los Angeles  ...   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   

  Postal Code  Region       Product ID         Category Sub-C

In [29]:
#8. Delete all orders where Sales is less than 50

df_filtered = df[df["Sales"] >= 50]
print("7. Orders after deleting Sales < 50 (sample):")
print(df_filtered.head(), "\n")


7. Orders after deleting Sales < 50 (sample):
   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
1       2  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
3       4  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   
7       8  CA-2014-115812    6/9/2014   6/14/2014  Standard Class    BH-11710   
9      10  CA-2014-115812    6/9/2014   6/14/2014  Standard Class    BH-11710   

     Customer Name   Segment        Country             City  ... Postal Code  \
0      Claire Gute  Consumer  United States        Henderson  ...       42420   
1      Claire Gute  Consumer  United States        Henderson  ...       42420   
3   Sean O'Donnell  Consumer  United States  Fort Lauderdale  ...       33311   
7  Brosina Hoffman  Consumer  United States      Los Angeles  ...       90032   
9  Brosina Hoffman  Consumer  United States      Los Angeles  

In [30]:
#9.Use aggregation to group orders by Region and calculate total sales per region

print("8. Total Sales by Region:")
print(df.groupby("Region")["Sales"].sum().reset_index(), "\n")


8. Total Sales by Region:
    Region        Sales
0  Central  501239.8908
1     East  678781.2400
2    South  391721.9050
3     West  725457.8245 



In [31]:
#10. Fetch all distinct values for Ship Mode from the collection

print("9. Distinct Ship Modes:")
print(df["Ship Mode"].unique(), "\n")


9. Distinct Ship Modes:
['Second Class' 'Standard Class' 'First Class' 'Same Day'] 



In [32]:
#11. Count the number of orders for each category

print("10. Number of Orders per Category:")
print(df["Category"].value_counts().reset_index().rename(
    columns={"index": "Category", "Category": "OrderCount"}
))


10. Number of Orders per Category:
        OrderCount  count
0  Office Supplies   6026
1        Furniture   2121
2       Technology   1847
