# mongodb theory

---

### 1. What are the key differences between SQL and NoSQL databases?

* **SQL**: Relational, structured schema, supports JOINs, uses tables.
* **NoSQL** (like MongoDB): Document-based, flexible schema, horizontally scalable, stores data in JSON-like documents.

---

### 2. What makes MongoDB a good choice for modern applications?

* Schema flexibility
* JSON-style documents (BSON)
* High scalability and performance
* Real-time analytics support
* Easy cloud deployment (e.g., MongoDB Atlas)

---

### 3. Explain the concept of collections in MongoDB.

* Collections are equivalent to tables in SQL.
* They hold **documents** (records), which are JSON-like.
* Example:

```json
{
  "name": "Alice",
  "age": 25,
  "skills": ["Python", "MongoDB"]
}
```

---

### 4. How does MongoDB ensure high availability using replication?

* Uses **Replica Sets**: A group of `primary` and `secondary` nodes.
* If primary fails, a secondary auto-elects as new primary.

---

### 5. What are the main benefits of MongoDB Atlas?

* Fully managed cloud database.
* Automated backups, scaling, monitoring.
* Global clusters, serverless options, and security features.

---

### 6. What is the role of indexes in MongoDB, and how do they improve performance?

* Indexes support fast query execution.
* Example:

```python
  db.users.create_index("email")
```

* Without an index, MongoDB scans all documents.

---

### 7. Describe the stages of the MongoDB aggregation pipeline.

* `$match`: Filter data
* `$group`: Group by fields
* `$project`: Select/transform fields
* `$sort`: Order documents
* `$limit`, `$skip`: Pagination

---

### 8. What is sharding in MongoDB? How does it differ from replication?

* **Sharding**: Splits data across shards (horizontal scaling).
* **Replication**: Duplicates data for fault tolerance.
* They **can be used together**.

---

### 9. What is PyMongo, and why is it used?

* PyMongo is the **official Python driver** for MongoDB.
* Used to connect and interact with MongoDB from Python.

```python
  from pymongo import MongoClient
  client = MongoClient("mongodb://localhost:27017/")
```

---

### 10. What are the ACID properties in the context of MongoDB transactions?

* **Atomicity, Consistency, Isolation, Durability**
* Supported in multi-document transactions (since v4.0+)

---

### 11. What is the purpose of MongoDB’s `explain()` function?

* Analyzes and outputs query execution plans.

```python
  db.collection.find({"name": "Alice"}).explain()
```

---

### 12. How does MongoDB handle schema validation?

* Define rules with `validator` during collection creation.

```python
  db.create_collection("users", validator={
    "$jsonSchema": {
      "bsonType": "object",
      "required": ["name", "age"]
    }
  })
```

---

### 13. What is the difference between a primary and a secondary node in a replica set?

* **Primary**: Accepts writes/reads.
* **Secondary**: Replicates from primary, usually read-only.

---

### 14. What security mechanisms does MongoDB provide for data protection?

* Role-Based Access Control (RBAC)
* TLS/SSL encryption
* Authentication (SCRAM, x.509)
* IP whitelisting
* Encryption-at-rest

---

### 15. Explain the concept of embedded documents and when they should be used.

* Store related data inside the same document.
* Ideal for **one-to-few** relationships.

```json
{
  "name": "Alice",
  "address": { "city": "Delhi", "zip": "110001" }
}
```

---

### 16. What is the purpose of MongoDB’s `$lookup` stage in aggregation?

* Performs **JOIN-like** operations between collections.

```json
{ "$lookup": {
  "from": "orders",
  "localField": "user_id",
  "foreignField": "user_id",
  "as": "order_info"
}}
```

---

### 17. What are some common use cases for MongoDB?

* Content management systems
* IoT applications
* Real-time analytics
* Catalogs and inventories
* Mobile apps

---

### 18. What are the advantages of using MongoDB for horizontal scaling?

* Easy to add shards.
* Distributes data automatically.
* Increases write/read capacity.

---

### 19. How do MongoDB transactions differ from SQL transactions?

* MongoDB transactions are newer (v4.0+), mostly used in enterprise needs.
* Not as mature as RDBMS but support multi-doc ACID compliance.

---

### 20. What are the main differences between capped collections and regular collections?

* **Capped collections**: Fixed size, auto-overwrite old docs.
* **Regular collections**: Unlimited, manual deletion required.

---

### 21. What is the purpose of the `$match` stage in MongoDB’s aggregation pipeline?

* Filters documents similar to `find()`, used as the **first stage**.

```json
{ "$match": { "status": "active" } }
```

---

### 22. How can you secure access to a MongoDB database?

* Enable authentication
* Use TLS/SSL
* Define user roles
* Enable auditing and IP whitelisting

---

### 23. What is MongoDB’s WiredTiger storage engine, and why is it important?

* Default storage engine (since v3.2).
* Supports compression, concurrency, checkpoints.
* Great performance with high throughput workloads.


#Practical question.

In [1]:
# pip install pandas pymongo


Collecting pymongo
  Downloading pymongo-4.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.13.2


In [2]:
import pandas as pd
from pymongo import MongoClient

In [None]:
# MongoDB connection setup
client = MongoClient("mongodb://localhost:27017/")
db = client["superstore"]
orders_col = db["orders"]

In [None]:
# Write a Python script to load the Superstore dataset from a CSV file into MongoDB.
df = pd.read_csv("superstore.csv")
orders_col.delete_many({})
orders_col.insert_many(df.to_dict(orient="records"))
print("Data loaded into MongoDB")

In [None]:
# 2. Retrieve and print all documents from the Orders collection
print("\n All Documents:")
for doc in orders_col.find():
    print(doc)

In [None]:
# 3. Count and display the total number of documents in the Orders collection<

total_docs = orders_col.count_documents({})
print(f"\n Total Documents: {total_docs}")

In [None]:
# 4.Write a query to fetch all orders from the "West" region.
print("\n Orders from West region:")
for doc in orders_col.find({"Region": "West"}):
    print(doc)

In [None]:
# 5.Write a query to find orders where Sales is greater than 500.
print("\n Orders with Sales > 500:")
for doc in orders_col.find({"Sales": {"$gt": 500}}):
    print(doc)

In [None]:
#  6.Fetch the top 3 orders with the highest Profit
print("\n Top 3 orders by Profit:")
for doc in orders_col.find().sort("Profit", -1).limit(3):
    print(doc)

In [None]:
# 7.Update all orders with Ship Mode as "First Class" to "Premium Class.
update_result = orders_col.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)
print(f"\n Updated {update_result.modified_count} documents from 'First Class' to 'Premium Class'.")

In [None]:
# 8.Delete all orders where Sales is less than 50.
delete_result = orders_col.delete_many({"Sales": {"$lt": 50}})
print(f"\n Deleted {delete_result.deleted_count} documents where Sales < 50.")

In [None]:
# 9.Use aggregation to group orders by Region and calculate total sales per region.
print("\n Total Sales per Region:")
pipeline = [
    {"$group": {"_id": "$Region", "total_sales": {"$sum": "$Sales"}}}
]
for result in orders_col.aggregate(pipeline):
    print(result)

In [None]:
# 10.Fetch all distinct values for Ship Mode from the collection.
distinct_modes = orders_col.distinct("Ship Mode")
print(f"\n Distinct Ship Modes: {distinct_modes}")

In [None]:
# 11.Count the number of orders for each category.
print("\n📦 Orders count per Category:")
pipeline = [
    {"$group": {"_id": "$Category", "order_count": {"$sum": 1}}}
]
for result in orders_col.aggregate(pipeline):
    print(result)