Theory questions

1. What are the key differences between SQL and NoSQL databases?
- The key differences between SQL (relational) and NoSQL (non-relational) databases lie in their data models, scalability, schema flexibility, and use cases:

Data Model:

- SQL Databases: Employ a relational model, organizing data into structured tables with predefined schemas, where relationships between tables are established using primary and foreign keys.

- NoSQL Databases: Utilize various non-relational data models, such as document (e.g., MongoDB), key-value (e.g., Redis), wide-column (e.g., Cassandra), or graph (e.g., Neo4j). They offer more flexibility in data structure.

Scalability:

- SQL Databases: Primarily scale vertically, meaning performance is enhanced by increasing the resources (CPU, RAM) of a single server.
- NoSQL Databases: Are designed for horizontal scaling, distributing data across multiple servers or nodes, which allows for easier scaling out to handle large volumes of data and traffic.

Schema Flexibility:

- SQL Databases: Enforce a strict, predefined schema, requiring data to conform to specific table structures and data types. Changes to the schema can be complex.
- NoSQL Databases: Offer schema flexibility, often referred to as "schemaless" or "schema-on-read," allowing for dynamic changes to data structures without requiring a predefined schema. This is well-suited for unstructured or semi-structured data.

Consistency and Transactions:

- SQL Databases: Typically adhere to ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring strong data consistency and reliable transactions, making them suitable for applications requiring strict data integrity (e.g., financial systems).
- NoSQL Databases: Often prioritize availability and partition tolerance over strict consistency (following BASE – Basically Available, Soft state, Eventually consistent – principles), making them suitable for applications where high availability and performance are paramount, even if it means eventual consistency.

Use Cases:

- SQL Databases: Ideal for applications requiring complex queries, strong data integrity, and well-defined relationships, such as enterprise resource planning (ERP) systems, financial applications, and traditional web applications.
- NoSQL Databases: Well-suited for handling large volumes of unstructured or rapidly changing data, real-time web applications, big data analytics, content management systems, and applications requiring high scalability and performance.

2. What makes MongoDB a good choice for modern applications?
- MongoDB's suitability for modern applications stems from several key features:

- Flexible Schema (Schema-less Design):

Unlike traditional relational databases that require a predefined schema, MongoDB's document-oriented model allows for dynamic and flexible schemas. This means developers can store documents with varying structures within the same collection, simplifying data modeling and enabling rapid iteration in agile development environments.

- Scalability:

MongoDB is designed for horizontal scalability through sharding, which distributes data across multiple servers. This allows applications to handle massive amounts of data and high traffic loads, making it ideal for large-scale, data-intensive applications.

- High Performance:

MongoDB's ability to index documents, perform fast read/write operations, and utilize a binary JSON (BSON) format for efficient data storage contributes to its high performance, crucial for real-time applications and those requiring quick data access.

3.  Explain the concept of collections in MongoDB?
- In MongoDB, a collection is a grouping of related documents. It serves a similar purpose to a table in a relational database, but with a key difference: collections in MongoDB are schema-less by default.

4.  How does MongoDB ensure high availability using replication?
- MongoDB ensures high availability through replica sets, which are groups of mongod instances that maintain the same data set. This setup provides redundancy and automatic failover, eliminating single points of failure.

5. What are the main benefits of MongoDB Atlas?
- Key Benefits of MongoDB Atlas
Fully Managed – Automated backups, monitoring, patching, and scaling.

Multi-Cloud & Global – Runs on AWS, Azure, or GCP with global data distribution.

Secure – Built-in encryption, access controls, and compliance (SOC 2, HIPAA, GDPR).

Scalable – Auto-scaling and sharding for growing workloads.

Developer-Friendly – Flexible schema, built-in APIs (REST, GraphQL), and analytics tools.

Cost-Efficient – Pay-as-you-go with a free tier for testing and small apps.

6. What is the role of indexes in MongoDB, and how do they improve performance?
- In MongoDB, indexes are crucial for enhancing query performance by enabling faster data retrieval. They act as a roadmap for the database, allowing it to quickly locate documents based on indexed fields, thus avoiding full collection scans. Instead of sifting through every document, MongoDB can use the index to pinpoint the relevant data directly, significantly reducing query execution time.

7.  Describe the stages of the MongoDB aggregation pipeline
- The MongoDB aggregation pipeline consists of multiple stages that process documents in a sequence, with the output of one stage serving as the input for the next. Each stage performs a specific operation to transform, filter, or analyze the data.

8. What is sharding in MongoDB? How does it differ from replication.
- In MongoDB, sharding distributes a large dataset across multiple machines (shards) to handle large volumes of data and high throughput, while replication creates copies of the data on multiple servers for redundancy and high availability. Sharding is for scaling horizontally, while replication is for redundancy and fault tolerance.

9. What is PyMongo, and why is it used?
- PyMongo is the official MongoDB driver for synchronous Python applications. If you want to learn how to connect and use MongoDB from your Python application, you've come to the right place. In this PyMongo tutorial, we'll build a simple CRUD (Create, Read, Update, Delete) application using FastAPI and MongoDB Atlas.

10. What are the ACID properties in the context of MongoDB transactions.
- n the context of MongoDB transactions, ACID stands for Atomicity, Consistency, Isolation, and Durability, which are a set of properties that guarantee the reliability and integrity of database transactions. MongoDB supports multi-document ACID transactions, enabling operations that span multiple collections and documents to adhere to these properties.

11. What is the purpose of MongoDB’s explain() function.
- MongoDB's explain() function provides detailed insights into how MongoDB executes a query or aggregation pipeline. Its primary purpose is to help users understand and optimize the performance of their operations by revealing the query plan chosen by the optimizer.

12. How does MongoDB handle schema validation.
- MongoDB offers schema validation to enforce a specific structure and data types for documents within a collection, despite its flexible schema model. This is achieved through the use of JSON Schema.

13. What is the difference between a primary and a secondary node in a replica set
- In a MongoDB replica set, the primary node is the only member that accepts write operations, while secondary nodes replicate data from the primary and can handle read operations. The primary node acts as the source of truth, and the secondary nodes maintain copies of the data, ensuring high availability and data redundancy.

14. What security mechanisms does MongoDB provide for data protection.
- Encryption

At Rest: AES-256 encryption (default in Atlas).

In Transit: TLS/SSL protects data between client and server.

- Access Control

Role-Based Access Control (RBAC).

Supports SCRAM, LDAP, x.509, Kerberos, AWS IAM (Atlas).

- Network Security

IP whitelisting, VPC peering, private endpoints.

- Field-Level Encryption

Encrypt specific fields client-side with your own keys.

- Auditing & Monitoring

Logs for access and changes; alerts for suspicious activity.

15.  Explain the concept of embedded documents and when they should be used
- Embedded documents are documents nested inside other documents. They let you store related data together in a single document, similar to a JSON object within another.

When to Use Embedded Documents

Use them when:

Data is closely related (e.g., user profile + address).

You frequently read the data together.

The embedded data doesn’t grow unbounded

16. What is the purpose of MongoDB’s $lookup stage in aggregation?
- The purpose of MongoDB's $lookup aggregation stage is to perform a left outer join between two collections within the same database. It allows for the combination of documents from one collection (the "input" collection) with related documents from another collection (the "joined" collection) based on a specified condition, typically a shared field.

17. What are some common use cases for MongoDB?
- MongoDB is a versatile database used in a wide range of applications. Common use cases include: Content Management Systems (CMS), e-commerce platforms, real-time analytics, IoT applications, mobile backends, social media platforms, and gaming applications. It's also frequently employed for data warehousing, log management, and as a data hub or single view for various applications.

18. What are the advantages of using MongoDB for horizontal scaling?
-
MongoDB offers significant advantages for horizontal scaling, primarily achieved through its sharding architecture. These advantages include:

- Increased Capacity:

Sharding distributes data and load across multiple servers (shards), allowing the database to handle datasets exceeding the capacity of a single machine and preventing bottlenecks.

- High Availability:

By replicating data across different servers within replica sets and distributing it across multiple shards, MongoDB enhances fault tolerance. If one server or shard fails, other replicas or shards can take over, ensuring continuous operation and high availability.

- Improved Performance:

Horizontal scaling distributes read and write operations across multiple servers, enabling parallel processing of queries and improving overall read and write throughput, particularly for large datasets and high-traffic applications.

- Cost Efficiency:

Sharding allows for the use of commodity hardware to manage large datasets and high traffic, which can be more cost-effective than scaling vertically by upgrading to more powerful, expensive single servers.

- Flexibility and Adaptability:

MongoDB's flexible document schema complements horizontal scaling, as it allows for easier adaptation to evolving data structures and application requirements without complex schema migrations across multiple shards.

19. How do MongoDB transactions differ from SQL transactions?
- SQL Transactions:

ACID by default.

Multi-statement, multi-table transactions are standard.

Strong consistency and rollback support across tables.

- MongoDB Transactions:

ACID compliance supported (since v4.0).

Best for multi-document operations, but slower than single-document writes.

Typically not needed for many use cases due to MongoDB’s document model.

Use multi-document transactions only when necessary—single document writes are already atomic.

20. What are the main differences between capped collections and regular collections?
- Capped Collection in MongoDBCapped collections and regular collections in MongoDB have key differences in their behavior and intended use. Capped collections are fixed-size and maintain insertion order, automatically overwriting older data when full, while regular collections can grow dynamically and don't enforce a specific size or insertion order.

21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?
- The $match stage in MongoDB's aggregation pipeline is used to filter documents based on a specified query, effectively selecting only the documents that meet certain criteria. It acts like a "find" operation within the pipeline, allowing you to narrow down the dataset before further processing. This is crucial for performance, as it reduces the number of documents that subsequent stages need to handle

22.  How can you secure access to a MongoDB database?
- How to Secure MongoDB Access
Enable Authentication – Use strong credentials and roles.

Restrict Network Access – Allow trusted IPs only.

Use Encryption – TLS for transit, disk encryption at rest.

Least Privilege – Grant minimal permissions to users.

Enable Auditing – Track access and changes.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?
- MongoDB's WiredTiger storage engine is a high-performance, scalable, and ACID-compliant storage engine that is the default for new MongoDB deployments. It's important because it offers features like document-level concurrency, compression, and encryption at rest, which were not available with the previous MMAPv1 storage engine. WiredTiger's B-tree based architecture, along with its support for multiple concurrency controls, makes it well-suited for handling both read-intensive and write-heavy workloads, enabling efficient management of larger datasets and improved performance.


practical questions

In [None]:
!pip install pymongo



In [None]:
# 1.  Write a Python script to load the Superstore dataset from a CSV file into MongoDB


import csv
import pymongo
import os

# === Configuration ===
CSV_FILE_PATH = 'superstore.csv'  # Path to your Superstore CSV
MONGO_URI = 'mongodb://localhost:27017/'  # Change if using remote MongoDB
DB_NAME = 'superstore_db'
COLLECTION_NAME = 'orders'

def connect_to_mongodb(uri, db_name, collection_name):
    client = pymongo.MongoClient(uri)
    db = client[db_name]
    collection = db[collection_name]
    return collection

def load_csv_to_dicts(file_path):
    with open(file_path, mode='r', encoding='utf-8-sig') as file:
        reader = csv.DictReader(file)
        data = [row for row in reader]
    return data

def clean_data(data):
    # Convert numerical fields to numbers if needed
    for record in data:
        for key, value in record.items():
            if value.replace('.', '', 1).isdigit():
                if '.' in value:
                    record[key] = float(value)
                else:
                    record[key] = int(value)
    return data

def insert_data(collection, data):
    if data:
        result = collection.insert_many(data)
        print(f'Inserted {len(result.inserted_ids)} documents into MongoDB.')
    else:
        print('No data to insert.')

def main():
    if not os.path.exists(CSV_FILE_PATH):
        print(f"CSV file not found at path: {CSV_FILE_PATH}")
        return

    print('Loading CSV data...')
    data = load_csv_to_dicts(CSV_FILE_PATH)

    print('Cleaning data...')
    data = clean_data(data)

    print('Connecting to MongoDB...')
    collection = connect_to_mongodb(MONGO_URI, DB_NAME, COLLECTION_NAME)

    print('Inserting data into MongoDB...')
    insert_data(collection, data)

    print('Done.')

if __name__ == '__main__':
    main()



CSV file not found at path: superstore.csv
