Q1.  Explain the difference between AWS Regions, Availability Zones, and Edge Locations. Why is this important for data analysis and latency-sensitive applications

Ans1. Difference Between AWS Regions, Availability Zones, and Edge Locations
1. AWS Regions
Definition:
A Region is a geographically distinct location where AWS has data centers. Each Region contains multiple Availability Zones.

us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland).

2. Availability Zones (AZs)
| Aspect                          | Explanation                                                           |
| ------------------------------- | --------------------------------------------------------------------- |
| **Latency**                     | Edge Locations reduce latency by serving data closer to users.        |
| **High Availability**           | Multiple AZs provide fault tolerance by isolating failures.           |
| **Data Residency & Compliance** | Regions allow you to keep data within specific geographic boundaries. |
| **Performance**                 | Selecting the right Region and AZ optimizes speed and reliability.    |
| **Disaster Recovery**           | AZs and Regions allow for backup and failover strategies.             |



Q2. Using the AWS CLI, list all available AWS regions. Share the command used and the output

Ans2.
aws ec2 describe-regions --query "Regions[].RegionName" --output text
Explanation:
aws ec2 describe-regions: Fetches all AWS regions.

--query "Regions[].RegionName": Filters the output to show only the region names.
--output text: Displays the output as plain text for easier reading.

us-east-1 us-east-2 us-west-1 us-west-2 af-south-1 ap-east-1 ap-south-1 ap-northeast-1 ap-northeast-2 ap-northeast-3 ap-southeast-1 ap-southeast-2 ca-central-1 eu-central-1 eu-west-1 eu-west-2 eu-west-3 eu-north-1 eu-south-1 me-south-1 sa-east-1


Q3.  Create a new IAM user with least privilege access to Amazon S3. Share your attached policies (JSON or

Ans3.To create a new IAM user with least privilege access to Amazon S3, you typically grant only the necessary permissions to perform specific S3 actions. For example, read-only access to S3 buckets or limited write access to certain buckets.

Example: IAM Policy for Least Privilege Access to S3 (Read-Only)
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket",
                "arn:aws:s3:::example-bucket/*"
            ]
        }
    ]
}


Q4.  Compare different Amazon S3 storage (Standard, Intelligent-Tiering, Glacier). When should each be used in data analytics workflows


Ans4.

| Storage Class              | Description                                                                                    | Use Case in Data Analytics Workflows                                                                                                                          | Cost & Access Characteristics                                      |
| -------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **S3 Standard**            | High durability, availability, and low latency. Designed for frequently accessed data.         | Use for active datasets, real-time analytics, and frequently queried data. Ideal for storing raw data, intermediate results, or data that needs quick access. | Higher cost, immediate access, no retrieval delay.                 |
| **S3 Intelligent-Tiering** | Automatically moves data between frequent and infrequent access tiers based on usage patterns. | Best when access patterns are unknown or unpredictable. Useful for datasets with varying or changing access frequency in analytics pipelines.                 | Slightly higher cost than Standard; saves cost by auto-optimizing. |
| **S3 Glacier**             | Low-cost archival storage with retrieval times from minutes to hours.                          | Use for long-term data archiving, historical data, or backups that are rarely accessed but must be retained for compliance or future analysis.                | Very low cost, but retrieval has latency and possible extra fees.  |



Q5.  Create an S3 bucket and upload a sample dataset (CSV or JSON). Enable versioning and show at least two  Create an S3 bucket and upload a sample dataset (CSV or JSON). Enable versioning and show at least two

Ans5. Here’s a step-by-step guide to create an S3 bucket, upload a sample dataset with versioning enabled, and demonstrate at least two versions of a file using AWS CLI.

aws s3api create-bucket --bucket my-sample-bucket-12345 --region us-east-1
aws s3api put-bucket-versioning --bucket my-sample-bucket-12345 --versioning-configuration Status=Enabled

Step 3: Prepare a Sample Dataset (e.g., data.csv)
id,name,age
1,Alice,30
2,Bob,25



{
    "Versions": [
        {
            "ETag": "\"etagvalue2\"",
            "VersionId": "version-id-2",
            "IsLatest": true,
            "Key": "data.csv",
            "LastModified": "2025-05-28T10:00:00.000Z",
            "Size": 56,
            "StorageClass": "STANDARD",
            "Owner": { ... }
        },
        {
            "ETag": "\"etagvalue1\"",
            "VersionId": "version-id-1",
            "IsLatest": false,
            "Key": "data.csv",
            "LastModified": "2025-05-28T09:30:00.000Z",
            "Size": 45,
            "StorageClass": "STANDARD",
            "Owner": { ... }
        }
    ]
}



Q6.  Write and apply a lifecycle policy to move files to Glacier after 30 days and delete them after 90. Share the

Ans6. Here is a sample S3 lifecycle policy in JSON to:

Move objects to Glacier after 30 days

Delete objects after 90 days

{
  "Rules": [
    {
      "ID": "MoveToGlacierAndDelete",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    }
  ]
}


Q7.  Compare RDS, DynamoDB, and Redshift for use in different stages of a data pipeline. Give one use case for each

Ans7.Here’s a concise comparison of Amazon RDS, DynamoDB, and Redshift focused on their roles in different stages of a data pipeline, along with a use case for each:

| Service        | Type                          | Best Used For in Data Pipeline                                                                                                        | Key Features                                                                             | Example Use Case                                                                          |
| -------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Amazon RDS** | Relational Database (SQL)     | **Transactional and Operational Data Storage** (OLTP) — Storing structured data, running complex queries, and supporting applications | Managed SQL databases (MySQL, PostgreSQL, etc.), ACID compliance, supports complex joins | Store user profiles and transactional data for real-time app usage                        |
| **DynamoDB**   | NoSQL Key-Value / Document DB | **High-Throughput, Low-Latency Data Ingestion & Serving** — Fast reads/writes at scale for semi-structured data                       | Serverless, fully managed, single-digit millisecond latency, scales automatically        | Collect and serve real-time clickstream or IoT data                                       |
| **Redshift**   | Data Warehouse (Analytical)   | **Batch Analytics and Reporting** — Large-scale analytical queries on aggregated and historical data                                  | Columnar storage, massively parallel processing, SQL querying optimized for analytics    | Run complex queries and generate business intelligence reports from aggregated sales data |


Q8.  Create a DynamoDB table and insert 3 records manually. Then write a Lambda function that adds records when triggered by S3 uploads

Ans8.
Here’s a step-by-step guide to:

Create a DynamoDB table

Insert 3 records manually

Write an AWS Lambda function that triggers on S3 uploads to add records to the DynamoDB table.

aws dynamodb create-table \
  --table-name MyDataTable \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5


Q9. What is serverless computing? Discuss pros and cons of using AWS Lambda for data pipelines


Ans9. Serverless computing is a cloud computing model where the cloud provider automatically manages the infrastructure, including server provisioning, scaling, and maintenance. Developers focus solely on writing and deploying code without worrying about the underlying servers.

In this model, you pay only for the actual compute time your code consumes, rather than for pre-allocated resources.

AWS Lambda Overview
AWS Lambda is a popular serverless compute service that runs your code in response to events (e.g., file uploads, API calls) and automatically manages the compute resources.


| Pros                           | Cons                                |
| ------------------------------ | ----------------------------------- |
| No server management           | Max 15-minute execution timeout     |
| Automatic, seamless scaling    | Cold start latency                  |
| Cost-effective pay-per-use     | Limited memory and storage          |
| Event-driven integration       | Stateless (need external state)     |
| Quick development & deployment | Complexity grows with pipeline size |


Q10 Create a Lambda function triggered by S3 uploads that logs file name, size, and timestamp to Cloudwatch.Share code and a log screenshot

Ans10.
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # Extract bucket name and object info from the S3 event
    record = event['Records'][0]['s3']
    bucket = record['bucket']['name']
    key = record['object']['key']
    size = record['object']['size']
    timestamp = record['eventTime']
    
    # Log details to CloudWatch
    logger.info(f"File uploaded: {key}")
    logger.info(f"Bucket: {bucket}")
    logger.info(f"Size (bytes): {size}")
    logger.info(f"Upload time: {timestamp}")
    
    return {
        'statusCode': 200,
        'body': json.dumps('Log recorded successfully')
    }



Q11. Use AWS Glue to crawl your S3 dataset, create a Data Catalog table, and run a Glue job to convert CSV data to parquet. Share job code and output location

Ans11.
Step 1: Create an AWS Glue Crawler

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Load data from Glue catalog table (created by crawler)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database="my_glue_db",
    table_name="my_csv_table",
    transformation_ctx="datasource0"
)

# Convert data to parquet format
datasink = glueContext.write_dynamic_frame.from_options(
    frame=datasource0,
    connection_type="s3",
    connection_options={"path": "s3://my-bucket/output-data/"},
    format="parquet",
    transformation_ctx="datasink"
)

job.commit()


Q12. Explain the difference between Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics. Provide a real-world example of how each would be used

Ans12.
Here’s a clear comparison of Amazon Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics, along with real-world examples for each:

| Service                    | Purpose                                          | Processing Type                     | Real-World Use Case                            |
| -------------------------- | ------------------------------------------------ | ----------------------------------- | ---------------------------------------------- |
| **Kinesis Data Streams**   | Real-time data capture and custom processing     | Developer-managed stream processing | Real-time clickstream or fraud detection       |
| **Kinesis Data Firehose**  | Data ingestion and delivery to storage/analytics | Fully managed data delivery         | Log aggregation pipeline to S3 & Redshift      |
| **Kinesis Data Analytics** | SQL analytics on streaming data                  | Serverless SQL querying             | Real-time IoT monitoring and anomaly detection |



Q13.  What is columnar storage and how does it benefit Redshift performance for analytics workloads

Ans13.
Columnar storage is a method of organizing data in a database where data is stored column-by-column instead of row-by-row.

| Aspect            | Row Storage          | Columnar Storage          |
| ----------------- | -------------------- | ------------------------- |
| Data Organization | Row-by-row           | Column-by-column          |
| Compression       | Less efficient       | Highly efficient          |
| Query Performance | Reads all columns    | Reads only needed columns |
| Ideal Use Case    | OLTP (transactional) | OLAP (analytics)          |









