#Assignment Questions


Q1. Explain the difference between AWS Regions, Availability Zones, and Edge Locations. Why is this important for
data analysis and latency-sensitive applications"

 1. AWS Regions

Definition:
A Region is a geographically distinct location that contains multiple Availability Zones. AWS has Regions all over the world (e.g., US East (N. Virginia), Europe (Frankfurt), Asia Pacific (Tokyo)).

Purpose:
Regions allow you to deploy resources close to your users or data sources. Each Region is isolated from others to provide fault tolerance and data sovereignty.

Significance:
Choosing the right Region affects data residency (important for compliance), latency, and disaster recovery strategies.

2. Availability Zones (AZs)

Definition:
An Availability Zone is a physically separate data center within a Region. Each Region has 2 or more AZs.

Purpose:
AZs are designed for fault isolation. They are connected with low-latency, high-bandwidth networks but are physically independent to prevent a single point of failure.

Significance:
You can deploy your application across multiple AZs for high availability and fault tolerance. For example, if one AZ goes down, another can serve traffic seamlessly.

3. Edge Locations

Definition:
Edge Locations are sites used by AWS for caching content closer to end users via the AWS Content Delivery Network (CDN) service called Amazon CloudFront.

Purpose:
They store cached copies of static or dynamic content to reduce latency by serving requests from the nearest location to the user.

Significance:
Edge Locations are important for accelerating content delivery and API responses globally. They are not designed for hosting or running applications but for caching and delivering content quickly.



Q2.  Using the AWS CLI, list all available AWS regions. Share the command used and the output&




In [None]:
aws ec2 describe-regions --query "Regions[].RegionName" --output text

OUTPUT : us-east-1 us-east-2 us-west-1 us-west-2 af-south-1 ap-east-1 ap-south-1 ap-northeast-1 ap-northeast-2 ap-northeast-3 ap-southeast-1 ap-southeast-2 ca-central-1 eu-central-1 eu-west-1 eu-west-2 eu-west-3 eu-north-1 eu-south-1 me-south-1 sa-east-1


Q3. Create a new IAM user with least privilege access to Amazon S3. Share your attached policies (JSON or
screenshot)





In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}


Q4 . Compare different Amazon S3 storage (Standard, Intelligent-Tiering, Glacier). When should each be used in
data analytics workflows"

1. Amazon S3 Standard
Description:
The default storage class designed for frequently accessed data. Offers high durability, availability, and low latency.

Key Features:

Millisecond access latency

99.99% availability

High throughput and performance

Suitable for frequently accessed, critical data

Use Cases in Data Analytics:

Storing raw data ingested from sources for immediate processing

Storing active datasets and intermediate results that analytics jobs query repeatedly

Real-time analytics where low latency is critical

2. Amazon S3 Intelligent-Tiering
Description:
Automatically moves data between two tiers: frequent access and infrequent access, based on changing access patterns, without performance impact or operational overhead.

Key Features:

Automatic cost optimization based on access patterns

No retrieval fees or operational complexity

Designed for datasets with unknown or unpredictable access patterns

Slightly higher storage cost than Standard but saves money on infrequently accessed data

Use Cases in Data Analytics:

Datasets with unpredictable or fluctuating access patterns

Data that might be frequently accessed initially but becomes infrequently accessed later

Long-term storage where access patterns are uncertain, such as logs or intermediate analysis outputs that might be needed later

3. Amazon S3 Glacier (and Glacier Deep Archive)
Description:
Designed for long-term archival and backup with very low storage costs but higher retrieval latency.

Key Features:

Retrieval times range from minutes (Glacier Instant Retrieval) to hours (Standard and Deep Archive)

Very low storage costs

Suitable for data that is rarely accessed but must be retained

Use Cases in Data Analytics:

Archiving historical datasets that are no longer actively queried but must be retained for compliance or future analysis

Storing old raw data before deletion or for audit purposes

Backup of analytics data that is costly to regenerate but rarely accessed

Summary Table
Storage Class	Access Frequency	Retrieval Latency	Cost	Use Case in Data Analytics
S3 Standard	Frequent	Milliseconds	Highest among these	Active datasets, real-time analytics
S3 Intelligent-Tiering	Variable/Unpredictable	Milliseconds (frequent tier)	Moderate	Datasets with unknown access patterns
S3 Glacier	Rare/Archive	Minutes to hours	Lowest	Long-term archival, historical data storage

Q6. Write and apply a lifecycle policy to move files to Glacier after 30 days and delete them after 90. Share the
policy JSON or Screenshot&



In [None]:
{
  "Rules": [
    {
      "ID": "MoveToGlacierAndDelete",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    }
  ]
}


Q7. Compare RDS, DynamoDB, and Redshift for use in different stages of a data pipeline. Give one use case for
each&

1. Amazon RDS (Relational Database Service)
Type: Managed relational database (supports engines like MySQL, PostgreSQL, SQL Server, etc.)

Best for: OLTP (Online Transaction Processing), structured data, complex queries with relational joins

Data Pipeline Stage: Data ingestion & transactional storage
RDS is great for capturing and storing structured data in real time from applications or data sources before moving it further down the pipeline.

Use Case:
Storing user profile data and transactions in an e-commerce app before analytics.
Example: Orders, payments, and customer data stored in RDS to ensure ACID compliance and support complex transactional queries.

2. Amazon DynamoDB
Type: Fully managed NoSQL key-value and document database

Best for: Highly scalable, low-latency applications with flexible schema, event-driven data ingestion

Data Pipeline Stage: Real-time data ingestion and quick lookups
Ideal for fast ingestion of semi-structured or unstructured data, with millisecond response times.

Use Case:
Capturing real-time IoT sensor data streams.
Example: Storing time-series sensor data that needs rapid ingestion and fast retrieval for dashboarding or quick anomaly detection.

3. Amazon Redshift
Type: Fully managed data warehouse (columnar storage, optimized for OLAP workloads)

Best for: Large-scale analytics, complex queries, data aggregation, and reporting

Data Pipeline Stage: Data storage & analytics
Used as the central analytics platform for running complex, large-scale queries on structured data aggregated from various sources.

Use Case:
Analyzing customer behavior and sales trends using aggregated data from RDS and DynamoDB.
Example: Running complex BI queries and reports on historical sales data, customer segmentation, and marketing effectiveness.

Q8.  Create a DynamoDB table and insert 3 records manually. Then write a Lambda function that adds records
when triggered by S3 uploads&

 Create a DynamoDB Table
Let's create a simple table called Uploads with UploadId as the partition key.

Using AWS CLI:


In [None]:
aws dynamodb create-table \
    --table-name Uploads \
    --attribute-definitions AttributeName=UploadId,AttributeType=S \
    --key-schema AttributeName=UploadId,KeyType=HASH \
    --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5


Step 2: Insert 3 records manually
Using AWS CLI put-item command to add three sample items:

In [None]:
aws dynamodb put-item --table-name Uploads --item '{"UploadId": {"S": "upload1"}, "FileName": {"S": "file1.csv"}, "UploadTime": {"S": "2025-06-22T10:00:00Z"}}'

aws dynamodb put-item --table-name Uploads --item '{"UploadId": {"S": "upload2"}, "FileName": {"S": "file2.csv"}, "UploadTime": {"S": "2025-06-22T11:00:00Z"}}'

aws dynamodb put-item --table-name Uploads --item '{"UploadId": {"S": "upload3"}, "FileName": {"S": "file3.csv"}, "UploadTime": {"S": "2025-06-22T12:00:00Z"}}'


Step 3: Write a Lambda function triggered by S3 uploads to add records
Lambda Function (Python)
This function will trigger on S3 ObjectCreated events and insert a record into the DynamoDB table.

In [None]:
import json
import boto3
import uuid
from datetime import datetime

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Uploads')

def lambda_handler(event, context):
    for record in event['Records']:
        s3_info = record['s3']
        bucket_name = s3_info['bucket']['name']
        object_key = s3_info['object']['key']

        upload_id = str(uuid.uuid4())

        upload_time = datetime.utcnow().isoformat() + 'Z'


        table.put_item(
            Item={
                'UploadId': upload_id,
                'FileName': object_key,
                'BucketName': bucket_name,
                'UploadTime': upload_time
            }
        )

    return {
        'statusCode': 200,
        'body': json.dumps('Upload records added to DynamoDB')
    }


Q9.What is serverless computing? Discuss pros and cons of using AWS Lambda for data pipelines&

**Serverless computing: **

Serverless computing is a cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of servers.

You don’t have to manage infrastructure (no servers or VMs to provision or maintain).

You write code/functions that are triggered by events (like file uploads, database changes, API calls).

You pay only for the compute time you consume—no charges when your code isn't running.

It’s highly scalable and event-driven.

AWS Lambda is one of the most popular serverless compute services.

Pros of Using AWS Lambda for Data Pipelines
No Server Management:
You focus on your code and logic; AWS handles the underlying infrastructure.

Automatic Scalability:
Lambda automatically scales to handle the volume of events, so it adapts well to fluctuating data loads.

Cost-Effective:
Pay only for actual execution time (in milliseconds), no cost when idle.

Event-Driven Integration:
Lambda natively integrates with many AWS services (S3, DynamoDB, Kinesis, SNS, etc.), which is great for building reactive data pipelines.

Rapid Development & Deployment:
Small, modular functions mean faster iterations and easier debugging.

Built-in Fault Tolerance:
AWS retries failed executions automatically, improving reliability.

Cons of Using AWS Lambda for Data Pipelines
Execution Time Limits:
Lambda has a max execution time of 15 minutes, which is limiting for long-running ETL or complex batch processes.

Cold Start Latency:
Functions may experience latency when they’re invoked after a period of inactivity (especially for larger functions or in VPC).

Resource Limitations:
Memory max is 10 GB, disk space (512 MB /tmp), and ephemeral storage constraints can restrict certain workloads.

Complexity in Orchestration:
Managing complex workflows with multiple functions and retries requires additional services like AWS Step Functions, adding complexity.

Limited Language Support:
While AWS Lambda supports many languages, some niche or legacy languages may not be supported directly.

Monitoring and Debugging Challenges:
Distributed functions can make tracing and debugging more complex compared to monolithic applications.

Q10. Create a Lambda function triggered by S3 uploads that logs file name, size, and timestamp to Cloudwatch.
Share code and a log screenshot



In [None]:
import json
import boto3
import logging
from datetime import datetime

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):

    for record in event['Records']:
        s3 = record['s3']
        bucket_name = s3['bucket']['name']
        object_key = s3['object']['key']
        object_size = s3['object']['size']


        timestamp = datetime.utcnow().isoformat() + "Z"


        logger.info(f"File uploaded: {object_key}")
        logger.info(f"Bucket: {bucket_name}")
        logger.info(f"Size (bytes): {object_size}")
        logger.info(f"Upload timestamp: {timestamp}")

    return {
        'statusCode': 200,
        'body': json.dumps('Logged S3 upload details successfully.')
    }


Q12 .  Explain the difference between Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics. Provide a
real-world example of how each would be used

1. Kinesis Data Streams
What it is:
A scalable, real-time data streaming service that allows you to capture and store data streams from multiple sources for processing.

Key Characteristics:

You build custom applications (consumers) to process data in real-time.

Data retention can be from 24 hours up to 7 days (or longer with extended retention).

Low latency ingestion and retrieval.

Real-world Example:
Collecting and processing clickstream data from a high-traffic website.
Example: A gaming company collects player actions in real-time to power leaderboards and in-game analytics.

2. Kinesis Data Firehose
What it is:
A fully managed service for loading streaming data into data stores and analytics tools (like S3, Redshift, Elasticsearch, Splunk).

Key Characteristics:

No need to write custom consumer applications.

Supports automatic data transformation (via AWS Lambda) and compression.

Near real-time data delivery with automatic scaling.

Real-world Example:
Automatically ingesting IoT sensor data into Amazon S3 for long-term storage and batch analytics.
Example: A manufacturing plant streams sensor data and Firehose delivers it directly to S3, optionally transforming and compressing it on the fly.

3. Kinesis Data Analytics
What it is:
Enables real-time processing and analytics of streaming data using standard SQL queries.

Key Characteristics:

No need to write complex code; use SQL to analyze streaming data.

Integrates with Kinesis Data Streams and Firehose as data sources and sinks.

Useful for real-time dashboards, anomaly detection, and filtering.

Real-world Example:
Monitoring social media sentiment in real-time during a marketing campaign.
Example: Data from Kinesis Data Streams is fed into Kinesis Data Analytics to identify spikes in positive or negative sentiment instantly.

Q13.  What is columnar storage and how does it benefit Redshift performance for analytics workloads"

Columnar storage is a way of organizing data in a database where data is stored column-by-column instead of row-by-row.

Instead of storing full rows one after another (row-oriented), columnar storage stores all the values of a single column together.

For example, all values from the "age" column are stored sequentially, then all values from the "salary" column, and so on.

Amazon Redshift is a columnar data warehouse, and this storage format gives it several advantages for analytics workloads:

1. Faster Query Performance
Analytical queries usually involve aggregations and scans over a subset of columns (e.g., sum of sales, average age).

Columnar storage allows Redshift to read only the columns required for the query, minimizing the amount of data read from disk.

2. Better Compression
Since columns contain similar data types and often similar values, Redshift can apply highly effective compression algorithms.

This reduces storage space and speeds up I/O by reading less data.

3. Efficient Data Scanning
Scanning a few columns is faster because the database skips irrelevant columns entirely.

This leads to lower I/O, CPU usage, and faster query response times.

4. Optimized for OLAP Workloads
Redshift and other data warehouses are designed for Online Analytical Processing (OLAP), which involves large-scale aggregations and reporting rather than frequent single-row transactions.

Columnar storage matches these patterns perfectly.




Q15. What is the role of the AWS Glue Data Catalog in Athena? How does schema-on-read work?

Role of AWS Glue Data Catalog in Athena
AWS Glue Data Catalog is a central metadata repository — it stores information about data sources like tables, schemas, partitions, and their locations.

When you run queries in Amazon Athena (which is a serverless interactive query service for data in S3), Athena uses the Glue Data Catalog as its metadata store to understand:

What tables exist,

What columns they have,

What data types each column is,

Where the data physically lives in S3.

Essentially, the Glue Data Catalog tells Athena how to interpret the raw data files stored in S3 without needing to move or transform the data first.

How Schema-on-Read Works
Unlike traditional databases that enforce a schema when you write data (schema-on-write), services like Athena and Glue use schema-on-read.

Schema-on-read means:

You store raw data in S3 (e.g., CSV, JSON, Parquet) without enforcing a schema upfront.

When you run a query, Athena applies the schema at query time, based on metadata from the Glue Data Catalog.

The query engine reads the raw data and interprets it according to the schema defined in the Data Catalog.

This allows flexibility:

You can store different data formats without upfront conversion.

You can evolve schemas without rewriting data.

You avoid the overhead of schema validation on data ingestion.

Q16. Create an Athena table from S3 data using Glue Catalog. Run a query and share the SQL + result screenshot

SELECT column_name, COUNT(*) AS count
FROM my_database.my_table
GROUP BY column_name
ORDER BY count DESC
LIMIT 5;


Q 17. Describe how Amazon Quicksight supports business intelligence in a serverless data architecture. What are
SPICE and embedded dashboards

Amazon QuickSight and Serverless BI
Amazon QuickSight is a fully managed, serverless business intelligence (BI) service that enables organizations to easily create and publish interactive dashboards and visualizations without managing any infrastructure.

Serverless Architecture:
You don’t provision or manage servers; QuickSight automatically scales to handle users and data size.

Direct Integration with AWS Data Sources:
QuickSight connects directly to services like Athena, Redshift, RDS, S3, and Glue Data Catalog, making it a natural fit for serverless data lakes and pipelines.

Fast, Scalable BI:
Users can quickly explore and visualize large datasets with minimal latency and effort.

What is SPICE?
SPICE (Super-fast, Parallel, In-memory Calculation Engine) is QuickSight’s proprietary, in-memory data engine designed for fast data querying and visualization.

How it works:
SPICE imports data from your sources into a highly optimized, compressed, and parallelized in-memory engine.

Benefits:

Speed: Queries run faster because data is pre-loaded in-memory.

Scalability: Can handle millions of rows and hundreds of concurrent users.

Offline Access: Users can interact with dashboards without querying the underlying source each time.

Cost-Efficient: Reduces load on the original data sources.

What are Embedded Dashboards?
Embedded Dashboards allow you to integrate QuickSight visualizations directly into your own applications, portals, or websites.

Use Cases:

Provide analytics to customers without requiring them to log into QuickSight separately.

Build custom BI experiences with your app’s branding and user management.

How it works:
QuickSight generates secure URLs or iFrames with embedded dashboards that your app can render seamlessly.



Q19 . Explain how AWS CloudWatch and CloudTrail differ. IN a data analytics pipeline, what role does each play in
monitoring, auditing, and troubleshooting?

AWS CloudWatch vs AWS CloudTrail: What’s the Difference?
Aspect	AWS CloudWatch	AWS CloudTrail
Purpose	Monitoring and observability of AWS resources and applications	Governance, compliance, auditing, and operational auditing
Data Type	Metrics, logs, events from AWS services & apps	API call history (who did what, when, where)
Focus	Real-time system performance and operational health	Tracking changes and user activity
Examples	CPU usage, memory utilization, application logs, alarms	User logins, API calls like CreateBucket, RunInstances

Role in a Data Analytics Pipeline
Function	AWS CloudWatch	AWS CloudTrail
Monitoring	- Track pipeline health (e.g., job success/failure, latency)
- Monitor resource utilization (EC2, Lambda, Redshift)
- Trigger alarms on anomalies	- Not used directly for monitoring resource health
Auditing	- Limited audit capability (log storage only)	- Full audit trail of all API calls and user actions
- Essential for compliance and forensic analysis
Troubleshooting	- Analyze application and infrastructure logs
- Visualize metrics to detect bottlenecks or failures
- Create dashboards for operational visibility	- Investigate unauthorized or unintended changes
- Trace who initiated failed or unexpected actions

Q20. Describe a complete end-to-end data analytics pipeline using AWS services. Include services for data
ingestion, storage, transformation, querying, and visualization. (Example: S3 → Lambda → Glue → Quicksight)
Explain why you would choose each service for the stage it’s used in?

Example Pipeline:
Data Ingestion → Storage → Transformation → Querying → Visualization

1. Data Ingestion: Amazon Kinesis Data Firehose
Why?

Seamlessly ingests streaming data from sources like IoT devices, applications, or logs.

Fully managed, scales automatically, and delivers data reliably to destinations like S3.

Supports data transformation via Lambda during ingestion (e.g., format conversion).

2. Storage: Amazon S3
Why?

Highly durable, scalable, and cost-effective object storage.

Ideal for storing raw, processed, and archived datasets of any size.

Serves as the “data lake” that centralizes all incoming data.

3. Transformation: AWS Glue
Why?

Serverless ETL service that crawls data to infer schemas, catalog data, and orchestrate transformations.

Supports Python/Scala scripts for flexible data processing (e.g., converting CSV to Parquet, cleaning).

Integrates directly with S3, Glue Data Catalog, and Athena for smooth downstream querying.

4. Querying: Amazon Athena
Why?

Serverless interactive query service that uses SQL to analyze data directly in S3.

No infrastructure to manage, scales automatically, and you pay per query.

Uses Glue Data Catalog for metadata, enabling schema-on-read querying without data movement.

5. Visualization: Amazon QuickSight
Why?

Serverless BI tool with fast, interactive dashboards and built-in ML insights.

Connects directly to Athena, Redshift, and other data sources.

Supports SPICE for fast in-memory analysis and embedding dashboards in apps.