In [None]:
# Explain the difference between AWS Regions, Availability Zones, and Edge Locations. Why is this important for data analysis and latency-sensitive applications"

   - AWS Regions
      - Definition: A Region is a geographically distinct area that contains multiple data centers called Availability Zones (AZs).
      - Purpose: Regions allow customers to deploy resources close to their users or to meet regulatory and compliance requirements.
   - Availability Zones (AZs)
      - Definition: An Availability Zone is one or more physically separate data centers within a region, each with independent power, cooling, and networking.
      - Purpose: AZs offer high availability and fault tolerance. Services like EC2, RDS, and EBS can be deployed across AZs to ensure uptime during failures.
   - Edge Locations
      - Definition: Edge Locations are endpoints of the AWS global network used for content delivery and low-latency access, primarily used by Amazon CloudFront, Route 53, and AWS Global Accelerator.
      - Purpose: Serve cached/static content to users as quickly as possible, reducing latency.






In [None]:
# Using the AWS CLI, list all available AWS regions. Share the command used and the output

aws ec2 describe-regions --query "Regions[*].RegionName" --output table


In [None]:
# Create a new IAM user with least privilege access to Amazon S3. Share your attached policies (JSON or screenshot)

     1. 1. Create the IAM User (AWS CLI)
          aws iam create-user --user-name s3-least-priv-user

     2. Attach Inline Policy to the User
          {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowMinimalS3Access",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name"
      ]
    },
    {
      "Sid": "AllowLimitedObjectAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}
       3. Attach the Policy to the User
           aws iam put-user-policy \
  --user-name s3-least-priv-user \
  --policy-name S3LeastPrivilegePolicy \
  --policy-document file://s3-least-privilege-policy.json



#  Compare different Amazon S3 storage (Standard, Intelligent-Tiering, Glacier). When should each be used in data analytics workflows

    - 1. S3 Standard
       - Use Case: Frequently accessed data (hot data).
       - Durability: 99.999999999% (11 nines).
       - Availability: 99.99%
       - Latency: Milliseconds
    - 2. S3 Intelligent-Tiering
       - Use Case: Data with unknown or unpredictable access patterns.
       - Durability: 99.999999999%
       - Availability: 99.9–99.99%
       - Latency: Milliseconds (for frequent + infrequent tiers)




In [None]:
# Create an S3 bucket and upload a sample dataset (CSV or JSON). Enable versioning and show at least two versions of one file

     - Step 1: Create an S3 Bucket
        aws s3api create-bucket \
  --bucket your-unique-bucket-name \
  --region us-east-1

     - Step 2: Enable Versioning
         aws s3api put-bucket-versioning \
  --bucket your-unique-bucket-name \
  --versioning-configuration Status=Enabled


     - Step 3: Create a Sample CSV File
         echo "id,name,value" > sample.csv
echo "1,Alice,100" >> sample.csv

     - Step 4: List File Version
        aws s3api list-object-versions --bucket your-unique-bucket-name --prefix sample.csv


     - Sample Output (simplified):

        {
  "Versions": [
    {
      "VersionId": "A1B2C3D4E5F6G7",
      "Key": "sample.csv",
      "LastModified": "2025-05-29T12:00:00.000Z",
      "IsLatest": true
    },
    {
      "VersionId": "X7Y8Z9W0V1U2T3",
      "Key": "sample.csv",
      "LastModified": "2025-05-29T11:58:00.000Z",
      "IsLatest": false
    }
  ]
}


In [None]:
# Write and apply a lifecycle policy to move files to Glacier after 30 days and delete them after 90. Share the policy JSON

    - Step 1: Lifecycle Policy JSON

        {
  "Rules": [
    {
      "ID": "MoveToGlacierAndDelete",
      "Filter": {
        "Prefix": ""
      },
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      }
    }
  ]
}
     - Step 2: Apply the Lifecycle Policy

     aws s3api put-bucket-lifecycle-configuration \
  --bucket your-unique-bucket-name \
  --lifecycle-configuration file://s3-lifecycle-policy.json

    - Step 3: Verify the Lifecycle Policy

    aws s3api get-bucket-lifecycle-configuration \
  --bucket your-unique-bucket-name



# Compare RDS, DynamoDB, and Redshift for use in different stages of a data pipeline. Give one use case for each

   - RDS
       - Relational DB (SQL)
       - Transactional data (OLTP)
       - Structured SQL
       - Vertical scaling
   - DynamoDB
       - NoSQL (Key-Value / JSON)
       - High-speed, low-latency apps
       - Key-based queries
       - Auto-scaling
   - Redshift
       - Columnar Data Warehouse
       - Large-scale analytics (OLAP)
       - Complex joins, aggreg.
       - Horizontal scaling

In [None]:
# Create a DynamoDB table and insert 3 records manually. Then write a Lambda function that adds records when triggered by S3 uploads

   - Step 1: Create a DynamoDB Table

       aws dynamodb create-table \
  --table-name UserActivity \
  --attribute-definitions AttributeName=UserId,AttributeType=S \
  --key-schema AttributeName=UserId,KeyType=HASH \
  --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5


  - Step 2: Insert 3 Records Manually

      aws dynamodb put-item \
  --table-name UserActivity \
  --item '{"UserId": {"S": "user1"}, "Activity": {"S": "Login"}, "Timestamp": {"S": "2025-05-29T10:00:00Z"}}'

aws dynamodb put-item \
  --table-name UserActivity \
  --item '{"UserId": {"S": "user2"}, "Activity": {"S": "ViewPage"}, "Timestamp": {"S": "2025-05-29T10:02:00Z"}}'

aws dynamodb put-item \
  --table-name UserActivity \
  --item '{"UserId": {"S": "user3"}, "Activity": {"S": "Purchase"}, "Timestamp": {"S": "2025-05-29T10:05:00Z"}}'


   - Step 3: Lambda Function to Add Records on S3 Upload

       import json
import boto3
from datetime import datetime

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('UserActivity')

def lambda_handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        size = record['s3']['object']['size']

        # Generate a fake UserId for demo purposes
        user_id = f"user_{key.split('.')[0]}"

        # Put item in DynamoDB
        table.put_item(
            Item={
                'UserId': user_id,
                'Activity': f"Uploaded {key}",
                'Timestamp': datetime.utcnow().isoformat(),
                'FileSize': str(size)
            }
        )

    return {
        'statusCode': 200,
        'body': json.dumps('DynamoDB record created from S3 upload')
    }



#  What is serverless computing? Discuss pros and cons of using AWS Lambda for data pipelines

    - What is Serverless Computing?
       - Serverless computing is a cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of servers.
       - Instead of managing servers or infrastructure, developers write and deploy code, and the cloud automatically handles scaling, maintenance, and resource allocation.
    - Pros of Using AWS Lambda for Data Pipelines
       - No Server Management;
       - Automatic Scaling:
       - Cost Efficiency:
       - Event-driven Architecture:
       - Fast Deployment





In [None]:
# Create a Lambda function triggered by S3 uploads that logs file name, size, and timestamp to Cloudwatch. Share code and a log screenshot

   - Step 1: Lambda Function Code (Python)

        import json
import logging
import boto3
from datetime import datetime

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # The event contains info about the S3 object(s)
    for record in event['Records']:
        bucket_name = record['s3']['bucket']['name']
        object_key = record['s3']['object']['key']
        object_size = record['s3']['object']['size']
        event_time = record['eventTime']  # ISO 8601 string

        # Log the details
        logger.info(f"File uploaded - Bucket: {bucket_name}, Key: {object_key}, Size: {object_size} bytes, Time: {event_time}")

    return {
        'statusCode': 200,
        'body': json.dumps('Logged S3 upload details.')
    }




In [None]:
#  Use AWS Glue to crawl your S3 dataset, create a Data Catalog table, and run a Glue job to convert CSV data to parquet. Share job code and output location


     import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

# Parameters passed when starting the job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_database', 'source_table', 'target_path'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read the data from Glue Catalog (CSV source)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database=args['source_database'],
    table_name=args['source_table'],
    transformation_ctx="datasource0"
)

# Convert to Parquet and write to S3
glueContext.write_dynamic_frame.from_options(
    frame=datasource0,
    connection_type="s3",
    connection_options={"path": args['target_path']},
    format="parquet",
    transformation_ctx="datasink0"
)

job.commit()



# Explain the difference between Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics. Provide a real-world example of how each would be used

   - Kinesis Data Streams
      - A scalable, low-latency data streaming service for collecting and processing large streams of data records in real time.
      - It allows you to build custom applications that consume and process streaming data with fine-grained control.
      - When you need real-time, custom processing and buffering of streaming data with the ability to replay data.
      - A stock trading platform that collects real-time price tick data from exchanges and feeds it into a custom application for live fraud detection or trend analysis.
   - Kinesis Data Firehose
      - When you want a simple, managed way to capture and store streaming data without managing the underlying infrastructure or building custom applications.
      - Streaming website clickstream logs directly into Amazon S3 for batch analytics and storage, without building a custom streaming pipeline.
   - Kinesis Data Analytics
      - When you want to perform real-time analytics or monitoring with SQL queries on streaming data.
      - Real-time dashboard monitoring user activity where you want to detect unusual patterns by querying clickstream data with SQL as it flows in.




# What is columnar storage and how does it benefit Redshift performance for analytics workloads

    - What is Columnar Storage?
        - Columnar storage is a method of storing data in a database where data is stored column-by-column instead of row-by-row.
        - Instead of storing complete rows together, each column’s data is stored contiguously on disk.
    - How Does Columnar Storage Benefit Amazon Redshift Performance for Analytics?
        - Efficient I/O for Analytics Queries
        - Better Compression
        - Faster Aggregations and Scans
        - Improved CPU Efficiency



In [None]:
#  Load a CSV file from S3 into Redshift using the COPY command. Share table schema, command used, and sample output from a query

   -  Heres a corresponding Redshift table schema:


        CREATE TABLE sales (
    sale_id INT,
    sale_date DATE,
    customer_id INT,
    amount DECIMAL(10, 2)
);

   - COPY Command to Load CSV from S3

     COPY sales
FROM 's3://my-bucket/data/sales.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
CSV
IGNOREHEADER 1
DELIMITER ','
TIMEFORMAT 'auto';



#  What is the role of the AWS Glue Data Catalog in Athena? How does schema-on-read work

   - Role of AWS Glue Data Catalog in Amazon Athena
      - The AWS Glue Data Catalog is a centralized metadata repository that stores table definitions, schemas, and location information for your data.
      - Athena uses the Glue Data Catalog as its metadata store to understand
      - When you run a query in Athena, it refers to the Glue Data Catalog to parse and interpret the underlying raw data
      - Glue Data Catalog can also be used by other AWS analytics services (like Redshift Spectrum, Glue ETL, EMR), enabling consistent metadata management across AWS.

   - How Schema-on-Read Works
      - Schema-on-read means the schema is applied when you query the data, not when you write or store it
      - Unlike traditional databases (schema-on-write) where data must conform to a schema before storing, schema-on-read lets you store raw data as-is (like CSV, JSON, Parquet) in S3.

In [None]:
# Create an Athena table from S3 data using Glue Catalog. Run a query and share the SQL + result screenshoV

   tep 1: Create Glue Catalog Table (using Athena SQL)
      CREATE EXTERNAL TABLE IF NOT EXISTS sales (
  sale_id INT,
  sale_date STRING,
  customer_id INT,
  amount DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ",",
  "quoteChar"     = "\"",
  "escapeChar"    = "\\"
)
LOCATION 's3://my-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false');

   Step 2: Run a Sample Query in Athena

     SELECT sale_id, sale_date, amount
FROM sales
WHERE amount > 100
ORDER BY sale_date DESC
LIMIT 5;



# Describe how Amazon Quicksight supports business intelligence in a serverless data architecture. What are SPICE and embedded dashboards

   - How Amazon QuickSight Supports BI in a Serverless Data Architecture
      - Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that lets you visualize and analyze data without managing any infrastructure (hence, serverless).
      - It connects directly to various AWS data sources like S3, Redshift, Athena, RDS, and more — no need to provision servers or manage scaling.
      - QuickSight automatically scales to accommodate users and data size, charging you based on usage, so you only pay for what you use.
      - It simplifies BI by enabling users to create interactive dashboards and visualizations quickly, with minimal setup.
  - What are Embedded Dashboards?
      - Embedded Dashboards let you integrate QuickSight dashboards into your own web applications or portals.
      - This provides your customers or internal users with interactive BI reports without them needing a separate QuickSight login.
      - You can embed dashboards securely, control access, and customize the look and feel.
      - Ideal for SaaS vendors or internal apps wanting to provide BI as part of the user experience.


# Explain how AWS CloudWatch and CloudTrail differ. IN a data analytics pipeline, what role does each play in monitoring, auditing, and troubleshooting

   - AWS CloudWatch
     - 	Monitoring and observability of AWS resources
     - 	Metrics (CPU, memory, network), logs, alarms
     - 	Operational data: performance metrics & logs
     - Yes — for alerting, dashboards, auto-scaling
   - 	AWS CloudTrail
     - Governance, auditing, and compliance tracking
     - API calls made to AWS services (who did what)
     - Security and operational audit trails
     - Mostly post-facto analysis and auditing


# Describe a complete end-to-end data analytics pipeline using AWS services. (Example: S3 → Lambda → Glue → Quicksight) Explain why you would choose each service for the stage its used in

   1. Data Ingestion: Amazon Kinesis Data Firehose
      - Kinesis Firehose provides a fully managed, scalable, and easy-to-configure data ingestion service that automatically captures streaming data and delivers it to storage destinations like S3.
      - It supports data buffering, compression, and transformation (via Lambda), allowing near real-time data ingestion with minimal operational overhead.
      - Use case: Collect streaming data such as website clickstreams, IoT sensor data, or application logs.
   2. Data Storage: Amazon S3
      - S3 is a highly durable, cost-effective, and scalable data lake storage service. It can store raw data of any format (CSV, JSON, Parquet) with virtually unlimited size
      - Use case: Store raw streaming data ingested from Firehose, as well as transformed and partitioned datasets for downstream analytics.
   3. Data Catalog & Transformation: AWS Glue
      - Glue automatically crawls your S3 data to create and maintain a centralized Data Catalog, capturing table schemas and partitions.
      - Use case: Clean and transform raw data to optimized, query-efficient formats; keep metadata updated for querying services.
   4. Querying: Amazon Athena
      - Athena is a serverless interactive query service that allows you to run standard SQL queries directly on data stored in S3 using the Glue Data Catalog metadata.
      - Use case: Analysts and data scientists can explore and analyze datasets ad hoc without needing to move or load data elsewhere.
   5. Visualization: Amazon QuickSight
      - QuickSight is a fully managed BI and visualization service that integrates seamlessly with Athena and Glue Data Catalog
      - Use case: Business users and stakeholders visualize analytics results with rich, shareable dashboards embedded in applications or portals.