1. Difference between AWS Regions, Availability Zones, and Edge Locations
Regions: Geographically isolated areas (e.g., us-east-1, ap-south-1). Each region contains multiple availability zones.


Availability Zones (AZs): Data centers within a region. They provide high availability and fault tolerance.


Edge Locations: Used by CloudFront for content delivery with low latency to end-users.


Importance for Data Analysis & Latency-Sensitive Applications:


Choose regions close to your users for low latency.


Use AZs for high availability.


Use edge locations for fast content delivery.



2. List all AWS regions using AWS CLI
aws ec2 describe-regions --all-regions --query "Regions[*].RegionName"

Sample Output:
[
  "af-south-1",
  "ap-east-1",
  "ap-south-1",
  "ap-northeast-1",
  "eu-west-1",
  ...
]


3. Create IAM User with Least Privilege for S3
Policy JSON:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}


4. Compare S3 Storage Classes
Storage Class
Use Case
S3 Standard
Frequently accessed data.
S3 Intelligent-Tiering
Automatically moves data between tiers.
S3 Glacier
Archival. Use when data is rarely accessed.


5. Create S3 Bucket, Upload File, Enable Versioning
aws s3api create-bucket --bucket my-dataset-bucket --region ap-south-1
aws s3api put-bucket-versioning --bucket my-dataset-bucket --versioning-configuration Status=Enabled
aws s3 cp data.csv s3://my-dataset-bucket/data.csv
aws s3 cp updated_data.csv s3://my-dataset-bucket/data.csv
aws s3api list-object-versions --bucket my-dataset-bucket


6. Lifecycle Policy: Move to Glacier after 30 days, Delete after 90
{
  "Rules": [
    {
      "ID": "MoveToGlacierThenDelete",
      "Status": "Enabled",
      "Prefix": "",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      }
    }
  ]
}


7. RDS vs DynamoDB vs Redshift
Service
Use Case
RDS
Transactional data (e.g., orders, billing)
DynamoDB
Serverless NoSQL (e.g., user sessions)
Redshift
OLAP, large-scale analytics queries


8. DynamoDB + Lambda for S3 Trigger
Create Table: aws dynamodb create-table ...


Insert Records:


aws dynamodb put-item --table-name myTable --item '{"ID":{"S":"1"}, "Name":{"S":"Alice"}}'

Lambda Function (Python):


import boto3

def lambda_handler(event, context):
    s3 = event['Records'][0]['s3']
    filename = s3['object']['key']
    dynamodb = boto3.client('dynamodb')
    dynamodb.put_item(
        TableName='myTable',
        Item={
            'ID': {'S': filename},
            'Status': {'S': 'Uploaded'}
        }
    )


9. Serverless Computing and AWS Lambda
Definition: Serverless allows you to run code without managing servers.


Pros: Auto-scaling, cost-efficient, easy to deploy.


Cons: Cold starts, timeout limits, debugging difficulty.


For Pipelines: Lambda is ideal for lightweight, event-driven processing.



10. Lambda Logging to CloudWatch on S3 Upload
import logging
import time

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    file = event['Records'][0]['s3']['object']
    logger.info(f"File: {file['key']}, Size: {file['size']}, Time: {time.time()}")


11. AWS Glue: Crawl, Catalog, Convert CSV to Parquet
Steps:


Create Crawler → S3 Path


Create Glue Job:


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

glueContext = GlueContext(SparkContext.getOrCreate())
datasource = glueContext.create_dynamic_frame.from_catalog(database="mydb", table_name="mytable")
glueContext.write_dynamic_frame.from_options(frame=datasource, connection_type="s3", format="parquet", connection_options={"path": "s3://output-path/"})


12. Kinesis Data Family Comparison
Service
Use Case Example
Kinesis Data Streams
Real-time clickstream ingestion.
Kinesis Firehose
Stream logs to S3/Redshift/Elasticsearch.
Kinesis Analytics
SQL queries on streaming data.


13. Columnar Storage in Redshift
Stores data by columns, not rows.


Benefits: Faster queries, better compression, ideal for analytics.



14. Load CSV into Redshift
Table Schema:
CREATE TABLE sales(id INT, product VARCHAR, amount DECIMAL);

COPY Command:
COPY sales FROM 's3://mybucket/sales.csv'
CREDENTIALS 'aws_iam_role=arn:aws:iam::xxxx:role/myRedshiftRole'
CSV IGNOREHEADER 1;

Query Output:
SELECT * FROM sales LIMIT 5;


15. Glue Data Catalog in Athena
Role: Central metadata repository.


Schema-on-Read: Athena reads the data schema during query time. No data movement or transformation needed beforehand.



16. Create Athena Table from S3 + Query
CREATE EXTERNAL TABLE sales (
  id INT,
  product STRING,
  amount DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ("separatorChar" = ",")
LOCATION 's3://mybucket/sales/';

SELECT * FROM sales WHERE amount > 100;


17. Amazon QuickSight for BI
Serverless BI tool.


SPICE: In-memory storage for faster queries.


Embedded Dashboards: Integrate into custom web apps.



18. QuickSight Dashboard from Athena/Redshift
Connect to Athena or Redshift.


Create calculated field (e.g., Profit = Revenue - Cost).


Add filters (e.g., Region = 'North').




19. CloudWatch vs CloudTrail
Feature
CloudWatch
CloudTrail
Purpose
Logs, metrics, monitoring
API activity auditing
Pipeline Role
Monitor Glue, Lambda, etc.
Track user/API interactions


20. End-to-End AWS Data Pipeline
Pipeline:
Ingestion: S3 (Raw data uploaded via APIs or directly).


Trigger: Lambda (Triggers Glue job on file upload).


Transformation: AWS Glue (ETL from CSV to Parquet).


Querying: Athena (Run SQL queries over S3).


Visualization: QuickSight (Dashboards built on Athena tables).


