### Glacier Operations
- Restore links have an expiry date
- 3 retrieval options 


#### Glacier - Vault Policies & Vault Lock
- vault is a collection of archives
- Each Vault has:
    - one vault access policy
    - one vault lock policy
- Vault Policies are written in JSON
- Vault Access Policy is similar to bucket policy (restrict user / account permissions)
- Valut Lock Policy is a policy you lock, for regulatory and compliance requirements
    - the policy is immutable, it can never be changed
    - Example 1: forbid deleting an archive if less than 1 year old
    - Example 2: implement WORM policy (write once read many)

 #### Dynamo DB in Big Data
 - Common use cases
     - Mobile apps
     - Gaming
     - Digital ad serving
     - Live voting
     - Audience interaction for live events
     - Sensor networks
     - Log ingestion
     - Access control for web-based content
     - Metadata storage for Amazon S3 objects
     - E-commerce shopping carts
     - Web session managment
     
- Anti Pattern
    - Prewritten application tied to a traditional relational database: use RDS instead
    - Joins or complex transactions
    - Binary Large Object (BLOB) data:
    - Large data with low I/O rate: use S3 instead

# Exercise

```
1  sudo yum -y aws-kinesis-agent
    2  sudo yum install -y aws-kinesis-agent
    3  wget htt[://media.sundog-soft.com/AWSBigData/LogGenerator.zip
    4  wget http://media.sundog-soft.com/AWSBigData/LogGenerator.zip
    5  l
    6  ls
    7  unzip LogGenerator.zip 
    8  chmod a+x LogGenerator.py 
    9  ls
   10  less LogGenerator.
   11  less LogGenerator.py 
   12  sudo mkdir /var/log/cadabra
   13  cd /etc/aws-kinesis/
   14  sudo vim agent.json 
   15  ls
   16  sudo service aws-kinesis-agent start
   17  cd ~
   18  ls
   19  sudo ckconfig aws-kinesis-agent on
   20  sudo chkconfig aws-kinesis-agent on
   21  sudo ./LogGenerator.py 500000
   22  tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log 
   23  ls
   24  history
   25  cd /etc/aws-kinesis/
   26  ls
   27  sudo vim agent.json 
   28  ls
   29  sudo vim agent.json.r
   30  rm agent.json
   31  sudo rm agent.json
   32  sudo mv agent.json.r  agent.json.r 
   33  sudo mv agent.json.r  agent.json
   34  sudo service aws-kinesis-agent restart
   35  cd ~
   36  ls
   37  ./LogGenerator.py 
   38  sudo ./LogGenerator.py 
   39  tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log 
   40  ls
   41  ls /var/log/aws-kinesis-agent/
   42  ls
   43  vim LogGenerator.py 
   44  ls
   45  ls /var/log/cadabra/
   46  ls
   47  clear
   48  sudo pip install boto3
   49  mdkir .aws
   50  ls
   51  mkdir .aws
   52  cd .aws/
   53  ls
   54  vim credentials
   55  ls
   56  vim config
   57  cd ~
   58  ls
   59  wget http://media.sundog-soft.com/AWSBigData/Consumer.py
   60  chmod a+x Consumer.py 
   61  ./Consumer.py 
   62  history
``` 

# ElasticCache

- The same way RDS is to get managed Relational Databases
- ElastiCache is to get managed Redis or Memcached
- Caches are in-memory databases with really high performance, low latency
- helps reduce load off of databases for read intensive workloads
- helps make your application stateless
- write scaling using sharding
- read scaling using read replicas
- multi AZ with failover capability
- AWS takes care of OS maintenance / patching optimizations, setup, configuration, monitoring, failure recovery and backups

#### Redis 
- in-memory key-value store
- super low latency
- cache survice reboots by default 
- great to host
    - user sessions
    - leaderboard
    - distributed states
    - relieve pressure on databases
    - pub / sub capability for messaging
    - multi AZ with automatic failover for disaster recovery if you don't want to lose your cache data
    - support for read replicas

#### Memcached
- in-memory object store
- cache doesn't survice reboots
- use cases:
    - quick retrieval of objects from memory
    - cache often accessed objects
- Overall, Redis has largely grown in popularity and has better feature than Memcached

# Lambda Integration
#### Why not just run a server?
- Server management 
- Serers can be cheap, but scaling gets expensive really fast
- You don't pay for processing time you don't use
- 

![image.png](attachment:image.png)

# Lambda Costs, Promises, and Anti Patterns

- timeout : 900

#### Anti Patterns
- Long running applications
- Dynamic websites
- Stateful applications

# Glue

- Serverless discovery and definition of table definitions and schema
    - S3 data lakes
    - RDS
    - Redshift
    - Most other SQL databases
- Custom ETL jobs
    - Trigger driven, on a schedule or on database
    

#### Glue and S3 Partitions
- Glue crawler will extract partitions based on how your S3 data is organized
- Think up front about how you will be querying your data lake in S3
- Example : devices send sensor data every hour
    - Do you query primarily by time ranges?
        - if so, organize your buckets as yyyy/mm/dd/device
    - Do you query primarily by device?
        - if so, organize your buckets as device/yyyy/mm/dd

# Glue, Hive, and ETL

#### Glue ETL
- Automatic code generation
- scala or python (using spark at the backend)
- encryption
    - server-side(at rest)
    - ssl(in transit)
    
- can be event-driver
- can provision additional "DPUs" (Data processing units) to increase performance of underlygin spark jobs
- errors reported to cloudwatch

#### Glue cost model
- Billed by the minute for crawler and ETL jobs
- First million objects stored and accesses are free for the Glue Data Catalog
- Development endpoints for developing ETL code charged by the minute

 #### Glue Anti-Patterns
 - Streaming data (Glue is batch oriented, minimum 5 munute intervals)
 - Multiple ETL engines
 - NoSQL databases

# EMR


- Elastic MapReduce
- Managed Hadoop framework on EC2 instances
- Includes Spark. HBase, Presto, Flink, Hive & more


- Transient vs long-running clusters
    - can cpin up task nodes using spot instances for temporary capacity
    - Can use reserved instances on long running clusters to save 
    
- Connect directly to master to run jobs
- submit ordered steps via the console

# EMR, AWS Integration

![image.png](attachment:image.png)

#### EMR Storage
- HDFS
- EMRFS: access S3 as if it were HDFS
    - EMRFS consistnet view - optional for S3 consistenc
    - users dynamoDB to track consistency
- Local file system
- EBS for HDFS

# EMR Promises

- EMR charges by the hour
    - plus EC@ charges
- provisions new nodes if a core node fails
- Can add and remove tasks nodes on the fly
- Can resize a running cluster's core nodes

- Waht's hadoop?
    - mapreduce
    - yarn
    - hdfs

# Intro to Apache Spark

# Spark Intergration with kinesis and redshift

- Spark Streaming + Kinesis

# Hive on EMR

### Hive metastore
- hive maintains a metastore that imparts a structure you define on the unstructured data that is stored on HDFS etc

#### External Hive Metastores
- Metastore is stored in MySQL on the master node by default
- External matastores offer better resiliency / integration
    - AWS Glue Data Catalog
    - Amazon RDS

![image.png](attachment:image.png)

#### Other Hive / AWS integration points
- Load table partitions from S3
- Write tables in S3
- Load scripts from S3
- DynamoDB as an external table

#### Apache Pig
- Writing mappers and reducers by hand takes a long time
- Pig introduces Pig Latin, a scriptin language that lets you use SQL-like syntax to define your map and reduce steps
- Highly extensible with user

#### HBase
- Non relational petabyte-scale database
- Based on Google's BigTable, on top of HDFS
- in memory
- hive integration

#### Sounds a lot like DynamoDB
- Both are NoSQL databases intended for the same sorts of things
- But if you're all-in with AWS anyhow, DynamoDB has advantages
    - Fully managed (auto-scaling)
    - More integration with other AWS services
    - Glue integration
- HBase has some advantages though:
    - Efficient storage of sparse data
    - Appropriate for high frequency counters
    - High write & update throughput
    - More integration with Hadoop

# Presto on EMR

- it can connect to many different big data databases and data stores at once, and query across them
- Interactive queries at petabyte scale
- Familiar SQL syntax
- Optimized for OLAP analytical queries, data warehousing
- Developed, and still partially maintained by Facebook
- This is what Amazon Athena uses under the hood
- Exposes JDBS, command line and tableau interfaces

- HDFS
- S3
- Redshift 
- SQL
- HBase
- Teradata
- MongoDB
- Cassandra


# Hue, Splunk, Flume

#### Flume
- Another way to stream data into your cluster
- Made from the start with Hadoop in mind
    - built in sinks for HDFS and HBase
- Originally made to handle log aggregation

![image.png](attachment:image.png)

# S3DistCP

- Tool for copying large amounts of data
    - From S3 into HDFS
    - From HDFS into S3
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large numbers of objects
    - Across buckets, across accounts

# EMR Security and Instance Types

- IAM policies
- Keberos
- SSH 
- IAM roles

#### Choosing Instance Types
- Master node
    - m4.large if < 50 nodes, m4.xlarger if > 50 nodes
    - core task nodes:
        - m4.large waits a lot on external dependencies, t2.medium
        - improved performance : m4.xlarge
        - Computation intensive application : high CPU instances
        - database, memory-caching applications : high memory instances
        - network / CPU- intensive (NLP, ML) - cluster computer instances
        
    - Spot instances
        - good choice for task nodes
        - only use on core & master if you're testing or very cost sensitive; you
        re risking partial data loss
        

# ML

# Data Pipeline

- Features
    - Destinations include S3, RDS, DynammoDB, Redshift and EMR
    - Manages task dependencies
    - Retries and notifies on failures
    - cross region pipelines
    - Precondition checks
    - Data sources may be on premises
    - Highly available

- Activities
    - EMR
    - Hive
    - Copy
    - SQL
    - Scripts

# Kinesis Anlytics

- Streaming ETL
- COntinuous metric generation
- Responsive analytics

- pay only for resources consumed
- serverless; scales automatically
- use IAM permissions to access streaming source and destination
- schema discovery

- RANDOM\_CUT\_FOREST
    - SQL function used for anomaly detection on numeric columns in a stream
    - They're especially proud of this because they published a paper on it
    - it's a novel way to identify outliers in a data set so you can handle them howeber you need to
    - example : detect anomalous subway ridership during the NYC marathon

# Amazon Elasticsearch Service

- the elastic stack
- a search engine
- an analysis tool
- a visualization tool
- a data pipeline (beats / log stash)
    - you can use kinesis too
- horizontally scalable

- Usage
    - full text search
    - log analytics
    - application monitoring
    - security analytics
    - clickstream

- Concepts
    - documents
    - types
    - indices
    
- Shard
- Redundancy


# Amazon ES
- fully managed
- scale up or down without downtime
    - but this isn't automatic
- pay for what you use
    - instance hours, storage, data transfer
- network isolation
- AWS integration
    - S3 buckets (via lambda to kinesis)
    - kinesis data streams
    - dynamoDB
    - CloudWatch / CloudTrail
    - Zone awareness

- Dedicated master node(s)
    - choice of count and instance types
- domains
- snapshots to S3
- Zone Awareness

- Security
    - Resource based policies
    - identiy based policies
    - ip based policies
    - request signing
    - VPC
    - cognito

#### Securing Kibana
- Cognito
- Getting inside a VPC from outside is hard
    - nginx reverse proxy on EC2 forwarding to ES domain
    - SSH tunnel for port 5601
    - VPC direct connect
    - VPN

#### Amazon ES anti patterns
- OLTP
    - No transactions
    - RDS or DynamoDB is better
- Ad hoc data querying
    - athena is better
    
- remember amazon es is primarily for search & analytics

# Amazon Athena

- Interactive query service for S3
    - no need to load data, it stays in S3
- Presto under the hood
- Serverless

- Supports many data formats
    - CSV
    - Json
    - ORC (columnar, splittable)
    - Parquet(columnar, splittable)
    - Avro (splittable)
    
- Unstructured, semi structured, or structured

- Usage cases
    - Ad-hoc queries of web logs
    - Querying staging data before loading to Redshift
    - Analyze CloudTrail / CloudFront / VPC / ELB etc logs in S3
    - Integration with Jupyter, Zeppelin, RStudio noteboks
    - Integration with Quick Sight
    - Integration via ODBC / JDBC with other visualization tools

# Athena + Glue

- Pay as you go
    - $ 5 per TB scanned
    - successful or cancelled queries count, failed queries do not
    - no charge for DDL
- Save LOTS of money by using columnar formats
    - ORC, parquet
    - save 30 - 90%, and get better
- Glue and S3 have their own charges

![image.png](attachment:image.png)

- Athena Anti pattern
    - Highly formatted reports / visualization
    - ETL
        - use glue instead

# Amazon Redshift