# AWS Documentation

#### Contents

## Amazon S3

Amazon S3 allows people to store objects(files) in "buckets"(directories).

Buckets must have **globally** unique name.

**s3://aws-machine-learning-furkan**/my_file.csv

Objects (files) have a Key. The key is the FULL path: <br>
&nbsp;&nbsp;&nbsp; <my_bucket>/my_file.txt<br>
&nbsp;&nbsp;&nbsp; <my_bucket>/my_folder1/another_myfolder/my_file.txt

It is recommended to file your files as **year/month/day/hour**.

/2020/07/14/11/my_file.json

Will be added:
AWS S3 for machine learning
AWS S3 Data Partitioning


#### S3 Lifecycle Rules <br>
Set of rules to move data between different tiers, to save storage cost<br>
General Purpose => Infrequent Access => Glacier<br>
###### Transition Actions: objects are transitioned to another storage class.<br>
Move objects to Standard IA class 60 days after creation.<br>
Move to Glacier for archiving after 6 months.

###### Expiration actions: S3 deletes expired objects on our behalf<br>
Access log files can be set to delete after a specified period of time

<font color='green'>ServisSoft should delete heat datas after 6 months for savings</font>

#### Amazon S3 Security <br>
###### S3 Encryption for Objects<br>
There are 4 methods of encrypting objects in S3<br>
* SSE-S3: encrypts S3 objects using keys handled & managed by AWS
* SSE-KMS: use AWS key Management Service to manage encryption keys
* SSE-C when you want to manage your own encryption keys
* Client Side Encryption

<font color='green'>For Machine Learning; SSE-S3 and SSE-KMS are most likely used.</font>

###### SSE-S3

![SSE-S3.png](attachment:SSE-S3.png)

###### SSE-KMS

![SSE-KMS.png](attachment:SSE-KMS.png)

##### User Based Security<br>
* IAM policies - which API calls should be allowed for a specific user <br>

##### Resource Based <br>
* Bucket Policies - bucket wide rules from the S3 console - allows cross account
* Object Access Control List
* Bucket Access Control List

##### S3 Bucket Policies<br>

###### JSON based policies<br>
* Resources: buckets and objects
* Actions: Set of API to Allow or Deny
* Effect: Allow/Deny
* Principal: The account or user to apply the policy to

###### Use S3 bucket for policy to
* Grant public access to the bucket
* Force objects to be encrypted at upload
* Grant access to another account

<font color='red'>Make sure AWS SageMaker can access S3</font>

## AWS Kinesis

* Kinesis is a managed alternative to Apache Kafka
* Great for application logs, metrics, __IoT__, clickstreams
* Great for __real time big data__

##### Kinesis Streams: low latency streaming ingest at scale <br>
##### Kinesis Analytics: perform real-time analytics on streams using SQL<br>
##### Kinesis Firehose: load streams into S3, Redshift, ElasticSearch & Splunk<br>
##### Kinesis Video Streams: meant for streaming video in  real-time

<font color = "orange">Kinesis Data Analytics is a expensive service.</font>

<font color='red'>
* Kinesis Data Stream can be use for creating real-time machine learning applications <br>
* Kinesis Data Analytics is able to create real-time ETL/ML algorithms on streams
</font>

### AWS Glue<br>
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize data, clean it, enrich it, and move it reliably between various data stores.

![aws-glue.png](attachment:aws-glue.png)

#### Glue Data Catalog

* Metadata repository for all your tables.
* Integrates with Athena or Redshift Spectrum
<font color = yeşil">Glue Crawlers can help build the Glue Data Catalog

![aws-glue_catalog.png](attachment:aws-glue_catalog.png)

<font color = "green">Amazon QuickSight is a cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that include ML Insights. </font>

### Glue ETL<br>
* Transform data, Clean data, Enrich data (before analysis)
* Generate ETL code in Python or Scala, can modify the code
* Can provide your own Spark or PySpark scripts
* Target can be S3, JBDC, or in Glue Data Catalog
* Fully managed, cost effective, paying only for the resources which 'em consumed
* Jobs are run on a serverless Spark platform
* Glue Scheduler to schedule the jobs
* Glue Triggers to automate job runs based on "events"
<br>
<font color = "green">This part is very useful for the getting ready to dataset </font>

##### Glue ETL advantages

* DropFields, DropNullFields
* Filter
* Join
* Map
* Format Conversions
<br>
<font color ="green">You can specify your data in here. Also you should get ready your data for machine learning in here. </font><br>
<font color = "red">This is data pre-processing part. </font>

#### AWS Data Stores for Machine Learning


##### Redshift
* Data warehousing, SQL analytics
* Load data from S3 to Redshift
* Use Redshift Spectrum to query data directly

##### RDS, Aurora
* Relational Store, SQL
* Must provision servers in advance

##### DynamoDB
* NoSQL data store, serverless, provision read/write capacity
* Useful to store a machine learning model served by your application

##### S3
* Object Storage
* Serverlesss, infinite storage
* Integration with most AWS Services


##### ElasticSearch
* Indexing of data
* Search amongst data points
* Clickstream Analytics

###### ElastiCache
* Caching mechanism
* Not really used for Machine Learning

Pipeline Example

![pipeline-example.png](attachment:pipeline-example.png)

#### Glue vs AWS Data Pipeline

###### Glue
* Run Apache Spark code, Scala or Python based, focus on the ETL
* Do not worry about configuring or managing the resources
* Data Catalog to make the data available to Athena or Redshift Spectrum

###### Data Pipeline
* Orchestration service
* More control over the environment, compute resources that run code
* Allows access to EC2 or EMR instances (creates resources in your own account)

#### AWS Batch
* Run batch jobs as Docker images
* Dynamic provisioning of the instances
* Optimal quantity and type based on volume and requirements
* No need to manage clusters, fully serverless
* Schedule Batch jobs using CloudWatch Events
* Orchestrate Batch Jobs using AWS Step Functions

<font color ="green">For any non-ETL related work, Batch is probably better</font>

#### Database Migration Service (DMS)
* Quickly and securely migrate databases to AWS, resilient, self healing
* The source database remains available during the migration
* Continuous Data Replication using CDC
* You must create an EC2 instance to perform the replication tasks

![DMS.png](attachment:DMS.png)

### Amazon Athena

* Interactive query service for S3
* Serverless
* Support CSV,JSON,ORC,Parquet,Avro
* Unstructured, semi-structured or structured

![athena.png](attachment:athena.png)