# **AWS Storage Service - Simple Storage Service (S3)**

## **Learning Outcomes**


---
*After going through this notebook, the student should be able to explain*


1.   *An overview of simple storage service(S3).*
2.   *Criterias to create a DNS compliant bucket name.*
1.   *Various bucket permissions like ACL and bucket policy.*
2.   *How to work with various properties (versioning, static website hosting,etc.,) of a bucket using AWS console.*
1.   *what are the various storage classes (S3-standard, standard-IA,etc.,) available for the users.*
2.   *The concept of cross regional replication.*


---






### **S3 Overview**


---
After several discussions between you and your architect and product engineer, team decided to build a **datalake** on **AWS** and so you as a data engineer  evaluated and understood that AWS suits best by considering following **capabilities**


  * S3 is AWS object storage service. The objects are stored in a bucket. Here, the bucket is the **container** and the object is the **entity**.

  * The **service** is at **regional level**. Hence, makes **copies** of objects at various **availability zone** within the region. This makes the storage durable.

  * The bucket can hold as many objects as you want (infinite) but the object size cannot exceed **5 TB**.

  * In a free tier account 5 GB of storage is free per month.

  * For any account we can create upto **100 buckets**

  * **S3 API** calls can be made directly from **AWS console** or using **SDK** from a programmatic languages or even **CLI** from a terminal







### **DNS Compliant Bucket Name**


---


*   Unique

*   Lowercase

*   Alphanumeric

*   Special characters (- & .) in the middle of the names.

*   length - between 3 to 63 characters.







### **Bucket creation**


---



*   When a bucket is created, the **default permission** is that all the **requests** are **blocked** for anyone on **public** to this bucket. This has to be remembered and changes need to be done to provide access.

*   After changes when objects are put into the bucket, they can be accesible to public. Changes to permission can be done with **ACL** or **Bucket Policy**






<font size="3" color="red"><b>This notebook discusses on how to work with S3 from AWS console </font></b>

## **S3 from Console**

### **Permissions**

#### **Bucket Permissions**


---


*   **Access control list** - can control access to the bucket, share the bucket with other account, can set bucket for public access, enable sever request logging.

*   **Bucket policy** - JSON documents that control the action on the objects within the bucket.



### **Properties**

#### **Versioning**


---


* Versioning is **disabled** by **default**. It can be enabled but **once** it is **enabled**, it can only be **uspended** (not disabled).

* Versioning keeps **track** of various version of the **object**.

* Good thing about versioning is that we can **retrive any version** of the object.

* But each version will take its own space. so, causing the cost to go up.




#### **Static Webpage Hosting**


---


* We can specify an index document and error document. For ex., index document as index.html and error document as error.html

* Then we can upload the documents into the bucket.

* Then make the objects public.





#### **Encryption**


---


* We have two types of encryption **- AES-256** and **AWS-KMS**. 

* The former is used to encrypt data at **rest** and the keys are **managed by AWS**. In the **latter**, the keys are created and **managed by the user**.



#### **Tags**


---


* Tags are **key value pairs** used for documentation and billing purposes.

* For ex., if the resources belonging to a particular projects are tagged, then we can check the billing for that particular project resources in the billing service.



#### **Transfer Acceleration**


---


* uses the **edge location** to increase the upload speed to the S3 bucket.

* For ex., if the bucket is hosted in mumbai region and the users from other regions like europe, middle east are uploading GB's of data into the bucket it takes time. To acclerate the upload process, the Transfer Acceleration will use the edge location to find the shortest possible route to upload the data.

* It can increase the transfer speed upto **300%**

* It is not a free service.





### **Management**

#### **Storage Classes**


---





**S3 Standard**


---


*   Highly available

*   Hightly Durable

*   Easy and fast access to data

*   Can access data frequently

**S3 standard-IA**


---

same as S3 standard but with some changes

* minimum object size is 128 KB

* data has to be in standard for 30 days before converting it into standard-IA

* There is a retrival fee for every GB of data.

* Uses include data which needs to be accessed infrequently. eg., archival data

**Intelligent-tiering**


---


* Data labeled as frequent access is automatically converted to IA if not accessed for 30 days.so, resulting in cost saving.

* Can be used where it is not possible to anticipate the frequency of data retrival

**One zone IA**


---


* storage and redundant only in one AZ.

* used for secondary backups (backup of backup), non-critical data.

**Glacier**


---


* Cheapest storage option but retrival time takes hours.

* For eg., you can store compliance data, which we won't access for months or even years.













#### **Lifecycle Management**


---

* Lifecycle of current and previous versions of the object are configured here.
This helps in reducing the storage cost and easy management of objects in the bucket. **Eg:** When you configure Lifecycle management rules, objects are moved from one storage tier to other on a timeframe/ periodically and ultimately deleted which will result in cost saving

#### **Cross Region Replication**


---

* Two buckets in different regions are linked, in which the destination bucket will contain the copy of the objects newly added to the source bucket.


* For this to happen Versioning has to be enabled in both the buckets.




## **Reference**

**<font size="3" color="green"><b>Special Thanks to Rohan Aurora for creating this beatiful resource</font></b>**



In [None]:
#@markdown **1. Video on EC2**
from IPython.display import YouTubeVideo
YouTubeVideo('q5kSzwx7x1U&ab',width=900, height=500)

## **Its Practise Time..**

**Answer the following questions**


---



1. If versioning was enabled for your S3 bucket, how will you be billed for the following scenario:                                              At the start of the month you have 3.2 GB (3,294,967,296 bytes), and on the start of the 13th-day of the month the same file was overwritten, leading to the file size of 7.3 GB (7,368,709,120 bytes). Assume there are 31 days in the said month. Choose the right answer considering it as a free tier account and the S3 bucket is in the Mumbai region and it is standard storage?


---


2. What is the significance of durability in an S3 storage class and what is the durability of  S3 one zone-IA?

---

3. When to use ACL and When to use Bucket Policy?

---

4. What is the best way to upload a 20GB file onto S3?

---

5. What is a good option to improve S3 performance when we have a high number of GET requests?

---

6. Santhosh is teaching, how to generate data using Kinesis streams. He wants to store this data in S3.  He requires the last 30 days of data to do Big data Analysis. Data prior to 30 days is stored as back up, such that, he can share them with any of his students on their request (Although he feels that this is not mission-critical and won't bother if he loses data). Data older than one year has to be deleted. Can you please provide a solution such that he can reduce his monthly bill? 

---

7. Santhosh who is located in India wants to do big data analysis on Databricks. Since the Databricks he is using is a community edition it is only available in us-west-2. Amit wants this data to be stored in AWS S3 for ease of availability and access. But he needs a suggestion in which region the bucket should be created. As a data engineer what would you suggest and why?

---

8. Santhosh wants to develop an application such that, it uploads images, store images in location, meta-data of that image in another location. As a data professional what would you suggest?