<!--
Where are we?
-----

[Let's look at the map](http://insightdataengineering.com/blog/pipeline_map.html)
-->

The Cloud & AWS
===

![Living in the Cloud](https://s3-us-west-2.amazonaws.com/dsci/6007/assets/living_in_the_cloud.jpg)

By the end of this session you will be able to:
----

- Explain what Cloud Computing is
- Provide reasons why Amazon leads in the cloud space 
- Provide pros and cons of cloud vs on-prem data systems
- Explain what AWS, S3, EC2 are
- Explain how cloud computing requires a different worldview

What is the cloud?
----

In the simplest terms, people running computers on your behalf that you can reach over the internet

What is software-as-a-service as part of the cloud (SaaS)?
> A company gives you an API over the internet, and then handle everything else for you.

<details><summary>
Q: What is a "no cloud" solution?
</summary>
The proverbial computer under your desk. 

You provide the internet connection and the electricity (and maintenance).
</details>

Why is The Cloud so popular?
---

Cloud vs On-Prem
----------------

What are the pros and cons of cloud vs on-premises hosting or *on-prem*?

Like Uber vs owning your own car.

Feature          |Cloud                    |On-Prem
-------          |-----                    |-------
Cost             |Higher variable cost     |Higher fixed cost
Capacity         |Elastic                  |Fixed
Performance      |Moderate                 |Can be better if within
Security         |Provider secures         |Company secures
Office Politics  |Teams get own resources  |Teams compete for fixed resources
Time to Setup    | Fast!   | Slow!

---


![Evolution of the Cloud](https://s3-us-west-2.amazonaws.com/dsci/6007/assets/evolution_of_the_cloud.jpg)

Who are the major cloud providers?
----

![as a Service](https://s3-us-west-2.amazonaws.com/dsci/6007/assets/covering_your_.jpg)


---
Steve Yegge and Decoupled Design
--------------------------------

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/yegge.jpg">

Who is Steve Yegge?

- Steve Yegge is a developer from Amazon and Google.

- Steve blogged a long [rant][yegge-rant] about Amazon's APIs vs
  Google's APIs.

[yegge-rant]: https://plus.google.com/+RipRowan/posts/eVeouesvaVX

What is the difference between Amazon and Google's APIs?

- At Amazon developers have to use Amazon's public APIs to for their
  internal dependencies.
- At Google developers can use private APIs for dependencies.
- The forced dogfooding makes Amazon's APIs more decoupled.


---
Why AWS?
---

![](http://www.datacenterknowledge.com/wp-content/uploads/2015/05/Screen-Shot-2015-05-28-at-10.23.03-AM-e1432833116144.png
   )
[Source](http://www.datacenterknowledge.com/archives/2015/05/28/gartner-aws-pulls-further-ahead-in-iaas-cloud-market/)

----
[Explore console page](https://us-west-2.console.aws.amazon.com/console/home?region=us-west-2#)

----
What are the primary services that Amazon AWS offers?
-----

| Name   | Full Name | Should have been called | Service | Use this to |
|:-------:|:------:|:------:|:------:|
| S3     | Simple Storage Service     | Unlimited FTP Server  | Storage | Store images and other assets for websites. Keep backups and share files between services. Host static websites. Also, many of the other AWS services write and read from S3. |
| EC2    | Elastic Compute Cloud      | Virtual Servers | Execution | Host the bits of things you think of as a computer  |


----
What is AWS S3?
----

![S3](https://s3-us-west-2.amazonaws.com/dsci/6007/assets/s3.png)

Amazon S3 is a simple key, value store designed to store as many objects as you want. 

You store these objects in one or more buckets. 

__This is your 1st NoSQL datastore!__

Buckets and Files
-----------------

What is a bucket?

- A bucket is a container for files.

- Think of a bucket as a logical grouping of files like a sub-domain.

- A bucket can contain an arbitrary number of files.

How large can a file in a bucket be?

- A file in a bucket can be 5 TB.

Bucket Names
------------

What are best practices on naming buckets?

- Bucket names should be DNS-compliant.

- They must be at least 3 and no more than 63 characters long.

- They must be a series of one or more labels, separated by a single
  period. 
  
- Bucket names can contain lowercase letters, numbers, and hyphens. 

- Each label must start and end with a lowercase letter or a number.

- Bucket names must not be formatted as an IP address (e.g., 192.168.5.4).

What are some examples of valid bucket names?

- `myawsbucket`

- `my.aws.bucket`

- `myawsbucket.1`

What are some examples of invalid bucket names? 

- `.myawsbucket`

- `myawsbucket.`

- `my..examplebucket`

Check for understanding
--------

<details><summary>
Q: Why are these bucket names invalid?
</summary>
Bucket names cannot start or end with a period. And they cannot have a
multiple periods next to each other.
</details>

Creating Buckets
----------------

Q: How can I create a bucket?

- Get your access key and secret key from the `rootkey.csv` that you
  downloaded from Amazon AWS.
  
- In the following snippet replace `/dev/null` with `~/.aws/credentials` 
  (on Linux/Mac) or `%USERPROFILE%\.aws\credentials` (on Windows), and 
  replace `ACCESS_KEY` and `SECRET_KEY` with the keys from `rootkey.csv`.
  
        %%writefile /dev/null      
        [default]
        aws_access_key_id = ACCESS_KEY
        aws_secret_access_key = SECRET_KEY

- Create a connection to S3.

---
Start using AWS
---

1. Sign up for free Tier (while we are waiting on Activate authorization)
3. Create [Security Credential](https://console.aws.amazon.com/iam/home?#security_credential)
2. Install boto (i.e., the AWS api for python)
4. [Configure Boto Credentials](http://boto.cloudhackers.com/en/latest/getting_started.html)

Then the following should code should run

In [1]:
import boto
conn = boto.connect_s3()
print conn

S3Connection:s3.amazonaws.com


- List all the buckets.

Upgrading Boto
--------------

Q: Boto is not able to find the credentials. How can I fix this?

- Older versions of Boto were not able to read the credentials file.

- You might run into this problem on the EC2 instance.

- Here is how to upgrade Boto to the latest version.

In [3]:
!conda update boto -y

Fetching package metadata .......
Solving package specifications: ..........

Package plan for installation in environment /Users/alessandro/anaconda/envs/dsci6007:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boto-2.45.0                |           py27_0         1.4 MB

The following packages will be UPDATED:

    boto: 2.43.0-py27_0 --> 2.45.0-py27_0

Fetching packages ...
boto-2.45.0-py 100% |################################| Time: 0:00:00   5.39 MB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Unlinking packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%


Check for understanding
--------

<details><summary>
Q: What is latency?
</summary>
Latency is the time it takes between making a request and the start of a response.
</details>

<details><summary>
Q: Which is better? Higher latency or lower?
</summary>
Lower is better.
</details>

<details><summary>
Q: Why is S3 latency higher than EBS?
</summary>
One reason is that EBS is in the same availability zone.
</details>

In [4]:
conn.get_all_buckets()

[<Bucket: aws-logs-608193005321-us-east-1>,
 <Bucket: aws-logs-608193005321-us-west-2>,
 <Bucket: dsci>,
 <Bucket: dsci6007lab>,
 <Bucket: ill-instructor>,
 <Bucket: isaac1>,
 <Bucket: kinesis-lab5>,
 <Bucket: mrjob-f99dcdcfee39923f>,
 <Bucket: seattle-dsi>]

- Create new bucket.

In [5]:
import os
import random

user = os.environ['USER']
bucket_name = user + str(int(random.random()*1000))
bucket_name = bucket_name.lower()
print bucket_name
bucket = conn.create_bucket(bucket_name)
print bucket

alessandro502
<Bucket: alessandro502>


Adding Files
------------

Q: How can I add a file to a bucket?

- List files.

In [6]:
bucket.get_all_keys()

[]

- Add file.

In [7]:
file_key = bucket.new_key('file.txt')
print file_key
file_key.set_contents_from_string('hello world!!')

<Key: alessandro502,file.txt>


13

- List files again. New file should appear.

In [8]:
bucket.get_all_keys()

[<Key: alessandro502,file.txt>]

Q: How can I get a file from a bucket?

- Get file.

In [9]:
f = bucket.get_key('file.txt')
print f.get_contents_as_string()

hello world!!



Creating Buckets With Periods
-----------------------------

Q: How can I create a bucket in Boto with a period in the name?

- There is a bug in Boto that causes `create_bucket` to fail if the
  bucket name has a period in it. 

- To get around this run this code snippet.

```python
import ssl
if hasattr(ssl, '_create_unverified_context'):
    ssl._create_default_https_context = ssl._create_unverified_context
```

- Now try creating the bucket with a period in its name and it should work.

```python
bucket_name_with_period = bucket_name + ".1.2.3"
bucket_with_period = conn.create_bucket(bucket_name_with_period)
bucket_with_period.delete()
```

- For more details see <https://github.com/boto/boto/issues/2836>.


Access Control
--------------

Q: I want to access my S3 file from a web browser without giving my
access and secret keys. How can I open up access to the file to
anyone?

- You can set up Access Control Lists (ACLs) at the level of the
  bucket or at the level of the individual objects in the bucket
  (folders, files).

Q: What are the different ACL policies?

ACL Policy           |Meaning
----------           |-------
`private`            |No one else besides owner has any access rights.
`public-read`        |Everyone has read access.
`public-read-write`  |Everyone has read/write access.
`authenticated-read` |Registered Amazon S3 users have read access.

Q: What does `read` and `write` mean for buckets and files?

- Read access to a file lets you read the file.

- Read access to a bucket or folder lets you see the names of the
  files inside it.

Pop Quiz
--------

<details><summary>
Q: If a bucket is `private` and a file inside it is `public-read` can
I view it through a web browser?
</summary>
Yes. Access to the file is only determined by its ACL policy.
</details>

<details><summary>
Q: If a bucket is `public-read` and a file inside it is `private` can
I view the file through a web browser?
</summary>
No, you cannot. However, if you access the URL for the bucket you will see the file listed.
</details>

Applying Access Control
-----------------------

Q: How can I make a file available on the web so anyone can read it?

- Create a file with a specific ACL.

In [12]:
file2 = bucket.new_key('file2.txt')
file2.set_contents_from_string('hello world!!!',policy='private')

14

- Try reading the file.

In [13]:
file2_url = 'http://s3.amazonaws.com/' + bucket_name + '/file2.txt'
print file2_url
!curl $file2_url

http://s3.amazonaws.com/alessandro502/file2.txt
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>D3BAF9F08D6046AC</RequestId><HostId>RNhB+aLJktFhpyFqIo9isDv4YuXxdpIIzgqbicnUsiUSw9ML7e4cxr6npjmrXzSW/DIn3PB8P1c=</HostId></Error>

- Now change its ACL.

In [14]:
file2.set_acl('public-read')
!curl $file2_url

hello world!!!

- Also you can try accessing the file through the browser.

- If you do not specify the ACL for a file when you set its contents,
  the file is `private` by default.

S3 Files to URLs
----------------

Q: How can I figure out the URL of my S3 file?

- As above, you can compose the URL using the region, bucket, and file name. 

- For N. Virginia the general template for the URL is `http://s3.amazonaws.com/BUCKET/FILE`.

- You can also find the URL by looking at the file on the AWS web console.

Deleting Buckets
----------------

Q: How can I delete a bucket?

- Try deleting a bucket containing files. What happens?

In [15]:
print conn.get_all_buckets()
bucket.delete()

[<Bucket: alessandro502>, <Bucket: aws-logs-608193005321-us-east-1>, <Bucket: aws-logs-608193005321-us-west-2>, <Bucket: dsci>, <Bucket: dsci6007lab>, <Bucket: ill-instructor>, <Bucket: isaac1>, <Bucket: kinesis-lab5>, <Bucket: mrjob-f99dcdcfee39923f>, <Bucket: seattle-dsi>]


S3ResponseError: S3ResponseError: 409 Conflict
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>BucketNotEmpty</Code><Message>The bucket you tried to delete is not empty</Message><BucketName>alessandro502</BucketName><RequestId>88DD24CF89EAF26D</RequestId><HostId>1dZqsDB0BoLexHXfPAiID3b8VC7ZbCii35rtojaDtoQYNbvJXLR+vRw+MIrrIwSiyHrQqUAWNFk=</HostId></Error>

- To delete the bucket first delete all the files in it.

In [16]:
for key in bucket.get_all_keys(): 
    key.delete()

- Then delete the bucket.

In [17]:
print conn.get_all_buckets()
bucket.delete()
print conn.get_all_buckets()

[<Bucket: alessandro502>, <Bucket: aws-logs-608193005321-us-east-1>, <Bucket: aws-logs-608193005321-us-west-2>, <Bucket: dsci>, <Bucket: dsci6007lab>, <Bucket: ill-instructor>, <Bucket: isaac1>, <Bucket: kinesis-lab5>, <Bucket: mrjob-f99dcdcfee39923f>, <Bucket: seattle-dsi>]
[<Bucket: aws-logs-608193005321-us-east-1>, <Bucket: aws-logs-608193005321-us-west-2>, <Bucket: dsci>, <Bucket: dsci6007lab>, <Bucket: ill-instructor>, <Bucket: isaac1>, <Bucket: kinesis-lab5>, <Bucket: mrjob-f99dcdcfee39923f>, <Bucket: seattle-dsi>]
