# Static Website on S3 with boto3

In [1]:
import boto3
from botocore.exceptions import ClientError

import logging
import os

## Why this project

As a Data Scientist, I do care that my projects are deployed and running in production, therefore my interest in Data Engineering and DevOps part. I also care, that business-users have access to insights and advanced visualizations that I produce. 

Therefore I make interactive visualizations with Jupyter notebooks. Each Jupyter notebook can easily be converted to HTML file (or HTML file and images), which in turn can constitute static website. I host those websites. 

AWS S3 is a perfect opportunity to host everything simple and quickly. Boto3 is AWS SDK for python. I am just starting with boto3 and I love it. No mouse-clicking, just python script. 

## Purpose

Here I will host a static Website on AWS S3 bucket using python script with boto3 library. No manual interaction with AWS console is needed. The content of the Website includes several/many files, all of which are locally saved in one _nested_ directory.

## Get access to AWS

First of all I need to get access to my AWS account.

A lot of tutorials explain at this place, how to use your 'Access key ID' and your 'Secret access key' without exposing them in the code. 

But we  are on Bertelsmann Challenge Cloud, aren't we? We have learned the better way - creating Programmatic User at AWS. Actually we already created such a user during the course. Let us use it!

With Programmatic User, the credentials are saved on your computer in ~/.aws/credentials file.
Just choose the one of the profiles in this file. Then use the name of this profile (here 'default') to creating a boto3 session. 

In [2]:
session = boto3.Session(profile_name='default')

You also do not need to manually enter your region, it is available from session. 

In [3]:
current_region = session.region_name
print(current_region)

us-east-2


## Client and Resource

Boto3 has _client_ - and _resource_ -level access to AWS. Resource represent an object-oriented  interface to AWS, this is a higher-level abstraction than a client. See [a short explanation here](https://stackoverflow.com/questions/42809096/difference-in-boto3-between-resource-client-and-session).

My purpose is to use _resource_ level only. I use the [documentation here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#service-resource)

For people interested in _client_ level, I recommend the [GitHub of Balazs Kocsis](https://github.com/bkocis/bertelsmann-dsml-group-projects/blob/master/Project-boto/AWS-SDK-for-python-boto3.ipynb)

First I create _resource_.


In [4]:
s3_resource = boto3.resource('s3')

## MIME types

Before actually starting with bucket and file uploading, let me introduce the `mimetypes` python module. 
Just have a look at the dictionary, mapping filename extensions to MIME types. 

In [5]:
import mimetypes
for fileext in ['.html', '.png']:
    print('For file extension {:s} --> the MIME type is {:s}'.format(fileext, mimetypes.types_map[fileext]))

For file extension .html --> the MIME type is text/html
For file extension .png --> the MIME type is image/png


In order to be able to deal with unknown types, I convert it to _defaultdict_ with the default MIME type being 'application/octet-stream' 

In [6]:
from collections import defaultdict
content_type = defaultdict(lambda: 'application/octet-stream', mimetypes.types_map)
for fileext in ['.html', '.png', '.coffee']:
    print('For file extension {:s} --> the MIME type is {:s}'.format(fileext, content_type[fileext]))

For file extension .html --> the MIME type is text/html
For file extension .png --> the MIME type is image/png
For file extension .coffee --> the MIME type is application/octet-stream


Why care about MIME type? Well, you need this to display your HTML file (in your Website) in browser.  When I uploaded files without specifying the MIME type explicitly, my files were not displayed. Instead I got an option to download the file.

The content of my website is stored in the nested directory. So, I want to upload all the files from this directory. In AWS CLI I found the `--recursive` flag for this purpose. But how to do this with boto3?
I only found the option to iterate over all the files. 

Now I am ready to create the bucket and upload the content of `target_dir`.

## Create bucket and upload files

In [7]:
target_dir = 'wholesale-more-frozen-products'
bucket_name = target_dir

In [8]:
# ----- test structure of nested directory
# target_dir = bucket_name
# for subdir, dirs, files in os.walk(target_dir):
#     print(subdir)
#     #print(dirs)
#     for file in files:
#         print("\t", os.path.join(subdir, file)) #.replace(target_dir+'/', '')

In [9]:
def create_bucket(s3_resource, aws_region, bucket_name):
    try:
        s3_resource.create_bucket(Bucket=bucket_name,
                         CreateBucketConfiguration={'LocationConstraint': aws_region})
    except ClientError as err:
        print(err)
    
    return None

In [10]:
def upload_nested_directory(s3_resource, aws_region, bucket_name, target_dir):
    if not os.path.isdir(target_dir):
        raise ValueError('target_dir %r not found.' % target_dir)
        
    for subdir, dirs, files in os.walk(target_dir):
        for file in files:
            filename = os.path.join(subdir, file)
            s3_path = filename.replace(target_dir+'/', '')
            fileext = '.'+file.split('.')[-1]  # file extension
            s3_resource.Object(bucket_name, s3_path).put(Body=open(filename, 'rb'),
                                                          ACL = 'public-read',
                                                          ContentType = content_type[fileext])
    
            print('File uploaded to https://s3.%s.amazonaws.com/%s/%s' % (
                                                    aws_region, bucket_name, s3_path))
    return None

In [11]:
create_bucket(s3_resource=s3_resource, aws_region=current_region, bucket_name=bucket_name)

In [12]:
upload_nested_directory(s3_resource=s3_resource, aws_region=current_region, 
                        bucket_name=bucket_name, target_dir=target_dir)

File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/frozen.html
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/frozen3.html
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/wholesale-moreFrozenProducts.html
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/me.coffee
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/chart.css
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/d3_tip.css
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/d3.v3.4.3.js
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/me.js
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-products/me_assets/me.css
File uploaded to https://s3.us-east-2.amazonaws.com/wholesale-more-frozen-pro

Let us list all files in the bucket to ensure that everyting is uploaded properly

## Bucket sub-resource

I create a bucket sub-resource, from which I can access all objects and their properties in object-oriented manner.

In [13]:
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
    print(obj.key, obj.last_modified)

frozen.html 2020-02-23 18:25:20+00:00
frozen3.html 2020-02-23 18:25:20+00:00
me_assets/Points/plot.coffee 2020-02-23 18:25:23+00:00
me_assets/Points/plot.js 2020-02-23 18:25:23+00:00
me_assets/chart.css 2020-02-23 18:25:21+00:00
me_assets/d3.v3.4.3.js 2020-02-23 18:25:22+00:00
me_assets/d3_tip.css 2020-02-23 18:25:22+00:00
me_assets/d3_tip.js 2020-02-23 18:25:23+00:00
me_assets/me.coffee 2020-02-23 18:25:21+00:00
me_assets/me.css 2020-02-23 18:25:23+00:00
me_assets/me.js 2020-02-23 18:25:22+00:00
wholesale-moreFrozenProducts.html 2020-02-23 18:25:21+00:00
wholesale-moreFrozen_cache/html/__packages 2020-02-23 18:25:24+00:00
wholesale-moreFrozen_cache/html/set-options_18066191f72c912163595330ff610c0c.RData 2020-02-23 18:25:25+00:00
wholesale-moreFrozen_cache/html/set-options_18066191f72c912163595330ff610c0c.rdb 2020-02-23 18:25:25+00:00
wholesale-moreFrozen_cache/html/set-options_18066191f72c912163595330ff610c0c.rdx 2020-02-23 18:25:24+00:00
wholesale-moreFrozen_cache/html/unnamed-chunk-

Now I can access my files in the bucket and directly (without downloading) see their content. My index document

In [14]:
index_document = 'wholesale-moreFrozenProducts.html'

can be accessed via

In [15]:
print("http://{}.s3.amazonaws.com/{}".format(bucket_name, index_document))

http://wholesale-more-frozen-products.s3.amazonaws.com/wholesale-moreFrozenProducts.html


Try to use the link above. It works for me. Does it for you?

Are we done? Is this my website? The point of calling it 'website' is having the endoint without specifying the index page explicitly. To rephase it, the website should know its index page itself. So I need to set the website configuration first.

## Error and Index documents for website

Here I am creating the BucketWebsite sub-resource and specifying the _index_ and _error_ documents of my website. 

In [16]:
bucket_website = s3_resource.BucketWebsite(bucket_name)

In [17]:
website_configuration = {
    'ErrorDocument': {'Key': 'error.html'},
    'IndexDocument': {'Suffix': index_document},
}

In [18]:
bucket_website.put(WebsiteConfiguration=website_configuration)

{'ResponseMetadata': {'RequestId': '6462A19EE2B93AA4',
  'HostId': 'gD2SYhtWmITQt4aE3AF0HtXcpjRewKx7RvvBuA76BjEH7yiB1M8VdhUKdf488RmWSrtoIEUMJsU=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'gD2SYhtWmITQt4aE3AF0HtXcpjRewKx7RvvBuA76BjEH7yiB1M8VdhUKdf488RmWSrtoIEUMJsU=',
   'x-amz-request-id': '6462A19EE2B93AA4',
   'date': 'Sun, 23 Feb 2020 18:27:57 GMT',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

To retrieve the website configuration use

In [19]:
print(bucket_website.error_document)
print(bucket_website.index_document)

{'Key': 'error.html'}
{'Suffix': 'wholesale-moreFrozenProducts.html'}


Unfortunately, I did not found a way to get the endpoint of the website programatically from the `bucket-website` sub-resource. It there a way? So, let us construct it manually

In [20]:
print("http://{}.s3-website.{}.amazonaws.com/".format(bucket_name, current_region))

http://wholesale-more-frozen-products.s3-website.us-east-2.amazonaws.com/


Now I get access to the website without specifying the index document.

## Delete website configuration

To save AWS resources, I can delete the website (but won't do it here).

In [None]:
bucket_website.delete()

Since my files are still in the bucket, I can access them directly by name

In [21]:
print("http://{}.s3.amazonaws.com/{}".format(bucket_name, index_document))

http://wholesale-more-frozen-products.s3.amazonaws.com/wholesale-moreFrozenProducts.html


## Delete files in the bucket

Now I can delete all files in the bucket

In [None]:
for obj in bucket.objects.all():
    obj.delete()

## Delete Bucket

As soon as the bucket is empty, I can delete it

In [None]:
response = bucket.delete()

By the way, in AWS CLI I found a way to delete a _nonempty_ bucket with `--force` flag like

    aws s3 rb s3://my-example-bucket –-force
    
In boto3 I found nothing similar. Do you?