# Boto3 and S3 Cheatsheet

The following notebook compiles a set of tutorials for interaction with S3 resources using boto3 SDK from Python.

In [1]:
import boto3
import json
import os
import uuid
import pandas as pd

Change the default profile

In [2]:
session = boto3.session.Session(profile_name='boto3user')

Two options here:

s3_client = session.client('s3')

s3_resource = session.resource('s3')

## Creating bucket

In [3]:
# Create a random bucket name to avoid errors
def create_bucket_name(bucket_prefix):
    return ''.join([bucket_prefix, str(uuid.uuid4())])

If the region is set to other different than US, the region must be stated when creating bucket instance.

In [4]:
def create_bucket(bucket_prefix, s3_connection, session=None):
    """
    Create new bucket from session and s3 (client or resource) connection and assign a name based on a prefix.
    :params bucket_prefix
    :params s3_connection (can be client or resource)
    :params session (optional if using non-default option)
    """
    
    #Create session if not exists
    if session is None:
        session = boto3.session.Session()
    
    BUCKET_NAME = create_bucket_name(bucket_prefix=bucket_prefix)
    response = s3_connection.create_bucket(
        Bucket=BUCKET_NAME,
        CreateBucketConfiguration={
            'LocationConstraint': session.region_name
        })
    return BUCKET_NAME, response

I could use both the client and the resource to create buckets, receiving different responses:

In [5]:
# Using resource
s3_resource = session.resource('s3')
first_bucket_name, first_response = create_bucket('firstbucket', s3_resource, session)

In [6]:
first_response

s3.Bucket(name='firstbucket486e40b7-83db-461f-9c34-d6708ef46496')

In [7]:
first_response.name

'firstbucket486e40b7-83db-461f-9c34-d6708ef46496'

In [8]:
# Using client
s3_client = session.client('s3')
second_bucket_name, second_response = create_bucket('secondbucket', s3_client, session)

In [9]:
second_response

{'ResponseMetadata': {'RequestId': '7AC15A6E80D0E17A',
  'HostId': 'vaOi41YOvyFBlUXeDmZmUNqS8Nbu79c6UzZt4XJVS0s715bK5J17K015R1qn7gAyw4jonBzSzkY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'vaOi41YOvyFBlUXeDmZmUNqS8Nbu79c6UzZt4XJVS0s715bK5J17K015R1qn7gAyw4jonBzSzkY=',
   'x-amz-request-id': '7AC15A6E80D0E17A',
   'date': 'Fri, 08 Jan 2021 20:24:38 GMT',
   'location': 'http://secondbucket7b67ed25-2dc5-4b6a-82d8-78921e8ef4a8.s3.amazonaws.com/',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Location': 'http://secondbucket7b67ed25-2dc5-4b6a-82d8-78921e8ef4a8.s3.amazonaws.com/'}

The client will return a JSON response which is formated as a dictionary and I can navigate.

In [10]:
second_response['Location']

'http://secondbucket7b67ed25-2dc5-4b6a-82d8-78921e8ef4a8.s3.amazonaws.com/'

## Adding data to my buckets

Create artificial files to test

In [11]:
def create_temp_file(size, file_name, file_content):
    """Create artificial files with a predetermined size and content"""
    
    random_file_name = ''.join([str(uuid.uuid4().hex[:6]), file_name])
    with open(random_file_name, 'w') as f:
        f.write(str(file_content) * size)
    return random_file_name

In [12]:
first_file_name = create_temp_file(300, 'firstfile.txt', 'f') 

In [13]:
# Instantiate bucket and object
first_bucket = s3_resource.Bucket(name=first_bucket_name)
first_object = s3_resource.Object(bucket_name=first_bucket_name, key=first_file_name)

In [14]:
# We can create objects as subresources
first_object_again = first_bucket.Object(first_file_name)

In [15]:
# Or buckets as sub-resources
first_bucket_again = first_object.Bucket()

We can use sub-resources of either to obtain the other one. Moreover, files can be added to s3 by:

- using an object instance
- using a bucket instance
- using a client

In [16]:
# Using object instance
first_object.upload_file(first_file_name)

Now downloading a file I just uploaded

In [21]:
# Similar approach can be used to download using Bucket instance
s3_resource.Object(first_bucket_name, first_file_name).download_file(os.path.join(os.getcwd(), 'downloaded_file.txt'))

In [22]:
%ls

bc29f2firstfile.txt  downloaded_file.txt  s3_boto_guide.ipynb


Copying objects between buckets

In [23]:
def copy_to_bucket(bucket_from, bucket_to, file_name):
    copy_source = {
        'Bucket': bucket_from,
        'Key': file_name
    }
    s3_resource.Object(bucket_to, file_name).copy(copy_source)

In [24]:
copy_to_bucket(first_bucket_name, second_bucket_name, first_file_name)

## Deleting objects

In [25]:
s3_resource.Object(second_bucket_name, first_file_name).delete()

{'ResponseMetadata': {'RequestId': 'B43F8A6822E36C5B',
  'HostId': 'Uf0pCKsRw9t6dcPqNp9z6xIYFgf9k5F7ND1Rme7Zoqbb89qlgpKaCXTEWbAoIfhkAV2bROIszyU=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'Uf0pCKsRw9t6dcPqNp9z6xIYFgf9k5F7ND1Rme7Zoqbb89qlgpKaCXTEWbAoIfhkAV2bROIszyU=',
   'x-amz-request-id': 'B43F8A6822E36C5B',
   'date': 'Fri, 08 Jan 2021 20:28:28 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

## ACL (access control list)

Using ACL to make objects public

In [26]:
second_file_name = create_temp_file(400, 'secondfile.txt', 's')

In [27]:
second_object = s3_resource.Object(first_bucket.name, second_file_name) #Create object
second_object.upload_file(second_file_name, ExtraArgs={'ACL': 'public-read'}) #pass extra parameters

In [28]:
second_object.Acl()

s3.ObjectAcl(bucket_name='firstbucket486e40b7-83db-461f-9c34-d6708ef46496', object_key='5e0a14secondfile.txt')

In [29]:
second_object.Acl().grants

[{'Grantee': {'ID': 'e5f6bd524241279692f95063288f3ca5f7238610b72e16e9a5999e7d839bb2c5',
   'Type': 'CanonicalUser'},
  'Permission': 'FULL_CONTROL'},
 {'Grantee': {'Type': 'Group',
   'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'},
  'Permission': 'READ'}]

In [30]:
# Convert object to private again
response = second_object.Acl().put(ACL='private')

In [31]:
second_object.Acl().grants

[{'Grantee': {'ID': 'e5f6bd524241279692f95063288f3ca5f7238610b72e16e9a5999e7d839bb2c5',
   'Type': 'CanonicalUser'},
  'Permission': 'FULL_CONTROL'}]

## Encryption

Adding encryption to files

In [32]:
third_file_name = create_temp_file(300, 'thirdfile.txt', 't')

In [33]:
third_object = s3_resource.Object(first_bucket_name, third_file_name)
third_object.upload_file(third_file_name, ExtraArgs={'ServerSideEncryption': 'AES256'}) # pass extra parameters

In [34]:
third_object.server_side_encryption

'AES256'

## Traversals

Bucket traversal:

In [35]:
for bucket in s3_resource.buckets.all():
    print(bucket.name)

firstbucket486e40b7-83db-461f-9c34-d6708ef46496
secondbucket7b67ed25-2dc5-4b6a-82d8-78921e8ef4a8


Object traversal:

In [36]:
for obj in first_bucket.objects.all():
    print(obj.key)

18afaathirdfile.txt
5e0a14secondfile.txt
bc29f2firstfile.txt


This provides an ObjectSummary instead of the object. To access an object, the sub-resource must be accessed through the bucket.

## Deleting resources

Every single object inside a bucket must be deleted first.

In [37]:
# All the versions must be deleted
def delete_all_objects(bucket_name):
    res = []
    bucket = s3_resource.Bucket(bucket_name)
    for obj_v in bucket.object_versions.all():
        res.append({
            'Key': obj_v.object_key,
            'VersionId': obj_v.id
        })
    print(res)
    bucket.delete_objects(Delete={'Objects': res})

In [38]:
delete_all_objects(first_bucket_name)

[{'Key': '18afaathirdfile.txt', 'VersionId': 'null'}, {'Key': '5e0a14secondfile.txt', 'VersionId': 'null'}, {'Key': 'bc29f2firstfile.txt', 'VersionId': 'null'}]


In [39]:
# Check that everything was deleted
for obj in first_bucket.objects.all():
    print(obj.key)

Add some data to empty second bucket and test function:

In [40]:
second_bucket = s3_resource.Bucket(second_bucket_name)
second_bucket.Object(first_file_name).upload_file(first_file_name)

In [41]:
delete_all_objects(second_bucket_name)

[{'Key': 'bc29f2firstfile.txt', 'VersionId': 'null'}]


In [42]:
# Check that everything was deleted
for obj in s3_resource.Bucket(second_bucket_name).objects.all():
    print(obj.key)

Delete buckets now:

In [43]:
first_bucket.delete()

{'ResponseMetadata': {'RequestId': 'BHEM9Z0JBT9SDS7J',
  'HostId': 'wcqvDZDqt7IKg+7BsOS2SjNV1+f0L1mkaCkPrby9aM20+izOHGNyiFI66xIXgvRl++9EKdr/yMY=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'wcqvDZDqt7IKg+7BsOS2SjNV1+f0L1mkaCkPrby9aM20+izOHGNyiFI66xIXgvRl++9EKdr/yMY=',
   'x-amz-request-id': 'BHEM9Z0JBT9SDS7J',
   'date': 'Fri, 08 Jan 2021 20:31:41 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

In [44]:
second_bucket.delete()

{'ResponseMetadata': {'RequestId': '4EAE0C8B972A6D87',
  'HostId': 'FHuikCt5jiYmeE804QylXq7McfBajUOcBxdl33grPyNuE8l1BkqRIVvGTB/BRhVSuWWhssi1dmo=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'FHuikCt5jiYmeE804QylXq7McfBajUOcBxdl33grPyNuE8l1BkqRIVvGTB/BRhVSuWWhssi1dmo=',
   'x-amz-request-id': '4EAE0C8B972A6D87',
   'date': 'Fri, 08 Jan 2021 20:31:42 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

## Exploring udacity data

In [45]:
udacity_bucket_name = 'udacity-dend'

In [46]:
udacity_bucket = s3_resource.Bucket(udacity_bucket_name)

Explore songs using resource:

In [47]:
for i, obj in enumerate(udacity_bucket.objects.filter(Prefix='song-data')):
    print(obj.key)
    if i>9:
        break

song-data/
song-data/A/A/A/TRAAAAK128F9318786.json
song-data/A/A/A/TRAAAAV128F421A322.json
song-data/A/A/A/TRAAABD128F429CF47.json
song-data/A/A/A/TRAAACN128F9355673.json
song-data/A/A/A/TRAAAEA128F935A30D.json
song-data/A/A/A/TRAAAED128E0783FAB.json
song-data/A/A/A/TRAAAEM128F93347B9.json
song-data/A/A/A/TRAAAEW128F42930C0.json
song-data/A/A/A/TRAAAFD128F92F423A.json
song-data/A/A/A/TRAAAGR128F425B14B.json


Download one file:

In [48]:
json_file_name = 'song-data/A/A/A/TRAAAAK128F9318786.json'

In [49]:
# Store in current working directory with the same name
udacity_bucket.Object(key=json_file_name).download_file(
    f"{os.getcwd()}/{json_file_name.split('/')[-1]}"
)

In [50]:
%ls

18afaathirdfile.txt      TRAAAAK128F9318786.json  downloaded_file.txt
5e0a14secondfile.txt     bc29f2firstfile.txt      s3_boto_guide.ipynb


Now explore through the S3 client and extract files recursively:

In [51]:
files = s3_client.list_objects(
    Bucket=udacity_bucket_name,
    MaxKeys=10,
    Prefix='song-data/A'
)['Contents']

In [53]:
songs_sample = []
for file in files:
    obj = s3_client.get_object(Bucket=udacity_bucket_name, Key=file['Key'])
    obj_json = json.loads(obj['Body'].read())
    songs_sample.append(obj_json)    

In [54]:
pd.DataFrame(songs_sample)

Unnamed: 0,song_id,num_songs,title,artist_name,artist_latitude,year,duration,artist_id,artist_longitude,artist_location
0,SOBLFFE12AF72AA5BA,1,Scream,Adelitas Way,,2009,213.9424,ARJNIUY12298900C91,,
1,SOQPWCR12A6D4FB2A3,1,A Poor Recipe For Civic Cohesion,Western Addiction,37.77916,2005,118.07302,AR73AIO1187B9AD57B,-122.42005,"San Francisco, CA"
2,SOCIWDW12A8C13D406,1,Soul Deep,The Box Tops,35.14968,1969,148.03546,ARMJAGH1187FB546F3,-90.04892,"Memphis, TN"
3,SOFRDWL12A58A7CEF7,1,Hit Da Scene,Quest_ Pup_ Kevo,,0,252.94322,AR9Q9YC1187FB5609B,,New Jersey
4,SOEKAZG12AB018837E,1,I'll Slap Your Face (Entertainment USA Theme),Jonathan King,51.50632,2001,129.85424,ARSVTNL1187B992A91,-0.12714,"London, England"
5,SOXZYWX12A6310ED0C,1,It's About Time,Jamie Cullum,,0,246.9873,ARC1IHZ1187FB4E920,,
6,SOIGICF12A8C141BC5,1,Game & Watch,Son Kite,,2004,580.54485,AREWD471187FB49873,,
7,SODZYPO12A8C13A91E,1,Burn My Body (Album Version),Broken Spindles,,0,177.99791,AR1C2IX1187B99BF74,,
8,SOFSOCN12A8C143F5D,1,Face the Ashes,Gob,,2007,209.60608,ARXR32B1187FB57099,,
9,SONRWUU12AF72A4283,1,Into The Nightlife,Cyndi Lauper,,2008,240.63955,ARGE7G11187FB37E05,,"Brooklyn, NY"


# Resources

- Useful resource for handling from the client side [here](https://towardsdatascience.com/working-with-amazon-s3-buckets-with-boto3-785252ea22e0)
- Handling data using resources [here](https://realpython.com/python-boto3-aws-s3/)