# Accessing S3 Bucket
This notebook shows how to access the shared S3 bucket, and uses a custom class to facilitate common operations.

S3 is a key-value object store, so anytime you see "key" it means a file or directory.


## [Accessing S3 File System](https://dev.to/mariazentsova/how-to-load-data-from-s3-to-aws-sagemaker-mea)

In [None]:
import boto3
from io import BytesIO
import json
import os
import pickle
import pandas as pd
import sagemaker
from zipfile import ZipFile

class S3:
    def __init__(self, bucket_name='lyricgen'):
        self.role = sagemaker.get_execution_role()
        self.bucket_name = bucket_name
        self.resource = boto3.Session().resource('s3')
        self.bucket = self.resource.Bucket(bucket_name)

    def get_role(self):
        return self.role

    def get_bucket_name(self):
        return self.bucket_name

    def get_bucket(self):
        return self.bucket

    def get_resource(self):
        return self.resource

    def get_client(self):
        return self.get_resource().meta.client

    def list_files(self):
        return [file.key for file in self.bucket.objects.all()]

    def request(self, key):
        return self.bucket.Object(key).get()['Body']

    def get_pickled(self, key):
        return pickle.loads(self.request(key).read())

    def get_df(self, key):
        return pd.read_csv(self.request(key))

    def get_json(self, key):
        return json.loads(self.request(key).read().decode('utf-8'))

    def get_zip(self, key):
        return zipfile.ZipFile(BytesIO(request(key).read()))

    def download_file(self, key, filepath=''):
        return self.resource.meta.client.download_file(self.bucket_name, key, filepath)

    def download_dir(self, key, filepath=''):
        paginator = self.get_client().get_paginator('list_objects')
        page_results = paginator.paginate(Bucket=self.get_bucket_name(), Delimiter='/', Prefix=key)
        for result in page_results:
            if result.get('CommonPrefixes') is not None:
                for subdir in result.get('CommonPrefixes'):
                    self.download_dir(subdir.get('Prefix'), filepath)
            for file in result.get('Contents', []):
                destination = os.path.join(filepath, file.get('Key'))
                if not os.path.exists(os.path.dirname(destination)):
                    os.makedirs(os.path.dirname(destination))
                if not file.get('Key').endswith('/'):
                    self.download_file(file.get('Key'), destination)
                    self.get_resource().meta.client.download_file(self.bucket_name, file.get('Key'), destination)

    def write_file(self, filepath, key):
        with open(filepath, 'rb') as f: # Read in binary mode
            return self.bucket.Object(key).upload_fileobj(f)


## Check Access

In [None]:
# create s3bucket instance
s3bucket = S3()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [None]:
# load df from csv
cpi_example = s3bucket.get_df('cpi.csv')
cpi_example.head(3)

Unnamed: 0,cpi,year
0,0.27107,1947
1,0.291867,1948
2,0.289005,1949


In [None]:
s3bucket.download_dir('data')

## Usage
Most of the operations below require the "key" parameter to either access or write data

### Create S3 Bucket Instance

In [None]:
# create s3bucket instance
s3bucket = S3('lyricgen')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### List all files in S3 Bucket
This can be useful if you're unsure of the filepath.
- Note: This will take ~30s since we have the lyrics folder loaded.

In [None]:
s3bucket.list_files()

### Load a Text/CSV File into Pandas
You can access nested files by specifying the prefix for the filepath

In [None]:
# load df from csv
cpi_nested_example = s3bucket.get_df('data/cpi.csv')
cpi_nested_example.head()

Unnamed: 0,cpi,year
0,0.27107,1947
1,0.291867,1948
2,0.289005,1949
3,0.29208,1950
4,0.315274,1951


### Download Files (to Sagemaker EFS)
As before, you can specify filepaths to access nested files. You can download into a specific folder, assuming it's created already

In [None]:
# download key to filepath
s3bucket.download_file('data/cpi.csv', 'cpi.csv')

### Download a Directory (to Sagemaker EFS)
Similarly, you can download an entire directory.
- If you want to create a new directory to hold this data, you can do so by specifying the second parameter

In [None]:
# download a directory into root
s3bucket.download_dir('data')

In [None]:
# download a directory into a new folder
s3bucket.download_dir('data', 's3')

### Writing Files to S3
You can write files to S3 and specify subdirectories where you'd like to store the object. Below we're writing the cpi.csv file we downloaded back into the data folder on S3.

In [None]:
s3bucket.write_file('cpi.csv', 'data/cpi.csv')

# That's it!
To use this, just copy the cell containing the S3 class into your dev environment and get rolling!