### Module 7: Using S3 for data retrieval and storage



We have already used the boto3 library to download and upload files from the S3 service.<P>
- client.download_file(bucket, key, local_filename)
- client.upload_file(local_filename, bucket, key)
    
In this activity, we will use a couple of other boto3 functions to get- and put- files to/from an S3 bucket.

In [17]:
import boto3
import pandas as pd

### 1. AWS SDK Review
Boto3 is the SDK libarary for Python<P>
Common workflow:
1. Create a new AWS session
2. Using that session, create a client for the AWS service you want to use
3. Call the correct client function to accomplish what you want
4. Parse the response to verify the action was successful

#### 1A. Create a new session using your credentials

In [18]:
# Create a new session and store it in the variable called 'sess' 
sess = boto3.session.Session()
# This object uses the Role attached to your Sagemaker session.  I have configured this role
# to allow  you certain permissions in our AWS account.
type(sess)

boto3.session.Session

#### 1B. Use the session to create a client of an AWS service we want to use.
For example, let's use the AWS service called "Secure Token Service". This will allow us to verify our account and credentials.

In [19]:
# Create a sts client object (Secure Token Service)
sts = sess.client('sts')

#### 1C. Call the correct client function to accomplish what you want
In our example, we want to verify we have the correct credentails to use our AWS account.<P>
    
Here is the sts client documentation:<BR>
- https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html

In [20]:
# The Python function 'dir()' will give you all attributes and functions from an object.
dir(sts)

['_PY_TO_OP_NAME',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache',
 '_client_config',
 '_convert_to_request_dict',
 '_emit_api_params',
 '_endpoint',
 '_exceptions',
 '_exceptions_factory',
 '_get_waiter_config',
 '_load_exceptions',
 '_loader',
 '_make_api_call',
 '_make_request',
 '_register_handlers',
 '_request_signer',
 '_response_parser',
 '_serializer',
 '_service_model',
 'assume_role',
 'assume_role_with_saml',
 'assume_role_with_web_identity',
 'can_paginate',
 'close',
 'decode_authorization_message',
 'exceptions',
 'generate_presigned_url',
 'get_access_key_info',
 'get_caller_identity',
 'get_federation_token',
 'get_paginator',


In [21]:
# Use the 'sts' client to call the 'get_caller_identity()' function. 
# Store the result in a variable called 'response'
response = sts.get_caller_identity()
response

{'UserId': 'AROAWWVMBM7EHU6KP3QO3:SageMaker',
 'Account': '460996044744',
 'Arn': 'arn:aws:sts::460996044744:assumed-role/AmazonSageMaker-ExecutionRole-20220222T110265/SageMaker',
 'ResponseMetadata': {'RequestId': '3c8fe132-6391-4369-a0f4-a28be30ea3cd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3c8fe132-6391-4369-a0f4-a28be30ea3cd',
   'content-type': 'text/xml',
   'content-length': '470',
   'date': 'Thu, 04 Aug 2022 18:29:46 GMT'},
  'RetryAttempts': 0}}

#### 1D. Parse the response to verify action was taken
The response is a Python dictionary. <P>
In this example, we just want to display a few details from the response.

In [22]:
# Get a little tricky parsing the response dictionary:
#
if response['ResponseMetadata']['HTTPStatusCode'] == 200: # Make sure it was a successful response
    print('Username:',response['UserId'].split(':')[-1]) # Extracting a detail from the dictionary
    print('Account Number:',response['Arn'].split(':')[4])
    print('IAM Role:',response['Arn'].split(':')[5].split('/')[-2])
else: # Handle a bad response HTTPStatusCode
    print("Something went wrong, we didn't get the right status code.")

Username: SageMaker
Account Number: 460996044744
IAM Role: AmazonSageMaker-ExecutionRole-20220222T110265


### 2. Load csv data from AWS S3 directly into a pandas dataframe
A clean workflow is to not store data locally on your Sagemaker instance.  Rather, will will load it from an S3 bucket directly into memory.<P>

Let's load from an S3 bucket into directly into a pandas DataFrame

In [23]:
# Create S3 Client
s3 = sess.client('s3') 
#
# Define the bucket & file you want to load
source_bucket = 'machinelearning-read-only'
source_key = 'data/boston.csv'
#
# Make the call to the 'get_object' function
#      https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object
response = s3.get_object(Bucket=source_bucket, Key=source_key)
#
# Load the 'data' part of the response directly into a dataframe
df = pd.read_csv(response.get("Body")) # The 'Body' is of type streaming body. We can put this right into a dataframe
#
# There is no local storage of the dataframe. It is only in memory.
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### 3. Save data from a pandas DataFrame into a datafile stored in an S3 bucket.

Imagine you have performed your work and now want to store your data back into an S3 bucket.<P>
    
Let's store it in a different bucket with a different key:
- bucket = 'machinelearning-shared'
- key = 'data/\<your username\>/\<your file name\>'
  
    

In [24]:
# Create a new dataframe from the describe() method
summary_df = df.describe() # This function returns a dataframe
summary_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [31]:
# Define some variable to store the data
#
# This is a bucket in which you have full access
destination_bucket = 'machinelearning-shared'
#
# define your custom key
destination_key = 'data/kcolvin/boston_summary.csv' # Use your own folder and file names here, no leading '/'
#
print('Location to save to:',destination_bucket + '  ' + destination_key)

Location to save to: machinelearning-shared  data/kcolvin/boston_summary.csv


In [32]:
# Store the DataFrame to a CSV file in S3:
#
# We need to convert our DataFrame into a csv-like data structure.
from io import StringIO # Import the StringIO object from the io library
#
csv_buffer = StringIO() # create an empty StringIO object to store the csv data
#
# Save the existing dataframe into that empty csv_buffer
summary_df.to_csv(csv_buffer, header=True, index=False) # Include the header (Columns), but not the row index.
#
csv_buffer.seek(0) # This sets the stream postion to the start. Needed to send the data to the S3 bucket
#
# Use put_object() to upload the csv_buffer to a csv file in the S3 location
response = s3.put_object(Bucket=destination_bucket, Body=csv_buffer.getvalue(), Key=destination_key)
response

{'ResponseMetadata': {'RequestId': 'NYE93KQHQ7KT9FB5',
  'HostId': 'bx2eGChPHfQ8INTx+Ehm5lRU4Afc6mPYyVHT3IDq21/hM+zierds3GVfF/W3WDb8pWWppplLFo0pW1rET+WOrQ==',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'bx2eGChPHfQ8INTx+Ehm5lRU4Afc6mPYyVHT3IDq21/hM+zierds3GVfF/W3WDb8pWWppplLFo0pW1rET+WOrQ==',
   'x-amz-request-id': 'NYE93KQHQ7KT9FB5',
   'date': 'Thu, 04 Aug 2022 18:31:44 GMT',
   'etag': '"14b500b642d6c64f276eaf6fd881445d"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"14b500b642d6c64f276eaf6fd881445d"'}

### 4. Check your file on the S3 bucket
You have to be a little careful with the put_object() function. Sometimes it will give a StatusCode of 200, but something didn't quite work right. <P>
Let's check all files on the S3 bucket by using the list_object() function

In [33]:
response = s3.list_objects(Bucket = destination_bucket)
# Parse though the response
for object in response['Contents']:
    print(object['Key'])

data/dkraker-data.csv
data/kcolvin/boston_summary.csv
data/kcolvin/boston_summary.pkl


### 5. We often store other types of data to S3 also.
- Let's practice with a file type called a 'pickle'
- Saving data to Pickle files is often called 'serializing the data'

Pickle files are often used to store trained machine learning models for future use.<P>
    
But in this example, let's just save a pandas DataFrame to a pickle file.

In [34]:
# We can just save it to a local file directly. 
# This is easy, but we often don't want to store data on the local file system.
#
# This is just the to_pickle() function from pandas
df.to_pickle('./boston_summary.pkl')
# 
# Mostly, we don't want to store data locally on our file system.

This time, let's store the data in an S3 bucket.

In [35]:
# We use the ByteIO object instead of the StringIO object for a pickle file
from io import BytesIO

destination_bucket = 'machinelearning-shared' # Same as before
destination_key = 'data/kcolvin/boston_summary.pkl' # Customize for your use
#
# Create a new, empty BytesIO object
pickle_buffer = BytesIO()
#
# Load the DataFrame into the pickle object
summary_df.to_pickle(pickle_buffer)
#
pickle_buffer.seek(0) # This sets the stream postion to the start. Needed to send the data to the S3 bucket
#
# Use put_object() to upload the csv_buffer to a csv file in the S3 location
response = s3.put_object(Bucket=destination_bucket, Body=pickle_buffer.getvalue(), Key=destination_key)
response

{'ResponseMetadata': {'RequestId': '4RDMP00CFD1RAMJQ',
  'HostId': 'G2MOc7ArBDSseKmvCxw/v2krcl03YgHHdYv9ql/N/1+7nI7dQeCvFsHK6MhUePGGrlsPBhawIuI=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'G2MOc7ArBDSseKmvCxw/v2krcl03YgHHdYv9ql/N/1+7nI7dQeCvFsHK6MhUePGGrlsPBhawIuI=',
   'x-amz-request-id': '4RDMP00CFD1RAMJQ',
   'date': 'Thu, 04 Aug 2022 18:34:15 GMT',
   'etag': '"a76ae1527069cd20f6660c27189fcf31"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"a76ae1527069cd20f6660c27189fcf31"'}

### 6. Load the saved Pickle file back into a new DataFrame
Just for demonstration purposes

In [30]:
import pickle
#
source_bucket = 'machinelearning-shared' 
source_key = 'data/kcolvin/boston_summary.pkl' # Customize for your use
#
# Get the file from S3 
response = s3.get_object(Bucket = source_bucket, Key = source_key)
#
# Read the 'Body' part of the response into a variable. This is where the DataFrame data exists in the response.
body = response['Body'].read()
#
# Create a new pandas DataFrame using the pickle.loads() function
new_df = pickle.loads(body)
new_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Summary: In this notebook, we have demonstrated loading and saving pandas DataFrames from/to S3 buckets. 