## Introduction to using the AWS System Development Kit (SDK) using Python
Last week, we used the 'AWS CLI' to work with files in S3. This week, we are going to do perform very similar functions, but we are going to use Python instead of the CLI.
<P>


### Prep: Install boto3 module

In [1]:
# First, import the 'sys' package. This is just useful functions we need.
import sys
#
# Some troubleshooting code below, if needed. Uncomment to use:
# List all installed modules
#!{sys.executable} -m pip list
# Check Python version
#!python --version

##### If you don't have boto3 installed, you need to install this module. You must 'install' a package before you can 'import' it. Boto3 is the Python AWS module. You only need to do this once.

In [2]:
# This should work for almost all installations of Jupyter Notebook
#
# Call 'pip install boto3'. Uncomment this next line:
#!{sys.executable} -m pip install boto3
# 
# If pip fails, try 'conda install'. If necessary, uncomment the next line.
#!conda install --yes --prefix {sys.prefix} boto3

### Once boto3 is installed, start here:

In [3]:
# If boto3 is installed, then we can import it. If no error from running this cell, then we are ready to go.
import boto3

In [4]:
# Import a couple other useful packages. Hopefully, you have already installed pandas
# pandas has useful data types and methods.
import pandas as pd
# The io module provides Python’s main facilities for dealing with various types of I/O.
import io

In [5]:
# Create a boto3 session.  This creates an object which allows us to interact with AWS.
#
session = boto3.Session()

In [6]:
# First, let's make sure we have authorization to use our AWS account.
# This gets your credentials from the AWS CLI that we configured last week.
# Create a sts client object (Secure Token Service)
sts = session.client('sts')
#
# Ask the client to report my identity (or credentials)
# You should see a line with:
#  'Arn': 'arn:aws:iam::460996044744:user/<your username>'
response = sts.get_caller_identity()
response

{'UserId': 'AIDAWWVMBM7EGCKXQSL4Y',
 'Account': '460996044744',
 'Arn': 'arn:aws:iam::460996044744:user/kcolvin',
 'ResponseMetadata': {'RequestId': 'aed27e2a-27d5-4dfa-85e8-b1d436917e55',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'aed27e2a-27d5-4dfa-85e8-b1d436917e55',
   'content-type': 'text/xml',
   'content-length': '404',
   'date': 'Wed, 02 Feb 2022 23:05:15 GMT'},
  'RetryAttempts': 0}}

In [7]:
# From the response, find your username
# Get the Arn key, then split the string at the '/' character
# Load the 2nd element [1] from that split string.
my_username = response['Arn'].split('/')[1]
print(my_username)

kcolvin


In [8]:
# If the above cell worked, then we can continue.  If it failed, then you have to troubleshoot before continuing.
#
# Create S3 client. This object says we can interact with the S3 service on AWS.
s3c = session.client('s3')

#### Using AWS S3

##### 'list buckets'

In [9]:
# Here is the 'list_buckets' function
#
# Call the function and store the results in the variable 'response'
response = s3c.list_buckets()  # Use the list_buckets() function from S3 boto3 library
#
# The response is a JSON object. Python calls this a 'dictionary' or 'dict'. To work with AWS in Python, you need 
# to be very familiar with dictionaries
print('response type is: ', type(response),'\n')
#
# Filter all info except the 'Buckets' key
bucket_names = response['Buckets']
#
# Loop though all the buckets and pull out just the 'Name' 
print('Name of the buckets:\n')
for name in bucket_names:
    print(name['Name'])

response type is:  <class 'dict'> 

Name of the buckets:

athena-ime312-data
athena-ime312-results
cf-templates-1557tsq4h6qye-us-west-2
ftp-dlink-cam1
gse580
gse580-read-only
kcolvintemp
msba-rekognition
temp-235612


In [9]:
# Let's not filter the response above. Let's look at the whole json object returned from AWS.
response

{'ResponseMetadata': {'RequestId': '5NHMH6GZJEN5VWD1',
  'HostId': 'aDAhUpacp3AJqxZ4NE+CtkkcesLBZgNHSACIjA09GW5VYYXef7EQ4Sbe+2i1ggkQm6q4MhWyK/Y=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'aDAhUpacp3AJqxZ4NE+CtkkcesLBZgNHSACIjA09GW5VYYXef7EQ4Sbe+2i1ggkQm6q4MhWyK/Y=',
   'x-amz-request-id': '5NHMH6GZJEN5VWD1',
   'date': 'Fri, 21 Jan 2022 21:37:00 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'athena-ime312-data',
   'CreationDate': datetime.datetime(2021, 2, 18, 17, 46, 59, tzinfo=tzutc())},
  {'Name': 'athena-ime312-results',
   'CreationDate': datetime.datetime(2021, 2, 18, 17, 47, 20, tzinfo=tzutc())},
  {'Name': 'cf-templates-1557tsq4h6qye-us-west-2',
   'CreationDate': datetime.datetime(2020, 2, 18, 18, 43, 41, tzinfo=tzutc())},
  {'Name': 'ftp-dlink-cam1',
   'CreationDate': datetime.datetime(2020, 8, 12, 4, 21, 54, tzinfo=tzutc())},
  {'Name': 'gse580',
   'Cre

#### Quick refresher on Python dictionaries

In [11]:
# A dictionary is a collection of 'keys' and 'values'. For each key, there is a related value.
# define a dict
my_car ={'brand': 'Toyota',
         'model': 'Sienna',
         'year': 1999}
# Show the variable type
print('The type of the variable my_car is: ',type(my_car))
#
# If you know a 'key', get the value using square brackets
print('The model of my_car is: ',my_car['model'])
# Show all they keys
print('All the keys: ',my_car.keys())
# Show all the values
print('All the values: ',my_car.values())
#
# Just show the dictionary:
my_car

The type of the variable my_car is:  <class 'dict'>
The model of my_car is:  Sienna
All the keys:  dict_keys(['brand', 'model', 'year'])
All the values:  dict_values(['Toyota', 'Sienna', 1999])


{'brand': 'Toyota', 'model': 'Sienna', 'year': 1999}

In [12]:
# When AWS responds, it returns a dict. You have to deal with those dicts:
# Repeating code from above:
response = s3c.list_buckets()
# Show everyting included in 'buckets'
print('\nThe full contents of response:\n')
response


The full contents of response:



{'ResponseMetadata': {'RequestId': 'HBKCX8E7XS629T9X',
  'HostId': '0Wbxh3Ryt67u/FeYC7Fu3B0zIhLIuyGvd/FCw8oUusVQIMgEg7I6A86K4ciRyXGdisPKgLJivhU=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '0Wbxh3Ryt67u/FeYC7Fu3B0zIhLIuyGvd/FCw8oUusVQIMgEg7I6A86K4ciRyXGdisPKgLJivhU=',
   'x-amz-request-id': 'HBKCX8E7XS629T9X',
   'date': 'Fri, 21 Jan 2022 21:43:12 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'athena-ime312-data',
   'CreationDate': datetime.datetime(2021, 2, 18, 17, 46, 59, tzinfo=tzutc())},
  {'Name': 'athena-ime312-results',
   'CreationDate': datetime.datetime(2021, 2, 18, 17, 47, 20, tzinfo=tzutc())},
  {'Name': 'cf-templates-1557tsq4h6qye-us-west-2',
   'CreationDate': datetime.datetime(2020, 2, 18, 18, 43, 41, tzinfo=tzutc())},
  {'Name': 'ftp-dlink-cam1',
   'CreationDate': datetime.datetime(2020, 8, 12, 4, 21, 54, tzinfo=tzutc())},
  {'Name': 'gse580',
   'Cre

In [13]:
# Show only the name of the first bucket
response['Buckets'][0]['Name']

'athena-ime312-data'

##### 'list objects'

In [14]:
# From AWS S3, here is the 'list_objects' function:
#
# Let's select a bucket and list the contents of that bucket
bucket = 'gse580'
#
# Call the 'list_objects' function
response = s3c.list_objects(Bucket=bucket)
#
# Just get the 'Contents' from the response
all_objects = response['Contents']
#
# Print out the names of the files
for obj in all_objects:
    print(obj['Key'])

ayagar/
ayagar/IMG_7149.JPG
ayagar/data/data.csv
ayagar/index.html
bbschnei/
bbschnei/data/clean_data.csv
bbschnei/data/data.csv
bbschnei/index.html
benba/data/data.csv
bzwarg/
bzwarg/data/
bzwarg/data/clean_data.csv
bzwarg/data/data.csv
bzwarg/index.html
cehayden/
cehayden/IMG_5414.jpeg
cehayden/data/data.csv
cehayden/index.html
chartman/
chartman/chartman.html
chartman/data/
chartman/data/data.csv
clopez/
clopez/data/data.csv
clopez/index.html
cnilsson/
cnilsson/data/
cnilsson/data/data.csv
cnilsson/index.html
disaaved/
disaaved/Nala Butt 3.jpg
disaaved/data/
disaaved/data/data.csv
disaaved/index.html
disaaved/index.html.html
dlov/
dlov/data/data.csv
dlov/images/macy.jpg
dlov/index.html
dlov/macy.jpg
dlov/sync/
ebohnenb/
ebohnenb/data/data.csv
ebohnenb/index.html
ebohnenb/turkeys.jpg
error.html
jtmetz/
jtmetz/castle.JPG
jtmetz/data/data.csv
jtmetz/index.html
jtmetz/index_redo.html
jtmetz/index_redo.txt
kcolvin/
kcolvin/data/
kcolvin/data/clean_data.csv
kcolvin/data/data.csv
kcolvin/i

In [15]:
# Look for a specific pattern in the keys
for obj in all_objects:
    if 'kcolvin' in obj['Key']:
        print(obj['Key'])
#
# Look for a specific key
print('\nDoes kcolvin/data/data.csv exist?\n')
for obj in all_objects:
    if 'kcolvin/data/data.csv' in obj['Key']:
        print('\t',obj['Key'])

kcolvin/
kcolvin/data/
kcolvin/data/clean_data.csv
kcolvin/data/data.csv
kcolvin/images/
kcolvin/images/macy.jpg
kcolvin/index.html
kcolvin/newfolder/macy.jpg

Does kcolvin/data/data.csv exist?

	 kcolvin/data/data.csv


In [16]:
# Just for fun, let's grade last weeks assignment:
#
# Print the objects that match the pattern: '/data/data.csv'
print('Full credit for:\n')
for obj in all_objects:
    if 'data/data.csv' in obj['Key']:
        print(obj['Key'])

Full credit for:

ayagar/data/data.csv
bbschnei/data/data.csv
benba/data/data.csv
bzwarg/data/data.csv
cehayden/data/data.csv
chartman/data/data.csv
clopez/data/data.csv
cnilsson/data/data.csv
disaaved/data/data.csv
dlov/data/data.csv
ebohnenb/data/data.csv
jtmetz/data/data.csv
kcolvin/data/data.csv
kcorders/data/data.csv
mgeverce/data/data.csv
mkost/data/data.csv
nhughe01/data/data.csv
nomeyer/data/data.csv
oalamu/data/data.csv
pvavouli/data/data.csv
sohl/data/data.csv
stragess/data/data.csv
zkrieger/data/data.csv


#### Download, Upload and Delete objects

In [17]:
# Download S3 object 'gse580-read-only/data/data.csv' into our local jupyter notebook folder
s3c.download_file(Bucket = 'gse580-read-only', Key = 'data/data.csv', Filename = 'data.csv') # No response available
#
# Upload local_file to S3
s3c.upload_file(Filename = 'data.csv', Bucket = 'gse580', Key = 'temp/data.csv') # No response
#
# Delete object
response = s3c.delete_object(Bucket = 'gse580', Key= 'temp/data/csv')
if response['ResponseMetadata']['HTTPStatusCode'] == 204:
    print('Response value was 204, successful delete_object()')
else:
    print('Something went wrong')

Response value was 204, successful delete_object()


## Your assignment this week:

### Let's work with S3 data WITHOUT downloading a file to our local computer.
Here, we are going to work with S3 and pandas to create a DataFrame directly from S3.

In [18]:
# Load a .csv file from S3 straight into a pandas df
bucket = 'gse580-read-only'
key = 'data/data.csv'
#
# Call the 'get_object' function from boto3. This is a little different than the download_file() from above. 
response = s3c.get_object(Bucket=bucket, Key=key)
#
# Get the HTTPStatusCode from the response
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    # If all OK, then create the DataFrame
    print(f"Successful S3 get_object response. Status - {status}")
    df = pd.read_csv(response.get('Body'))
else:
    # See what the response is and troubleshoot
    print(f"Unsuccessful S3 get_object response. Status - {status}")
#
# Assuming it worked, show the df.head()
df.head()

Successful S3 get_object response. Status - 200


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54208.0,55434.0,56234.0,56699.0,57029.0,57357.0,...,102050.0,102565.0,103165.0,103776.0,104339.0,104865.0,105361.0,105846.0,106310.0,106766.0
1,Africa Eastern and Southern,AFE,"Population, total",SP.POP.TOTL,130836765.0,134159786.0,137614644.0,141202036.0,144920186.0,148769974.0,...,532760424.0,547482863.0,562601578.0,578075373.0,593871847.0,609978946.0,626392880.0,643090131.0,660046272.0,677243299.0
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996967.0,9169406.0,9351442.0,9543200.0,9744772.0,9956318.0,...,30117411.0,31161378.0,32269592.0,33370804.0,34413603.0,35383028.0,36296111.0,37171922.0,38041757.0,38928341.0
3,Africa Western and Central,AFW,"Population, total",SP.POP.TOTL,96396419.0,98407221.0,100506960.0,102691339.0,104953470.0,107289875.0,...,360285439.0,370243017.0,380437896.0,390882979.0,401586651.0,412551299.0,423769930.0,435229381.0,446911598.0,458803476.0
4,Angola,AGO,"Population, total",SP.POP.TOTL,5454938.0,5531451.0,5608499.0,5679409.0,5734995.0,5770573.0,...,24220660.0,25107925.0,26015786.0,26941773.0,27884380.0,28842482.0,29816769.0,30809787.0,31825299.0,32866268.0


### Simulated analysis of data
In reality, let's just clean it up a bit

In [19]:
# Modify the dataframe:

# Clean up the columns
clean_df = df.drop(['Country Code','Indicator Name','Indicator Code'],axis=1)
# Set the index to the 'Country Name' column
clean_df.set_index('Country Name',inplace = True)
# Have a look
clean_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,54208.0,55434.0,56234.0,56699.0,57029.0,57357.0,57702.0,58044.0,58377.0,58734.0,...,102050.0,102565.0,103165.0,103776.0,104339.0,104865.0,105361.0,105846.0,106310.0,106766.0
Africa Eastern and Southern,130836765.0,134159786.0,137614644.0,141202036.0,144920186.0,148769974.0,152752671.0,156876454.0,161156430.0,165611760.0,...,532760424.0,547482863.0,562601578.0,578075373.0,593871847.0,609978946.0,626392880.0,643090131.0,660046272.0,677243299.0
Afghanistan,8996967.0,9169406.0,9351442.0,9543200.0,9744772.0,9956318.0,10174840.0,10399936.0,10637064.0,10893772.0,...,30117411.0,31161378.0,32269592.0,33370804.0,34413603.0,35383028.0,36296111.0,37171922.0,38041757.0,38928341.0
Africa Western and Central,96396419.0,98407221.0,100506960.0,102691339.0,104953470.0,107289875.0,109701811.0,112195950.0,114781116.0,117468741.0,...,360285439.0,370243017.0,380437896.0,390882979.0,401586651.0,412551299.0,423769930.0,435229381.0,446911598.0,458803476.0
Angola,5454938.0,5531451.0,5608499.0,5679409.0,5734995.0,5770573.0,5781305.0,5774440.0,5771973.0,5803677.0,...,24220660.0,25107925.0,26015786.0,26941773.0,27884380.0,28842482.0,29816769.0,30809787.0,31825299.0,32866268.0


In [20]:
# Now save the dataframe back to a new file in S3
#
bucket  = 'gse580'
# Modify to your folder
# **** YOU MUST MODIFY ****
key = 'kcolvin/data/clean_data.csv'  # Name of new file on S3
# **** YOU MUST MODIFY ****
##
# Code to use the put_object function to save clean_df as a .csv file in S3
with io.StringIO() as csv_buffer:
    # Use the pandas to_csv function
    clean_df.to_csv(csv_buffer, index=True)
    #
    # Here is the put_object function
    response = s3c.put_object(Bucket=bucket, Key=key, Body=csv_buffer.getvalue())
    #
    status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
    #
    if status == 200:
        print(f"Successful S3 put_object response. Status - {status}")
    else:
        print(f"Unsuccessful S3 put_object response. Status - {status}")

Successful S3 put_object response. Status - 200


In [21]:
# Let's verify our new file is in your folder on S3 under the location: <your_username>/data/clean_data.csv
bucket = 'gse580'
#
# **** YOU MUST MODIFY ****
your_username = 'kcolvin/'
# **** YOU MUST MODIFY ****
#
print('This is your full key to your S3 file:',your_username + 'data/clean_data.csv \n')
#
# Call list_objects
response = s3c.list_objects(Bucket=bucket)
# Just get the objects
all_objects = response['Contents']
#
# Your assignment is to save your clean_data.csv to your folder on S3 under 'data/clean_data.csv'
# Verify it exists:
for obj in all_objects:
    # Build a string to match <your_username>/data/clean_data.csv
    if your_username + 'data/clean_data.csv' in obj['Key']:
        print('Do you have a file in location <your_username>/data/clean_data.csv?')
        print(obj['Key'])

This is your full key to your S3 file: kcolvin/data/clean_data.csv 

Do you have a file in location <your_username>/data/clean_data.csv?
kcolvin/data/clean_data.csv


#### Not part of the assignment, but might be useful in the future.

In [22]:
# Extra bit of code for fun:
# Create Secure Token Service (STS) client. This is another part of the boto3 module.
sts = session.client('sts')
# Call the get_caller_identity() function
response = sts.get_caller_identity()
# From the response, find your username
my_username = response['Arn'].split('/')[1]
print(my_username)

kcolvin
