# Accessing S3 resources in SageMaker

The first step is to ensure that the python kernel has your current AWS credentials. This is accomplished with the following two lines of code.

In [1]:
from sagemaker import get_execution_role
role = get_execution_role()

You will specify the data to access from your new S3 bucket by saving variables indicating the bucket name and the location of the file within the bucket. Enter that information into the following cell.

In [2]:
my_bucket = 'pm1178-labdata'
my_file = 'NCBirths2004.csv'

#Example
#my_bucket = 'bigdatateaching'
#my_file = 'marvel/character/part-00000-48ef7a8c-4747-4e40-9662-f9593a2c4655-c000.csv'

## Using the boto3 package to access S3 files

In [3]:
import boto3

s3client = boto3.client('s3')
response = s3client.get_object(Bucket=my_bucket, Key=my_file)
body = response['Body']

test = body.read()
test[:500]

b'ID,MothersAge,Tobacco,Alcohol,Gender,Weight,Gestation,Smoker\r\n1,30-34,No,No,Male,3827,40,No\r\n2,30-34,No,No,Male,3629,38,No\r\n3,35-39,No,No,Female,3062,37,No\r\n4,20-24,No,No,Female,3430,39,No\r\n5,25-29,No,No,Male,3827,38,No\r\n6,35-39,No,No,Female,3119,39,No\r\n7,20-24,No,No,Female,3260,40,No\r\n8,20-24,No,No,Male,3969,40,No\r\n9,20-24,No,No,Male,3175,39,No\r\n10,25-29,No,No,Female,3005,39,No\r\n11,25-29,No,No,Male,4054,41,No\r\n12,20-24,Yes,No,Male,3204,39,Yes\r\n13,30-34,No,No,Female,2892,38,No\r\n14,25-29,No,No,Fe'

## Using Pandas to read in data

Pandas has built in functionality to access S3 buckets. All you have to do is make sure your AWS credentials are loaded so you have appropriate permissions to access the bucket.

In [4]:
import os
import pandas as pd

df_pd = pd.read_csv(os.path.join('s3:/',my_bucket,my_file))
df_pd

Unnamed: 0,ID,MothersAge,Tobacco,Alcohol,Gender,Weight,Gestation,Smoker
0,1,30-34,No,No,Male,3827,40,No
1,2,30-34,No,No,Male,3629,38,No
2,3,35-39,No,No,Female,3062,37,No
3,4,20-24,No,No,Female,3430,39,No
4,5,25-29,No,No,Male,3827,38,No
...,...,...,...,...,...,...,...,...
1004,1005,35-39,No,No,Male,3799,39,No
1005,1006,20-24,No,No,Male,2835,39,No
1006,1007,15-19,No,No,Male,3260,38,No
1007,1008,20-24,No,No,Female,2637,41,No


Oh no, we got an error! Why might this have happened???

As it turns out, there is a package dependency conflicts between the packages botocore and s3fs. We need these packages to work together in order to interact with S3 files effectively. You can check out some discussion on the issue here - https://github.com/iterative/dvc/issues/7053. These issues can happen when packages are released at a quick pace. The issue and solution have all happened in the final months of 2021.

Let's resolve this problem by installing the appropriate package version. 

1. Run the following cell to install the new package. Note that we are installing the s3fs package, using a "==" to ask for a specific version, then running a "force" flag so that we force all the related packages installed.
2. Restart the kernel by going to the menu bar and clicking `Kernel`, then click `Restart Kernel...`, then confirm by clicking the red `Restart` button. Now run the script from the beginning and it should work!
3. Start running all your cells again from the top of the script

In [8]:
!pip install s3fs==2021.11.1 --force

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting s3fs==2021.11.1
  Downloading s3fs-2021.11.1-py3-none-any.whl (25 kB)
Collecting aiobotocore~=2.0.1
  Downloading aiobotocore-2.0.1.tar.gz (54 kB)
     |████████████████████████████████| 54 kB 316 kB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting fsspec==2021.11.1
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 64.2 MB/s            
[?25hCollecting aiohttp<=4
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 66.1 MB/s            
[?25hCollecting botocore<1.22.9,>=1.22.8
  Downloading botocore-1.22.8-py3-none-any.whl (8.1 MB)
     |████████████████████████████████| 8.1 MB 44.8 MB/s            
[?25hCollecting wrapt>=1.10.10
  Downloading wrapt-1.13.3-cp37-cp

**Don't forget to restart your kernel AND run the cells (not the pip install one) from the top of the notebook!!**

## Using the s3fs package to read in data

The s3fs package lets you open files from S3 in the same way that you would open files from your local file system. This is especially useful for when you want to use other python packages that might not play nicely with S3.

In [5]:
import s3fs
fs = s3fs.S3FileSystem()

# list in your bucket using f-string and the previously defined my_bucket variable
fs.ls(f"s3://{my_bucket}/")

['vk297-labdata/NCBirths2004.csv', 'vk297-labdata/StateNames.csv']

In [6]:
# open it directly
# pass the file handler variable f into any function, such as pandas read.csv
with fs.open(f's3://{my_bucket}/{my_file}') as f:
    df_fs = pd.read_csv(f)

df_fs

Unnamed: 0,ID,MothersAge,Tobacco,Alcohol,Gender,Weight,Gestation,Smoker
0,1,30-34,No,No,Male,3827,40,No
1,2,30-34,No,No,Male,3629,38,No
2,3,35-39,No,No,Female,3062,37,No
3,4,20-24,No,No,Female,3430,39,No
4,5,25-29,No,No,Male,3827,38,No
...,...,...,...,...,...,...,...,...
1004,1005,35-39,No,No,Male,3799,39,No
1005,1006,20-24,No,No,Male,2835,39,No
1006,1007,15-19,No,No,Male,3260,38,No
1007,1008,20-24,No,No,Female,2637,41,No
