In [4]:
import boto3
import os


# 1. S3 select api introduction
When we use frameworks such as Spark, or Arrow to retrieve objects from s3. They always retrieve the whole entities of the objects. For example, if spark read a paruqet file of 10 GiB in s3, a total 10 GiB of data will be transfered from s3 to the spark cluster. Even though spark may just require some of the columns and rows to do the calculation. As a result, we retrived many useless data that increase the I/O of the operation.

To avoid this, the S3 Select API allows us to retrieve a subset of data by using simple SQL expressions. The data filtering happens on the s3 server. And only the data needed by the application will be retrieved. This can improve drasticly the operation performance. 

As a result, the S3 select api is designed for filtering columns and rows only. It's not designed for handling complex analytical queries and return results. 

# 2. Limitation of S3 Select
- It supports a maximum of 256 KB length of an SQL expression.
- It supports a maximum of 1 MB length of a record in the input or result.
- Few SQL clauses that are supported are SELECT, FROM, WHERE, LIMIT, etc.
- It is not useful for complex analytical queries and joins.
- Currently, only three object formats, namely CSV, JSON, or Apache Parquet are supported by S3 Select queries.
- At a time, the select query can execute on a single file (object).
- It runs queries on a single object at a time in the S3 bucket.
- Supported CompressionType: NONE, GZIP, BZIP2. Default Value: NONE.


The first four limitation are normal, because S3 select is not designed for handling complex d

In [14]:
key_id=os.getenv("AWS_ACCESS_KEY_ID")
secret=os.getenv("AWS_SECRET_ACCESS_KEY")
session_token=os.getenv("AWS_SESSION_TOKEN")
endpoint=os.getenv("AWS_S3_ENDPOINT")

# print(f"key id: {key_id}")
# print(f"key secret: {secret}")
# print(f"session token: {session_token}")

In [15]:
s3_client = boto3.client(
    's3',
    endpoint_url=f'https://{endpoint}',
    aws_access_key_id=key_id,
    aws_secret_access_key=secret,
    aws_session_token=session_token
)

response = s3_client.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

Existing buckets:
  donnees-insee
  pengfei
  projet-relevanc
  projet-spark-lab


In [None]:
def fetch_csv(bucket:str,path:str,query:str):
    response = s3_client.select_object_content(
               Bucket=bucket,
               Key=path,
               ExpressionType='SQL',
               Expression=query,
               InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
               OutputSerialization = {'CSV': {}},)
    return response


In [34]:
# read csv with s3 select.
bucket="pengefi"
path="diffusion/data_format/netflix.csv"
q1="SELECT * FROM s3object s where s.\"rating\" = '5' limit 5"


In [33]:
resp1=

for event in resp1['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed'])
        print("Stats details bytesReturned: ")
        print(statsDetails['BytesReturned'])

822109,5,2005-05-13
2207774,5,2005-06-06
372233,5,2005-11-23
814701,5,2005-09-29
662870,5,2005-08-24

Stats details bytesScanned: 
4194304
Stats details bytesProcessed: 
4194304
Stats details bytesReturned: 
101


In [29]:
# read parquet with s3 select.
data_path="diffusion/data_format/sf_fire/parquet/arrow_sf_fire_none/f402f99cb6d9459696314909b6f6e0a3.parquet"

resp = s3_client.select_object_content(
    Bucket='pengfei',
    Key=data_path,
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object limit 5",
    InputSerialization = {'Parquet': {}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

ClientError: An error occurred (InternalError) when calling the SelectObjectContent operation (reached max retries: 4): We encountered an internal error, please try again.: cause(parquet format parsing not enabled on server)