* Upload data folder to AWS s3
* Overview of Pagination
* Review Marker and MaxKeys
* Develop Pagination using Marker and MaxKeys
* Overview of AWS s3 Paginator
* Develop Pagination using Paginator
* Exercise and Solution - Paginate AWS s3 Objects

* Upload data folder to AWS s3

Upload files using AWS CLI Command.

```shell
aws s3 cp data/ \
    s3://itvawsdata/ \
    --recursive \
    --profile itvdev1
```

* Overview of Pagination

In [None]:
bucket = 'itvawsdata'

In [None]:
import boto3

In [None]:
s3_client = boto3.client('s3')

In [None]:
s3_objects = s3_client.list_objects(Bucket=bucket)

In [None]:
len(s3_objects['Contents'])

In [None]:
s3_objects = s3_client.list_objects(
    Bucket=bucket,
    MaxKeys=10
)

In [None]:
len(s3_objects['Contents'])

* Review Marker and MaxKeys

In [None]:
import boto3

In [None]:
s3_client = boto3.client('s3')

In [None]:
s3_objects = s3_client.list_objects(
    Bucket=bucket,
    MaxKeys=10
)

In [None]:
s3_objects['Contents']

In [None]:
s3_objects['Marker']

In [None]:
s3_objects['MaxKeys']

* Develop Pagination using Marker and MaxKeys

In [None]:
all_objects = []
marker = ''

while True:
    print(marker)
    s3_objects = s3_client.list_objects(
        Bucket=bucket,
        MaxKeys=10,
        Marker=marker
    )
    if s3_objects.get('Contents') is None:
        break
    all_objects.extend(s3_objects['Contents'])
    marker = all_objects[-1]['Key']

In [None]:
len(all_objects)

In [None]:
all_objects[:10]

* Overview of AWS s3 Paginator

1. No need to specify Marker.
2. No need to worry about number of keys or objects in s3.
3. We can specify bucket and prefix to paginate using AWS s3 paginator.
4. The code will be cleaner using AWS s3 Paginator rather than using Marker and MaxKeys.

In [None]:
paginator = s3_client.get_paginator('list_objects')

* Develop Pagination using Paginator

1. Create response iterator.
2. Iterate through response iterator to iterate through object details.

Note: Clean up all the objects from s3://itvawsdata so that you can take care of exercise without any issues.

```shell
aws s3 rm s3://itvawsdata/ --recursive --profile itvdev1
```

In [None]:
import boto3

In [None]:
import os

In [None]:
os.environ.setdefault('AWS_PROFILE', 'itvdev1')

In [None]:
s3_client = boto3.client('s3')

In [None]:
bucket = input('Enter a bucket name: ')

In [None]:
paginator = s3_client.get_paginator('list_objects')

In [None]:
all_objects = []

In [None]:
response_iterator = paginator.paginate(
    Bucket=bucket
)

In [None]:
for response in response_iterator:
    all_objects.extend([item['Key'] for item in response['Contents']])

In [None]:
len(all_objects)

In [None]:
all_objects[:10]

* Exercise - Paginate AWS s3 Objects

1. Setup NYSE Data with one file per day in s3 (run the provided code as demonstrated).
  * s3 location: `s3://itvawsdata/nyse_all/nyse_data/`
2. Use s3 Paginator to get total number of files as well as total size in the form of a tuple.
3. Read data for 2007 January 10th into Data Frame using Pandas - `s3://itvawsdata/nyse_all/nyse_data/NYSE_2007010.csv`

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(
    'data/nyse_all/nyse_data/NYSE_2007.txt.gz', 
    compression='gzip',
    header=None,
    names=['ticker', 'trade_date', 'open', 'high', 'low', 'close', 'volume']
)

In [None]:
df

In [None]:
# Here is normal code to upload all the data to s3 by date.
# We will use the app with multiprocessing to speed up the upload process.
# Go to apps folder and run the app by using "python app.py"
# Make sure to set environment variables for
# BUCKET_NAME and NYSE_DATA_DIR

import pandas as pd
import os
import boto3

# AWS Setup
s3 = boto3.client('s3')
bucket = 'itvawsdata'  # Replace with your bucket name

# Assuming you have a list of files in the format 'NYSE_YYYY.txt.gz'
files = os.listdir('data/nyse_all/nyse_data')
files = [f for f in files if f.startswith('NYSE_') and f.endswith('.txt.gz')]

for file in files:
    # Read the gzipped CSV into a DataFrame
    df = pd.read_csv(
        os.path.join('data/nyse_all/nyse_data', file),
        compression='gzip',
        header=None,
        names=['ticker', 'trade_date', 'open', 'high', 'low', 'close', 'volume']
    )

    # Assuming trade_date is in the format YYYY-MM-DD
    # Extract unique dates in that file
    unique_dates = df['trade_date'].unique()

    # Save to separate files based on the unique date
    for trade_date in unique_dates:
        date_df = df[df['trade_date'] == trade_date]

        # Convert the DataFrame to CSV content
        csv_content = date_df.to_csv(index=False)

        # Define the key (path) in the S3 bucket
        key = f'nyse_all/nyse_data/{trade_date}.csv'

        # Upload the CSV content to S3
        s3.put_object(Body=csv_content, Bucket=bucket, Key=key, ContentType='text/csv')


* Solution - Paginate AWS s3 Objects

1. Setup NYSE Data with one file per day in s3 (run the provided code as demonstrated).
  * s3 location: `s3://itvawsdata/nyse_all/nyse_data/`
2. Use s3 Paginator to get total number of files as well as total size in MB.
3. Read data for 2007 January 10th into Data Frame using Pandas - `s3://itvawsdata/nyse_all/nyse_data/NYSE_2007010.csv`

In [None]:
import boto3

In [None]:
s3_client = boto3.client('s3')

In [None]:
paginator = s3_client.get_paginator('list_objects')

In [None]:
all_objects = []

In [None]:
response_iterator = paginator.paginate(
    Bucket=bucket,
    Prefix='nyse_all/nyse_data/'
)

In [None]:
for response in response_iterator:
    if response.get('Contents') is None:
        break
    all_objects.extend(response['Contents'])

In [None]:
file_count = len(all_objects)
total_size = sum([obj['Size'] for obj in all_objects])


In [None]:
file_count

In [None]:
# size in mb
total_size / 1024 / 1024