# Download the data to your S3 buckets

**Be sure to subscribe to the [product](https://aws.amazon.com/marketplace/pp/prodview-v3o7zrt6okwmo) before running this notebook.**

The main dataset is on the Amazon Web Services Data Exchange (ADX), and it contains:

- 60,000+ CS:GO matches 
- 2,000,000+ files  
- 150+ days of data
- 2TB+ of data

Each day's worth of data is a revision on the ADX. This allows for granular download volume. Each day *on average* is **13GB** and has **350 matches**.

The number of matches in each revision can be between 200 and 1500, so if you don't get enough you can always download more.

_**Edit the following in the cell below**_
- **Bucket name** for where you want the data
- **Date range** to choose how many days worth of existing data to download
- **Flag** for you want to download new data automatically (there is a new revision every day)

In [None]:
dest_bucket = 'my-bucket'
begin_date = '2021-12-01T00:00:00.000Z' #inclusive
end_date = '2022-05-31T00:00:00.000Z' #inclusive
auto_download_new_data = True

## one month worth of data
# begin_date = '2022-04-01T00:00:00.000Z' #inclusive
# end_date = '2022-04-30T00:00:00.000Z' #inclusive

In [None]:
from pureskillgg_makenew_pyskill.notebook import setup_notebook

In [None]:
setup_notebook()

In [None]:
import boto3
import dateutil.parser

In [None]:
begin_date_dt = dateutil.parser.isoparse(begin_date)
end_date_dt = dateutil.parser.isoparse(end_date)
data_set_id = 'f49be2ef387af522a7b6f000158113e0'

In [None]:
client = boto3.client('dataexchange')

In [None]:
response = client.list_data_set_revisions(DataSetId=data_set_id)
revisions = response['Revisions']
next_token = response.get('NextToken')
while next_token is not None:
    response = client.list_data_set_revisions(DataSetId=data_set_id,NextToken=next_token)
    revisions.extend(response.get("Revisions", []))
    next_token = response.get('NextToken')

In [None]:
def good_revision(revision):
    if revision['Finalized'] is False:
        return False
    rev_dt = dateutil.parser.isoparse(revision['Comment'])
    if rev_dt < begin_date_dt:
        return False
    if rev_dt > end_date_dt:
        return False
    return True
    
revision_ids = [revision['Id'] for revision in revisions if good_revision(revision)]

# Transfer Existing Data to S3

In [None]:
for revision_id in revision_ids:
    response = client.create_job(Details={
        'ExportRevisionsToS3': {
                'DataSetId': data_set_id,,
                'RevisionDestinations': [
                    {
                        'Bucket': dest_bucket,
                        'KeyPattern': '${Asset.Name}',
                        'RevisionId': revision_id
                    },
                ]
            }
        }
    if response['ResponseMetadata']['HTTPStatusCode'] != 200:
        raise Exception(f"Query raised http error {response['ResponseMetadata']['HTTPStatusCode']}")

# Automatically Transfer New Data to S3

In [None]:
if auto_download_new_data:
    response = client.create_event_action(
        Action={
            'ExportRevisionToS3': {
                'RevisionDestination': {
                    'Bucket': dest_bucket,
                    'KeyPattern': '${Asset.Name}'
                }
            }
        },
        Event={
            'RevisionPublished': {
                'DataSetId': data_set_id
            }
        }
    )