In [1]:
! pip install awswrangler

Collecting awswrangler
[?25l  Downloading https://files.pythonhosted.org/packages/b0/9d/ab160a0857e80ab143f4a81abb5fa28b1a325ec8f660fd2a0ac455924247/awswrangler-0.0.25.tar.gz (44kB)
[K    100% |████████████████████████████████| 51kB 23.2MB/s ta 0:00:01
[?25hCollecting numpy~=1.17.4 (from awswrangler)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/ab/43e678759326f728de861edbef34b8e2ad1b1490505f20e0d1f0716c3bf4/numpy-1.17.4-cp36-cp36m-manylinux1_x86_64.whl (20.0MB)
[K    100% |████████████████████████████████| 20.0MB 2.4MB/s eta 0:00:01
[?25hCollecting pandas~=0.25.3 (from awswrangler)
[?25l  Downloading https://files.pythonhosted.org/packages/52/3f/f6a428599e0d4497e1595030965b5ba455fd8ade6e977e3c819973c4b41d/pandas-0.25.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K    100% |████████████████████████████████| 10.4MB 6.8MB/s eta 0:00:01
[?25hCollecting pyarrow~=0.15.1 (from awswrangler)
[?25l  Downloading https://files.pythonhosted.org/packages/dd/77/5865b367a6792

## 1. Upload data to S3

First you need to create a bucket for this experiment. Upload the data from the following public location to your own S3 bucket.

You can create a bucket from the following link: <a href='https://s3.console.aws.amazon.com/s3/home?region=us-east-1'> s3 console </a>

To facilitate the work of the crawler we will use two different prefixes (folders): one for the billing information and one for reseller. 



### Download the data

In [2]:
# your bucket name
your_bucket = 'zoomagri-maxi-bucket-sagemaker'

In [3]:
!wget https://ml-lab-mggaska.s3.amazonaws.com/billing_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/reseller_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/awswrangler-0.0b2-py3.6.egg

--2019-12-10 18:38:31--  https://ml-lab-mggaska.s3.amazonaws.com/billing_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.216.234.51
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.216.234.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15803443 (15M) [binary/octet-stream]
Saving to: ‘billing_sm.csv’


2019-12-10 18:38:31 (99.5 MB/s) - ‘billing_sm.csv’ saved [15803443/15803443]

--2019-12-10 18:38:31--  https://ml-lab-mggaska.s3.amazonaws.com/reseller_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.216.234.51
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.216.234.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210111 (205K) [binary/octet-stream]
Saving to: ‘reseller_sm.csv’


2019-12-10 18:38:31 (31.1 MB/s) - ‘reseller_sm.csv’ saved [210111/210111]

--2019-12-10 18:38:31--  https

In [4]:
import boto3, os
import awswrangler

In [5]:
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('billing', 'billing_sm.csv')).upload_file('billing_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('reseller', 'reseller_sm.csv')).upload_file('reseller_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('python', 'awswrangler-0.0b2-py3.6.egg')).upload_file('awswrangler-0.0b2-py3.6.egg')


## 2. Add athena full access permissions to SageMaker

In [6]:
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

arn:aws:iam::856165527696:role/service-role/AmazonSageMaker-ExecutionRole-20191210T152860


Go to the <a href='https://console.aws.amazon.com/iam/home?region=us-east-1#/roles'>IAM roles console</a> and attach the Amazon Athena full access policy to this role.

## 2. Create a Crawler

To use this csv information in the context of a Glue ETL, first we have to create a Glue crawler pointing to the location of each file. The crawler will try to figure out the data types of each column. 


1. On the <a href='https://console.aws.amazon.com/iam/home?region=us-east-1#/roles'>IAM roles console</a> create an IAM role GlueCrawlerRole with the policy AWSGlueServiceRole and S3FullAccess.

2. Go to  <a href='https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers'>Glue crawlers console</a> 

3. Add a Crawler : create one pointing to different each S3 locations (one to billing and one to reseller)

    3.1 Fill  a Crawler Name: point a Data Store to specific S3 path, Navigate to your bucket and your folder: /billing, click "Next"
    
    3.2 Specify "Yes" to add a new Data Store and navigate to your bucket and your folder: /reseller, Click "Next" and select "No" when asking for add more Data stores, use an existing IAM role "AWSGlueServiceRole", add database "implementationdb", Click on "Next" and "Finish"
    
    3.3 After the crawler is created select "Run it now".
    

## 3. Configure Athena query destination

Go to the <a href='https://console.aws.amazon.com/athena/home?force&region=us-east-1#query'>Athena console</a>.

Under Settings in the top right corner set the query results location to s3://YOUR-BUCKET-NAME/athena-results/.

To verify that your crawlers created correctly you can run the following query:
    
    select * from billing limit 3; 


## 4. Execute a query to create a sample View in Athena

In [7]:

session = awswrangler.Session()
query=('CREATE VIEW resellers_sample AS SELECT *'
       'FROM billing where id_reseller '
       'in (select distinct id_reseller from reseller TABLESAMPLE BERNOULLI(10))')

df = session.pandas.read_sql_athena(
    sql=query,
    database="implementationdb",
    max_result_size=1
)

In [8]:
df

<generator object Pandas._apply_dates_to_generator at 0x7fb1f6097048>