**ATHENA WITH PYTHON3**
===

In [None]:
import boto3
import pandas,os,io

s3 = boto3.client('s3', region_name='eu-west-1')
athena = boto3.client('athena', region_name='eu-west-1')
bucket = os.getenv('USER')+"-"+os.getenv('WORKSHOP')+"-aws-bigdata-workshop"
table = "tweets-hillary-"+os.getenv('USER')+os.getenv('WORKSHOP')
print("Table NAME : "+table)

def do_query(query):
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': 'default'
        },
        ResultConfiguration={
            'OutputLocation': 's3://'+bucket+'/athena/',
        }
    )
    print ("QueryExecutionId "+response['QueryExecutionId'])
    return response['QueryExecutionId']

def get_result(rid):
    try:
        response = athena.get_query_results(
            QueryExecutionId=""+rid
        )
        key = 'athena/'+rid+'.csv'
        obj = client.get_object(Bucket=bucket, Key=key)
        df = pandas.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8',lineterminator='\n')
        return df
    except :
        print("error or Query has not yet finished")


In [None]:
create = " \
    CREATE EXTERNAL TABLE IF NOT EXISTS default."+table+" ( \
      `id` bigint, \
      `name` string, \
      `message` string, \
      `ts` bigint, \
      `isodate` timestamp, \
      `date` string \
    ) \
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' \
    WITH SERDEPROPERTIES ( \
      'serialization.format' = ',', \
      'field.delim' = '|' \
    ) LOCATION 's3://aws-potus-eu-west-1/split/Hillary/' \
    TBLPROPERTIES ('has_encrypted_data'='false'); \
    "

In [None]:
r = do_query(create)

In [None]:
r = do_query("select count(distinct id) from "+table+" where message like '%trump%';")

In [None]:
df = get_result(r)
df.head()

In [None]:
r = do_query("select * from "+table+" where message like '%trump%';")

In [None]:
df = get_result(r)
df.head()

In [None]:
r = do_query("select count(distinct id) from "+table+" ;")

In [None]:
df = get_result(r)
df.head()

**ALTERNATIVE : ATHENA WITH AWS CONSOLE**
===

Steps :

    Go to https://console.aws.amazon.com/athena/home?region=us-east-1#
    In the query editor
    Left click "Add table"
    In the dropdown choose"Create a new table”
    Choose the name {{user}}
    Choose a name for your first table "clinton”
    Choose the path as s3://aws-potus/split/Hillary/
    For dataformat tab choose "Text File with Custom Delimiters"
        then "Other" as "Field delimiter" and | as value
    For columns tab choose
        id : bigint / user : string / message : string / ts : bigint / date : timestamp
        no changes for partition
    And to finish "create table"

Requests :

    select count(distinct id) from clinton where message like '%obama%';
    select * from clinton limit 100;
    select count(distinct id) from clinton;
    select count(distinct id) from clinton where message like '%obama%';



ALTERNATIVES : HIVE and DEMO on EMR
===

Init
 - `aws emr create-default-roles --region eu-west-1`
 - `aws emr create-cluster --release-label emr-5.4.0 --name "EMR cluster" --applications Name=Hue Name=Hive Name=Presto --instance-type m3.xlarge --use-default-roles --instance-count 3 --region eu-west-1 --ec2-attributes KeyName=YOUR-KEY-PAIR`
 - Get your ip with https://www.google.fr/search?hl=en&safe=off&q=what+is+my+ip
 - Modify security group for your master node and authorize your ip
 - `ssh -i YOUR-PEM hadoop@IP`
 - `hive`
 - ``CREATE EXTERNAL TABLE clinton (`id` bigint,`user` string,`message` string,`ts` bigint,`date` timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' LOCATION 's3://aws-potus-eu-west-1/split/Hillary/';``
 - `exit;`

Hive
 - `hive`
 - `select count(distinct id) from clinton where message like '%obama%';`
 - `select * from clinton limit 100;`
 - `select count(distinct id) from clinton;`
 - `select count(distinct id) from clinton where message like '%obama%';`
 - `exit;`

Presto
 - `presto-cli --catalog hive --schema default ;`
 - `select count(distinct id) from clinton where message like '%obama%';`
 - `select * from clinton limit 100;`
 - `select count(distinct id) from clinton;`
 - `select count(distinct id) from clinton where message like '%obama%';`
 - `exit`
 
HUE
 -  Open security group for IP on the master node on port 8888
 - Go to http://EMR-MASTER-IP:8888
 
 **Time to delete your EMR Cluster :)**


   
   




