# AWS SETUP

We need credentials to access to AWS resources.

<img src="img1.png" width="400px">

Click on **Access Keys (Access Key ID and Secret Access Key)** and generate a new Key (**REMEMBER** to save the credentials!).

<img src="img2.png">



Running on EMR
------------------

By default MrJob runs the job on your own computer using the ``multiprocessing`` module.
To change this behavior and make it run the processes on amazon EMR we need to specify a **"mrjob.conf"** config file and let MrJob load the configuration from it.

A typical MrJob configuration file for EMR looks like:

``` 
runners:
  emr:
      region: eu-west-1
      aws_access_key_id: KEY
      aws_secret_access_key: SECRET_KEY
      num_core_instances: 4
      instance_type: c1.medium
      ec2_key_pair: EMR
      ec2_key_pair_file: ~/.ssh/EMR.pem 
      ssh_tunnel: true
```

To **run** the wordcount on EMR we need to tell MrJob to use EMR and which configuration file to load:

```
$ MRJOB_CONF=./mrjob.conf python wordcount.py -r emr lorem.txt
```

### EMR Startup Steps ###

First step performed by MrJob is loading your configuration file:

```
using configs in ./mrjob.conf
```

Then it will create an S3 bucket where the *wordcount.py* and the data is uploaded (lorem.txt)

```
(envCourses) MBP-di-Alex:examples alexcomu$ MRJOB_CONF=./mrjob.conf python firstletter_count.py -r emr lorem.txt 
Using configs in ./mrjob.conf
Auto-created temp S3 bucket mrjob-adfa692a01d6eae0
Using s3://mrjob-adfa692a01d6eae0/tmp/ as our temp dir on S3
Creating temp directory /var/folders/_x/g5brlyv963vclshf_kffdm440000gn/T/solution_03_ex1.alexcomu.20160628.212525.574013
Copying local files to s3://mrjob-adfa692a01d6eae0/tmp/solution_03_ex1.alexcomu.20160628.212525.574013/files/...
Created new cluster j-1WM54P4L1AF4F
Waiting for step 1 of 1 (s-1T3DC5OWYK99Z) to complete...

```

Now that the data is available it will start the EMR job and the required EC2 machines:

```
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
...
PENDING (cluster is STARTING)
PENDING (cluster is STARTING: Configuring cluster software)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
RUNNING for 13.7s
RUNNING for 44.1s
...

Attempting to fetch counters from logs...
Waiting for cluster (j-1WM54P4L1AF4F) to terminate...
  TERMINATING: Steps completed
  TERMINATED: Steps completed
Looking for step log in s3://mrjob-adfa692a01d6eae0/tmp/logs/j-1WM54P4L1AF4F/steps/s-1T3DC5OWYK99Z...
  Parsing step log: s3://mrjob-adfa692a01d6eae0/tmp/logs/j-1WM54P4L1AF4F/steps/s-1T3DC5OWYK99Z/syslog.gz  
Removing s3 temp directory s3://mrjob-adfa692a01d6eae0/tmp/solution_03_ex1.alexcomu.20160628.212525.574013/...
Removing temp directory /var/folders/_x/g5brlyv963vclshf_kffdm440000gn/T/solution_03_ex1.alexcomu.20160628.212525.574013...
Removing log files in s3://mrjob-adfa692a01d6eae0/tmp/logs/j-1WM54P4L1AF4F/...
Terminating cluster: j-1WM54P4L1AF4F
```

After ~5 minutes the job completed (on local it took ~10s, this is why we will test most jobs locally) and MrJob will download the output from S3:

```
Streaming final output from s3://mrjob-adfa692a01d6eae0/tmp/solution_03_ex1.alexcomu.20160628.212525.574013/output/...
"a"	20
"b"	1
"c"	6
"d"	2
"e"	11
"f"	5
"h"	1
"i"	9
"l"	11
"m"	8
"n"	13
"o"	1
"p"	12
"q"	5
"r"	1
"s"	10
"t"	6
"u"	7
"v"	11

```




# Running on BIGDATA

When testing on EMR we have let MrJob configure an EMR cluster for us and upload the data files,
while this might be good for spot tries it is not feasible when working on big data that require multiple EMR machine that have to be configured each time and big data files that have to be copied to S3 each time.

To avoid this issue MrJob permits to run the code against an existing EMR cluster and S3 bucket.

## Running on existing data


To run on data already on S3, you can pass one or multiple ``s3://`` url to use as data sources.
For example if we want to run a job against the Twitter Dataset provided for the team work exercise we can do that using:

```
$ MRJOB_CONF=./mrjob.conf python myscript.py -r emr s3://alexcomu/berlinale_aggregated.csv
```

**When using S3 pay attention to the AWS region**, the s3 bucket region must be the same written inside your **mrjob.conf**, so in this case we must ensure that ``aws_region: eu-west-1`` is specified inside the configuration file, or accessing to the dataset will fail.

**NOTE:** Hadoop is able to use GZIP compressed files for input, so nothing particular is required to work with  *2011-02-11.json.XY.gz* files, they will be automatically decompressed by hadoop itself and you will get the contained JSON as the input.

# Jobs on EMR from WebApps and Runners

We have already seen that when running JOBS from webapps or other python scripts RUNNERS must be used. So far we only used runners by providing the input ourselves and running them locally.
When we want to run the job on EMR and we want to load data from an S3 bucket we must pass the options to the runner like we would by command line.
Specifically in case we want to run a WORD COUNT job on the twitter dataset the runner might look like:

## Example of application

We want to analyze the file https://s3-eu-west-1.amazonaws.com/alexcomu/berlinale_aggregated.csv which is composed by:
```
berlinale1;1454331524;995823735;213.61.32.110;10;1580;0;wowza02;instance1;wowza_app;
app-name;timestamp;session-id;client-IP;seconds;kbytes;client-type;server-name;instance;stream-name;
``` 
Let's try to sum amount of kbytes transmitted per app-name.

In [None]:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol

class MRBerlinaleKBytes(MRJob):

    def mapper(self, _, line):
        (app_name, timestamp_start, session_id,
         client_ip, length_stream, kbyte_transf,
         client_type, server_name, wowza_instance, stream_name) = line.split(";")
        yield app_name, int(kbyte_transf)


    def reducer(self, app_name, kbytes):
        yield app_name, sum(kbytes)


In [None]:
from berlinale import MRBerlinaleKBytes

import logging
logging.basicConfig(level=logging.INFO)


INPUT_FILE = 's3://alexcomu/berlinale_aggregated.csv'

mr_job = MRBerlinaleKBytes(args=['-r', 'emr', 
                                '--conf-path', 'mrjob.conf',
                                INPUT_FILE])

output = {}
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        output[key] = value

import json
print json.dumps(output)


To run the application:

     python runner.py 

Here the output

```
(envCourses) MBP-di-Alex:00_example alexcomu$ python runner.py 
INFO:mrjob.emr:Using s3://mrjob-adfa692a01d6eae0/tmp/ as our temp dir on S3
INFO:mrjob.runner:Creating temp directory /var/folders/_x/g5brlyv963vclshf_kffdm440000gn/T/berlinale.alexcomu.20160628.215720.741936
INFO:mrjob.emr:Copying local files to s3://mrjob-adfa692a01d6eae0/tmp/berlinale.alexcomu.20160628.215720.741936/files/...
INFO:mrjob.emr:Created new cluster j-69OH7QX323LS
INFO:mrjob.emr:Waiting for step 1 of 1 (s-37CO88BBCWNZT) to complete...
INFO:mrjob.emr:  PENDING (cluster is STARTING)
INFO:mrjob.emr:  PENDING (cluster is STARTING)
...
INFO:mrjob.emr:  PENDING (cluster is STARTING)
INFO:mrjob.emr:  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
...
INFO:mrjob.emr:  RUNNING for 7.0s
INFO:mrjob.emr:  RUNNING for 38.1s
...
INFO:mrjob.emr:  RUNNING for 174.5s
INFO:mrjob.emr:  COMPLETED
INFO:mrjob.logs.mixin:Attempting to fetch counters from logs...
INFO:mrjob.emr:Waiting for cluster (j-2SL7Q218ZQ2XW) to terminate...
INFO:mrjob.emr:  TERMINATING: Steps completed
INFO:mrjob.emr:  TERMINATED: Steps completed
INFO:mrjob.emr:Looking for step log in s3://mrjob-adfa692a01d6eae0/tmp/logs/j-2SL7Q218ZQ2XW/steps/s-35VOMK863GN4Z...
INFO:mrjob.emr:  Parsing step log: s3://mrjob-adfa692a01d6eae0/tmp/logs/j-2SL7Q218ZQ2XW/steps/s-35VOMK863GN4Z/syslog.gz
...
...
...
```

The result is:
```
{"berlinale_tc1": 72637117, "berlinale_ext_pke": 400595652, "berlinale_film": 1411678707, "berlinale_rtd": 8916633097, "bal_berlinale_rt_low": 6394, "berlinale_rte": 8654319443, "berlinale_pke": 10233396499, "berlinale_pkd": 9994392877, "berlinale_prt_pkd": 5175364, "berlinale_prt_pke": 6498380, "berlinale_enc": 847054727, "berlinale_ext_pkd": 573904214, "berlinale_pk_low": 15049, "berlinale_tc2": 13644927, "berlinale_rt": 4783701218}

```


So, if I want run my script on AWS and use S3 as output I can simply run:

    $ MRJOB_CONF=./mrjob.conf python berlinale.py -r emr -v s3://alexcomu/berlinale_aggregated.csv  --output-dir=s3://YOUR-BUCKET/DESTINATION-FOLDER

