# Your Task

__Task 2: Prepare the data__
__FROM__: Michael Ortiz <br>
__Subject:__ Next Steps: Collect and Prepare Data

Hello,

Now you’re ready to use EMR to collect the web data you will analyze to provide the general sentiment toward the smart phones. The data you will need to collect is in the web database compiled by Common Crawl.

The work you will do in EMR involves running scripts that will access and compile web pages from the Common Crawl that are relevant to smart phones (the most helpful web pages are those that contain a smart phone review), count the number of positive, negative, or uncertain sentiments expressed, and then store this data in an S3 repository folder. As you can imagine, accessing and compiling all this data will take quite a while; but the good news is that there’s no need to write the scripts to do it—the previous analyst that was working on this project, Amy Gorman, has already written and tested the three Python programs that will help you accomplish your task (I’ve attached them to this eMail). You should take a look at these programs to see how they were written, but here is what they do: the “mapper” script examines and counts data from portions of the Common Crawl data, the “reducer” script accumulates the analysis from the individual mapper jobs, and the “aggregation” script helps stitch together the raw output from the multiple job flows you will need to initiate to analyze all the necessary data.

That’s the high-level view of what the scripts do. The details of it are that the mapper script scans the Common Crawl data for a smart phone name (e.g., iPhone, Samsung Galaxy, Nokia Lumia, etc.) and also scans for the words review, critique, looks at, in depth, analysis, evaluate, evaluation, and assess. If the mapper finds any of these words and a reference to one of the phones we are interested in, then the number of instances is recorded for the relevant device. Then the mapper scans and records instances of words related to the features of the phones, such as camera, display, or performance, that also have a positive, negative, or uncertain word within a 5-14 words. When this sentiment toward a specific phone feature is found, the mapper emits a count for each of the instances it observes. The mapper is sending this information back to the reducer, which is running on a master node. The reducer accumulates the information it receives from the mappers and writes it to the output file on S3. The result of all this can be seen in the sample data matrix that Amy Gorman constructed from the data she was able to access. This matrix was attached to my previous eMail.

__Your Job__

What I’d like you to do is run the script on 400 to 500 Common Crawl archive files and collect as many web pages as you can that contain references to smart phones. Because this process requires a lot of time and computing resources, avoiding errors is essential. That’s why I’d like you to start by running the script on just one Common Crawl archive file using the EMR console.

Then, once you get the hang of the process, I’d like you to ramp up your efforts by running the script from a command line, as this will be more efficient for starting multiple "steps" that scan large ranges of archive files. As I said, your goal should be to compile data from 400 to 500 archive files. This should give you around 20,000 web pages that contain references to smart phones. When you get more than 20,000 results it is permissible to keep moving forward in the plan of attack or you may choose to map more files and add more to the results.

After you've collected this data (the script stores this output in S3), I'd like you to consolidate it into a Large Matrix similar to the Sample Matrix Amy created. To do this, you will need the concatenate.py file that Amy used to compile her Sample Matrix results. I've attached it to this eMail. 

Michael

Attachments:<br>
[task2scripts2018](https://s3.amazonaws.com/gbstool/emails/2905/task2scripts2018.zip?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1561021200&Signature=U4wjXssVFXuyGOZEEt0R0o%2BHl34%3D)

# Plan of Attack

## Introduction

__Your Task__
You have been asked by Michael Ortiz, VP of Alert Analytics, to use EMR to collect the web data you will analyze to provide the general sentiment toward the smart phones and compile it into a csv file called the Large Matrix. This involves first running the mapper and reducer scripts on one WET file of Common Crawl data in EMR using the EMR console. Then you will run the scripts in EMR on 400 to 500 Common Crawl WET files using the EMR Command Line Interface. Lastly you will use the concatenating script to compile the Common Crawl data you have collected and to create the Large Matrix and submit your file. 

This task requires you to prepare two deliverables:

* Preliminary Output Generated from the EMR Console— a csv file generated after running the Python Mapper and Reducer programs on a single Common Crawl WET file
* Large Data Matrix — a csv file generated by running the Python Mapper and Reducer programs on 400 to 500 Common Crawl WET files (approximately a billion pages) and meets the following requirements:
     * Is initiated by the command line interface using JSON job files that either you have created or that have been provided to you. It is OK to expand upon JSON files provided to crawl more segments of the Common Crawl.
     * Combines the results of all the streaming Hadoop jobs into a single csv file.
     * Contains a minimum of 20,000 instances, but should contain as many as you are reasonably able to gather. This number can be increased by expanding the number of common crawl WET file addresses your JSON file specifies for processing.

## Get Started

1. Read the email to ensure you understand the details of the task.
2. Download the and save the Python mapper, reducer and aggregation scripts. These are the programs Amy developed for your use. Here is a description of each:
     * Mapper.py: You will use this mapper program to examine and count data from portions of the Common Crawl data.
     * Reducer.py: You will use this reducer program to accumulate the analysis from the individual mapper jobs.
     * Concatenatepv3.py: You will use this Python program to aggregate the results of your streaming jobs.
     
__Note:__ It is very important that you verify you are running your clusters in the same end-point in which you set up the CLI in task one. If not, cross-origin charges can be very expensive. When you set up the CLI you were asked to specify a region (us-east-1); this same region needs to be reflected in the upper right-hand corner of the AWS web console as shown below. It doesn't have to be N. Virginia, but your CLI default region must match. N. Virginia is highly recommended because Common Crawl data is located at the N. Virginia Endpoint. 

## Identify the data source

Common Crawl is a non-profit organization that crawls and archives the entire readable Internet once per month. The archived files are stored on Amazon Web Services N. Virginia S3. Any individual and organization can access these files.

The January 2018 crawl contains more than 3.4 billion web pages and 270 tebidbytes of information when uncompressed. Storing and accessing all of these pages requires organization. The crawl is split into 1000’s of roughly similar sized files which are then saved as WARC file type and gzipped (WARC stands for Web ARChive format). Each of these files has it’s own specific address and we use these addresses as input with Amazon Web Services.

Because we are interested in sentiment mining, we will focus on using a subset of the WARC files that only contain text: WET. As a first step to getting our input addresses, visit the Common Crawl Blog (http://commoncrawl.org/connect/blog/) and download the wet paths file for last month. It should be named something like “all WET files (CC-MAIN-2016-50/wet.paths.gz)”. The .gz file is a compressed file and cannot be opened by sublime or any text editor. It needs to un-compressed or unzipped first. Inside of this file, you'll find a file called wet.paths.

NOTE: If you can’t open a gzip extension on your computer use google to search for a free gzip opener. Alternatively, ask your mentor to email you the latest uncompressed path file.

Open your wet.paths file with the Sublime or any text editor. You will find 10s of thousands of addresses similar to:

```crawl-data/CC-MAIN-2016-40/segments/1474738659496.36/wet/CC-MAIN-20160924173739-00001-ip-10-143-35-109.ec2.internal.warc.wet.gz ```

In [1]:
import gzip
import shutil
with gzip.open('wet.paths.gz', 'rb') as f_in:
    with open('wet.paths.txt', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [2]:
import pandas as pd
df = pd.read_csv('wet.paths.txt', names = ['address'])
display(df.head(),df.shape)

Unnamed: 0,address
0,crawl-data/CC-MAIN-2019-22/segments/1558232254...
1,crawl-data/CC-MAIN-2019-22/segments/1558232254...
2,crawl-data/CC-MAIN-2019-22/segments/1558232254...
3,crawl-data/CC-MAIN-2019-22/segments/1558232254...
4,crawl-data/CC-MAIN-2019-22/segments/1558232254...


(56000, 1)

You will need to add “s3://commoncrawl/” to the beginning of any file address you intend to use as input. For now, just create one address as seen below.

```s3://commoncrawl/crawl-data/CC-MAIN-2016-40/segments/1474738659496.36/wet/CC-MAIN-20160924173739-00001-ip-10-143-35-109.ec2.internal.warc.wet.gz```

In [3]:
df['address'] = 's3://commoncrawl/'+df['address']
display(df.head(),df.shape)
df_addresses = df

Unnamed: 0,address
0,s3://commoncrawl/crawl-data/CC-MAIN-2019-22/se...
1,s3://commoncrawl/crawl-data/CC-MAIN-2019-22/se...
2,s3://commoncrawl/crawl-data/CC-MAIN-2019-22/se...
3,s3://commoncrawl/crawl-data/CC-MAIN-2019-22/se...
4,s3://commoncrawl/crawl-data/CC-MAIN-2019-22/se...


(56000, 1)

## Run an EMR job flow using the EMR Console

__Preliminary Note:__ You must set your clusters to 'Auto-Terminate' when your steps complete or your cluster will continue to run. A running cluster will continue to accrue usage charges by AWS.

This is your first step in learning 'how' to obtain data from the Common Crawl. Note: this step uses the web console and EMR, but only serves to test your job flow and process on a sample of Common Crawl files. You will have to use the Command Line Interface in the next step to perform your final data preparation, but can only do so after ensuring that your job flow is correct (this will be done in this step).

Collecting all the data you will need will require that you run several clusters. It is wise to begin cautiously in order to reduce your chances of running into time-consuming and costly errors.

To that end, use the EMR console to set up and run one cluster that processes one Common Crawl WET file.

1. Identify an input address for a Common Crawl WET file: If you haven’t yet followed the data source guide in the preceding section, do so now. You should end up with an input address similar to:


In [4]:
df.iloc[0][0]

's3://commoncrawl/crawl-data/CC-MAIN-2019-22/segments/1558232254253.31/wet/CC-MAIN-20190519061520-20190519083520-00000.warc.wet.gz'

2. Set up three S3 buckets: In this step you will go to S3 via the web console and create three buckets: a bucket that you will use for mapper and reducer scripts, a bucket for your output, and a bucket for debugging logs. 

   * There are rules for creating buckets on S3; you can refer to Amazon for more information. 
   * IMPORTANT: Make sure that the S3 buckets you set up belong to the same Amazon Web Services account that you will running EMR jobs from (the one that you created 1. Set up computing environment > Plan of Attack > Setup Amazon Web Services Platform and Integrate Services). 
   * Optional - S3 Buckets can also be created from CLI. Visit the Optional Resources to challenge yourself with this process. 

In [5]:
!aws s3api create-bucket --bucket jl-mapper --region us-east-1

{
    "Location": "/jl-mapper"
}


In [6]:
!aws s3api create-bucket --bucket jl-output --region us-east-1

{
    "Location": "/jl-output"
}


In [7]:
!aws s3 ls

2019-06-19 20:17:31 aws-logs-983651120851-us-east-1
2019-06-24 19:50:44 jl-map-reduce
2019-06-18 20:38:34 jl-mapper
2019-06-18 20:39:14 jl-output
2019-06-18 20:38:45 jl-reducer


3. Upload the mapper and reducer scripts to an S3 bucket you created in the step 2: Here you will upload the Mapper.py and Reducer.py programs into your script location bucket. These files are attached to the Task 2 eMail.


In [8]:
!ls task2scripts2018/

Mapper.py         Reducer.py        concatenatepv3.py


In [9]:
!aws s3 cp ./task2scripts2018/Mapper.py s3://jl-mapper/

Completed 10.3 KiB/10.3 KiB (9.2 KiB/s) with 1 file(s) remainingupload: task2scripts2018/Mapper.py to s3://jl-mapper/Mapper.py  


In [10]:
!aws s3 cp ./task2scripts2018/reducer.py s3://jl-reducer/

Completed 4.5 KiB/4.5 KiB (8.3 KiB/s) with 1 file(s) remainingupload: task2scripts2018/reducer.py to s3://jl-reducer/reducer.py


__Instructions for creating the step for your cluster - Every cluster must have a step where the instructions for the cluster are implemented.__

1. __Create an EMR cluster:__ In this step you will access the EMR console and click on “Create New Cluster" and then Click on "Go to Advanced Options". This new screen shows that there are 4 steps you will need to complete to start your cluster. 
2. __Step 1: Software and Steps__
    1. Software Configuration - All options under software configuration can be left at their defaults settings 
    2. Define the Cluster: Next you will complete the fields that will define the job flow:
        * Under 'Add Steps'
        * Auto-Terminate: Ensure you check Auto Terminate. If you do not set your clusters to 'Auto-Terminate' they will continue to run and you will continue to be charged.
        * Step type: Select “Streaming program”. Here you will specify that you are running your own application and that it is a streaming job type. After selecting the stream job type click on "Configure".
    3. After clicking 'Configure' a new pop-up widow will open and allow you to enter the specific parameters for the step you'll be adding.
    4. Specify parameters: In this step you will fill in the following information:
       * Mapper location: You previously loaded the reducer script to an S3 bucket. It is from this bucket that EMR runs your script. Here you are telling EMR the location of that bucket. Use the dark Folder Symbol to the right of the form field and navigate to your mapper.
       * Reducer location: You previously loaded the mapper script to an S3 bucket. It is from this bucket that EMR runs your script. Here you are telling EMR the location of that bucket. Use the dark Folder Symbol.
       * Input location: Common Crawl keeps their web segment data in an S3 bucket on AWS. This field is asking for the URL of the Common Crawl WET file you want to access. Cut and paste the address you previously created.
       * Output location: The output location will be the unique S3 bucket you created for the output of this script run. Use the dark File Symbol. NOTE: after mapping to your output folder will need to add the name of a folder to the end of the address. This is the folder you want EMR to create when the step executes - EMR will not overwrite folders that currently exist in the S3 output folder.
       * Extra arguments: This field does not require any input, but here is an explanation of what it is used for: This field gives you the option to introduce any extra arguments into the job. For some Hadoop Streaming job (not for this course) you may need to specify that the input files are in sequence file format – a file format used by the Common Crawl to store text data in a compressed manner.
       * Action on failure: This parameter tells EMR what to do if this step fails. "Continue" would simply move on to the next step in your streaming program if you had one. "Cancel and wait" keeps the cluster running, but does not move on to a new step. "Terminate Cluster" will do exactly that: shut down the cluster. "Continue" and "Terminate" are appropriate settings for your current project. 
       * Click on 'Add Step' and the step will be adding to the cluster you are creating 
    5. Click 'next' to move to Hardware Configuration



3. __Step 2: Hardware Configuration__ - Most of these options can be left at their default settings, but here are some descriptions of each below:
     * Master/core/task instance type: These settings define the type of EC2 instances to be used in your cluster. Examples of instance types include: General Purpose, Compute Optimized, Memory Optimized, Storage Optimized. In the case of this task, we suggest you stick with default instances only. 
    * Request spot instance: Checking this box allows the user to re-buy (at a discount) unused CPU time from partially complete jobs that other EMR customers have run (for instance if a one-minute job was provisioned but the job only took 30 seconds). Though the job will be cheaper, it usually takes longer. As long as you start this with plenty of time before the due date (at least a week), you can use this feature to save money. We recommend you do not use this doing real-world development, as it will increase job run times and slow your productivity.
    * Core/task instance count: This number you input here defines the size of your Hadoop network. Increase or decrease the number of core instances depending on the amount of resources you anticipate you will need for a particular job. We recommend using one master node and two core nodes for each job. When you are ready to process large numbers of segments (such as twenty or more), you should start multiple parallel jobs with this configuration. This can be accomplished conveniently by using the command line interface and JSON job files (examples provided).
    * Click next

4. __Step 3: General Cluster Settings__
    * Cluster name: Give your cluster a name that makes logical sense to you and will make it easy to identify in a list of clusters
    * Logging: Ensure that this is checked and a folder has been specified
    * Termination Protection: Ensure that this is checked
    * No Tags or Bootstrap actions are required
    * Click next
5. __Step 4: Security__
    * EC2 Pair: Proceed without an EC2 key pair
    * Ensure that the cluster is visible to all IAM users in account
    * Check the default permissions and leave both the EC2 Security Groups and Encryption options at their default settings 
6. __Run job:__
    * At this point, when you click “Create Cluster” a screen will appear that tells you that your cluster has been created and that it may take a few minutes to launch. You can access or check the status of the Hadoop Streaming job by returning to the EMR console and clicking “Refresh.” The “Status” column will first display “Starting,” then, if all steps were completed correctly, this column will display “Running” and, when the job is done it will display “Terminated, All steps complete.”

1. __Review output:__ Once the job has finished running, navigate to the S3 bucket you set up and search the file folders for the output data. Download the output to your desktop through AWS S3 console, CyberDuck or via AWS CLI. (See resources for download with CLI instructions)

In [11]:
!aws s3 ls s3://jl-output/TestRun/

2019-06-19 22:08:02          0 _SUCCESS
2019-06-19 22:07:57       1908 part-00000
2019-06-19 22:07:56       2561 part-00001
2019-06-19 22:08:00       2279 part-00002
2019-06-19 22:07:57       3197 part-00003
2019-06-19 22:08:01       2521 part-00004
2019-06-19 22:07:58       2436 part-00005
2019-06-19 22:08:02       3383 part-00006


In [12]:
import sys, os
import boto3

In [16]:
s3 = boto3.resource('s3')
bucket = s3.Bucket('jl-output')

In [35]:
def download_bucket_files(s3, bucket_name):
    bucket = s3.Bucket(bucket_name)
    for bucket_object in bucket.objects.all():
        filename = bucket_object.key
        print(filename)
        if os.path.isdir('./'+bucket_name+'/'+os.path.dirname(filename))==False:
            os.mkdir(bucket_name+'/'+os.path.dirname(filename))
        bucket.download_file(filename, bucket_name+'/'+filename )
        
download_bucket_files(s3, 'jl-output')

FlowRun/1558232254253.31_00001/_SUCCESS
FlowRun/1558232254253.31_00001/part-00000
FlowRun/1558232254253.31_00001/part-00001
FlowRun/1558232254253.31_00002/_SUCCESS
FlowRun/1558232254253.31_00002/part-00000
FlowRun/1558232254253.31_00002/part-00001
FlowRun/1558232254253.31_00003/_SUCCESS
FlowRun/1558232254253.31_00003/part-00000
FlowRun/1558232254253.31_00003/part-00001
FlowRun/1558232254253.31_00004/_SUCCESS
FlowRun/1558232254253.31_00004/part-00000
FlowRun/1558232254253.31_00004/part-00001
FlowRun/1558232254253.31_00005/_SUCCESS
FlowRun/1558232254253.31_00005/part-00000
FlowRun/1558232254253.31_00005/part-00001
FlowRun/1558232254253.31_00006/_SUCCESS
FlowRun/1558232254253.31_00006/part-00000
FlowRun/1558232254253.31_00006/part-00001
FlowRun/1558232254253.31_00007/_SUCCESS
FlowRun/1558232254253.31_00007/part-00000
FlowRun/1558232254253.31_00007/part-00001
FlowRun/1558232254253.31_00008/_SUCCESS
FlowRun/1558232254253.31_00008/part-00000
FlowRun/1558232254253.31_00008/part-00001
FlowRun/

FlowRun/1558232254253.31_00067/part-00000
FlowRun/1558232254253.31_00067/part-00001
FlowRun/1558232254253.31_00068/_SUCCESS
FlowRun/1558232254253.31_00068/part-00000
FlowRun/1558232254253.31_00068/part-00001
FlowRun/1558232254253.31_00069/_SUCCESS
FlowRun/1558232254253.31_00069/part-00000
FlowRun/1558232254253.31_00069/part-00001
FlowRun/1558232254253.31_00070/_SUCCESS
FlowRun/1558232254253.31_00070/part-00000
FlowRun/1558232254253.31_00070/part-00001
FlowRun/1558232254253.31_00071/_SUCCESS
FlowRun/1558232254253.31_00071/part-00000
FlowRun/1558232254253.31_00071/part-00001
FlowRun/1558232254253.31_00072/_SUCCESS
FlowRun/1558232254253.31_00072/part-00000
FlowRun/1558232254253.31_00072/part-00001
FlowRun/1558232254253.31_00073/_SUCCESS
FlowRun/1558232254253.31_00073/part-00000
FlowRun/1558232254253.31_00073/part-00001
FlowRun/1558232254253.31_00074/_SUCCESS
FlowRun/1558232254253.31_00074/part-00000
FlowRun/1558232254253.31_00074/part-00001
FlowRun/1558232254253.31_00075/_SUCCESS
FlowRun/

In [36]:
import pandas

In [37]:
pd.read_csv('jl-output/TestRun/part-00000',header=None)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49,50,51,52,53,54,55,56,57,58
0,http://markbunting.com/lenovo-b750-the-allinon...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,https://lambdalv.com/lambda/sponsors/alaskan-h...,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,http://fiat-fuel-pu.wiring-diagram.love-craft....,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,http://liquidstate.co.uk/photos/.well-known/pk...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,http://www.sheaky.com/2009/12/safety-and-good-...,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,http://theblogofmj.com/cum-my-for-free-porn-co...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,http://boonecountymissouri.cf/United_Kingdom-S...,0,0,0,0,0,14,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
df = pd.DataFrame()
folder_path = 'jl-output/TestRun/'
for filename in os.listdir(folder_path):
    if 'part' in filename:
        df = pd.concat((df,pd.read_csv(folder_path+filename,header=None))).reset_index(drop=True)
df.to_csv('jl-output/output.csv',index=False)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49,50,51,52,53,54,55,56,57,58
0,http://markbunting.com/lenovo-b750-the-allinon...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,http://www.sheaky.com/2009/12/safety-and-good-...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,http://www.techydimension.com/tag/google/,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,https://www.multisoftvirtualacademy.com/micros...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,https://axiang.cc/archives/tag/facebook,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


3. Submit Preliminary Output. Submit the CSV file you generated from the web-based interface (EMR console).

## Run the Remaining Job Flows Using AWS CLI

The EMR console provides an easy-to-use graphical interface for launching and monitoring your job flows directly from a web browser, but you may find that it is not efficient enough when running large jobs. In this case, you will need to use the command line interface you installed back in task one. This interface will be accessed from the 'terminal' in Mac OSX and 'Command Prompt' on a Windows machine.

The Command Line Interface (CLI) in AWS provide the ability to programmatically launch and monitor progress of running job flows, to create additional custom functionality around job flows (such as sequences with multiple processing steps, scheduling, workflow, or monitoring). As mentioned you should have already installed the CLI in the previous step. 

Below are the Amazon Command Line Interface steps for running 100 WET files. You can run up to 255 Wet Files at a time, but we recommend your first cluster from CLI contain only 100. 

The steps below will cover 3 types of files:

1. BDF File – This file contains s3 addresses for searching WET files
2. CreateJson Python File – Running this file uses the s3 addresses from the BDF file and creates JSON files
3. JSON File – This file’s information will be “uploaded” through CLI to create the EMR cluster steps for each of the s3 addresses from your BDF File.

Prior to running this job, ensure that you have the AWS CLI installed and that credentials are configured correctly. Check with your mentor for assistance.

1. Select 100 WET file addresses. Open the WET paths file with a text editor and choose 100 addresses. The WET files are not indexed by topic so any 100 addresses can be used.
   1. Copy your 100 addresses and paste them into a new tab of your text editor
   2. Save this new tab as a .bdf file
   3. Add the S3 address snippet to the beginning of each your addresses. See step two if you need a reminder:     s3://commoncrawl/
   4. See the sample .bdf file in the Resources if you need help

In [61]:
df_addresses.iloc[355:610].to_csv('addresses_subset.bdf', index=False, header=False)

2. Copy the CreateJsonFiles.py Python script into the folder with the .bdf file and personalize the script. This script, which can be found in the “Resources” tab under the “JSON Files” heading, will generate a .json file from your .bdf file. You must make updates to this python script with a text editor like Sublime before you run it.

    1. Update the python script to have the correct S3 locations for your Mapper.py and Reducer.py files – see files parameter on line 17
    
    2. Update the python script to have the correct S3 address for your output bucket – see output parameter on line 18
    3. Review all lines of script to assure that the mapper and reducer file names are correct
    4. Save the file

In [52]:
# first, need to create single folder with mapper and reducer
s3 = boto3.client('s3')

#create bucket
bucket_name = 'jl-map-reduce'
s3.create_bucket(Bucket=bucket_name)

#upload map reduce files
for file in ['Mapper.py', 'Reducer.py']:
    s3.upload_file('task2scripts2018/'+file, bucket_name, file)

In [53]:
s3 = boto3.resource('s3')
for my_bucket_object in s3.Bucket(bucket_name).objects.all():
    print(my_bucket_object)

s3.ObjectSummary(bucket_name='jl-map-reduce', key='Mapper.py')
s3.ObjectSummary(bucket_name='jl-map-reduce', key='Reducer.py')


3. Run the CreateJsonFiles.py Python File from the command shell to generate your json file. Open the ‘terminal’ (OSX) or ‘Command Prompt’ (windows) command shell. In your command shell, use the change directory command to point to the folder containing your BDF and CreateJSON files. If you don’t know how to change directories, spend a few minutes researching google and youtube for instructions. Now run the CreateJSON file from command line. The command format is below. You will be asked for the input .bdf file’s name and a name for the output file.

At this point, you should have a .json file containing all of the appropriate markup needed by AWS EMR. The .json files will contain one step per WET address from the .bdf file.

4. Checking the validity of your .json file - In order for the CLI to correctly process .json files the .json file has to be formatted corrected or be structured in the correct manner. Thankfully there are numerous online validation tools that we can use to check the validity of JSON files.

    1. In order to check the structure of your .json file open it in Sublime and copy/paste the text into JSONLint. If your file structure is valid the site will inform you of such; if not, you might need to modify the structure of your .json file so the CLI will process it correctly.
    2. After you make any necessary modifications to your file structure copy the text out of JSONLint and paste it back into Sublime and save it as a .json file - you can either overwrite your original file or save it as a new one - just remember which one is valid!
    

5. Run the the .json files from CLI to create a EMR Cluster. Running a json file from CLI will initiate the creation of a new AWS EMR cluster that will search for the desired sentiment data. You can monitor the progress of your cluster from the EMR Console. You may run only one cluster at a time. 

Below is an AWS command to run the JSON file (run it from the Terminal (Mac) or Command Prompt (Windows)); change the following areas (shown as XXXX):

```aws emr create-cluster --name “XXXXXXXXXXXX” --ec2-attributes SubnetId=subnet-XXXXXXXX --release-label emr-5.4.0 --auto-terminate --log-uri s3://XXXXXXXXXXX/ --use-default-roles --enable-debugging --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=c4.large --steps file://xxxxxx.json```
*   
    1. Cluster name
    2. subnetID: SubnetId can be found by visiting one of your recent, successful clusters using Console. Look under Network and Hardware. You should see something with a structure similar to: subnet-61e40522
    3. log-uri: This s3 address should point to your debugging bucket
    4. json file name 

Note: The AWS CLI script can be different for Windows users: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/add-steps-cli-console.html 

In [62]:
!aws emr create-cluster --name “FlowJob” --ec2-attributes SubnetId=subnet-51a1b81b --release-label emr-5.4.0 --auto-terminate --log-uri s3://aws-logs-FlowJob/ --use-default-roles --enable-debugging --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=c4.large --steps file://FlowJob.json

{
    "ClusterId": "j-2KF80BD8UQZG9"
}


6. Monitor your Cluster from AWS Console. Running 100 WET files can be a lengthy process. From the EMR console, open the cluster you created. Click on Steps tab, you should see a step for each of the WET addresses. You may have an occasional failed step, this is expected and OK. If all of your steps are failing you should terminate the cluster and check the Log files to understand the failures. 

## Consolidate The Results of the Jobs

You will need to aggregate the results of the streaming jobs you set up. This involves the following steps:

1. __Download your EMR output.__ Using Console, open and inspect your S3 output bucket. You should find one folder per step from your EMR job (minus any failed steps). With CyberDuck or CLI, download all of the individual output folders to a single folder location on your local machine. Keep the folders for each EMR job step, you don't have to pull the Part files out of them.

In [64]:
download_bucket_files(boto3.resource('s3'), 'jl-output')

FlowRun/1558232254253.31_00001/_SUCCESS
FlowRun/1558232254253.31_00001/part-00000
FlowRun/1558232254253.31_00001/part-00001
FlowRun/1558232254253.31_00002/_SUCCESS
FlowRun/1558232254253.31_00002/part-00000
FlowRun/1558232254253.31_00002/part-00001
FlowRun/1558232254253.31_00003/_SUCCESS
FlowRun/1558232254253.31_00003/part-00000
FlowRun/1558232254253.31_00003/part-00001
FlowRun/1558232254253.31_00004/_SUCCESS
FlowRun/1558232254253.31_00004/part-00000
FlowRun/1558232254253.31_00004/part-00001
FlowRun/1558232254253.31_00005/_SUCCESS
FlowRun/1558232254253.31_00005/part-00000
FlowRun/1558232254253.31_00005/part-00001
FlowRun/1558232254253.31_00006/_SUCCESS
FlowRun/1558232254253.31_00006/part-00000
FlowRun/1558232254253.31_00006/part-00001
FlowRun/1558232254253.31_00007/_SUCCESS
FlowRun/1558232254253.31_00007/part-00000
FlowRun/1558232254253.31_00007/part-00001
FlowRun/1558232254253.31_00008/_SUCCESS
FlowRun/1558232254253.31_00008/part-00000
FlowRun/1558232254253.31_00008/part-00001
FlowRun/

FlowRun/1558232254253.31_00067/part-00000
FlowRun/1558232254253.31_00067/part-00001
FlowRun/1558232254253.31_00068/_SUCCESS
FlowRun/1558232254253.31_00068/part-00000
FlowRun/1558232254253.31_00068/part-00001
FlowRun/1558232254253.31_00069/_SUCCESS
FlowRun/1558232254253.31_00069/part-00000
FlowRun/1558232254253.31_00069/part-00001
FlowRun/1558232254253.31_00070/_SUCCESS
FlowRun/1558232254253.31_00070/part-00000
FlowRun/1558232254253.31_00070/part-00001
FlowRun/1558232254253.31_00071/_SUCCESS
FlowRun/1558232254253.31_00071/part-00000
FlowRun/1558232254253.31_00071/part-00001
FlowRun/1558232254253.31_00072/_SUCCESS
FlowRun/1558232254253.31_00072/part-00000
FlowRun/1558232254253.31_00072/part-00001
FlowRun/1558232254253.31_00073/_SUCCESS
FlowRun/1558232254253.31_00073/part-00000
FlowRun/1558232254253.31_00073/part-00001
FlowRun/1558232254253.31_00074/_SUCCESS
FlowRun/1558232254253.31_00074/part-00000
FlowRun/1558232254253.31_00074/part-00001
FlowRun/1558232254253.31_00075/_SUCCESS
FlowRun/

FlowRun/1558232254253.31_00134/part-00001
FlowRun/1558232254253.31_00135/_SUCCESS
FlowRun/1558232254253.31_00135/part-00000
FlowRun/1558232254253.31_00135/part-00001
FlowRun/1558232254253.31_00136/_SUCCESS
FlowRun/1558232254253.31_00136/part-00000
FlowRun/1558232254253.31_00136/part-00001
FlowRun/1558232254253.31_00137/_SUCCESS
FlowRun/1558232254253.31_00137/part-00000
FlowRun/1558232254253.31_00137/part-00001
FlowRun/1558232254253.31_00138/_SUCCESS
FlowRun/1558232254253.31_00138/part-00000
FlowRun/1558232254253.31_00138/part-00001
FlowRun/1558232254253.31_00139/_SUCCESS
FlowRun/1558232254253.31_00139/part-00000
FlowRun/1558232254253.31_00139/part-00001
FlowRun/1558232254253.31_00140/_SUCCESS
FlowRun/1558232254253.31_00140/part-00000
FlowRun/1558232254253.31_00140/part-00001
FlowRun/1558232254253.31_00141/_SUCCESS
FlowRun/1558232254253.31_00141/part-00000
FlowRun/1558232254253.31_00141/part-00001
FlowRun/1558232254253.31_00142/_SUCCESS
FlowRun/1558232254253.31_00142/part-00000
FlowRun/

FlowRun/1558232254253.31_00202/part-00000
FlowRun/1558232254253.31_00202/part-00001
FlowRun/1558232254253.31_00203/_SUCCESS
FlowRun/1558232254253.31_00203/part-00000
FlowRun/1558232254253.31_00203/part-00001
FlowRun/1558232254253.31_00204/_SUCCESS
FlowRun/1558232254253.31_00204/part-00000
FlowRun/1558232254253.31_00204/part-00001
FlowRun/1558232254253.31_00205/_SUCCESS
FlowRun/1558232254253.31_00205/part-00000
FlowRun/1558232254253.31_00205/part-00001
FlowRun/1558232254253.31_00206/_SUCCESS
FlowRun/1558232254253.31_00206/part-00000
FlowRun/1558232254253.31_00206/part-00001
FlowRun/1558232254253.31_00207/_SUCCESS
FlowRun/1558232254253.31_00207/part-00000
FlowRun/1558232254253.31_00207/part-00001
FlowRun/1558232254253.31_00208/_SUCCESS
FlowRun/1558232254253.31_00208/part-00000
FlowRun/1558232254253.31_00208/part-00001
FlowRun/1558232254253.31_00209/_SUCCESS
FlowRun/1558232254253.31_00209/part-00000
FlowRun/1558232254253.31_00209/part-00001
FlowRun/1558232254253.31_00210/_SUCCESS
FlowRun/

FlowRun/1558232254253.31_00268/part-00001
FlowRun/1558232254253.31_00269/_SUCCESS
FlowRun/1558232254253.31_00269/part-00000
FlowRun/1558232254253.31_00269/part-00001
FlowRun/1558232254253.31_00270/_SUCCESS
FlowRun/1558232254253.31_00270/part-00000
FlowRun/1558232254253.31_00270/part-00001
FlowRun/1558232254253.31_00271/_SUCCESS
FlowRun/1558232254253.31_00271/part-00000
FlowRun/1558232254253.31_00271/part-00001
FlowRun/1558232254253.31_00272/_SUCCESS
FlowRun/1558232254253.31_00272/part-00000
FlowRun/1558232254253.31_00272/part-00001
FlowRun/1558232254253.31_00273/_SUCCESS
FlowRun/1558232254253.31_00273/part-00000
FlowRun/1558232254253.31_00273/part-00001
FlowRun/1558232254253.31_00274/_SUCCESS
FlowRun/1558232254253.31_00274/part-00000
FlowRun/1558232254253.31_00274/part-00001
FlowRun/1558232254253.31_00275/_SUCCESS
FlowRun/1558232254253.31_00275/part-00000
FlowRun/1558232254253.31_00275/part-00001
FlowRun/1558232254253.31_00276/_SUCCESS
FlowRun/1558232254253.31_00276/part-00000
FlowRun/

FlowRun/1558232254253.31_00335/_SUCCESS
FlowRun/1558232254253.31_00335/part-00000
FlowRun/1558232254253.31_00335/part-00001
FlowRun/1558232254253.31_00336/_SUCCESS
FlowRun/1558232254253.31_00336/part-00000
FlowRun/1558232254253.31_00336/part-00001
FlowRun/1558232254253.31_00337/_SUCCESS
FlowRun/1558232254253.31_00337/part-00000
FlowRun/1558232254253.31_00337/part-00001
FlowRun/1558232254253.31_00338/_SUCCESS
FlowRun/1558232254253.31_00338/part-00000
FlowRun/1558232254253.31_00338/part-00001
FlowRun/1558232254253.31_00339/_SUCCESS
FlowRun/1558232254253.31_00339/part-00000
FlowRun/1558232254253.31_00339/part-00001
FlowRun/1558232254253.31_00340/_SUCCESS
FlowRun/1558232254253.31_00340/part-00000
FlowRun/1558232254253.31_00340/part-00001
FlowRun/1558232254253.31_00341/_SUCCESS
FlowRun/1558232254253.31_00341/part-00000
FlowRun/1558232254253.31_00341/part-00001
FlowRun/1558232254253.31_00342/_SUCCESS
FlowRun/1558232254253.31_00342/part-00000
FlowRun/1558232254253.31_00342/part-00001
FlowRun/

FlowRun/1558232254253.31_00402/part-00001
FlowRun/1558232254253.31_00403/_SUCCESS
FlowRun/1558232254253.31_00403/part-00000
FlowRun/1558232254253.31_00403/part-00001
FlowRun/1558232254253.31_00404/_SUCCESS
FlowRun/1558232254253.31_00404/part-00000
FlowRun/1558232254253.31_00404/part-00001
FlowRun/1558232254253.31_00405/_SUCCESS
FlowRun/1558232254253.31_00405/part-00000
FlowRun/1558232254253.31_00405/part-00001
FlowRun/1558232254253.31_00406/_SUCCESS
FlowRun/1558232254253.31_00406/part-00000
FlowRun/1558232254253.31_00406/part-00001
FlowRun/1558232254253.31_00407/_SUCCESS
FlowRun/1558232254253.31_00407/part-00000
FlowRun/1558232254253.31_00407/part-00001
FlowRun/1558232254253.31_00408/_SUCCESS
FlowRun/1558232254253.31_00408/part-00000
FlowRun/1558232254253.31_00408/part-00001
FlowRun/1558232254253.31_00409/_SUCCESS
FlowRun/1558232254253.31_00409/part-00000
FlowRun/1558232254253.31_00409/part-00001
FlowRun/1558232254253.31_00410/_SUCCESS
FlowRun/1558232254253.31_00410/part-00000
FlowRun/

FlowRun/1558232254253.31_00469/_SUCCESS
FlowRun/1558232254253.31_00469/part-00000
FlowRun/1558232254253.31_00469/part-00001
FlowRun/1558232254253.31_00470/_SUCCESS
FlowRun/1558232254253.31_00470/part-00000
FlowRun/1558232254253.31_00470/part-00001
FlowRun/1558232254253.31_00471/_SUCCESS
FlowRun/1558232254253.31_00471/part-00000
FlowRun/1558232254253.31_00471/part-00001
FlowRun/1558232254253.31_00472/_SUCCESS
FlowRun/1558232254253.31_00472/part-00000
FlowRun/1558232254253.31_00472/part-00001
FlowRun/1558232254253.31_00473/_SUCCESS
FlowRun/1558232254253.31_00473/part-00000
FlowRun/1558232254253.31_00473/part-00001
FlowRun/1558232254253.31_00474/_SUCCESS
FlowRun/1558232254253.31_00474/part-00000
FlowRun/1558232254253.31_00474/part-00001
FlowRun/1558232254253.31_00475/_SUCCESS
FlowRun/1558232254253.31_00475/part-00000
FlowRun/1558232254253.31_00475/part-00001
FlowRun/1558232254253.31_00476/_SUCCESS
FlowRun/1558232254253.31_00476/part-00000
FlowRun/1558232254253.31_00476/part-00001
FlowRun/

FlowRun/1558232254253.31_00535/part-00001
FlowRun/1558232254253.31_00536/_SUCCESS
FlowRun/1558232254253.31_00536/part-00000
FlowRun/1558232254253.31_00536/part-00001
FlowRun/1558232254253.31_00537/_SUCCESS
FlowRun/1558232254253.31_00537/part-00000
FlowRun/1558232254253.31_00537/part-00001
FlowRun/1558232254253.31_00538/_SUCCESS
FlowRun/1558232254253.31_00538/part-00000
FlowRun/1558232254253.31_00538/part-00001
FlowRun/1558232254253.31_00539/_SUCCESS
FlowRun/1558232254253.31_00539/part-00000
FlowRun/1558232254253.31_00539/part-00001
FlowRun/1558232254253.31_00540/_SUCCESS
FlowRun/1558232254253.31_00540/part-00000
FlowRun/1558232254253.31_00540/part-00001
FlowRun/1558232254253.31_00541/_SUCCESS
FlowRun/1558232254253.31_00541/part-00000
FlowRun/1558232254253.31_00541/part-00001
FlowRun/1558232254253.31_00542/_SUCCESS
FlowRun/1558232254253.31_00542/part-00000
FlowRun/1558232254253.31_00542/part-00001
FlowRun/1558232254253.31_00543/_SUCCESS
FlowRun/1558232254253.31_00543/part-00000
FlowRun/

FlowRun/1558232254731.5_00043/_SUCCESS
FlowRun/1558232254731.5_00043/part-00000
FlowRun/1558232254731.5_00043/part-00001
FlowRun/1558232254731.5_00044/_SUCCESS
FlowRun/1558232254731.5_00044/part-00000
FlowRun/1558232254731.5_00044/part-00001
FlowRun/1558232254731.5_00045/_SUCCESS
FlowRun/1558232254731.5_00045/part-00000
FlowRun/1558232254731.5_00045/part-00001
FlowRun/1558232254731.5_00046/_SUCCESS
FlowRun/1558232254731.5_00046/part-00000
FlowRun/1558232254731.5_00046/part-00001
FlowRun/1558232254731.5_00047/_SUCCESS
FlowRun/1558232254731.5_00047/part-00000
FlowRun/1558232254731.5_00047/part-00001
FlowRun/1558232254731.5_00048/_SUCCESS
FlowRun/1558232254731.5_00048/part-00000
FlowRun/1558232254731.5_00048/part-00001
FlowRun/1558232254731.5_00049/_SUCCESS
FlowRun/1558232254731.5_00049/part-00000
FlowRun/1558232254731.5_00049/part-00001
TestRun/_SUCCESS
TestRun/part-00000
TestRun/part-00001
TestRun/part-00002
TestRun/part-00003
TestRun/part-00004
TestRun/part-00005
TestRun/part-00006


2. __Download the concatenatepv3.py file.__ Put this python file in the same folder where you saved all your EMR output folders. Concatenatepv3.py will open each of your EMR output folders and aggregate all of your part files into two .csv files.

In [65]:
!ls task2scripts2018/

Mapper.py         Reducer.py        concatenatepv3.py


3. __Open a command prompt and run the concatenatepv3.py script.__ Change directory with command prompt so that you are in the same folder with the EMR output folders. Note: Your output files need to still be in their separate folders and the concatenatepv3.py file will need to be outside of these folders in order to run properly. To run this script, enter the following into command prompt:  

```python concatenatepv3.py```

The script will ask you for the root location from which to start the walk of directories. If nothing is entered, it will begin walking from the current working directory. The output will be two files: 'concatenated_websites.csv' and 'concatenated_factors.csv'.

4. __Check the number of instances.__ Open concatenated_factors.csv and inspect the contents. You need to have at least 20,000 instances. If you are short of 20,000 instances, you may need to search additional WET files using the steps above. You can run up to 255 steps/Wet files at a time. If you have at least 20,000 instances, move on to the next step.

In [66]:
df = pd.read_csv('jl-output/FlowRun/concatenated_factors.csv')
display(df.head(),df.shape)

Unnamed: 0,id,iphone,samsunggalaxy,sonyxperia,nokialumina,htcphone,ios,googleandroid,iphonecampos,samsungcampos,...,samsungperunc,sonyperunc,nokiaperunc,htcperunc,iosperpos,googleperpos,iosperneg,googleperneg,iosperunc,googleperunc
0,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,1,0,0
2,2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,1,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


(33794, 59)

5. __Rename 'concatenated_factors.csv' to LargeMatrix.csv.__ Zip and submit your LargeMatrix file. 

In [67]:
df.to_csv('LargMatrix.csv', index=False)