# Your Task

## Intro. Email

__FROM:__ Michael Ortiz <br>
__Subject:__ Welcome to Alert! <br>

Hello,

As founding partner and SVP, I would like to personally welcome you to Alert Analytics. I’m excited to have you join the team. You will be working on a challenging project to analyze sentiment on the web toward a number of smart phones for Helio, a smart phone and tablet app developer. You will be taking over this project from Amy Gorman, one of our most experienced analysts. She has made significant progress, but there is still a lot left to do. Before we dig into the details of what I will need you to do, let me give you some background on our client, what they are trying to accomplish, our general approach to the project, and what Amy has accomplished so far.

__Project Background__

Helio is working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere (one of the apps, for example, enables specialists in communicable diseases to diagnose conditions by examining images and other patient data uploaded by local aid workers). The government agency requires that the app suite be bundled with one model of smart phone. Helio is in the process of evaluating potential handset models to determine which one to bundle their software with. After completing an initial investigation, Helio has created a short list of five devices that are all capable of executing the app suite’s functions. To help Helio narrow their list down to one device, they have asked us to examine the prevalence of positive and negative attitudes toward these devices on the web. The goal of this project is to provide our client with a report that contains an analysis of sentiment toward the target devices, as well as a description of the methods and processes we used to arrive at our conclusions.

__Our Approach to the Project__

Although there are a number of ways to capture sentiment from text documents, our general approach to this project is to count words associated with sentiment toward these devices within relevant documents on the web. We then leverage this data and machine learning methods to look for patterns in the documents that enable us to label each of these documents with a value that represents the level of positive or negative sentiment toward each of these devices. We then analyze and compare the frequency and distribution of the sentiment for each of these devices.

In order to really gauge the sentiment toward these devices, we must do this on a very large scale. To that end, we use the cloud computing platform provided by Amazon Web Services (AWS) to conduct the analysis. The data sets we analyze will come from Common Crawl. Common Crawl is an open repository of web crawl data (over 5 billion pages so far) that is stored on Amazon’s Public Data Sets.

__Progress to Date__

The first thing Amy did was to figure out what data to collect from each document in Common Crawl that would enable us to determine if the document was relevant to our analysis. As you can imagine, only a very small fraction of the billions of webpages are going to be expressing useful sentiment about the devices we are interested in. Then she determined what data to collect from each webpage to enable us to assess if the review was strongly positive, strongly negative, or somewhere in between. Next, she wrote Python mapper, reducer, and output aggregator programs to efficiently collect and compile this data across the billions of documents on the Common Crawl. She then put this data into a matrix that contains approximately 12,000 entries (we call it the small data matrix).

I have forwarded an email from Amy that includes the small data matrix she developed. Please look it over. We are in the process of manually labeling this subset of documents with a sentiment rating that you will use later in the process to develop machine learning models capable of determining web page sentiment automatically. You’ll then use those models to review a much larger set of web pages and build a data matrix at least an order of magnitude larger for the complete analysis.

__Your Job on the Project__

Your job during the course of this project is to collect and develop a data matrix in the range of 20 thousand instances (called the large data matrix) of relevant web documents from the Common Crawl. Using Amy’s labeled small matrices, you will create models in R that understand the sentiment patterns within the data. Then you will apply your models to the large matrices you collected to understand their sentiment scores. Lastly you will analyze this large labeled data matrix and report descriptive statistics to the client on the level of sentiment toward the handsets.

__Immediate Next Steps__

You will need to quickly setup and become familiar Amazon Web Services. AWS is a reliable, scalable, and inexpensive platform for the use of cloud applications and services and it will give you easy access to web data through one management console. The specific AWS services that you will be employing include Elastic Compute Cloud (EC2), Elastic MapReduce (EMR), and Simple Storage Service (S3). I’ve outlined these services briefly below:

* The Amazon Elastic Compute Cloud (EC2) is a web-based service that enables you to run application programs in the Amazon computing environment. EC2 can serve as a practically unlimited set of virtual machines. 
* Amazon Elastic MapReduce (EMR) is a web service that enables you to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). This is where you will process the Common Crawl data to develop the large data matrix.
* Amazon Simple Storage Services (S3) is storage designed to make web-scale computing easier for developers and provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. This is a general platform that stores all the data (input and output) for the EC2 and EMR services.

As I mentioned earlier, the data sets you will analyze will come from Common Crawl. Common Crawl is stored on Amazon’s Public Data Sets, so you will be able to access it directly for map-reducing processing.

I would like you to sign up for an Amazon Web Services account and become familiar with the services mentioned above (EC2, EMR, and S3). Once you’ve done this, I’d like you to run a sample task to ensure that you have installed the AWS services correctly and that you are ready to start conducting the analysis. This task involves pairing AWS, S3, and EMR with Hadoop Streaming (a utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer). This sample job sets up a basic streaming EMR job and uses a Python program (Mapper.py and Reducer.py) to map the number of times each word occurs within a wet.gz file. The primary purpose of this is just to give you experience and confidence using AWS before moving on to using the Command Line Interface.

Glad to have you onboard!

Michael

__Michael Ortiz__<br>
Senior Vice-President<br>
Alert! Analytics<br>

## Helio Sentiment Analysis Overview

__FROM:__ Michael Ortiz <br>
__Subject:__ FWD: Small data matrix for Helio Sentiment Analysis

Hi Mike,

Here is the final run of small data matrix that I developed for the Helio sentiment analysis project (Small Matrix unlabeled 8d). I have also attached a detailed explanation of each column in the matrix (Helio Sentiment Analysis Matrix Detail). I have explained my basic approach for the new analyst below. I will send the three Python programs I developed to construct the small data matrix shortly.

All the Best,

Amy

__Amy Gorman__
Data Analyst<br>
Alert! Analytics
 

__General Approach__

My approach is to scan the Common Crawl for documents that are relevant to the text analysis (documents that express meaningful sentiment about one of the phones) and then, for each relevant document, collect information about the sentiment toward key features of the phone (camera, display, performance, and operating system). I believe that patterns in the sentiment toward these features can help us develop a model to predict the overall sentiment toward each phone.

To gather information about the relevancy of the document and the sentiment toward phone features, I basically look for and count relevant words. To assess relevancy, I look for references to the device itself and references to words that indicate the document contains a meaningful view of the device. To gather sentiment toward each feature, I look for references to the feature itself and references to positive, negative, and neutral words. Documents are retained only if they have at least one mention of a phone or phone OS term (rows 5-12, columns B-H), and at least one of the following terms is present in the document: review; critique; looks at; in depth; analysis; evaluate; evaluation; assess.

I know this seems simple, but my mantra is to use the simplest, most efficient approach that provides meaningful results. I have found that more complex approaches to text analysis can become untenable as they are scaled up across billions of web pages. For example, I’ve found that more is not necessarily better when it comes to features. If you create a set of features that is too large, it seems to actually hurt the performance of classifiers because they will be exposed to too much ‘noise’, or irrelevant data. In principle, this could likely be mitigated with more sophisticated and non-linear machine learning algorithms, but those algorithms will likely become very computationally expensive quickly.

To keep things manageable, I’ve found that it’s best to use the simplest set of features possible, with the simplest classifier possible, to achieve the best results. Another hazard I’ve noticed (with regard to making things too complex) is that I may not always make the right interpretation for why certain results were obtained. In my experience, getting a good feature set that is both compact and useful for prediction is not always an easy goal to reach and can require extensive experimentation to achieve, but trying to keep it as simple as you can seems to be a good guiding principle.

__Structure of Small Data Matrix__

* Each row in the csv file is a webpage from the Common Crawl that is relevant to the analysis. 
* Columns A-BF contains information I collected from each web page to assess the sentiment toward each phone.

__It may be helpful to think of the small data matrix as being organized into the following five sections:__

1. Attributes that collect information about the relevancy of the webpage toward each device (columns A-E).
     * These attributes basically answer the question: “Does this webpage express meaningful sentiment about a device we care about?”
     * I use the values in these columns to include the webpages I want to collect further information from and throw out those that aren’t relevant.
2. Attributes that collect information about the sentiment toward the operating system used on the phone.  (columns F-G)
     * These two columns record the number of mentions of terms for each operating system.
     * This column helps us determine if the operating system is discussed in the webpage, the number of positive, negative and neutral words about the operating systems are recorded in columns BA-BF.
3. Attributes that collect information about the sentiment toward a phone’s camera (columns H-V).
     * These columns  record the number of positive, negative and neutral words that are within reasonable proximity to the word camera or alternative words for camera (e.g., iSight)
4. Attributes that collect information about the sentiment toward a phone’s display (columns W-AK)
     * These columns record the number of positive, negative and neutral words that are in reasonable proximity to the word display or alternative words for display.
5. Attributes that collect information about the sentiment toward a phone’s performance (columns AL-BF).
     * In the first set of columns (AL-AZ) I collect information about the sentiment towards the performance of a phone’s hardware. These columns record the number of positive, negative and neutral words that are within reasonable proximity to the word performance or alternative words for performance.
     * In the next set of columns (BA-BF), I record the number of positive, negative and neutral words that are in reasonable proximity to words related to the operating system (e.g., iOS, Android operating system etc.)
     
__Lessons Learned__

I thought it might be helpful to the new analyst if I explained three key discoveries I made during the development of the small data matrix.

1. Counting mentions of the device name produces too many irrelevant results.
     * To determine if a webpage is relevant to our analysis I wrote the script to count the number of references to a given device (e.g., Samsung Galaxy) and accept a document as relevant if the number of references met a certain threshold. This approach let in a an unacceptable number of irrelevant documents, for example, web pages that contain trending search topics mention the name of a handset multiple times, but don’t include any discussion of the handset itself. The method also let in lengthy and rambling forums, in which the handset is mentioned multiple times, but no coherent, meaningful sentiment is expressed.
     * This lesson is important because a key challenge in a typical large-scale sentiment analysis is to find the fractionally small number of documents that are relevant to the analysis across billions of webpages.
     * Current approach: The script now checks for both the name of the device and words that indicate that the document includes a meaningful assessment of the phone (e.g., review etc.). This method eliminates most of the irrelevant documents from the small data matrix.
2. Looking for positive, negative, neutral words that are adjacent to a “feature” word produced too few results.
     * I set the script to look for positive, negative, or neutral words immediately adjacent to terms that refer to camera, display, performance, and operating system. A test returned almost no results, as very few webpages state opinion about a feature in such a literal way.
     * This lesson is important because, in most cases, hundreds of thousands of webpages need to be analyzed to get a meaningful assessment of sentiment.
     * Current approach: The script looks for positive, negative, and neutral words that are within five words of a feature word (camera, display, performance) or an alternate of a feature word (e.g., iSight).
3. Apache Pig  has limited application to text analysis.
     * I originally explored using the Pig language to create the small data matrix. At first glance, it seemed to be the perfect way to script the function the mapper and reducer python programs currently execute. However it is likely not the best choice in our case for the following reason: The Pig language does not support regular expressions when counting word occurrences. This means that only single words or word phrases that are immediately adjacent to each other can be counted. This does not allow the flexibility to capture many different uses of language when describing sentiment toward a phone or one of the phone’s features. Although this limitation can be overcome by building a ‘user-defined function’ (UDF) in another language (such as Python) and implementing the regular expressions in that UDF. However, I found the UDF function is not efficiently mapped across the distributed computational power of the EMR platform, creating a significant bottleneck in processing power. It seems to me that this inefficiency has the potential to increase the time required to process a batch of one billion pages from about two days to over two weeks, depending on the number of instances being used.
     * This lesson is important because using Apache Pig requires less effort and time than developing Python programs so it seems very attractive to leverage, but it will not deliver even a relatively simple text analysis in an efficient manner, which is critically important when analyzing text in scale.
     * Current approach: I wrote programs in Python to run several map and reduce jobs on Common Crawl data using EMR. These EMR jobs run multiple instances of the mapper Python script (Mapper.py) on multiple servers (nodes).  Each of the mapper instances processes a subset of the overall set of Common Crawl files for the job. When a mapper completes it sends its results to a reducer process (run by the Reducer.py script). The reducer process combines the results from some number of mapper processes, then writes the combined results out to the S3 location. When a mapper or reducer completes, another process starts up in its place if there are still files to process. The data from the job flows is in comma separated value format (.csv) and could be opened in Excel or a text editor. However, since the results are split into one folder per step and one file per reducer, I created a Python script (ConcatenateFiles.py) and a .csv file (MatrixHeader.csv) that combines everything together and adds in a header row.

## Attachments:

[Helio_Sentiment_Analysis_Matrix_Detail](https://s3.amazonaws.com/gbstool/emails/2082/Helio_Sentiment_Analysis_Matrix_Detail.xlsx?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1559811600&Signature=TVhh9kjCc%2F1RFv63yZZVU0nUtUg%3D)

[SampleMatrix_unlabeled](https://s3.amazonaws.com/gbstool/emails/2403/SampleMatrix_unlabeled.csv?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1559811600&Signature=i5y0%2FOyjq9wDPbe9UokFxCG5IeY%3D)

# Plan of Attack

## Introduction

You have been asked by Michael Ortiz, Senior Vice President  of Alert Analytics, to review the small data matrix and the forwarded email from the previous analyst that explains how she created it. You will then setup your Amazon Web Services computing environment and contact your mentor to confirm that you have completed this task.



## Install Python Related Libraries

In order to work with much of the data you'll be mapping from Common Crawl you'll need to work with a programming language called Python. From the Python website:

"Python is a programming language. It's used for many different applications. It's used in some high schools and colleges as an introductory programming language because Python is easy to learn, but it's also used by professional software developers at places such as Google, NASA, and Lucasfilm Ltd."

There are numerous versions of Python available for you to use, but we'll be using Python 3.6 for this exercise. NOTE: If you use Python 2.7 for this course some of the execution of the provided Python files will not function and you'll need to rewrite them for your version of Python.

We'll also be using a few libraries for Python, namely Numpy and Pandas; here is a brief word on each:

__NumPy (www.numpy.org)__

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

* a powerful N-dimensional array object
* sophisticated (broadcasting) functions
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy is licensed under the BSD license, enabling reuse with few restrictions.

__Pandas (Python Data Analysis Library)(pandas.pydata.org)__

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

pandas is a NUMFocus sponsored project. This will help ensure the success of development of pandas as a world-class open-source project.

You may choose to install Python, Numpy and Pandas individually on your own - This is only recommended if you already have prior programming experience and/or working knowledge or Python.

Thankfully, we also have another option that will install Python, Numpy and Pandas together; it is called the Anaconda Data Science Platform.

Per the Anaconda website: Anaconda is the leading open data science platform powered by Python. The open source version of Anaconda is a high-performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for data science. Additionally, you'll have access to over 720 packages that can easily be installed with conda, our renowned package, dependency and environment manager, that is included in Anaconda.

You can download and install Anaconda at the link below. Be sure to select the link for the Anaconda version that uses Python 3.6. https://www.anaconda.com/download

## Get Started - Setup AWS Platform and Integrate Services

1. Review the email from the previous analyst and the attachments to her email to get a handle on what has been done to date:

* SampleMatrix_unlabeled
* Helio_Sentiment_Analysis_Matrix_Detail

2. Create an AWS account.This account will give you access to all the tools you'll need to complete the upcoming tasks in this course. Below is a PDF that will help you set up your account, create a user and provision your user. 

* Account Set Up Guide:  [AWS_Account_Setup_JUNE17](https://s3.amazonaws.com/gbstool/courses/908/docs/AWS_Account_Setup_JUNE17.pdf?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1559811600&Signature=jaa1IcNr42vtLm5LL58wv8Ny%2FU4%3D)
* Amazon Web Services: http://aws.amazon.com/

This next section outlines setting up the AWS Command Line Interface; we will refer to this as the CLI in the weekly meetings.

3. Install The AWS CLI. This allows you to programmatically launch and monitor progress of running job flows, and to create additional custom functionality around job flows (such as sequences with multiple processing steps, scheduling, workflow, or monitoring).


* For Windows http://docs.aws.amazon.com/cli/latest/userguide/installing.html#install-msi-on-windows
* For Linux or OS X http://docs.aws.amazon.com/cli/latest/userguide/installing.html#install-with-pip
 

In [1]:
!pip install awscli --upgrade --user

Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/ba/4b/d879c45273a6635c67a646c1d41023d3b06f41b4a142abc887fcd11f00ec/awscli-1.16.172-py2.py3-none-any.whl (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 5.1MB/s ta 0:00:01
[?25hCollecting s3transfer<0.3.0,>=0.2.0 (from awscli)
[?25l  Downloading https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f718d14d3231e21c51b0/s3transfer-0.2.1-py2.py3-none-any.whl (70kB)
[K    100% |████████████████████████████████| 71kB 12.2MB/s ta 0:00:01
Collecting PyYAML<=3.13,>=3.10 (from awscli)
Collecting botocore==1.12.162 (from awscli)
[?25l  Downloading https://files.pythonhosted.org/packages/9c/70/519c9fce0af131989042906d3584f0b105588de211cf6a060ed018c98b10/botocore-1.12.162-py2.py3-none-any.whl (5.5MB)
[K    100% |████████████████████████████████| 5.5MB 3.0MB/s ta 0:00:011
[?25hCollecting colorama<=0.3.9,>=0.2.5 (from awscli)
  Using cached https://files.pythonhosted.org/p

__Test the Install and Configure CLI__

1. In the command shell run “aws help”. The AWS CLI help topics should be displayed. If you receive a warning that "aws" is not a recognized command, you will need to follow the instructions for adding CLI to your PATHWAY. Visit the AWS CLI install pages linked above for the instructions. 

In [2]:
!aws help

AWS()                                                                    AWS()



NNAAMMEE
       aws -

DDEESSCCRRIIPPTTIIOONN
       The  AWS  Command  Line  Interface is a unified tool to manage your AWS
       services.

SSYYNNOOPPSSIISS
          aws [options] <command> <subcommand> [parameters]

       Use _a_w_s _c_o_m_m_a_n_d _h_e_l_p for information on a  specific  command.  Use  _a_w_s
       _h_e_l_p  _t_o_p_i_c_s  to view a list of available help topics. The synopsis for
       each command shows its parameters and their usage. Optional  parameters
       are shown in square brackets.

OOPPTTIIOONNSS
       ----ddeebbuugg (boolean)

       Turn on debug logging.

       ----eennddppooiinntt--uurrll (string)

       Override command's default URL with the given URL.

       ----nnoo--vveerriiffyy--ssssll (boolean)

       By  default, the AWS CLI uses SS

2. Access Keys. If you don’t already have an access keys, get them by following the instructions here: http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html#cli-signup. Make sure to note the region that you set up the Access Keys on for the next step.
3. Set up the AWS credentials and config files.  Your credentials are used to calculate the signature value for every request you make. In the command shell, run “aws configure” as shown here: http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-quick-configuration.
(Note: This creates a “config” file and “credentials” file in C:\Users\[USERNAME]\.aws on windows and ~/.aws on Linux and OS X.)

__You will need to enter four pieces of information after running aws configure:__
* Access Key (you downloaded this while setting up your account)
* Secret Access Key (you downloaded this while setting up your account)
* Default Region Name: us-east-1
* Default Output Format: json

 

__Additional Tools__

Install Cyberduck (Check the required resources) and connect to S3. CyberDuck is a special type of browser that allows you to view and access files stored on Amazon S3. This will enable you to easily explore S3 buckets and upload/download multiple files at a time.

Install a text editor. It is highly recommended that you install a third-party text editor. The text editors that come preinstalled on IOS and Windows machines are not appropriate for this level of work. We recommend Sublime (https://www.sublimetext.com/). It's free to download and has a long introductory period. 
 

## Submit Your Work

Your mentors want to make sure that you have installed and configured CLI correctly before proceeding to the next task. To do this you will need to screen shot two things: 

The top right hand corner of your AWS Console web page. Specifically, your mentor needs to make sure that your end point is Northern Virginia. 
Your configured CLI. To do this, run "aws configure" again in your command shell. This will redact the majority of your access and secret access keys which will help keep your account secure. 
Your screen shot(s) could look similar to the image below. Upload your screen shot(s) on the Submit Your Work tab. 