# MapReduce Using `MRJob`: Part 1

## Job Posting Dataset

The sample dataset we will mainly use (`data/job-data/job-data-2018-09-*.txt`) for this tutorial contains job postings from one of the US job search websites. The data is stored with each row as a JSON document representing a job posting record. 

The example below shows a sample job postings from the data file. The sample record has been formatted with 4 spaces indentation. In the real file, each record is stored as a JSON document in one row.

*Example: JSON document of a job posting record*

```
{
    "industry": "Information Technology", 
    "datePosted": "2018-09-07", 
    "salaryCurrency": "USD", 
    "validThrough": "2018-10-07", 
    "empId": 671932, 
    "jobLocation": {
        "geo": {
            "latitude": "37.7623", 
            "@type": "GeoCoordinates", 
            "longitude": "-122.4145"
        }, 
        "@type": "Place", 
        "address": {
            "postalCode": "94110-2042", 
            "addressLocality": "San Francisco", 
            "@type": "PostalAddress", 
            "addressRegion": "CA", 
            "addressCountry": {
                "@type": 
                "Country", 
                "name": "US"
            }
        }
    }, 
    "estimatedSalary": {
        "@type": "MonetaryAmount", 
        "currency": "USD", 
        "value": {
            "maxValue": "202000", 
            "@type": "QuantitativeValue", 
            "unitText": "YEAR", 
            "minValue": "146000"
        }
    }, 
    "description": "<div><em>Generate insights and impact from data</em><em>.</em></div>\n<br/>\n<div>\n<div>We're looking for data scientists to join the Analytics team who are excited about applying their analytical skills to understand our users and influence decision making. If you are naturally data curious, excited about deriving insights from data, and motivated by having impact on the business, we want to hear from you.</div><br/>\n\n<div><strong>You will:</strong></div><div>\n\n\n<ul>\n<li>Work closely with product and business teams to identify important questions and answer them with data.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Apply statistical and econometric models on large datasets to: i) measure results and outcomes, ii) identify causal impact and attribution, iii) predict future performance of users or products.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Design, analyze, and interpret the results of experiments.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Drive the collection of new data and the refinement of existing data sources.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Create analyses that tell a \"story\" focused on insights, not just data.</li>\n</ul>\n\n</div><br/>\n\n<div><strong>We're looking for someone with:</strong></div><div>\n\n\n<ul>\n<li>3+ years experience working with and analyzing large data sets to solve problems.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>A PhD or MS in a quantitative field (e.g., Economics, Statistics, Eng, Natural Sciences, CS).</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Expert knowledge of a scientific computing language (such as R or Python) and SQL.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Strong knowledge of statistics and experimental design.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Ability to communicate results clearly and a focus on driving impact.</li>\n</ul>\n\n</div><br/>\n\n<div><strong>Nice to haves:</strong></div><div>\n\n\n<ul>\n<li>Prior experience with data-distributed tools (Scalding, Hadoop, Pig, etc).</li>\n</ul>\n\n</div><br/>\n\n<div><strong>You should include these in your application:</strong></div><div>\n\n\n<ul>\n<li>Resume and LinkedIn profile.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Description of the most interesting data analysis you've done, key findings, and its impact.</li>\n</ul>\n\n</div><br/>\n\n<div>\n\n\n<ul>\n<li>Link to or attachment of code you've written related to data analysis.</li>\n</ul>\n\n</div>\n</div>\n<br/>", 
    "hiringOrganization": {
        "@type": "Organization", 
        "sameAs": "www.stripe.com", 
        "name": "Stripe"
    },
    "@type": "JobPosting", 
    "jobId": 2280174543, 
    "@context": "http://schema.org", 
    "employmentType": "FULL_TIME", 
    "occupationalCategory": [
        "15-1111.00", 
        "Computer and Information Research Scientists"
    ], 
    "title": "Data Scientist"
}
```

Copy input data to HDFS:

In [1]:
!hdfs dfs -mkdir job-data/

mkdir: `job-data': File exists


In [2]:
!hdfs dfs -put ../data/job-data/* job-data/

put: `job-data/job-data-2018-09-08.txt': File exists
put: `job-data/job-data-2018-09-09.txt': File exists


## 1. Protocols For Input & Output

`mrjob` assumes that all data is newline-delimited bytes. Each job has an *input protocol*, an *output protocol*, and an *internal protocol*. These protocols can be changed by overwritting the attributes: `INPUT_PROTOCOL`, `INTERNAL_PROTOCOL`, and `OUTPUT_PROTOCOL`, respectively.

The default *input* protocol is `RawValueProtocol`, which just reads in a line as a `str`.
The default *output* and *internal* protocols are both `JSONProtocol`, which reads and writes JSON strings separated by a tab character.

`JSONValueProtocol` encodes value as a JSON and discard key (key is read in as None). To load the job posting dataset, we can set `INPUT_PROTOCOL = JSONValueProtocol` which automaticall loads input data as Python `dict` objects.

For more information, see [Protocols](https://pythonhosted.org/mrjob/guides/writing-mrjobs.html#job-protocols).

### **Example**: Simple JSON Parser

The script below reads the data into `MRTest.mapper` with each record loaded as a Python dict, and generates output of key-value pairs where keys are `jobId` and values are `jobLocation`, which will then be written into output files as JSON documents. Note that no `MRTest.reducer` is provided, this type of jobs are also called *map-only* jobs.

- *Data flow*:

  - Input:`record`
  - $\quad\downarrow$
  - mapper:`<_, record> -> <jobId, jobLocation>`
  - $\quad\downarrow$
  - Output:`jobId jobLocation`
  
- *Features and highlights*:
  
  `INPUT_PROTOCOL = JSONValueProtocol` allows MRJob to parse JSON documents as python dict

In [3]:
%%file mr-jobs/1_protocols.py
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol


class MRTest(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, value):
        yield value.get('jobId', None), value.get('jobLocation', None)

        
if __name__ == '__main__':
    MRTest.run()

Writing mr-jobs/1_protocols.py


- Test locally:

In [4]:
!python3 mr-jobs/1_protocols.py ../data/job-data/* --output-dir mr-output

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/1_protocols.hadoop.20180926.222546.611543
job output is in mr-output
Removing temp directory /tmp/1_protocols.hadoop.20180926.222546.611543...


- Run on your Hadoop cluster:

In [5]:
!hdfs dfs -rm -r hdfs:///user/hadoop/mr-output

Deleted hdfs:///user/hadoop/mr-output


In [6]:
!python3 mr-jobs/1_protocols.py -r hadoop \
hdfs:///user/hadoop/job-data/ \
--output-dir mr-output/

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop-2.8.4/bin...
Found hadoop binary: /usr/local/hadoop-2.8.4/bin/hadoop
Using Hadoop version 2.8.4
Looking for Hadoop streaming jar in /usr/local/hadoop-2.8.4...
Found Hadoop streaming jar: /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
Creating temp directory /tmp/1_protocols.hadoop.20180926.222550.384342
Copying local files to hdfs:///user/hadoop/tmp/mrjob/1_protocols.hadoop.20180926.222550.384342/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar4154328930656747625/] [] /tmp/streamjob4721268306871442407.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input files to process : 2
  number of splits:2
  Submitting tokens for job: job_1537993323748_0013
  Submitted application application_1537993323748_0013
  The url to track the job: ht

### **Example**: Simple Event Counter

With the help of MRJob, it's easy to write some simple mapreduce jobs. For example, to compute the number of jobs available in each region, the data flow becomes:

- *Data flow*:

  - Input:`record`
  - $\quad\downarrow$
  - mapper:`<_, record> -> <addressRegion, 1>`
  - $\quad\downarrow$
  - reducer:`<addressRegion, [1]> -> <addressRegion, count>`
  - $\quad\downarrow$
  - Output:`jobId jobLocation`


In [7]:
%%file mr-jobs/1_region_counter.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONValueProtocol


class MRCounter(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, value):
        try:
            region = value['jobLocation']['address']['addressRegion']
        except KeyError:
            yield 'NA', 1
        else:
            yield region, 1
    
    def reducer(self, key, values):
        yield key, sum(values)
    
    def steps(self):
        return [MRStep(mapper=self.mapper,
                       combiner=self.reducer,
                       reducer=self.reducer)]
    
        
if __name__ == '__main__':
    MRCounter.run()

Writing mr-jobs/1_region_counter.py


- Test locally:

In [8]:
!python3 mr-jobs/1_region_counter.py ../data/job-data/* --output-dir mr-output

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/1_region_counter.hadoop.20180926.222629.230984
job output is in mr-output
Removing temp directory /tmp/1_region_counter.hadoop.20180926.222629.230984...


- Run on your Hadoop cluster:

In [9]:
!hdfs dfs -rm -r hdfs:///user/hadoop/mr-output

Deleted hdfs:///user/hadoop/mr-output


In [10]:
!python3 mr-jobs/1_region_counter.py -r hadoop \
hdfs:///user/hadoop/job-data/ \
--output-dir mr-output/

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop-2.8.4/bin...
Found hadoop binary: /usr/local/hadoop-2.8.4/bin/hadoop
Using Hadoop version 2.8.4
Looking for Hadoop streaming jar in /usr/local/hadoop-2.8.4...
Found Hadoop streaming jar: /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
Creating temp directory /tmp/1_region_counter.hadoop.20180926.222632.974850
Copying local files to hdfs:///user/hadoop/tmp/mrjob/1_region_counter.hadoop.20180926.222632.974850/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar2848612009597891020/] [] /tmp/streamjob6857587264638180417.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input files to process : 2
  number of splits:2
  Submitting tokens for job: job_1537993323748_0014
  Submitted application application_1537993323748_0014
  The url to track t

### Exercise 1

In this exercise we want to find the number of job postings in each industry.

- Create a MapReduce script and define your MRJob class to count the number of job postings in each industry. 
- When you finish, first test it locally and check the output.
- Run it on the hadoop cluster and check the HDFS directory.

## 2. Filtering

Keys:

- Filtering pattern aims to find a subset of data but usually not change the actural records. 
  - We can set `OUTPUT_PROTOCOL = JSONValueProtocol` to ignore the key field for each record in the output.
- Filtering patterns usually don't need a reducer if each record is filtered individually and the evaluation does not depend on other records.
- Filtering usually serves as an abstract pattern for some other patterns.

Applications:

- Data cleaning
- Events tracking
- Records matching
- Random sampling
- Dataset splitting

### 2.1 Simple Filtering

Simple filtering is often used when data cleaning, events tracking, outliers removing, etc., are needed.

### **Example**: Find all jobs with titles relavant to *Data Scientist*.


- *Data flow*:

  - Input:`record`
  - $\quad\downarrow$
  - mapper:`<_, record> [if keyword in title -> <None, record>]`
  - $\quad\downarrow$
  - Output:`record`
  
- *Features and highlights*:
  
  `OUTPUT_PROTOCOL = JSONValueProtocol` ignores the key field for each record in the output

In [11]:
%%file mr-jobs/2.1_simple_filtering.py
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol


class MRSimpleFiltering(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol
    OUTPUT_PROTOCOL = JSONValueProtocol
    
    def mapper(self, _, value):
        title = value.get('title', '').lower()
        if title.find('data scientist') > -1:
            yield _, value


if __name__ == '__main__':
    MRSimpleFiltering.run()

Writing mr-jobs/2.1_simple_filtering.py


Test locally:

In [12]:
!python3 mr-jobs/2.1_simple_filtering.py ../data/job-data/* --output-dir mr-output

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/2.hadoop.20180926.222712.099042
job output is in mr-output
Removing temp directory /tmp/2.hadoop.20180926.222712.099042...


Run on your Hadoop cluster:

In [13]:
!hdfs dfs -rm -r hdfs:///user/hadoop/mr-output

Deleted hdfs:///user/hadoop/mr-output


In [14]:
!python3 mr-jobs/2.1_simple_filtering.py \
-r hadoop hdfs:///user/hadoop/job-data/ \
--output-dir mr-output/

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop-2.8.4/bin...
Found hadoop binary: /usr/local/hadoop-2.8.4/bin/hadoop
Using Hadoop version 2.8.4
Looking for Hadoop streaming jar in /usr/local/hadoop-2.8.4...
Found Hadoop streaming jar: /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
Creating temp directory /tmp/2.hadoop.20180926.222715.775770
Copying local files to hdfs:///user/hadoop/tmp/mrjob/2.hadoop.20180926.222715.775770/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar2506588348652920310/] [] /tmp/streamjob2871258137806635665.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input files to process : 2
  number of splits:2
  Submitting tokens for job: job_1537993323748_0015
  Submitted application application_1537993323748_0015
  The url to track the job: http://c8d937eb6693:80

### 2.2 Random Sampling

Random sampling pattern allows us to create a subset (usually much smaller) of our larger dataset for quick exploration. Thus each record should have an equal probability of being selected. 

If reproducible is not required, then we can use a random function, e.g.: `random.uniform(a, b)` in python, to do the work.

### **Example**: Create a random subset with 10% of the full dataset.

We want to pass an argument `fraction` to our `MRJob` script. We can do this by using `MRJob.configure_args()` and `MRJob.add_passthru_arg()` together.

- *Data flow*:

  - Input:`record`
  - $\quad\downarrow$
  - mapper:`<_, record> [Prob=0.1 -> <None, record>]`
  - $\quad\downarrow$
  - Output:`record`
  
- *Features and highlights*:
  
  - `MRJob.configure_args()` allows user to define arguments for this script 
  - `MRJob.add_passthru_arg('--fraction', **kwargs)` defines a command-line argument named `fraction`
  - To pass a value to `fraction` via command-line arguemnt: `--fraction <value>`
  - To use `fraction` in script: `MRJob.options.fraction`
  - `MRJob.mapper_init()` validates the value of `fraction` before the `mapper` processes any input.

In [15]:
%%file mr-jobs/2.2_random_sampling.py
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol

import random


class MRRandomSampling(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol
    OUTPUT_PROTOCOL = JSONValueProtocol
    
    def configure_args(self):
        super().configure_args()
        self.add_passthru_arg('--fraction', type=float)
        
    def mapper_init(self):
        if self.options.fraction > 1 or self.options.fraction < 0:
            raise ValueError('Invalid fraction value')
        
    def mapper(self, _, value):
        if random.uniform(0, 1) < self.options.fraction:
            yield _, value


if __name__ == '__main__':
    MRRandomSampling.run()

Writing mr-jobs/2.2_random_sampling.py


- Test locally:

In [16]:
!python3 mr-jobs/2.2_random_sampling.py ../data/job-data/* --output-dir mr-output/ --fraction .1

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/2.hadoop.20180926.222749.971923
job output is in mr-output/
Removing temp directory /tmp/2.hadoop.20180926.222749.971923...


- Run on your Hadoop cluster:

In [17]:
!hdfs dfs -rm -r hdfs:///user/hadoop/mr-output

Deleted hdfs:///user/hadoop/mr-output


In [18]:
!python3 mr-jobs/2.2_random_sampling.py \
-r hadoop hdfs:///user/hadoop/job-data/ \
--output-dir mr-output/ --fraction .1

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop-2.8.4/bin...
Found hadoop binary: /usr/local/hadoop-2.8.4/bin/hadoop
Using Hadoop version 2.8.4
Looking for Hadoop streaming jar in /usr/local/hadoop-2.8.4...
Found Hadoop streaming jar: /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
Creating temp directory /tmp/2.hadoop.20180926.222753.492256
Copying local files to hdfs:///user/hadoop/tmp/mrjob/2.hadoop.20180926.222753.492256/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar7073738273186572016/] [] /tmp/streamjob773106399113576008.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input files to process : 2
  number of splits:2
  Submitting tokens for job: job_1537993323748_0016
  Submitted application application_1537993323748_0016
  The url to track the job: http://c8d937eb6693:808

## 2.3 Data Splitting

For machine learning modeling, we usually divide the data set into two non-overlapping subsets:
- training set — a subset to train a model.
- test set — a subset to test the trained model.

If the goal is to split the dataset into such two subsets, then we need to make sure:
- each record can only be selected into one of the two datasets
- sampling is reproducible

The `sample` function below returns either `True` or `False` based on the hashed value of key and fraction:
1. split fraction into *numerator* and *denominator*, e.g.: 0.125 $\rightarrow$ 125/1000
2. calculate the hash value of the key. Here we will use MD5, which is a widely used hash function producing a 128-bit hash value.
3. calculate hash value modulo *denominator*, if it's less than *numerator*, return `True`, otherwise return `False`.

Note: if you just want to randomly sample the dataset, then a simple random number generator will work.

In [19]:
import decimal
import hashlib

def sample(key, fraction):
    if fraction > 1 or fraction < 0:
        raise ValueError('Invalid fraction value')
    # calculate numerator and denominator
    frac = decimal.Decimal(str(fraction)).as_tuple()
    numer = sum([v*10**i for i, v in enumerate(frac.digits[::-1])])
    denom = 10**(-frac.exponent)
    # calculate hash value using md5
    hash_val = hashlib.md5(str(key).encode()).hexdigest()
    return (int(hash_val, 16) % denom) < numer

In [20]:
# test the function with the code below
N = 1000
print(sum([sample(i, fraction=0.25) for i in range(N)]))

259


### **Example**: Creating a reproducible train/test split.
 
- *Features and highlights*:
    
  - `MRJob.add_passthru_arg('--split')` defines a command-line argument named `split`, which takes value "train" or "test". 
    - If `split=train`, it outputs train subset, otherwise it outputs test subset.
  - `MRJob.add_passthru_arg('--test_size')` defines a command-line argument named `test_size`. 
    - The value should be between 0.0 and 1.0, which represent the proportion of the dataset to include in the test split.
  - To create a train/test split, we run the script twice, one with `split=train` and one with `split=test`.

In [21]:
%%file mr-jobs/2.3_train_test_splitting.py
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol

import decimal
import hashlib


class MRTrainTestSplit(MRJob):
    
    INPUT_PROTOCOL = JSONValueProtocol
    OUTPUT_PROTOCOL = JSONValueProtocol
    
    def configure_args(self):
        super().configure_args()
        self.add_passthru_arg('--split')
        self.add_passthru_arg('--test_size', type=float, default=0.3)
        
    def mapper_init(self):
        if self.options.split not in ('train', 'test'):
            raise ValueError('Invalid split value')
        if self.options.test_size > 1 or self.options.test_size < 0:
            raise ValueError('Invalid test size')
        
    def mapper(self, _, value):
        key = value.get('jobId', 0)
        include = self._sample(key=key, fraction=self.options.test_size)
        if include ^ (self.options.split=='train'):
            yield _, value
    
    def _sample(self, key, fraction=1):
        frac = decimal.Decimal(str(fraction)).as_tuple()
        numer = sum([v*10**i for i, v in enumerate(frac.digits[::-1])])
        denom = 10**(-frac.exponent)
        hash_val = hashlib.md5(str(key).encode()).hexdigest()
        return (int(hash_val, 16) % denom) < numer
    
        
if __name__ == '__main__':
    MRTrainTestSplit.run()

Writing mr-jobs/2.3_train_test_splitting.py


- Test locally:

In [22]:
!python3 mr-jobs/2.3_train_test_splitting.py ../data/job-data/* \
--output-dir mr-output/train \
--test_size 0.3 \
--split train \
&& python3 mr-jobs/2.3_train_test_splitting.py ../data/job-data/* \
--output-dir mr-output/test \
--test_size 0.3 \
--split test

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/2.hadoop.20180926.222826.701787
job output is in mr-output/train
Removing temp directory /tmp/2.hadoop.20180926.222826.701787...
No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/2.hadoop.20180926.222827.538116
job output is in mr-output/test
Removing temp directory /tmp/2.hadoop.20180926.222827.538116...


- Run on your Hadoop cluster:

In [23]:
!hdfs dfs -rm -r hdfs:///user/hadoop/mr-output

Deleted hdfs:///user/hadoop/mr-output


In [24]:
!python3 mr-jobs/2.3_train_test_splitting.py \
-r hadoop hdfs:///user/hadoop/job-data/ \
    --output-dir mr-output/train \
    --test_size 0.3 \
    --split train \
&& python3 mr-jobs/2.3_train_test_splitting.py \
-r hadoop hdfs:///user/hadoop/job-data/ \
    --output-dir mr-output/test \
    --test_size 0.3 \
    --split test

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop-2.8.4/bin...
Found hadoop binary: /usr/local/hadoop-2.8.4/bin/hadoop
Using Hadoop version 2.8.4
Looking for Hadoop streaming jar in /usr/local/hadoop-2.8.4...
Found Hadoop streaming jar: /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
Creating temp directory /tmp/2.hadoop.20180926.222831.229682
Copying local files to hdfs:///user/hadoop/tmp/mrjob/2.hadoop.20180926.222831.229682/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar6197470053724491520/] [] /tmp/streamjob7920787548911979230.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input files to process : 2
  number of splits:2
  Submitting tokens for job: job_1537993323748_0017
  Submitted application application_1537993323748_0017
  The url to track the job: http://c8d937eb6693:80

### **Exercise 2**

You may have already noticed from the output of exercise 1 that the majority jobs belong to *Information Technology* industry. 

- Now create another script to produce a 30% sample from all jobs whose industry is `"Information Technology"`.
- When you finish, test it locally and on the hadoop cluster.