# MapReduce Kata

A code kata is an exercise in programming which helps a programmer hone their skills through practice and repetition. The term was probably first coined by Dave Thomas, co-author of the book "The Pragmatic Programmer", in a bow to the Japanese concept of kata in the martial arts.

This repository can be found on my GitHub account: https://github.com/matchilling/kata-mapreduce

## Basic setup & usage

The project uses Docker and GNU make for running mapreduce jobs on AWS EMR. To get started execute the following steps in your terminal.

```bash
# 1.1 Create an .env file in the project root (c.f. .env.dist)
# 1.2 Create an mrjob.conf file in the project root (c.f. mrjob.conf.dist)
# 1.3 Build the project image from the Dockerfile
$ make build

# 2.1 Run bash interactively in a the container
$ make shell
```

Check the self-documented [makefile](https://github.com/matchilling/kata-mapreduce/blob/master/makefile) for further details.

## Bigram

[MRBigram.py](https://github.com/matchilling/kata-mapreduce/blob/master/bin/MRBigram.py) calculates the conditional probability that a word occurs immediately after the word "my" using the entire collection of over 200,000 short jokes from ([Kaggle](https://www.kaggle.com/abhinavmoudgil95/short-jokes)).

Which 10 words are most likely to be said immediately after the word "my", i.e. with the highest conditional probability?

```bash
# Run MRBigram with a small dataset
$ make shell
$ ./bin/MRBigram.py data/shortjokes_small.csv

# Run MRBigram on AWS EMR with the complete dataset
$ make shell
$ ./bin/MRBigram.py data/shortjokes_complete.csv -r emr -c mrjob.conf

# Download the output from s3
$ cat output/MRBigram/result | sort -r | head -n 10
# 5.5085675287006115 "wife"
# 4.0071172642406220 "girlfriend"
# 2.9517147244497526 "friend"
# 1.7476296097691764 "women"
# 1.6208838082238526 "dad"
# 1.2723328539742120 "coffee"
# 1.2162722109830110 "life"
# 1.2089599532015503 "favorite"
# 1.1212128598240183 "mom"
# 1.1041509250006094 "son"
```

The complete result can be found here: [output/MRBigram/result](https://github.com/matchilling/kata-mapreduce/blob/master/output/MRBigram/result)

![Screenshot AWS EMR web console](https://raw.githubusercontent.com/matchilling/kata-mapreduce/master/output/MRBigram/screenshot_aws_emr_web_gui.png "Screenshot AWS EMR web console")

In [1]:
#!/usr/bin/env python

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

from pprint import PrettyPrinter

pp = lambda x : PrettyPrinter(indent = 2).pprint(x)

class MRBigram(MRJob):

    SORT_VALUES = True

    def combine_counts(self, key, counts):
        yield key, sum(counts)

    def mapper(self, _, line):
        def normalize(text):
            words = re.compile(r"[\w']+").findall(text)
            words = [word.lower() for word in words]
            return [word for word in words if not word.isdigit()]

        data = line.split(',')

        lineNumber = data[0]
        text = data[1]
        corpus = normalize(
            text[1:-1] if (text.startswith('"') and text.endswith('"')) else text
        )

        previous_word = None
        for word in corpus:
            if previous_word is not None:
                yield (previous_word, '##SELF##'), 1
                yield (previous_word, word), 1

            previous_word = word

    def reducer(self, previous_word, value):
        occurrence = 0.0

        for id, data in value:
            if 'total_01' == id:
                occurrence = float(data)
            elif 'total_02' == id and 'my' == previous_word:
                yield (100.0 * float(data[1]) / occurrence, data[0])

    def steps(self):
        return [
            MRStep(mapper=self.mapper, combiner=self.combine_counts, reducer=self.summarise_counts),
            MRStep(reducer=self.reducer)
        ]

    def summarise_counts(self, key, counts):
        previous_word, word = key
        count = sum(counts)

        if '##SELF##' == word:
            yield previous_word, ('total_01', count)
        else:
            yield previous_word, ('total_02', (word, count))

if __name__ == '__main__':
    MRBigram.run()

ModuleNotFoundError: No module named 'mrjob'