## Python implementation of MapReduce

The assumption is that the map-reduce functions follow the following:

`map([x1, x2, ...]) -> [(k1, v2), (k2, v2), ...]`

`reduce([v1, v2, ...]) -> y`

where every element is arbitrary (any data structure)

In [None]:
from itertools import groupby
from operator import itemgetter

def mapreduce(data, map_func, reduce_func):

    # 1. Map phase
    mapped_data = map_func(data)
    
    # 2. Shuffle phase
    mapped_data.sort(key=itemgetter(0))  # Sort input data by key
    grouped_data = groupby(mapped_data, key=itemgetter(0))  # Group data by key
    
    # 3. Reduce
    results = [
        (key, reduce_func([item[1] for item in group])) 
        for key, group in grouped_data
    ]

    return results

## 1. Count characters in words

In [None]:
data = ['dasdsagf', 'mike', 'george', 'gertretr123', 'dsadsajortriojtiow']

def map_func(data):
    result = []
    for string in data:
        for char in string:
            result.append((char, 1))
    return result

def reduce_func(data):
    return sum(data)

In [None]:
mapreduce(data, map_func, reduce_func)

## 2. Word Co-occurence

In this scenario, we want to find out the number of times pairs of words co-occur in the same sentence.

In [None]:
data = ["the quick brown fox jumps over the lazy dog", 
        "the quick blue cat sleeps on the lazy chair", 
        "the quick green bird flies under the lazy cloud"]

def map_func(data):
    result = []
    for sentence in data:
        words = sentence.split()
        for i, word_i in enumerate(words):
            for j, word_j in enumerate(words[i+i:]):
                result.append(
                    ((word_i, word_j), 1)
                )
    return result

def reduce_func(data):
    return sum(data)

In [None]:
mapreduce(data, map_func, reduce_func)

## 3. Reverse Web-link Graph

This case assumes we have the graph of web pages where each page links to a list of pages. The goal is to create the reverse graph, where for each page we list all pages linking to it.

In [None]:
data = [("page_A", ["page_B", "page_C", "page_D"]),
        ("page_B", ["page_A", "page_E"]),
        ("page_C", ["page_F", "page_A"])]

def map_func(data):
    result = []
    for page, links in data:
        for link in links:
            result.append((link, page))
    return result

def reduce_func(data):
    return list(data)

In [None]:
mapreduce(data, map_func, reduce_func)

## 4. Document Similarity:

In this scenario, we compute the similarity between pairs of documents using the Jaccard similarity, which measures the overlap between sets.

In [None]:
data = [("doc1", ["the", "quick", "brown", "fox"]),
        ("doc2", ["the", "lazy", "dog"]),
        ("doc3", ["the", "quick", "blue", "cat"])]

def map_func(data):
    result = []
    for document, words in data:
        for word1 in words:
            for word2 in words:
                if word1 != word2:
                    result.append(((word1, word2), 1))
    return result

def reduce_func(data):
    return sum(data)

In [None]:
mapreduce(data, map_func, reduce_func)

## 5. TF-IDF (Term Frequency-Inverse Document Frequency):

In [None]:
data = [
    ("doc1", "the quick brown fox"),
    ("doc2", "the lazy dog"),
    ("doc3", "the quick blue cat")
]

def map_func(data):
    result = []
    for doc_id, text in data:
        words = text.split()
        term_freq = {}
        for word in words:
            term_freq[word] = term_freq.get(word, 0) + 1
        result.extend([(word, (doc_id, freq)) for word, freq in term_freq.items()])
    return result

def reduce_func(data):
    doc_counts = len(data)
    result = []
    for doc_id, freq in data:
        result.append((doc_id, freq * (1 / doc_counts)))
    return result

In [None]:
mapreduce(data, map_func, reduce_func)

## 6. Distributed Sort

In [None]:
data = [ 
    "record_10", "record_2", "record_3", "record_13",
    "record_1", "record_32", "record_14", "record_18", 
    "record_8", "record_4", "record_12", "record_19",
    "record_13", "record_33", "record_5", "record_9",
]

def map_func(data):
    result = []
    
    for record in data:
        _, key = record.split("_")
        result.append((int(key), record))
    
    return result

def reduce_func(data):
    return data[0]

In [None]:
mapreduce(data, map_func, reduce_func)