In [1]:
import sys
sys.path.insert(0, '..')

# MLMR sklearn transformer integration

We have also added posibility to make use of MapReduce functionality as sklean transformer. This functionality is just a fancy wrap that uses core functionality from `mlmr.function` module. If you haven't read `Functional tutorial` yet, I'd recommend you to check it out before reading this one.

As a starting point let's assume that we are solving text classification problem and we need to integrate parallelized text lemmatization function into our sklearn pipeline. Let's take code of text lemmatization from the `Functional tutorial`.

In [2]:
import string
import spacy 
import en_core_web_sm

import numpy as np
import pandas as pd

from nltk.tokenize import word_tokenize


texts = pd.Series([
    """
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    Ut maximus consequat turpis et condimentum. 
    Duis ullamcorper dictum posuere.
    Curabitur auctor quis sapien congue aliquet. 
    Aliquam dignissim suscipit rhoncus. 
    Fusce vitae cursus dui, eu aliquam dui. 
    Nulla et ultrices lacus, at iaculis arcu. 
    Sed fermentum metus libero, sed egestas libero ultrices sed. 
    Duis erat leo, ultricies quis dapibus non, lacinia ut tellus.
    """
]*10000)


nlp = en_core_web_sm.load()

def preprocess_texts_df(df): # our transform(map) function
    return df.apply(preprocess_text)

def remove_punct(doc):
    return [t for t in doc if t.text not in string.punctuation]

def remove_stop_words(doc):
    return [t for t in doc if not t.is_stop]

def lemmatize(doc):
    return ' '.join([t.lemma_ for t in doc])

def preprocess_text(text):
    doc = nlp(text)
    removed_punct = remove_punct(doc)
    removed_stop_words = remove_stop_words(removed_punct)
    return lemmatize(removed_stop_words)

We have prepared two kind of transformers you could use:

`mlmr.transformers.BaseMapReduceTransformer`, which is a sklearn wrapper base class for `mlmr.function.transform_concat` function. Just to refresh your memory `mlmr.function.transform_concat` description:

>Function for performing parallel data transformations on data (pd.DataFrame, pd.Series). From n_jobs argument, number of processes to run in parallel is calculated. Data is evenly divided into number of processes slices. Then transform_func is applied on each slice in parallel. After calculation is complete all transformation results are flattened. Flattened result is returned. Data preserves initial ordering.

You have to create a child class, which implements `transform_part` function. This function stands for `transform_func` in a functional interface and for **Map** part in map reduce. So your implementation could be as follows:

In [3]:
from mlmr.transformers import BaseMapReduceTransformer
    
class TextPreprocessor(BaseMapReduceTransformer):
    
    def transform_part(self, X):
        return preprocess_texts_df(X)

Now you can easily integrate it, into your pipeline.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

n_jobs = 4

text_classification_pipeline = Pipeline([
     ('text_preprocessor', TextPreprocessor(n_jobs=n_jobs)),
     ('vectorizer', TfidfVectorizer(analyzer = "word", max_features=10000)),
     ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=n_jobs))
])

Also we have prepared a transformer similar to `sklearn.FunctionTransformer`:
`mlmr.transformers.FunctionMapReduceTransformer`, which is a sklearn wrapper class for `mlmr.function.map_reduce` function. Just to refresh your memory `mlmr.function.map_reduce` description:

>Base function for performing parallel MapReduce on data. Firstly data are splitted into data splits using `data_split_func` function. From `n_jobs` argument, number of processes to run in parallel is calculated. Then `map_func` is applied on each data split in parallel. After calculation is complete `reduce_func` is sequentially applied on list of `map_func` results. `reduce_func` result is returned. Data preserves initial ordering.

So basically it is a base function using which you can formulate any MapReduce problem. Usage of this object could look as follows (this representation is equivalent to previous example):


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from mlmr.transformers import FunctionMapReduceTransformer

def get_split_data_func(n_slices): # wrapper function of split data function
    def split_data(data):
        return np.array_split(data, n_slices)
    return split_data

n_jobs = 4

text_classification_pipeline = Pipeline([
     ('text_preprocessor', FunctionMapReduceTransformer(
         map_func=preprocess_texts_df,
         reduce_func=pd.concat,
         data_split_func=get_split_data_func(n_jobs),
         n_jobs=n_jobs
     )),
     ('vectorizer', TfidfVectorizer(analyzer = "word", max_features=10000)),
     ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=n_jobs))
])

I hope it will help you parallelize your ML projects! If you have found any error or if you have got an idea of an improvement, post it in GitHub issues and we'll work on it.