# Building an NLP Pipeline Class

For the pair problem today, you're going to work on building a class that vectorizes an arbitrary list of documents. The goal is to build something that takes in a bunch of text, and can spit out the cleaned text as a matrix. I'll get you started with a template.

In [1]:
class NLPPipe:
    
    def __init__(self, vectorizer, cleaning_function, tokenizer, stemmer):
        '''
        Create a pipeline that vectorizes an arbitary list of documents.
        '''
        self.vectorizer = vectorizer
        self.cleaning_function = cleaning_function
        self.tokenizer = tokenizer
        self.stemmer = stemmer
    
    def fit(self, text):
        pass
    
    def transform(self, text):
        pass

## Passing Functions (Example)
As a quick note, if you want to pass a function into a class you can do so like this:

In [5]:
def print_the_word_bob_three_times():
    for i in range(3):
        print('bob')
        
# Notice the "Camel Case" used in class names
class ThisIsAnExample:
    
    def __init__(self, function_input):
        # Here, we save an arbitrary function, `function_input` to the object `function_to_run`
        self.function_to_run = function_input
        
    def do_the_thing(self):
        self.function_to_run()  # Notice the parethesis, to actually call the function

In [6]:
example = ThisIsAnExample(print_the_word_bob_three_times)

Note, above, that when we put the function in, we **do not invoke it with the parentheses**!

In [7]:
example.do_the_thing()

bob
bob
bob


## Order of Operations

Both the `.fit` and `.transform` methods should take in in *raw* `text` (a *list* of text documents), cleaning them, and then vectorizing them. So, in your `cleaning_function`, we

1. Loop through each document in `text` ... and,
2. Pick out the individual words using your `tokenizer`
3. Capture only the "meaningful" portion of each of these words using your `stemmer`
4. Join the clean words (stemmed tokens) together, back into each document
5. ... Output all the text as another list of (clean) documents, to give to the `vectorizer`

`.fit` and `.transform` use the `cleaning_function` before fitting or transforming (respectively) the class's `vectorizer` using the given `text`.

## What We Want

So what I want is the ability to do something like:

```python
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import PorterStemmer

train_corpus = ['BOB the builder', 'He is a strange thing', 'caRtoon type thing', 'Yes, he can fix things']
test_corpus = ['BOB the builder', 'can he fix it?', 'yes he can!']  # Note the punctuation ...

nlp = nlp_pipe(CountVectorizer(), simple_cleaning_function_i_made, TreebankWordTokenizer(), PorterStemmer())
nlp.fit(train_corpus)
nlp.transform(test_corpus)
```
Which should return the test corpus in its vectorizer format.

# Solution!

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle


class NLPPipe:
   
    def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None, 
                 stemmer=None, model=None):
        """
        A class for pipelining our data in NLP problems. The user provides a series of 
        tools, and this class manages all of the training, transforming, and modification
        of the text data.
        ---
        Inputs:
        vectorizer: the model to use for vectorization of text data
        tokenizer: The tokenizer to use, if none defaults to split on spaces
        cleaning_function: how to clean the data, if None, defaults to the in built class
        """
        if not tokenizer:
            tokenizer = self.splitter
        if not cleaning_function:
            cleaning_function = self.clean_text
        self.stemmer = stemmer
        self.tokenizer = tokenizer
        self.model = model
        self.cleaning_function = cleaning_function
        self.vectorizer = vectorizer
        self._is_fit = False
        
    def splitter(self, text):
        """
        Default tokenizer that splits on spaces naively
        """
        return text.split(' ')
        
    def clean_text(self, text, tokenizer, stemmer):
        """
        A naive function to lowercase all works can clean them quickly.
        This is the default behavior if no other cleaning function is specified
        """
        cleaned_text = []
        for post in text:
            cleaned_words = []
            for word in tokenizer(post):
                low_word = word.lower()
                if stemmer:
                    low_word = stemmer.stem(low_word)
                cleaned_words.append(low_word)
            cleaned_text.append(' '.join(cleaned_words))
        return cleaned_text
    
    def fit(self, text):
        """
        Cleans the data and then fits the vectorizer with
        the user provided text
        """
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        self.vectorizer.fit(clean_text)
        self._is_fit = True
        
    def transform(self, text):
        """
        Cleans any provided data and then transforms the data into
        a vectorized format based on the fit function. Returns the
        vectorized form of the data.
        """
        if not self._is_fit:
            raise ValueError("Must fit the models before transforming!")
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        return self.vectorizer.transform(clean_text)
    
    def save_pipe(self, filename):
        """
        Writes the attributes of the pipeline to a file
        allowing a pipeline to be loaded later with the
        pre-trained pieces in place.
        """
        if type(filename) != str:
            raise TypeError("filename must be a string")
        pickle.dump(self.__dict__, open(filename+".mdl", 'wb'))
        
    def load_pipe(self, filename):
        """
        Writes the attributes of the pipeline to a file
        allowing a pipeline to be loaded later with the
        pre-trained pieces in place.
        """
        if type(filename) != str:
            raise TypeError("filename must be a string")
        if filename[-4:] != '.mdl':
            filename += '.mdl'
        self.__dict__ = pickle.load(open(filename, 'rb'))

In [2]:
train_corpus = ['BOB the builder', 'He is a strange thing', 'caRtoon type thing', 'Yes, he can fix things']
test_corpus = ['BOB the builder', 'can he fix it?', 'yes he can!']  # Note the punctuation ...

In [3]:
def simple_cleaning_function_i_made(text, tokenizer, stemmer):
    cleaned_text = []
    for post in text:
        cleaned_words = []
        for word in tokenizer(post):
            low_word = word.lower()
            if stemmer:
                low_word = stemmer.stem(low_word)
            cleaned_words.append(low_word)
        cleaned_text.append(' '.join(cleaned_words))
    return cleaned_text

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import PorterStemmer

nlp = NLPPipe(vectorizer=CountVectorizer(), 
              cleaning_function=simple_cleaning_function_i_made, 
              tokenizer=TreebankWordTokenizer().tokenize, 
              stemmer=PorterStemmer())

nlp.fit(train_corpus)
nlp.transform(test_corpus).toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]])

In [5]:
nlp.vectorizer.vocabulary_

{'bob': 0,
 'the': 8,
 'builder': 1,
 'he': 5,
 'is': 6,
 'strang': 7,
 'thing': 9,
 'cartoon': 3,
 'type': 10,
 'ye': 11,
 'can': 2,
 'fix': 4}

In [6]:
nlp.__dict__.keys()

dict_keys(['stemmer', 'tokenizer', 'model', 'cleaning_function', 'vectorizer', '_is_fit'])

## A Few Recommendations
1. **Your model class should not save data** for many reasons. For one, saving data with your model makes dump/load time very slow, and it also makes it difficult to generalize your model to different data sets.

2. Include print statements (or, even better, [logging](https://docs.python.org/3.8/library/logging.html) outputs) wherever you can to make things easier to debug when things go wrong.

3. **Keep functions small!!**. I can't stress this enough. The larger the function length (i.e., number of lines) the harder it is to debug and to keep track of what is going on.