# Profiling NLP: optimizing code with spaCy

## Objective

To compare two Python natural-language processing libraries with respect to their speed and efficiency

## Architecture

NLTK is very popular but not optimized for speed. SpaCy is built with Cython, which gives it an enormous speed advantage, but many people misuse it in a way that slows it down to Python speeds. We'll examine ways to achieve common NLP tasks while avoiding time overhead.

## Tasks

The processing pipeline(s) include:

    NLTK version:
      - tokenize: split texts into individual tokens
      - lowercase: normalize the vocabulary by case
      - stopword removal: remove tokens if they appear in a specified list
      - tag: tag part of speech (for lemmatization)
      - lemmatize: normalize the vocabulary to the base form of each token
      - join (optional): return the list of tokens in one joined string
      
The standard spaCy pipeline includes all of this, plus dependency parsing and NER:
      
    spaCy version:
      - tokenize
      - tag: tag part of speech
      - parse: perform dependency parsing
      - named entity recognition: extract named entities according to statistical model
      - lemmatize
      - join (optional)


In [1]:
import pandas as pd
import numpy as np
import spacy
import nltk
# import gensim
from itertools import dropwhile
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

In [2]:
%load_ext line_profiler

In [5]:
stop_words = stopwords.words('english')

In [91]:
class InputError(AttributeError):
    '''if parameter spacy == True, corpus input must be of type spacy.tokens.doc.Doc, not {doc.__class__}!\ne.g.: clean_text(nlp("You keep using that word. I do not think it means what you think it means.")'''
    

In [96]:
def clean_text(doc, spacy=True, printed=False, list_tokens=False):
    '''
    define a simple preprocessing pipeline for general NLP tasks.
    non-spaCy version:
      - tokenize: split texts into individual tokens
      - lowercase: normalize the vocabulary by case
      - stopword removal: remove tokens if they appear in a specified list
      - tag: tag part of speech (for lemmatization)
      - lemmatize: normalize the vocabulary to the base form of each token
      - join (optional): return the list of tokens in one joined string
      
      the spaCy pipeline includes all of this, plus dependency parsing and NER:
      
    spaCy version: (we will disable some of these pipes in testing.)
      - tokenize
      - lemmatize
      - stopword removal
      - tag: tag part of speech
      - parse: perform dependency parsing
      - named entity recognition: extract named entities according to statistical model
      - join (optional)
    '''  

    if spacy:
        try: 
            doc = [token.lemma_ for token in doc if not token.is_stop and not token.pos_ in ['PRON', 'PUNCT']]
            #dropwhile(lambda x: not (x.is_stop and x.pos_ in ['PRON', 'PUNCT']), tokens)
            #if not token.is_stop and not token.pos_ in set(['PRON', 'PUNCT'])
            if not list_tokens:
                return nlp.make_doc(' '.join([token for token in doc]))
            else:
                return doc
        except AttributeError as ae:
            print(f'''ERROR! if parameter spacy == True, corpus input must be of type spacy.tokens.doc.Doc, not {doc.__class__}!\ne.g.: clean_text(nlp("You keep using that word. I do not think it means what you think it means.")''')
            raise

    else:
        '''
        spaCy's default pipeline includes tokenizer + lemmatizer + POS-tagging
        we've added stopword removal to both processes
        '''
        def pos_tag_nltk(token, printed=False):
            tag_map = defaultdict(lambda : wn.NOUN)
            tag_map['J'] = wn.ADJ
            tag_map['V'] = wn.VERB
            tag_map['R'] = wn.ADV
        
            nonlocal lemmatizer
        
            token, tag = zip(*pos_tag([token]))
            lemma = lemmatizer.lemmatize(token[0], tag_map[tag[0][0]])
            if printed:
                print(token[0], "=>", lemma)
            return lemma

        new_doc = []
        tokenizer = RegexpTokenizer(r'\w+')
        lemmatizer = WordNetLemmatizer()
        doc = tokenizer.tokenize(doc)
        for token in doc:
            if token.lower() not in stop_words:
                new_doc.append(pos_tag_nltk(token.lower(), printed))
        if not list_tokens:
            return ' '.join([token for token in new_doc])
        else:
            return new_doc

In [11]:
doc_example = '''Let\'s try some NER: Barack Obama, Germany, $5 million, ten o\'clock. the parts of speech get lemmatized differently: it's fun making things, the making of the film, making fun of you'''

a = '''This is a Doc. I have made one, and it\'s clean now! Pretty cool, huh? It\'s fun making docs in spaCy. 
               Let\'s try for some NER: Barack Obama, Germany, $5 million, Quakers, ten o\'clock.
               Did you notice that made and make are stop words in spaCy but making isn't?
               the parts of speech get lemmatized differently: it's fun making things, the making of the film, making fun of you
               WordNet treats lemmas differently: repeat repeating repeats repeated repetition repetitive
               in spaCy, clocks are different from o\'clock '''

'''
Note the differences in spaCy's stopwords list
and lemmatization vs. NLTK's. Also note that
spaCy won't eliminate newline characters for you.
'''

nlp = spacy.load('en_core_web_lg')
nltk_ex = clean_text(doc_example, spacy=False)
spacy_ex = clean_text(nlp(doc_example), spacy=True)

print(f'NLTK: {nltk_ex}')
print('')
print(f'spaCy: {spacy_ex}')

<class 'spacy.tokens.doc.Doc'>
NLTK: let try ner barack obama germany 5 million ten clock part speech get lemmatized differently fun make thing make film make fun

spaCy: let try ner Barack Obama Germany $ 5 million o'clock part speech lemmatize differently fun make thing making film make fun


In [71]:
spacy_ex = clean_text(nlp(doc_example))
print(spacy_ex)

let try ner Barack Obama Germany $ 5 million o'clock part speech lemmatize differently fun make thing making film make fun


In [6]:
'''
However, spaCy's conservative lemmatization 
lets it keep the entities that it recognizes together,
unlike NLTK, which does not keep them in one unit.

spaCy also takes part of speech into account when lemmatizing.
if syntactic nuance is important to your use case, this
can be a valuable disambiguating tool.

NLTK: ten, clock
spaCy: ten o'clock

NLTK:  making of + N (nominalized V) -> make
       making + N (V, progressive aspect) -> make
spaCy: making of + n -> making (retains form of nominalization)
       making + N -> make 

NB: it's sensitive enough to know the difference between 
"making of + N" and "making N of"

'''
spacy_doc = nlp(doc_example)
spacy_doc.ents

print(clean_text(nlp('The making of the film')))
print(clean_text(nlp('making a banana cream pie')))
print(clean_text(nlp('making fun of your dumb photos')))
print(clean_text(nlp('making light of the situation')))

making film
make banana cream pie
make fun dumb photo
make light situation


In [99]:
'''
note that the order in which components are used counts.
this ordering recognizes two kinds of Barack Obama,
because he has been lowercased.
'''

nlp = spacy.load('en_core_web_lg')
print(nlp.pipeline)

docs = ['doc one', 'barack obama', 'fifty billion', 'Barack Obama']

def nltk_clean_nlp(docs, spacy=False):
    new_docs = []
    for doc in docs:
        clean_text(doc, spacy=False)
        new_docs.append(doc)
        
        # if spacy == True:
        #     doc = nlp(doc)
        #     ents = doc.ents
        #     print(doc, ents)
    
    return new_docs

l = nltk_clean_nlp(docs, spacy = True)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x1004a0a50>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x1f9058050>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x21bec89f0>)]


In [21]:
import requests, zipfile, io

In [24]:
def get_sample_docs(url='http://archives.textfiles.com/media.zip'): # http://archives.textfiles.com/media.zip
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()
    prof_docs = []
    for file in z.namelist():
        tmp = z.extract(file)
        try:
            with open(file, 'r') as f:
                prof_docs.append(f.read())
        except Exception as e:
            print(f'could not read {file}: {e}')
    return prof_docs

In [25]:
prof_docs = get_sample_docs()
len(prof_docs)

could not read media/EPISODES/: [Errno 21] Is a directory: 'media/EPISODES/'
could not read media/EPISODES/warnrlis.txt: 'utf-8' codec can't decode byte 0x8e in position 2443: invalid start byte
could not read media/SCRIPTS/: [Errno 21] Is a directory: 'media/SCRIPTS/'
could not read media/christop.int: 'utf-8' codec can't decode byte 0x97 in position 7661: invalid start byte
could not read media/earp: 'utf-8' codec can't decode byte 0xda in position 8: invalid continuation byte
could not read media/oliver.txt: 'utf-8' codec can't decode byte 0xb0 in position 1: invalid start byte
could not read media/oliver02.txt: 'utf-8' codec can't decode byte 0xb0 in position 1: invalid start byte
could not read media/widows: 'utf-8' codec can't decode byte 0xda in position 8: invalid continuation byte


161

In [41]:
reddit = pd.read_csv('reddit-comments-2015-08.csv')
blog_docs = reddit.body.tolist()

# Profiling

## Round 1: NLTK

### Total time: 1897.58 s (that's 31.6 minutes)
### 60.4%: clean_text
### 39.1%: nlp(doc)

In [100]:
%lprun -f nltk_clean_nlp -f clean_text -T nltk_reddit_2.txt nltk_clean_nlp(blog_docs)


*** Profile printout saved to text file 'nltk_reddit_2.txt'. 


## Round 2: spaCy, the wrong way

### (Converting a fast c struct to a slow python list)

### Total time: 871.419  s (still half the time of NLTK)

### 98.2%: list(nlp.pipe(docs))

In [101]:
nlp = spacy.load('en_core_web_lg')
print(nlp.pipeline)

def list_pipe_docs(docs): 
    '''
    For an apples-to-apples comparison, let's disable the
    dependency parse and named entity recognition pipes, 
    since NLTK isn't doing those things.
    '''
    doc_convert = []
    for doc in list(nlp.pipe(docs, disable=['parser', 'ner'])):
        new_doc = clean_text(doc)
        # for token in doc:
        #     if not token.is_stop or token.pos_ in ['PRON', 'PUNCT']:
        #         new_doc.append(token)
        doc_convert.append(new_doc)
    return doc_convert
                
        #print(doc.ents)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x20e313dd0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x202c43ad0>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x20e2a0de0>)]


In [102]:
%lprun -f list_pipe_docs -f clean_text -T spacy_list_reddit_2.txt list_pipe_docs(blog_docs) 


*** Profile printout saved to text file 'spacy_list_reddit_2.txt'. 


# Round 3: spaCy, using (some) Cython optimization

### Total time:

In [105]:
nlp_pipeline = spacy.load('en_core_web_lg')
nlp_pipeline.add_pipe(clean_text, 'clean_text')
print(nlp_pipeline.pipeline)

def pipe_docs(docs):
    for doc in nlp_pipeline.pipe(docs, disable=['parser', 'ner']):
        # _ = doc.text
        yield doc

def get_piped_docs(docs):
    gen_docs = []
    for doc in pipe_docs(docs):
        gen_docs.append(doc)
    return gen_docs

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x20e30dbd0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x1a8ba5210>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x1a8ba59f0>), ('clean_text', <function clean_text at 0x21c719cb0>)]


In [106]:
%lprun -f get_piped_docs -f pipe_docs -f clean_text -T spacy_pipe_reddit_2.txt get_piped_docs(blog_docs)

'''
clean_text is much faster in the pipe_docs function
the slowest part is the string join to return a spacy doc at the end
'''


*** Profile printout saved to text file 'spacy_pipe_reddit_2.txt'. 


'\nclean_text is much faster in the pipe_docs function\nthe slowest part is the string join to return a spacy doc at the end\n'

# The Winner: spaCy pipeline

* spaCy with Cython: 11 minutes
* spaCy with Python lists: 14.5 minutes
* NLTK: 31.6 minutes

The slowest parts of the pipeline are still the Python parts: the list comprehension (23.6% of total time, 2821.8 ms) and the string join with list comprehension (76.3% of total time, 9118.1 ms). (This could be further optimized, too.)


### Takeaways

* NLTK is a significantly slower way to preprocess text
* Using spaCy's nlp in a for-loop is still better than NLTK, but not as good as nlp.pipe()
* However, converting the nlp.pipe object to a Python list negates several minutes of time saved due to Cython efficiencies