# NLP Toolkit Selection

### Load Cleansed Wine Reviews

See [data preparation](wine_review-data_preparation.ipynb) for details on the prepared dataset.

Libraries

In [18]:
import pandas as pd
import numpy as np

from IPython.display import Markdown, display
from tqdm import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt

In [19]:
wine_df = pd.read_parquet('files/wine_review.parquet.gzip')
wine_df[['title', 'winery', 'year', 'variety', 'description']].head()

Unnamed: 0,title,winery,year,variety,description
0,Nicosia 2013 Vulkà Bianco (Etna),Nicosia,2013,White Blend,"Aromas include tropical fruit, broom, brimston..."
1,Quinta dos Avidagos 2011 Avidagos Red (Douro),Quinta dos Avidagos,2011,Portuguese Red,"This is ripe and fruity, a wine that is smooth..."
2,Rainstorm 2013 Pinot Gris (Willamette Valley),Rainstorm,2013,Pinot Gris,"Tart and snappy, the flavors of lime flesh and..."
3,St. Julian 2013 Reserve Late Harvest Riesling ...,St. Julian,2013,Riesling,"Pineapple rind, lemon pith and orange blossom ..."
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Sweet Cheeks,2012,Pinot Noir,"Much like the regular bottling from 2012, this..."


**Pick a review**

In [20]:
# pick a review at random
review = next(wine_df[['title','description']].sample(1, random_state=28).itertuples())
print(review.title)
display(Markdown(review.description))

Wente 2008 Shorthorn Canyon Syrah (Livermore Valley)


Savory on the nose with enough sweet blueberry to make it varietally enticing, this light purplish Syrah is well-balanced, pleasing and well-made, ready to drink now with a range of foods, its alcohol in proportion and acidity right on target. Wente blends in small percentages of Counoise, Petite Sirah, Mourvèdre and Tempranillo into this great-value Syrah. Delicious.

**Define a corpus for performance testing**

In [21]:
corpus = wine_df[['title', 'winery', 'year', 'variety', 'description']].sample(100, random_state=42).description
corpus.head()

12616    87—89 Barrel sample. Soft fruit, followed by h...
49741    Rock solid, clean and composed. For the past s...
35244    This uber-informal wine has faint aromas that ...
52642    Those mountain tannins are here in spades, and...
83201    The aromas are light initially, with notes of ...
Name: description, dtype: object

## NLTK

**Load NLP toolkit**

In [22]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxevent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /Users/patrick/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/patrick/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Error loading maxevent_ne_chunker: Package
[nltk_data]     'maxevent_ne_chunker' not found in index
[nltk_data] Downloading package words to /Users/patrick/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

**Split review into sentences**

In [23]:
sentences = nltk.sent_tokenize(review.description)
sentences

['Savory on the nose with enough sweet blueberry to make it varietally enticing, this light purplish Syrah is well-balanced, pleasing and well-made, ready to drink now with a range of foods, its alcohol in proportion and acidity right on target.',
 'Wente blends in small percentages of Counoise, Petite Sirah, Mourvèdre and Tempranillo into this great-value Syrah.',
 'Delicious.']

**Tokenize the words in each sentence**

In [24]:
import string
from nltk.corpus import stopwords

swords = stopwords.words('english')
swords.extend(['make', 'drink', 'wine', '%'])

tokenized_sentences = [list(filter(lambda word: (word not in string.punctuation) and (word.lower() not in swords), nltk.word_tokenize(sentence))) for sentence in sentences]
tokenized_sentences[0][:5]

['Savory', 'nose', 'enough', 'sweet', 'blueberry']

**Tag each word in sentence for parts of speech**

In [25]:
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
tagged_sentences[0][:5]

[('Savory', 'NNP'),
 ('nose', 'RB'),
 ('enough', 'RB'),
 ('sweet', 'JJ'),
 ('blueberry', 'NNS')]

**Named entity recognition**

In [26]:
chunked_sentences = [tree for tree in nltk.ne_chunk_sents(tagged_sentences)]
chunked_sentences[0][:5]

[Tree('GPE', [('Savory', 'NNP')]),
 ('nose', 'RB'),
 ('enough', 'RB'),
 ('sweet', 'JJ'),
 ('blueberry', 'NNS')]

**Remove named entities from each sentence**

In [27]:
def extract_entity_names(t):
    entity_names = []
    if hasattr(t, 'label'):
        if t.label() in ['GPE', 'PERSON', 'PROPN']:
            entity_names.extend([child[0] for child in t])
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))
    return entity_names


preprocessed_description = ' '.join([
  word    
  for i, tree in enumerate(chunked_sentences)
  for word in list(filter(lambda word: word not in extract_entity_names(tree), tokenized_sentences[i]))
])

display(Markdown('**Description**: ' + review.description))
display(Markdown('**After preprocessing**: ' + preprocessed_description))

**Description**: Savory on the nose with enough sweet blueberry to make it varietally enticing, this light purplish Syrah is well-balanced, pleasing and well-made, ready to drink now with a range of foods, its alcohol in proportion and acidity right on target. Wente blends in small percentages of Counoise, Petite Sirah, Mourvèdre and Tempranillo into this great-value Syrah. Delicious.

**After preprocessing**: nose enough sweet blueberry varietally enticing light purplish Syrah well-balanced pleasing well-made ready range foods alcohol proportion acidity right target blends small percentages great-value Syrah Delicious

**Performance Evaluation**

In [28]:
def nltk_preprocess(doc):
  sentences = nltk.sent_tokenize(doc)
  tokenized_sentences = [list(filter(lambda word: (word not in string.punctuation) and (word.lower() not in swords), nltk.word_tokenize(sentence))) for sentence in sentences]
  tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)

  def extract_entity_names(t):
    entity_names = []
    if hasattr(t, 'label'):
      if t.label() in ['GPE', 'PERSON', 'PROPN']:
        entity_names.extend([child[0] for child in t])
      else:
        for child in t:
          entity_names.extend(extract_entity_names(child))
    return entity_names


  return ' '.join([
    word    
    for i, tree in enumerate(chunked_sentences)
    for word in list(filter(lambda word: word not in extract_entity_names(tree), tokenized_sentences[i]))
  ])

In [29]:
%%time
corpus.progress_apply(nltk_preprocess)

100%|██████████| 100/100 [00:30<00:00,  3.30it/s]

CPU times: user 28.7 s, sys: 1.44 s, total: 30.1 s
Wall time: 30.3 s





12616     87—89 sample fruit followed high acids layer d...
49741     solid clean composed past several years Malbec...
35244     uber-informal faint aromas recall pressed wild...
52642     mountain tannins spades even though 're ripe s...
83201     aromas light initially notes cocoa herb barrel...
                                ...                        
66961     predominantly Merlot 83 along Sauvignon 9 Syra...
95467     blend 60 40 del opulent dessert opens intense ...
102544    soft broadly fruity Chardonnay near-tropical r...
39464     Crafted vines old Jean-Louis Grippat estate wo...
70310     best new releases massive sourced outstanding ...
Name: description, Length: 100, dtype: object

## spaCy

**Load NLP toolkit**

In [30]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Preprocess Text**

In [31]:
keep = lambda token: (token[1] not in ['PUNCT','PROPN', 'NUM']) and (token[0].lower() not in swords)
spacy_preprocess = lambda doc: ' '.join(map(lambda token: token[0], filter(keep, [(w.lemma_, w.pos_) for w in nlp(doc)])))

display(Markdown('**Description**: ' + review.description))
display(Markdown('**After preprocessing**: ' + preprocessed_description))

**Description**: Savory on the nose with enough sweet blueberry to make it varietally enticing, this light purplish Syrah is well-balanced, pleasing and well-made, ready to drink now with a range of foods, its alcohol in proportion and acidity right on target. Wente blends in small percentages of Counoise, Petite Sirah, Mourvèdre and Tempranillo into this great-value Syrah. Delicious.

**After preprocessing**: nose enough sweet blueberry varietally enticing light purplish Syrah well-balanced pleasing well-made ready range foods alcohol proportion acidity right target blends small percentages great-value Syrah Delicious

**Performance Evaluation**

In [32]:
%%time
corpus.progress_apply(spacy_preprocess)

100%|██████████| 100/100 [00:01<00:00, 93.23it/s] 

CPU times: user 976 ms, sys: 52.8 ms, total: 1.03 s
Wall time: 1.07 s





12616     sample soft fruit follow high acid layer dry t...
49741     rock solid clean compose past several year mer...
35244     uber informal faint aroma recall press wildflo...
52642     mountain tannin spade even though ripe sweet g...
83201     aroma light initially note cocoa herb barrel s...
                                ...                        
66961     predominantly along pleasing aroma suggest blu...
95467     blend opulent dessert open intense aroma dry b...
102544    soft broadly fruity near tropical ripeness pap...
39464     craft vine old estate wonderfully complex yet ...
70310     good new release massive source outstanding pa...
Name: description, Length: 100, dtype: object

## Conclusion

[spaCy](https://spacy.io/) appears to be a much better choice. It is significantly easier to use, does a better job of recognizing proper nouns, and performs much faster. Given the size of the corpus we want to analyze, we will select spaCy as our NLP toolkit.

# Next Step
- [Preprocess reviews with spaCy](wine_reviews-spacy_preprocess.ipynb)