# Introduction to spaCy

This code accompanies Kim Fessel's post on the ODSC blog: ["Level Up: spaCy NLP for the Win,"](https://opendatascience.com/level-up-spacy-nlp-for-the-win/) published February 2020.

## spaCy Installation

Install spaCy with pip:

`pip install spacy`

You will also need to download a language model.  For learning purposes, we will just start with this small English model:

`python -m spacy download en_core_web_sm`

## spaCy Basics

### Tokenization

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
review = "I'm so happy I went to this awesome Vegas buffet!"

In [3]:
doc = nlp(review)

> "The resulting spaCy document is a rich collection of tokens that have been annotated with many attributes... To see this in action, loop over each token in the document and print out the part of speech, lemma, and whether or not this token is a so-called stop word."

In [4]:
for token in doc:
    print(token.text, token.pos_, token.lemma_, token.is_stop)

I PRON -PRON- True
'm VERB be True
so ADV so True
happy ADJ happy False
I PRON -PRON- True
went VERB go False
to ADP to True
this DET this True
awesome ADJ awesome False
Vegas PROPN Vegas False
buffet NOUN buffet False
! PUNCT ! False


> "... spaCy tokenizes text in an entirely nondestructive manner... The underlying text does not change... spaCy does not explicitly break the original text into a list, but tokens can be accessed by index span."

In [5]:
doc.text

"I'm so happy I went to this awesome Vegas buffet!"

In [6]:
doc[:5]

I'm so happy I

In [7]:
doc[-5:-1]

this awesome Vegas buffet

> "spaCy also performs automatic sentence detection.  Iterating over the generator `doc.sents` yields each recognized sentence."

In [8]:
type(doc.sents)

generator

In [10]:
for sent in doc.sents:
    print(sent)

I'm so happy I went to this awesome Vegas buffet!


### Dependencies

> "... spaCy provides syntactic parsing to show word usage, thus creating a dependency tree..."

In [11]:
for token in doc:
    print(token.text, token.dep_)

I nsubj
'm ROOT
so advmod
happy acomp
I nsubj
went ccomp
to prep
this det
awesome amod
Vegas compound
buffet pobj
! punct


> "... visualizing these relationships reveals an even more comprehensive story.  First load a submodule called displaCy to help with the visualization... ask displaCy to render the dependency tree..."

In [12]:
from spacy import displacy

In [18]:
displacy.render(doc, style='dep', options={'distance': 80}) 

> "You can even traverse this parse tree... spaCy accurately labels 'awesome' as an adjectival modifier (amod) and also detects its relationship to 'buffet':"

In [19]:
from spacy.symbols import amod

In [20]:
for token in doc:
    if token.dep_ == 'amod':
        print(f"ADJ MODIFIER: {token.text} --> NOUN: {token.head}")

ADJ MODIFIER: awesome --> NOUN: buffet


In [21]:
spacy.explain("amod")

'adjectival modifier'

### Named Entity Recognition

> "To see which tokens spaCy identifies as named entities... simply cycle through `doc.ents`"

In [23]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Vegas GPE


In [24]:
spacy.explain("GPE")

'Countries, cities, states'

In [25]:
displacy.render(doc, style='ent', jupyter=True)

> "Consider this more complicated example with four different kinds of entities; displaCy provides unique colors to each."

In [26]:
document = nlp(
    "One year ago, I visited the Eiffel Tower with Jeff in Paris, France."
    )

In [27]:
displacy.render(document, style='ent', jupyter=True)

In [28]:
spacy.explain("FAC")

'Buildings, airports, highways, bridges, etc.'

## Case Study: Restaurant Reviews

> "We will examine [this Kaggle dataset](https://www.kaggle.com/vigneshwarsofficial/reviews), consisting of 1,000 [restaurant] reviews labeled by sentiment."

In [29]:
import pandas as pd

pd.set_option('max_colwidth', 100)

In [30]:
url = 'http://bit.ly/375FDrO'  #Kaggle dataset

df = pd.read_csv(url, sep='\t')

In [31]:
df.shape

(1000, 2)

In [32]:
df.columns = ['text', 'rating']

In [33]:
df.head()

Unnamed: 0,text,rating
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
4,The selection on the menu was great and so were the prices.,1


### Pipelines

> "We will now use spaCy's `pipe` method in order to process multiple documents in one go."

In [34]:
df['spacy_doc'] = list(nlp.pipe(df.text))

In [35]:
df.head()

Unnamed: 0,text,rating,spacy_doc
0,Wow... Loved this place.,1,"(Wow, ..., Loved, this, place, .)"
1,Crust is not good.,0,"(Crust, is, not, good, .)"
2,Not tasty and the texture was just nasty.,0,"(Not, tasty, and, the, texture, was, just, nasty, .)"
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1,"(Stopped, by, during, the, late, May, bank, holiday, off, Rick, Steve, recommendation, and, love..."
4,The selection on the menu was great and so were the prices.,1,"(The, selection, on, the, menu, was, great, and, so, were, the, prices, .)"


### Parts of Speech by Sentiment

> "Splitting the information by sentiment..."

In [36]:
positive_reviews = df[df.rating==1]
negative_reviews = df[df.rating==0]

> "What are the most common adjectives used in positive versus negative reviews?... Let's [also] check the nouns..."

In [37]:
pos_adj = [token.text.lower() for doc in positive_reviews.spacy_doc for token in doc if token.pos_=='ADJ']
neg_adj = [token.text.lower() for doc in negative_reviews.spacy_doc for token in doc if token.pos_=='ADJ']

pos_noun = [token.text.lower() for doc in positive_reviews.spacy_doc for token in doc if token.pos_=='NOUN']
neg_noun = [token.text.lower() for doc in negative_reviews.spacy_doc for token in doc if token.pos_=='NOUN']

In [38]:
from collections import Counter

In [39]:
Counter(pos_adj).most_common(10)

[('good', 72),
 ('great', 68),
 ('friendly', 24),
 ('amazing', 22),
 ('nice', 21),
 ('delicious', 20),
 ('best', 18),
 ('awesome', 12),
 ('first', 11),
 ('fantastic', 11)]

In [40]:
Counter(neg_adj).most_common(10)

[('good', 22),
 ('bad', 17),
 ('worst', 13),
 ('disappointed', 13),
 ('slow', 11),
 ('more', 10),
 ('other', 10),
 ('better', 10),
 ('terrible', 10),
 ('bland', 9)]

In [41]:
Counter(pos_noun).most_common(10)

[('food', 59),
 ('place', 56),
 ('service', 46),
 ('time', 22),
 ('restaurant', 17),
 ('staff', 15),
 ('menu', 11),
 ('experience', 11),
 ('steak', 10),
 ('prices', 9)]

In [42]:
Counter(neg_noun).most_common(10)

[('food', 66),
 ('place', 48),
 ('service', 37),
 ('time', 19),
 ('minutes', 19),
 ('flavor', 10),
 ('times', 9),
 ('salad', 8),
 ('restaurant', 8),
 ('quality', 8)]

### Dependency Parsing

> "For a given noun of interest, extract each of the adjectival modifiers that are among its children tokens..."

In [43]:
from spacy.symbols import amod
from pprint import pprint

In [44]:
def get_amods(noun, ser):
    amod_list = []
    for doc in ser:
        for token in doc:
            if (token.text) == noun:
                for child in token.children:
                    if child.dep == amod:
                        amod_list.append(child.text.lower())
    return sorted(amod_list)

def amods_by_sentiment(noun):
    print(f"Adjectives describing {str.upper(noun)}:\n")
    
    print("POSITIVE:")
    pprint(get_amods(noun, positive_reviews.spacy_doc))
    
    print("\nNEGATIVE:")
    pprint(get_amods(noun, negative_reviews.spacy_doc))

In [45]:
amods_by_sentiment("food")

Adjectives describing FOOD:

POSITIVE:
['amazing',
 'authentic',
 'authentic',
 'delicious',
 'fantastic',
 'good',
 'good',
 'good',
 'good',
 'great',
 'great',
 'great',
 'great',
 'great',
 'great',
 'great',
 'healthy',
 'impeccable',
 'mexican',
 'phenomenal',
 'tasty',
 'typical']

NEGATIVE:
['authentic',
 'bad',
 'bad',
 'bad',
 'better',
 'bland',
 'blandest',
 'familiar',
 'good',
 'good',
 'good',
 'mediocre',
 'mediocre',
 'mediocre',
 'such']


In [46]:
amods_by_sentiment("service")

Adjectives describing SERVICE:

POSITIVE:
['awesome',
 'awesome',
 'best',
 'clean',
 'excellent',
 'excellent',
 'fantastic',
 'fantastic',
 'friendly',
 'friendly',
 'good',
 'good',
 'great',
 'great',
 'great',
 'great',
 'great',
 'great',
 'perfect',
 'speedy']

NEGATIVE:
['atrocious',
 'awful',
 'bad',
 'better',
 'customer',
 'downright',
 'little',
 'poor',
 'poor',
 'rude',
 'slow',
 'spotty',
 'terrible',
 'terrible',
 'worst',
 'worst']
