# <INSERT WORKSHOP TITLE>

# Introductory Natural Language Processing with Spacy

# 1. Visualizing Spacy's Architecture

![alt](https://d33wubrfki0l68.cloudfront.net/27b15441d6fb921213fcca5addc324d3b00a9d15/2c15d/architecture-bcdfffe5c0b9f221a2f6607f96ca0e4a.svg)

For this workshop, these are the only points we need to take away

- The **Document** is the big encompassing container
- The **Token** is exactly what you think it is (mostly a word)
- The **Span** is simply a sequence of tokens that is sliced from a Document
- The Document is produced by the **Tokenizer** class.

## Installing the required packages

`pip3 install -U spacy`  
`python -m spacy download en`

In [None]:
import spacy

In [None]:
nlp = spacy.load('en')

In [None]:
document = nlp('Hello    World!')

## Basic word tokenization

Our document variable is simply a sequence of tokens. We can thus iterate through it and print out the tokens parsed. Spacy goes one step further and performs index-preserving word tokenization, i.e., every token (including spaces) retains its place in the sentence. Other libraries like NLTK do not offer this feature.

In [None]:
# Iterate through the tokens parsed by the document
for token in document:
    print("\"%d: %s\"" % (token.idx, token.text))

## Exploring the attributes of the token class

Let us have a look at the word-level attributes of a token class. They reveal a wealth of information about the words in the sentence.

In [None]:
document = nlp("Next week I'll   be in Madrid.")
print("WORD\tINDEX\tLEMMA\tPUNC\tSPACE\tSHAPE\tPOSITION\tTAG")
for token in document:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

# 2. Exploring Spacy's Toolbox

Remember the parts of the example NLP pipeline we looked at? Let's see the features that Spacy offers us to streamline that process!

The list of processes we will look at are:

1. Sentence segmentation
2. Parts of speech tagging ([View the codes](https://spacy.io/usage/linguistic-features))
3. Named Entity Recognition ([View Entity Codes](https://spacy.io/usage/linguistic-features#entity-types))
4. Dependency Parsing
5. Visualizations using Displacy
6. Similarity based on word embeddings

In [None]:
# Sentence segmentation
document = nlp("Three French Hens.    Two turtle doves.     And a Partridge in a pear tree.")

for sentence in document.sents:
    print(sentence)

In [None]:
# Parts of speech tagging
document = nlp("Next week I'll be in Madrid. ")

print([(token.text, token.tag_) for token in document])

In [None]:
# Named entity recognition
document = nlp("Madrid is the capital of Spain. It is home to the sporting giant Real Madrid.")

for entity in document.ents:
    print("%s: %s" % (entity.text, entity.label_))

In [None]:
# Exploring the variety of entities.
document = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
print("ENTITY\tLABEL")
for ent in document.ents:
    print("{0}\t{1}".format(
        ent.text, 
        ent.label_
    ))

In [None]:
# Visualizing entity recognition using displacy
from spacy import displacy

In [None]:
document = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(document, style='ent', jupyter=True)

In [None]:
# Noun phrases and chunking
document = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
print("TEXT\t\t\tLABEL\tROOT")
for chunk in document.noun_chunks:
    print("%s\t%s\t%s" % (chunk.text, chunk.label_, chunk.root.text))

In [None]:
# Dependency Parsing
document = nlp('Wall Street Journal just published an interesting piece on crypto currencies')

for token in document:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, 
        token.tag_, 
        token.dep_, 
        token.head.text, 
        token.head.tag_
    ))

In [None]:
# Visualizing the dependency parsing
displacy.render(document, style='dep', jupyter=True, options={'distance': 90})

## Word Vectors and Similarity

We will not explore the concept behind a Word vector in this workshop. 

Explaining word vectors(aka word embeddings) are not the purpose of this tutorial. Here are a few properties word vectors have:

- If two words are similar, they appear in similar contexts
- Word vectors are computed taking into account the context (surrounding words)
- Given the two previous observations, similar words should have similar word vectors
- Using vectors we can derive relationships between words

Additionally, we will need a larger english model for this. Download it using the following command:

`python -m spacy download en_core_web_lg`

In [None]:
# Word vectors
nlp = spacy.load('en_core_web_sm')
mango = nlp('mango')
print("Shape: %d" % mango.vector.shape)
print(mango.vector)

In [None]:
# Computing similarity
banana = nlp('banana')
dog = nlp('dog')
fruit = nlp('fruit')
animal = nlp('animal')
 
print(dog.similarity(animal), dog.similarity(fruit))
print(banana.similarity(fruit), banana.similarity(animal))

# Putting it all together: Language Processing Pipelines

Fundamentally, a pipeline is a list of functions called on a Doc in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to a Doc and returns it.

![alt](https://user-images.githubusercontent.com/13643239/55229632-dbff9480-521d-11e9-8499-efb2a9c948db.png)

In [None]:
nlp = spacy.load('en_core_web_sm')
print("NLP Pipe: ", nlp.pipe_names)
print("Pipeline: \n", nlp.pipeline)

## Custom Pipeline Components

Sometimes, we'd want to modify the pipeline by adding in our own custom components to perform specific tasks. Spacy offers this functionality in the following way

In [None]:
def custom_component(doc):
    print("The doc is {} characters long and has {} tokens.".format(len(doc.text), len(doc)))
    return doc

nlp.add_pipe(custom_component, name='print_length', last=True)

In [None]:
nlp.pipeline

In [None]:
doc = nlp("This is a sentence")

# Resources and Additional Readings
[Spacy Documentation](https://spacy.io/api/) .   
[A Formal Introduction to NLP](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf) .   
[Solving NLP Problems using Pipelines](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)
[Classifying Text using Spacy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)  
[Introducing Custom Components in Spacy](https://explosion.ai/blog/spacy-v2-pipelines-extensions)