<a href="https://colab.research.google.com/github/mustafabozkaya/NLP_Notebooks/blob/master/spacy/Spacy_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Natural Language in Python using spaCy

## Introduction

This tutorial provides a brief introduction to working with natural language (sometimes called "text analytics") in Pytho, using [spaCy](https://spacy.io/) and related libraries.
Data science teams in industry must work with lots of text, one of the top four categories of data used in machine learning.
Usually that's human-generated text, although not always.

Think about it: how does the "operating system" for business work? Typically, there are contracts (sales contracts, work agreements, partnerships), there are invoices, there are insurance policies, there are regulations and other laws, and so on.
All of those are represented as text.

You may run across a few acronyms: _natural language processing_ (NLP), _natural language understanding_ (NLU), _natural language generation_ (NLG) — which are roughly speaking "read text", "understand meaning", "write text" respectively.
Increasingly these tasks overlap and it becomes difficult to categorize any given feature.

The _spaCy_ framework — along with a wide and growing range of plug-ins and other integrations — provides features for a wide range of natural language tasks.
It's become one of the most widely used natural language libraries in Python for industry use cases, and has quite a large community — and with that, much support for commercialization of research advances as this area continues to evolve rapidly.

## Getting Started

Check out the excellent _spaCy_ [installation notes](https://spacy.io/usage) for a "configurator" which generates installation commands based on which platforms and natural languages you need to support.

Some people tend to use `pip` while others use `conda`, and there are instructions for both.  For example, to get started with _spaCy_ working with text in English and installed via `conda` on a Linux system:
```
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
```

BTW, the second line above is a download for language resources (models, etc.) and the `_sm` at the end of the download's name indicates a "small" model. There's also "medium" and "large", albeit those are quite large. Some of the more advanced features depend on the latter, although we won't quite be diving to the bottom of that ocean in this (brief) tutorial.

Now let's load _spaCy_ and run some code:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

That `nlp` variable is now your gateway to all things _spaCy_ and loaded with the `en_core_web_sm` small model for English.
Next, let's run a small "document" through the natural language parser:

In [None]:
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

First we created a [doc](https://spacy.io/api/doc) from the text, which is a container for a document and all of its annotations. Then we iterated through the document to see what _spaCy_ had parsed.

Good, but it's a lot of info and a bit difficult to read. Let's reformat the _spaCy_ parse of that sentence as a [pandas](https://pandas.pydata.org/) dataframe:

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)

df

Much more readable!
In this simple case, the entire document is merely one short sentence.
For each word in that sentence _spaCy_ has created a [token](https://spacy.io/api/token), and we accessed fields in each token to show:

 - raw text
 - [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) – a root form of the word
 - [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
 - a flag for whether the word is a _stopword_ – i.e., a common word that may be filtered out

Next let's use the [displaCy](https://ines.io/blog/developing-displacy) library to visualize the parse tree for that sentence:

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

Does that bring back memories of grade school? Frankly, for those of us coming from more of a computational linguistics background, that diagram sparks joy.

But let's backup for a moment. How do you handle multiple sentences?

There are features for _sentence boundary detection_ (SBD) – also known as _sentence segmentation_ – based on the builtin/default [sentencizer](https://spacy.io/api/sentencizer):

In [None]:
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

When _spaCy_ creates a document, it uses a principle of _non-destructive tokenization_ meaning that the tokens, sentences, etc., are simply indexes into a long array. In other words, they don't carve the text stream into little pieces. So each sentence is a [span](https://spacy.io/api/span) with a _start_ and an _end_ index into the document array:

In [None]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

We can index into the document array to pull out the tokens for one sentence:

In [None]:
doc[48:54]

Or simply index into a specific token, such as the verb `went` in the last sentence:

In [None]:
token = doc[51]
print(token.text, token.lemma_, token.pos_)

At this point we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That's a good start.

## Acquiring Text

Now that we can parse texts, where do we get texts?
One quick source is to leverage the interwebs.
Of course when we download web pages we'll get HTML, and then need to extract text from them.
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular package for that.

First, a little housekeeping:

In [None]:
import sys
import warnings

warnings.filterwarnings("ignore")

### Character Encoding

The following shows examples of how to use [codecs](https://docs.python.org/3/library/codecs.html) and [normalize unicode](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize). NB: the example text comes from the article "[Metal umlat](https://en.wikipedia.org/wiki/Metal_umlaut)".

In [None]:
x = "Rinôçérôse screams ﬂow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."

type(x)

The variable `x` is a *string* in Python:

In [None]:
repr(x)

Its translation into [ASCII](http://www.asciitable.com/) is unusable by parsers:

In [None]:
ascii(x)

Encoding as [UTF-8](http://unicode.org/faq/utf_bom.html) doesn't help much:

In [None]:
x.encode('utf8')

Ignoring difficult characters is perhaps an even worse strategy:

In [None]:
x.encode('ascii', 'ignore')

However, one can *normalize* text, then encode…

In [None]:
import unicodedata

unicodedata.normalize('NFKD', x).encode('ascii','ignore')

Even before this normalization and encoding, you may need to convert some characters explicitly **before** parsing. For example:

In [None]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"

ascii(x)

Consider the results for that line:

In [None]:
unicodedata.normalize('NFKD', x).encode('ascii', 'ignore')

...which still drops characters that may be important for parsing a sentence.

So a more advanced approach could be:

In [None]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')

x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
print(x)

### Parsing HTML

In the following function `get_text()` we'll parse the HTML to find all of the `<p/>` tags, then extract the text for those:

In [None]:
from bs4 import BeautifulSoup
import requests
import traceback

def get_text (url):
    buf = []

    try:
        soup = BeautifulSoup(requests.get(url).text, "html.parser")

        for p in soup.find_all("p"):
            buf.append(p.get_text())

        return "\n".join(buf)
    except:
        print(traceback.format_exc())
        sys.exit(-1)

Now let's grab some text from online sources.
We can compare open source licenses hosted on the [Open Source Initiative](https://opensource.org/licenses/) site:

In [None]:
lic = {}
lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))
lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-2.0"))
lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))

for sent in lic["bsd"].sents:
    print(">", sent)

One common use case for natural language work is to compare texts. For example, with those open source licenses we can download their text, parse, then compare [similarity](https://spacy.io/api/doc#similarity) metrics among them:

In [None]:
pairs = [
    ["mit", "asl"],
    ["asl", "bsd"],
    ["bsd", "mit"]
]

for a, b in pairs:
    print(a, b, lic[a].similarity(lic[b]))

That is interesting, since the [BSD](https://opensource.org/licenses/BSD-3-Clause) and [MIT](https://opensource.org/licenses/MIT) licenses appear to be the most similar documents.
In fact they are closely related.

Admittedly, there was some extra text included in each document due to the OSI disclaimer in the footer – but this provides a reasonable approximation for comparing the licenses.

## Natural Language Understanding

Now let's dive into some of the _spaCy_ features for NLU.
Given that we have a parse of a document, from a purely grammatical standpoint we can pull the [noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks), i.e., each of the noun phrases:

In [None]:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Not bad. The noun phrases in a sentence generally provide more information content – as a simple filter used to reduce a long document into a more "distilled" representation.

We can take this approach further and identify [named entities](https://spacy.io/usage/linguistic-features#named-entities) within the text, i.e., the proper nouns:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The _displaCy_ library provides an excellent way to visualize named entities:

In [None]:
displacy.render(doc, style="ent", jupyter=True)

If you're working with [knowledge graph](https://www.akbc.ws/) applications and other [linked data](http://linkeddata.org/), your challenge is to construct links between the named entities in a document and other related information for the entities – which is called [entity linking](http://nlpprogress.com/english/entity_linking.html).
Identifying the named entities in a document is the first step in this particular kind of AI work.
For example, given the text above, one might link the `Steve Wozniak` named entity to a [lookup in DBpedia](http://dbpedia.org/page/Steve_Wozniak).

In more general terms, one can also link _lemmas_ to resources that describe their meanings.
For example, in an early section we parsed the sentence `The gorillas just went wild` and were able to show that the lemma for the word `went` is the verb `go`. At this point we can use a venerable project called [WordNet](https://wordnet.princeton.edu/) which provides a lexical database for English – in other words, it's a computable thesaurus.

There's a _spaCy_ integration for WordNet called
[spacy-wordnet](https://github.com/recognai/spacy-wordnet) by [Daniel Vila Suero](https://twitter.com/dvilasuero), an expert in natural language and knowledge graph work.

Then we'll load the WordNet data via NLTK (these things happen):

In [None]:
import nltk

nltk.download("wordnet")

Note that _spaCy_ runs as a "pipeline" and allows means for customizing parts of the pipeline in use.
That's excellent for supporting really interesting workflow integrations in data science work.
Here we'll add the `WordnetAnnotator` from the _spacy-wordnet_ project:

In [None]:
!pip install spacy-wordnet

In [None]:
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

print("before", nlp.pipe_names)

if "WordnetAnnotator" not in nlp.pipe_names:
    nlp.add_pipe(WordnetAnnotator(nlp.lang), after="tagger")

print("after", nlp.pipe_names)

Within the English language, some words are infamous for having many possible meanings. For example, click through the results online in a [WordNet](http://wordnetweb.princeton.edu/perl/webwn?s=star&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=) search to find the meanings related to the word `withdraw`.

Now let's use _spaCy_ to perform that lookup automatically:

In [None]:
token = nlp("withdraw")[0]
token._.wordnet.synsets()

In [None]:
token._.wordnet.lemmas()

In [None]:
token._.wordnet.wordnet_domains()

Again, if you're working with knowledge graphs, those "word sense" links from WordNet could be used along with graph algorithms to help identify the meanings for a particular word. That can also be used to develop summaries for larger sections of text through a technique called _summarization_.  It's beyond the scope of this tutorial, but an interesting application currently for natural language in industry.

Going in the other direction, if you know _a priori_ that a document was about a particular domain or set of topics, then you can constrain the meanings returned from _WordNet_. In the following example, we want to consider NLU results that are within Finance and Banking:

In [None]:
domains = ["finance", "banking"]
sentence = nlp("I want to withdraw 5,000 euros.")

enriched_sent = []

for token in sentence:
    # get synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(domains)

    if synsets:
        lemmas_for_synset = []

        for s in synsets:
            # get synset variants and add to the enriched sentence
            lemmas_for_synset.extend(s.lemma_names())
            enriched_sent.append("({})".format("|".join(set(lemmas_for_synset))))
    else:
        enriched_sent.append(token.text)

print(" ".join(enriched_sent))

That example may look simple but, if you play with the `domains` list, you'll find that the results have a kind of combinatorial explosion when run without reasonable constraints.
Imagine having a knowledge graph with millions of elements: you'd want to constrain searches where possible to avoid having every query take days/weeks/months/years to compute.

Sometimes the problems encountered when trying to understand a text – or better yet when trying to understand a _corpus_ (a dataset with many related texts) – become so complex that you need to visualize it first.
Here's an interactive visualization for understanding texts: [scattertext](https://spacy.io/universe/project/scattertext), a product of the genius of [Jason Kessler](https://twitter.com/jasonkessler).
To install:

```
conda install -c conda-forge scattertext
```

Let's analyze text data from the party conventions during the 2012 US Presidential elections. It may take a minute or two to run, but the results from all that number crunching is worth the wait.

In [None]:
!pip install scattertext

In [None]:
import scattertext as st

if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

convention_df = st.SampleCorpora.ConventionData2012.get_data()
corpus = st.CorpusFromPandas(convention_df,
                             category_col="party",
                             text_col="text",
                             nlp=nlp).build()

Once you have the `corpus` ready, generate an interactive visualization in HTML:

In [None]:
html = st.produce_scattertext_explorer(
    corpus,
    category="democrat",
    category_name="Democratic",
    not_category_name="Republican",
    width_in_pixels=1000,
    metadata=convention_df["speaker"]
)

Now we'll render the HTML – give it a minute or two to load, it's worth the wait...

In [None]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
import sys

IN_COLAB = "google.colab" in sys.modules
print(IN_COLAB)

**NB: use the following cell on Google Colab:**

In [None]:
if IN_COLAB:
    display(HTML("<style>.container { width:98% !important; }</style>"))
    display(HTML(html))

**NB: use the following cell instead on Jupyter in general:**

In [None]:
file_name = "foo.html"

with open(file_name, "wb") as f:
    f.write(html.encode("utf-8"))

IFrame(src=file_name, width = 1200, height=700)

Imagine if you had text from the past three years of customer support for a particular product in your organization. Suppose your team needed to understand how customers have been talking about the product? This _scattertext_ library might come in quite handy! You could cluster (k=2) on _NPS scores_ (a customer evaluation metric) then replace the Democrat/Republican dimension with the top two components from the clustering.

## Summary

Five years ago, if you’d asked about open source in Python for natural language, a default answer from many people working in data science would've been [NLTK](https://www.nltk.org/).
That project includes just about everything but the kitchen sink and has components which are relatively academic.
Another popular natural language project is [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) from Stanford.
Also quite academic, albeit powerful, though _CoreNLP_ can be challenging to integrate with other software for production use.

Then a few years ago everything in this natural language corner of the world began to change.
The two principal authors for _spaCy_ -- [Matthew Honnibal](https://twitter.com/honnibal) and [Ines Montani](https://twitter.com/_inesmontani) -- launched the project in 2015 and industry adoption was rapid.
They focused on an _opinionated_ approach (do what's needed, do it well, no more, no less) which provided simple, rapid integration into data science workflows in Python, as well as faster execution and better accuracy than the alternatives.
Based on those priorities, _spaCy_ become sort of the opposite of _NLTK_.
Since 2015, _spaCy_ has consistently focused on being an open source project (i.e., depending on its community for directions, integrations, etc.) and being commercial-grade software (not academic research).
That said, _spaCy_ has been quick to incorporate the SOTA advances in machine learning, effectively becoming a conduit for moving research into industry.

It's important to note that machine learning for natural language got a big boost during the mid-2000's as Google began to win international language translation competitions.
Another big change occurred during 2017-2018 when, following the many successes of _deep learning_, those approaches began to out-perform previous machine learning models.
For example, see the [ELMo](https://arxiv.org/abs/1802.05365) work on _language embedding_ by Allen AI, followed by [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) from Google, and more recently [ERNIE](https://medium.com/syncedreview/baidus-ernie-tops-google-s-bert-in-chinese-nlp-tasks-d6a42b49223d) by Baidu -- in other words, the search engine giants of the world have gifted the rest of us with a Sesame Street repertoire of open source embedded language models based on deep learning, which is now _state of the art_ (SOTA).
Speaking of which, to keep track of SOTA for natural language, keep an eye on [NLP-Progress](http://nlpprogress.com/) and [Papers with Code](https://paperswithcode.com/sota).

The use cases for natural language have shifted dramatically over the past two years, after deep learning techniques arose to the fore.
Circa 2014, a natural language tutorial in Python might have shown _word count_ or _keyword search_ or _sentiment detection_ where the target use cases were relatively underwhelming.
Circa 2019 we're talking about analyzing thousands of documents for vendor contracts in an industrial supply chain optimization ... or hundreds of millions of documents for policy holders of an insurance company, or gazillions of documents regarding financial disclosures.
More contemporary natural language work tends to be in NLU, often to support construction of _knowledge graphs,_ and increasingly in NLG where large numbers of similar documents can be summarized at human scale.

The [spaCy Universe](https://spacy.io/universe) is a great place to check for deep-dives into particular use cases, and to see how this field is evolving. Some selections from this "universe" include:

 - [Blackstone](https://spacy.io/universe/project/blackstone) – parsing unstructured legal texts
 - [Kindred](https://spacy.io/universe/project/kindred) – extracting entities from biomedical texts (e.g., Pharma)
 - [mordecai](https://spacy.io/universe/project/mordecai) – parsing geographic information
 - [Prodigy](https://spacy.io/universe/project/prodigy) – human-in-the-loop annotation to label datasets
 - [spacy-raspberry](https://spacy.io/universe/project/spacy-raspberry) – Raspberry PI image for running _spaCy_ and deep learning on edge devices
 - [Rasa NLU](https://spacy.io/universe/project/rasa) – Rasa integration for voice apps

Also, a couple super new items to mention:

  - [spacy-pytorch-transformers](https://explosion.ai/blog/spacy-pytorch-transformers) to fine tune (i.e., use _transfer learning_ with) the Sesame Street characters and friends: BERT, GPT-2, XLNet, etc.
  - [spaCy IRL 2019](https://irl.spacy.io/2019/) conference – check out videos from the talks!

There's so much more that can be done with _spaCy_ – hopefully this tutorial provides an introduction. We wish you all the best in your natural language work.