# Natural Language Processing with Python
## ... and spaCy

This notebook is an exercise-based introductory demo of how to use Python for Natural Language Processing (NLP). It uses open data and the package spaCy, which comes with a lot of functionality for interacting with text data. At some point, some machine learning will be done, for which scikit-learn is used. 

Let's start by importing the pacakges that are used:

In [5]:
import sys

import spacy
import numpy as np
import pandas as pd
import sklearn

# Print which versions are used
print("This notebook uses the following packages (and versions):")
print("---------------------------------------------------------")
print("python", sys.version[:6])
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

This notebook uses the following packages (and versions):
---------------------------------------------------------
python 3.9.7 
spacy 3.1.3
numpy 1.21.2
pandas 1.3.4
sklearn 1.0


## Text data 
Much of what's here is adapted from the [spaCy documentation](https://spacy.io/).

Text is unstructured data, which means that we don't have something like a nice set of features (e.g. columns in a pandas DataFrame) for a set of observations (e.g. rows in that same DataFrame). The information is enclosed in human-readable text, but needs to be made quantitative in order for machine learning methods to be able to handle them. That process of getting quantitative information out of text is called NLP. SpaCy will help us out. 

There are many complications. In most applications, you will be after something like *the meaning*, *the context* or *the intent* of text. These can be hard to extract, and we will look at the qunatification of text in steps.

Let's start with a simple example sentence:

In [29]:
sentence = "This is an example sentence by Marcel with a somewhat obvouis spelling mistake."

From spaCy you can import [pre-trained language models](https://spacy.io/usage/models) in a number of languages, that enable you to digest the "documents" (this can be just that example sentence, or a whole collection of books). The examples below show what you can do with such "NLP models".

In [30]:
nlp = spacy.load('en_core_web_sm')

doc = nlp(sentence)
for token in doc:
    print(f"{token.text:14s} {token.pos_:6s} {token.dep_}")

This           DET    nsubj
is             AUX    ROOT
an             DET    det
example        NOUN   compound
sentence       NOUN   attr
by             ADP    prep
Marcel         PROPN  pobj
with           ADP    prep
a              DET    det
somewhat       ADV    advmod
obvouis        ADJ    amod
spelling       NOUN   compound
mistake        NOUN   pobj
.              PUNCT  punct


And if you need to know what any of those abbreviations mean, you can invoke

In [31]:
spacy.explain("ADJ")

'adjective'

Which shows that even a spelling mistake gets correctly interpreted. The interplay of words within a sentence is also known to the `doc` object:

In [32]:
from spacy import displacy
displacy.render(doc, style='dep')

SpaCy understands that my name is a "named entity" and it can try to figure out what kind of an entity I am (replace my name with "Steve" and it gets a little better):

In [34]:
for ent in doc.ents: print(f"{ent} is a {ent.label_} and appears in the sentence at position {ent.start_char}")

Marcel is a PRODUCT and appears in the sentence at position 31


In [39]:
displacy.render(nlp("Steve worked for Apple until January 2011").ents, style='ent')