<img src="https://spacy.io/static/social_default-1d3b50b1eba4c2b06244425ff0c49570.jpg" align='right' width=200>

# Natural Language Processing with Python
## ... and spaCy

This notebook is an exercise-based introductory demo of how to use Python for Natural Language Processing (NLP). It uses open data and the package spaCy, which comes with a lot of functionality for interacting with text data. Similar things can be done with packages like `nltk`. At some point, some machine learning will be done, for which scikit-learn is used. 

Let's start by importing the pacakges that are used:

In [1]:
# General imports
import sys, os
import numpy as np
import pandas as pd

# NLP related
import string
import regex as re
import spacy

# Machine learning
import sklearn
import gensim

# Visualisation
import matplotlib
import matplotlib.pyplot as plt

# Print which versions are used
print("This notebook uses the following packages (and versions):")
print("---------------------------------------------------------")
print("python", sys.version[:6])
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

This notebook uses the following packages (and versions):
---------------------------------------------------------
python 3.9.7 
numpy 1.21.2
pandas 1.3.4
regex 2.5.103
spacy 3.1.3
sklearn 1.0
gensim 4.1.2
matplotlib 3.4.3


## Text data

Text is unstructured data, which means that we don't have something like a nice set of features (e.g. columns in a pandas DataFrame) for a set of observations (e.g. rows in that same DataFrame). The information is enclosed in human-readable text, but needs to be made quantitative in order for machine learning methods to be able to handle them. That process of getting quantitative information out of text is called NLP. SpaCy will help us out. 


## Simple string operations

The first step might often be to use Python's rich collection of string operations. For example, making everything lower case, removing punctuation or splitting a document into its consecutive sentences are operations that we wouldn't need anything else than core python for:

In [2]:
my_text = "This workshop is about language, and Python. Let's Go!"
sentences = my_text.split('. ')
print(sentences)

# Removing punctuation with regular expressions
def remove_punctuation(text):
    pattern = "[" + string.punctuation + "]+"
    result = re.sub(pattern," ",text)
    return result

print(remove_punctuation("text!!!text??"))


for s in sentences:
    print(remove_punctuation(s.lower()))

['This workshop is about language, and Python', "Let's Go!"]
text text 
this workshop is about language  and python
let s go 


After simple operations like that, your results will no longer be case sensitive (but if uppercase is used to find names later on, be careful!). Note that for more complex string operations, it may be very useful to get familiar with [regular expressions](https://regex101.com/).

## SpaCy language models

Much of what's here is adapted from the [spaCy documentation](https://spacy.io/).

There are many complications. In most applications, you will be after something like *the meaning*, *the context* or *the intent* of text. These can be hard to extract, and we will look at the qunatification of text in steps.

From spaCy you can import [pre-trained language models](https://spacy.io/usage/models) in a number of languages, that enable you to digest the "documents" (this can be just that example sentence, or a whole collection of books). The examples below show what you can do with such "NLP models".

### Part-of-Speech Tagging
POS tagging can be helpful for understanding the build-up of the text you're dealing with. See below for an example.

Let's start with a simple example sentence:

In [3]:
sentence = "This is an example sentence by Marcel with a somewhat obvouis spelling mistake."

nlp = spacy.load('en_core_web_sm')

doc = nlp(sentence)
for token in doc:
    print(f"{token.text:14s} {token.pos_:6s} {token.dep_}")

This           DET    nsubj
is             AUX    ROOT
an             DET    det
example        NOUN   compound
sentence       NOUN   attr
by             ADP    prep
Marcel         PROPN  pobj
with           ADP    prep
a              DET    det
somewhat       ADV    advmod
obvouis        ADJ    amod
spelling       NOUN   compound
mistake        NOUN   pobj
.              PUNCT  punct


And if you need to know what any of those abbreviations mean, you can invoke

In [4]:
spacy.explain("ADJ")

'adjective'

Which shows that even a spelling mistake gets correctly interpreted. The interplay of words within a sentence is also known to the `doc` object:

In [5]:
spacy.displacy.render(doc, style='dep')

### Named entity recognition

SpaCy understands that my name is a "named entity" and it can try to figure out what kind of an entity I am:

In [6]:
for ent in doc.ents: print(f"{ent} is a {ent.label_} and appears in the sentence at position {ent.start_char}")

Marcel is a PRODUCT and appears in the sentence at position 31


--- 
#### Exercise
Just to get familiar with this type of exercise and solution loading: 

As you can see, my name isn't totally obvious for spaCy. Try with "Steve" and see if it gets better. Also, use the displacy renderer with `style='ent` to see what it recognizes in the sentence "Steve worked for Apple until January 2011".

In [11]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'NER.py')
# %load $to_include

In many real-world applications, paying special attention to pre-defined entities is very valuable!

### Stopwords

In many cases, tere is little to no information in super common words like *the* or *is*. Note that **this depends on your use case!!**. In general, the most common words in a language do'n add information because they appear all over the place, but their actual meaning might be important in your context. SpaCy comes with lists of stopwords that are useful for most use cases:

In [4]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(f"I know {len(stopwords)} stopwords.")

I know 326 stopwords.


---

#### Exercise
`stopwords` is a set. Can you think of a reason why?

What is the longest stopword in English included in spaCy?

Add a few more stopwords: "and", "market" and "people". How many of them were already in the collection? 

In [None]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'stopwords.py')
# %load $to_include

## Text Normalization: Stemming and Lemmatization

Often, the information content of a text does not depend on verb conjugations, single vs plural, etc. In such cases we want to use text normalization in order to tell our future model that "is" and "was" are both represnetations of "be". In the same vein, you could also map synonyms to the same word. This is less common, for reasons you can probably think of. This type of word normalization can be done in a variety of ways.

Stemming means to cut-off parts of the word (typically a suffix) to get back to the root of the word (e.g. reading -> read, played -> play etc.). This is a simple procedure. 

Lemmatization, on the other hand, is an organized and step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations). It typically results in more useful features for our future predictive models. It can be done with spaCy like this:

In [12]:
from_the_news = "Belarus has been accused of taking revenge for EU sanctions by offering migrants tourist visas, and helping them across its border. The BBC has tracked one group trying to reach Germany."

In [14]:
doc = nlp(from_the_news)

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
' '.join(lemma_word1)

'Belarus have be accuse of take revenge for EU sanction by offer migrant tourist visa , and help they across its border . the BBC have track one group try to reach Germany .'

## Preprocessing pipelines

Before to jump to learning on our text data, let's create a pipeline for preprocessing the data in way that we would want to. We will use pandas pipes to combine the functions into a pipeline. They work in a dataframe containing the data, so let's first create a simple data set.

In [46]:
df = pd.DataFrame({'text':['My first text ingredient.', 'More text. In the DataFrame.']})
df

Unnamed: 0,text
0,My first text ingredient.
1,More text. In the DataFrame.


In [47]:
def remove_period(text):
    return text.str.replace(".", "", regex=False)

def to_lower(text):
    return text.str.lower()

processed = (df.text.pipe(remove_period)
                    .pipe(to_lower)
            )

processed

0      my first text ingredient
1    more text in the dataframe
Name: text, dtype: object

For the cases below, we will be using a subset of the "20 newsgroup" dataset that comes along with scikit-learn. These are kind of discussion forums on which people get questions answered. We will load the subset here and quickly look at it:

In [5]:
from sklearn.datasets import fetch_20newsgroups

# We will load only 4 of the categories
cats = ['sci.space', 'sci.med', 'rec.autos', 'alt.atheism']
data = fetch_20newsgroups(categories=cats, 
                          remove=('headers', 'footers'))

print(data.target.shape)
print(len(data.data))

(2261,)
2261


In [72]:
# Get a random one
random_index = np.random.randint(0, high=len(data.data))
print(data.target_names[data.target[random_index]])
print()
print(data.data[random_index])

sci.med

Does anyone on this newsgroup happen to know WHY morphine was
first isolated from opium?  If you know why, or have an idea for where I
could look to find this info, please mail me.
	CSH
any suggestionas would be greatly appreciated

--
 "Kilimanjaro is a pretty tricky climb. Most of it's up, until you reach
the very, very top, and then it tends to slope away rather sharply."
					Sir George Head, OBE (JC)


---

#### Exercise

Create a pre-processing pipeline that cleans the data of the newsgroups. You can think of your own steps and order (order matters!), or you can take these steps (these might well be sub-optimal!):
1. Transform to lower case
2. Remove punctuation
3. Lemmatize
4. Remove stop words -- Look at the stop words list: is this lemmatized?

Are you ready? Or do you need to remove more?

In [101]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'preprocessor.py')
# %load $to_include

## Vectorize the preprocessed text data

In order for machine learning methods to deal with this cleaned-up text data, we are going to build a "bag of words" matrix out of these. This is a huge feature space where every observations is a document, and every single word that is in at least one of the documents is a feature. This will be a very sparse matrix.

We can use either the `CountVectorizer` or the `TfidfVectorizer` from scikit-learn for this. 

---

#### Exercise

Give that a go, look at the various 'hyperparameters" for the vectorizers and play with it a bit. Down below, we will use these in supervised and unsupervised learning.

Note that these vectorizers can take preprocessor functions as well! This will need to be done just slightly differently than above.

In [120]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'vectorizer.py')
# %load $to_include

In [None]:
# Another example, calling pre-processors from the vectorizer.

# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'vectorizer_preprocessor.py')
# %load $to_include

Just like with more common data sets, we can do both supervised and unsupervised machine learning with text data, after the vectorization described above. After all, we created numeric features (based on the occurence of words) for all documents, which serve as the observations of our model. Hence, we can use our data set to train a machine learning model like we are used to. 

Below are a supervised learning example, in which the label (the category in the 20 newsgroups data set) is predicted based on the bags-of-words. We can also pretend that we do not yet know these labels, or that there are 4 and do an unsupervised, clustering-like analysis. In this case, that is known as topic modeling and is described after the supervised example.

The two examples will be followed by a brief discussion of more elaborate machine learning techniques based on the *context* of words, rather than the words themselves, through word vectors.

## Supervised learning: text classification

We have a feature matrix (the result of the Count- or TfIdf-Vectorizer above) as well as a label (the category the text came from) for a subset of the 20 newsgroups data set. Building a predictive algorithm, that based on the occurence of words will determine which of the 4 labels fit best can be made in a way completely analogously to how we would do this with a feature matrix of another origin.

---
#### Exercise

If you follow the these steps, your predictive model will be built:
- Split your feature matrix and target vector in a train and a test set (e.g. a random 20% of your data can go in the test set) using `sklearn.model_selection.train_test_split`
- Instantiate a supervised classification algorithm. For example, use `sklearn.naive_bayes.MultinomialNB` with the default settings
- Train on the train set and evaluate the predictions on the independent test, using a visualization of the confusion matrix (`sklearn.metrics.confusion_matrix`)

In [None]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'naive_bayes.py')
# %load $to_include

The above example is as simple as the example below. By no means do I pretend that this is all there is to machine learning! I do hope that it shows you how to use a bag-of-words to do machine learning on text data.

## Unsupervised learning: topic modeling with LDA

In the unsupervised setting we look for structure present in the data that we do not have a "target variable" for. We do not know the correct answer, if that even means something. 

In this particular example, we would hope that 4 clusters are present, which in reality are described by the 4 different labels that we predicted above. Here we have a look at the data and try to find 4 topics, described by a form of soft clustering through [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).

The procedure is very similar to the supervised learning example above. It doesn't make much sense to split off a test set, as there is nothing to test.

---
#### Exercise
Run an LDA clustering with 4 components (`sklearn.decomposition.LatentDirichletAllocation`).


In [1]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'lda.py')
# %load $to_include

We can easily visualize the results with `pyLDAvis.sklearn` to investigate our topics:

In [135]:
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, bow, vectorizer)
dash

  default_term_info = default_term_info.sort_values(


---
#### Exercise
Play with the number of topics and see if you understand what happens!

## Word vectors

In order to capture the 'meaning' or 'context' of a word, people often use word vectors. These are an $N$-dimensional representation of a word in an abstract space, in which words with a similar meaning are supposed to be near each other.

The `nlp` object defined above comes with 96-dimensional word vectors:

In [34]:
mango = nlp('mango')
mango.vector.shape

(96,)

The numbers by themselves hardly mean anything, but proximity in this high-dimensional space does. 

---
#### Exercise

Get the vectors for "mango", "strawberry" and "brick" and verify that the fruits are indeed the closest pair.

Use the `similarity` method of tokens as well, to get a measure of all pairwise similarities.

In [None]:
# If you want solutions, uncomment and run the next two lines.
# to_include = os.path.join('solutions', 'vector_dist.py')
# %load $to_include

With the larger language model, the word vectors are 300-dimensional.

Word vectors can be incredibly powerful. You can use the pre-trained models in spaCy, or you can train your own, with e.g. `word2vec`, `GenSim` or `FastText`. It can also be useful to take existing word embeddings and "re-train" them, which is supposed to make the existing embeddings more relevant for your domain of application, while you can still use the versatile pre-trained models, which are typically trained on massive amounts of data (more than you're likely to have at hand).

With the vectors representing words, you can also do machine learning. In that case you do not need the bag-of-words methods any longer, which is nice for several reasons, e.g.:
- Bag-of-words methods are unaware of the contexts of words
- Word vectors are less sensitive to the use of synonyms and are more versatile in large diverse corpora of text
- Word vectors trivially combine into document vectors (through averaging), allowing you to treat the documents in much the same way

When you create an `nlp()` object ot of a document, using one of the "larger" language models (see above), it is possible to assess the similarity of two documents using the `.similarity()` method. This uses the `.vector` attribute and calculates the similarity based on the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), a well-known distance metric that is insensitive to the length (L2-norm) of the vector, but only to its "direction".

SpaCy has lots of functionalitie with word vectors, transformers and other sophisticated tooling see e.g. [their documentation](https://spacy.io/). The key difference between word/document-vectors and contextual language models such as transformers is that word vectors model lexical types, rather than tokens. If you have a list of terms with no context around them, a transformer model like BERT can’t really help you. BERT is designed to understand language in context, which isn’t what you have. A word vectors table will be a much better fit for your task. However, if you do have words in context – whole sentences or paragraphs of running text -- word vectors will only provide a very rough approximation of what the text is about.

Transformer models are usually trained with PyTorch, and is greatly helped by the use of GPUs. These are beyond the scope of this workshop.

