<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/1869px-Python-logo-notext.svg.png" align='right' width=200>

# Text analysis in Python
## An introduction from scratch


Marcel Haas, Hielke Muizelaar (LUMC)

In [None]:
# General imports
import sys, os
import numpy as np
import pandas as pd

# NLP related
import string
import regex as re
import spacy

# Machine learning
import sklearn

# Visualization
import matplotlib.pyplot as plt

In [None]:
# Print which versions are used
print("This notebook uses the following packages (and versions):")
print("---------------------------------------------------------")
print("python", sys.version[:6])
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

# The <i>very</i> basics
## String operations in Python

In [None]:
my_text = "This workshop is about language, and Python. Let's Go!"
sentences = my_text.split('. ')
print(sentences)

In [None]:
# Removing punctuation with regular expressions
def remove_punctuation(text):
    pattern = "[" + string.punctuation + "]+"
    result = re.sub(pattern," ",text)
    return result

print(remove_punctuation("text!!!text??"))


for s in sentences:
    print(remove_punctuation(s.lower()))

# SpaCy language models

Much of what's here is adapted from the [spaCy documentation](https://spacy.io/).

There are many complications. In most applications, you will be after something like *the meaning*, *the context* or *the intent* of text. These can be hard to extract, and we will look at the quantification of text in steps.

From spaCy you can import [pre-trained language models](https://spacy.io/usage/models) in a number of languages, that enable you to digest the "documents" (this can be just that example sentence, or a whole collection of books). The examples below show what you can do with such "NLP models".

## This is the first step from <i>text</i> to <i>quantitative data</i>.

## Part-of-Speech Tagging
POS tagging can be helpful for understanding the build-up of the text you're dealing with. See below for an example.

Let's start with a simple example sentence:

In [None]:
sentence = "This is an example sentence by Marcel with an obvouis spelling mistake."

nlp = spacy.load('en_core_web_sm')  # load a language model
doc = nlp(sentence)                 # Process the sentence with it into a "document"
for token in doc:                   # Show some added attributes
    print(f"{token.text:14s} {token.pos_:6s} {token.dep_}")

Note: If the language model doesn't load:
>python -m spacy download en_core_web_sm

If you need to know what any of those abbreviations mean, you can invoke

In [None]:
spacy.explain("ADJ")

The interplay of words within a sentence is also known to the `doc` object:

In [None]:
spacy.displacy.render(doc, style='dep')

## Named entity recognition

Also a part of the language model, entities can be recognized.
SpaCy understands that my name is a "named entity" and it can try to figure out what kind of an entity I am:

In [None]:
for ent in doc.ents: print(f"{ent} is a {ent.label_} and appears in the sentence at position {ent.start_char}")

## Stop words

What a stop word is <i><b>should</b></i> depend on your use case!

In [None]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(f"I know {len(stopwords)} stopwords.")

# Text normalization: 
## Stemming and lemmatization



In [None]:
from_the_news = ["Belarus has been accused of taking revenge for EU sanctions by offering migrants " \
"tourist visas, and helping them across its border. The BBC has tracked one group trying to reach Germany.", 
                "Biden and senators on verge of striking immigration deal aimed at clamping down on illegal border crossings",
                "Tea with salt? American scientist's \"outrageous proposal\" leaves U.S.-U.K. relations in \"hot water,\" embassy says",
                "Apple's Stolen Device Protection feature is now live. Here's how it can help protect your iPhone.",
                "Doomsday clock time for 2024 remains at 90 seconds to midnight. Here's what that means.",
                "Microsoft Teams services are down, as thousands of users report issues",
                "Funeral homes warned after FTC's first undercover phone sweep reveals misleading pricing",
                "How to watch today's Kansas City Chiefs vs. Baltimore Ravens game: AFC Championship Game livestream options"
                ]

In [None]:
doc = nlp(from_the_news[0])

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
' '.join(lemma_word1)

# Creating features for Machine Learning

![features](Figures/features.png)

![bag](Figures/bag.png)
![bag](Figures/bagwords.png)


## Python wouldn't be python if....

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(min_df=2,
                             max_df=.9,
                             ngram_range=(1,2))

bow = vectorizer.fit_transform(from_the_news)

bow.shape

# A Text Classification example

- 20 Newsgroups has short texts about specific topics. 

- We load 4 of them, and then try to predict the topic based on the text

## Getting the data

In [None]:
from sklearn.datasets import fetch_20newsgroups

# We will load only 4 of the categories
cats = ['sci.space', 'sci.med', 'rec.autos', 'alt.atheism']
data = fetch_20newsgroups(categories=cats, 
                          remove=('headers', 'footers'))

print(data.target.shape)
print(len(data.data))

In [None]:
# Get a random one
random_index = np.random.randint(0, high=len(data.data))
print(data.target_names[data.target[random_index]])
print()
print(data.data[random_index])

### Let's do some pre-processing

This is 1337 for "transform the data into usable form".
(trust me on this one)

In [None]:
def to_lower(text):
    return text.str.lower()

def remove_punctuation(text):
    pattern = "[" + string.punctuation + "]+"
    text = text.str.replace(pattern, " ", regex=True)
    text = text.str.replace("\n", " ", regex=False)
    return text

def lemmatize_stopwords(s):
    # I combine lemmatization and stopword removal to
    # have them both use nlp()
    # This is the slow function! Can you do something about it?
    doc = nlp(s)
    lemma = [] 
    for token in doc:
        if token.text not in stopwords:
            lemma.append(token.lemma_)
    return ' '.join(lemma)

def lsw(text):
    return text.apply(lemmatize_stopwords)


In [None]:
df = pd.DataFrame({'text':data.data, 'target':data.target})

processed = (df.text.pipe(to_lower)
                    .pipe(remove_punctuation)
                    .pipe(lsw)
            )

## Vectorize the cleaned data

Vectorization: transforming the unstructured data into rows of numbers.

In [None]:
vectorizer = CountVectorizer(min_df=4,
                             max_df=.5,
                             ngram_range=(1,2))
bow = vectorizer.fit_transform(processed)
bow.shape

## We have a feature matrix for a prediction model!


# Predictive modeling


Let's do prediction. This will be a lot of code again, but don't you worry...

1. Import the things we need
2. Split into a train and test set, so we can honestly assess the predictive power
3. Train a Naive Bayes model on the training data
4. Inspect the confusion matrix to see how we do

In [None]:
# 1. Imports
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix


In [None]:
# 2. Train-test-split
X_train, X_test, y_train, y_test = train_test_split(bow, data.target, test_size=0.2, random_state=42)


In [None]:
# 3. Instantiate and train model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)


In [None]:
# 4. Predict test data and assess!
y_pred = NB_model.predict(X_test)

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
plt.figure()
# I use a logarithmic color scheme, so the few mismatches are visible!
plt.imshow(np.log(cm+1), interpolation='nearest', cmap=plt.cm.Blues)
tick_marks = np.arange(len(data.target_names))
plt.xticks(tick_marks, data.target_names, rotation=45, ha="right")
plt.yticks(tick_marks, data.target_names)
plt.xlabel('Predicted label')
plt.ylabel('True label')
cb = plt.colorbar()
cb.set_ticks([]);

# Bags-of-Words are unaware of the context and meaning of words

Back to SpaCy, because they don't have to be!

Word vectors embed a word, based on their "surroundings" into a high-dimensional space.

## Word vectors from very simple language models

In [None]:
mango = nlp('mango')
mango.vector.shape

In [None]:
mango = nlp('mango').vector
strawberry = nlp('strawberry').vector
brick = nlp('brick').vector

print(((mango - strawberry)**2).sum())
print(((mango - brick)**2).sum())
print(((strawberry - brick)**2).sum())

## Similarity is also baked in, but only for "decent" language models

In [None]:
# If you do not download the larger language model like below,
# spaCy will complain about not having word vectors for the similarity
# measures. Download the larger model to your computer using 
# $ python -m spacy download en_core_web_lg
# Note that this takes about 800 MB of disk space
nlp = spacy.load('en_core_web_lg')
tokens = nlp('mango strawberry brick')
for i, token1 in enumerate(tokens):
    # similarities are symmetric!
    for token2 in tokens[i+1:]:
        print(token1.text, token2.text, token1.similarity(token2))

# On to Transformer-based language models!

<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" align='right' width=200>

## Hugging Face
https://www.huggingface.co
### A platform for machine learning applications, with a large focus on (large) language models!

Most known for its "transformers" library, which provides thousands of pretrained models to perform tasks on several modalities, including text. 

The library is named after the transformer neural network architecture ↓

![transformers](Figures/transformers.png)

### If this looks complex to you, you are absolutely right!


First described in a 2017 paper by Google, the transformer model is a sequence-to-sequence model that has seen widespread use in a plethora of machine learning tasks. The architecture was proposed to solve the problem of neural machine translation, which applies to any task that transforms an input sequence to an output sequence.

A large part of what sets transformers apart from other (priorly used) neural network architectures is the inclusion of attention mechanisms. In contrast to assigning the most recent words the highest importance, attention works similarly to our brains, in that it assigns larger weights to more influential words in a sentence. 

Classic transformer models contain multiple encoders and decoders. Encoders (shown left in the image above) take in an input sequence, text in our case, and transform these sequences to context vectors. For text, this generally means transforming the text to representative numerical values. These values are fed through a series of layers that reduce their dimensionality while preserving relevant information about how they relate to one another within the sentence structure. This is called "encoding" the input sequence. The decoder is then responsible for transforming this encoded representation back to its original form, in this case also taking the attention weights for each sequence part in account.

There is much more to transformer models than we can mention here. If you are interested in a deeper dive into the weeds of the workings of the model there are some great resources available online:

https://arxiv.org/abs/1706.03762 

https://blog.research.google/2017/08/transformer-novel-neural-network.html

https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power

## A popular natural language processing transformer model is BERT (Bidirectional Encoder Representations from Transformers)

![bert](Figures/bert.png)

BERT only includes the encoder part of the transformer model. This means that, on its own, it is only able to transform text to numerical values, NOT generate text. BERT was the first model to evaluate words both forwards and backwards, resulting in a deeper contextual understanding of the subject matter. During training, BERT masks a set percentage of the words in a sequences and attempts to predict the masked word by evaluating the words around it, rather than just evaluating the words either before or after, which was standard practice at the time. 

BERT can be fine-tuned for different types of tasks, such as text classification. This is done by adding a classification layer on top. This approach has seen widespread success and has contributed to BERT becoming the state-of-the-art model in the text classification domain. 

### Let's explore how we can use BERT ourselves using the aforementioned Hugging Face resources!

## Import the necessary libraries

In [None]:
from transformers import AutoTokenizer, pipeline, DataCollatorWithPadding, AutoModelForSequenceClassification, TrainingArguments, Trainer
from huggingface_hub import notebook_login
import evaluate
from datasets import Dataset

## We recommend to create a Hugging Face account, with which you can obtain access tokens and host your own models on the site!

In [None]:
# Enter token
from huggingface_hub import notebook_login

notebook_login()

## Load the dataset

For this example we use the same 20 newsgroups dataset from earlier. We use a Hugging Face 'Dataset' object, as they work well within the Hugging Face ecosystem.

In [None]:
data = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))
id2label = {idx: label for idx, label in enumerate(data["target_names"])}
d = {"text": data["data"], "label": data["target"]}
dset = Dataset.from_dict(d)
dset = dset.map(lambda x: {"label_text": id2label[x["label"]]})

## Initialize tokenizer

Tokenization is an essential part of natural language processing. It describes the process of breaking text into smaller parts, making machine analysis feasible. For BERT, the WordPiece algorithm is used, which creates a vocabulary of subwords based on frequently occurring symbol pairs in the training set.

In this example we use DistilBERT, a distilled version of BERT that has 40% less parameters than regular BERT, resulting in significantly faster load and training times, while preserving 95% of its performance. Here, we load its tokenizer from Hugging Face.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

## Tokenize the dataset

We now use the tokenizer on the dataset to obtain a tokenized version. We firstly create a function which does this and trims the text to the maximum input size of DistilBERT (truncation), as it cannot handle more than 512 tokens at once. 

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_groups = dset.map(preprocess_function, batched=True)

## Split dataset to training and test set


In machine learning it is standard practice to utilise a training and a test set. The training set is merely used to train the model, while the model does not interact with the test set until after it is fully trained. By evaluating the model on an unseen set we can ensure it is not merely replicating what it has learned, but is rather able to apply learned information to new instances.

In this example we use 70% of our total set for training and the remaining 30% for testing/evaluation. 

In [None]:
groups_dset_split = tokenized_groups.train_test_split(0.3)

In [None]:
groups_dset_split

## Initialize data collator

Data collators are used to create several batches of the input data, allowing for batched training, which speeds up training time and affects generalization of the model. Here, we use a padded data collator which dynamically pads input sequences with differing lengths to obtain sequences with the same length, which is required for batch training.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Load model


As we are fine-tuning an existing model for classification, we have to load said model from Hugging Face. We specify that we use the model for a text classification task via the use of "AutoModelForSequenceClassification". We set the number of labels to 20 as we have 20 different news groups in our dataset.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=20
)

## Initialize performance metrics

Naturally, we want to record the performance of our classification model. For this instance, we use "accuracy", which comes down to the percentage of total instances our model classified correctly. We make sure the predictions are matched with the actual labels (classes) linked to each text.

In [None]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Fine-tune/train the model

Finally, we move towards training (in this case fine-tuning) DistilBERT on our data. We specify an output directory for the finished model, which can be either a local path or a name for a Hugging Face repository. The other training arguments are chosen here arbitrarily for illustrative purposes. It is recommended to test several setups of arguments in order to explore what works on the data-at-hand. For our trainer-object we specify which dataset needs to be used for training and which for evaluation, and we pass our data collator and performance metric calculator from before. 

In [None]:
training_args = TrainingArguments(
    output_dir="Hielke/20newsgroups_healthRI",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=groups_dset_split["train"],
    eval_dataset=groups_dset_split["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Classify new text using the model!

With the model now trained, we can use it for classification purposes on whatever we deem worthy for it! For this we create a pipeline object, which handles tokenization and other processing steps of the given input text.

In [None]:
from transformers import pipeline

classifier = pipeline('text-classification', model='C:/users/muize/Hielke/20newsgroups_healthRI')

classifier("Is man merely a mistake of God's? Or God merely a mistake of man? -Friedrich Nietzsche")