<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/1869px-Python-logo-notext.svg.png" align='right' width=200>

# Text analysis in Python
## An introduction from scratch


Marcel Haas, Hielke Muizalaar (LUMC)

In [5]:
# General imports
import sys, os
import numpy as np
import pandas as pd

# NLP related
import string
import regex as re
import spacy

# Machine learning
import sklearn

# Visualization
import matplotlib.pyplot as plt

In [4]:
# Print which versions are used
print("This notebook uses the following packages (and versions):")
print("---------------------------------------------------------")
print("python", sys.version[:6])
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

This notebook uses the following packages (and versions):
---------------------------------------------------------
python 3.11.7
spacy 3.7.2
numpy 1.26.3
pandas 2.2.0
regex 2.5.140
sklearn 1.4.0


# The <i>very</i> basics
## String operations in Python

In [None]:
my_text = "This workshop is about language, and Python. Let's Go!"
sentences = my_text.split('. ')
print(sentences)

In [None]:
# Removing punctuation with regular expressions
def remove_punctuation(text):
    pattern = "[" + string.punctuation + "]+"
    result = re.sub(pattern," ",text)
    return result

print(remove_punctuation("text!!!text??"))


for s in sentences:
    print(remove_punctuation(s.lower()))

# SpaCy language models

Much of what's here is adapted from the [spaCy documentation](https://spacy.io/).

There are many complications. In most applications, you will be after something like *the meaning*, *the context* or *the intent* of text. These can be hard to extract, and we will look at the quantification of text in steps.

From spaCy you can import [pre-trained language models](https://spacy.io/usage/models) in a number of languages, that enable you to digest the "documents" (this can be just that example sentence, or a whole collection of books). The examples below show what you can do with such "NLP models".

## This is the first step from <i>text</i> to <i>quantitative data</i>.

## Part-of-Speech Tagging
POS tagging can be helpful for understanding the build-up of the text you're dealing with. See below for an example.

Let's start with a simple example sentence:

In [10]:
sentence = "This is an example sentence by Marcel with an obvouis spelling mistake."

nlp = spacy.load('en_core_web_sm')  # load a language model
doc = nlp(sentence)                 # Process the sentence with it into a "document"
for token in doc:                   # Show some added attributes
    print(f"{token.text:14s} {token.pos_:6s} {token.dep_}")

Note: If the language model doesn't load:
>python -m spacy download en_core_web_sm

If you need to know what any of those abbreviations mean, you can invoke

In [None]:
spacy.explain("ADJ")

The interplay of words within a sentence is also known to the `doc` object:

In [None]:
spacy.displacy.render(doc, style='dep')

## Named entity recognition

Also a part of the language model, entities can be recognized.
SpaCy understands that my name is a "named entity" and it can try to figure out what kind of an entity I am:

In [None]:
for ent in doc.ents: print(f"{ent} is a {ent.label_} and appears in the sentence at position {ent.start_char}")

## Stop words

What a stop word is <i><b>should</b></i> depend on your use case!

In [None]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(f"I know {len(stopwords)} stopwords.")

# Text normalization: 
## Stemming and lemmatization



In [None]:
from_the_news = "Belarus has been accused of taking revenge for EU sanctions by offering migrants " 
"tourist visas, and helping them across its border. The BBC has tracked one group trying to reach Germany."

In [None]:
doc = nlp(from_the_news)

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
' '.join(lemma_word1)

# Creating features for Machine Learning

![features](Figures/features.png)

![bag](Figures/bag.png)
![bag](Figures/bagwords.png)


## Python wouldn't be python if....

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(min_df=4,
                             max_df=.5,
                             ngram_range=[1,2])

bow = vectorizer.fit_transform(from_the_news)

bow.shape

# A Text Classification example

- 20 Newsgroups has short texts about specific topics. 

- We load 4 of them, and then try to predict the topic based on the text

## Getting the data

In [None]:
from sklearn.datasets import fetch_20newsgroups

# We will load only 4 of the categories
cats = ['sci.space', 'sci.med', 'rec.autos', 'alt.atheism']
data = fetch_20newsgroups(categories=cats, 
                          remove=('headers', 'footers'))

print(data.target.shape)
print(len(data.data))

In [None]:
# Get a random one
random_index = np.random.randint(0, high=len(data.data))
print(data.target_names[data.target[random_index]])
print()
print(data.data[random_index])

### Let's do some pre-processing

(trust me on this one)

In [None]:
def to_lower(text):
    return text.str.lower()

def remove_punctuation(text):
    pattern = "[" + string.punctuation + "]+"
    text = text.str.replace(pattern, " ", regex=True)
    text = text.str.replace("\n", " ", regex=False)
    return text

def lemmatize_stopwords(s):
    # I combine lemmatization and stopword removal to
    # have them both use nlp()
    # This is the slow function! Can you do something about it?
    doc = nlp(s)
    lemma = [] 
    for token in doc:
        if token.text not in stopwords:
            lemma.append(token.lemma_)
    return ' '.join(lemma)

def lsw(text):
    return text.apply(lemmatize_stopwords)


In [None]:
df = pd.DataFrame({'text':data.data, 'target':data.target})

processed = (df.text.pipe(to_lower)
                    .pipe(remove_punctuation)
                    .pipe(lsw)
            )

## Vectorize the cleaned data

In [None]:
vectorizer = CountVectorizer(min_df=4,
                             max_df=.5,
                             ngram_range=[1,2])
bow = vectorizer.fit_transform(processed)
bow.shape

## We have a feature matrix for a prediction model!


# Predictive modeling


Let's do prediction. This will be a lot of code again, but don't you worry...

1. Import the things we need
2. Split into a train and test set, so we can honestly assess the predictive power
3. Train a Naive Bayes model on the training data
4. Inspect the confusion matrix to see how we do

In [None]:
# 1. Imports
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix


In [None]:
# 2. Train-test-split
X_train, X_test, y_train, y_test = train_test_split(bow, data.target, test_size=0.2, random_state=42)


In [None]:
# 3. Instantiate and train model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)


In [None]:
# 4. Predict test data and assess!
y_pred = NB_model.predict(X_test)

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
plt.figure()
# I use a logarithmic color scheme, so the few mismatches are visible!
plt.imshow(np.log(cm+1), interpolation='nearest', cmap=plt.cm.Blues)
tick_marks = np.arange(len(data.target_names))
plt.xticks(tick_marks, data.target_names, rotation=45, ha="right")
plt.yticks(tick_marks, data.target_names)
plt.xlabel('Predicted label')
plt.ylabel('True label')
cb = plt.colorbar()
cb.set_ticks([]);

# Bags-of-Words are unaware of the context and meaning of words

Back to SpaCy, because they don't have to be!

Word vectors embed a word, based on their "surroundings" into a high-dimensional space.

## Word vectors from very simple language models

In [None]:
mango = nlp('mango')
mango.vector.shape

In [None]:
mango = nlp('mango').vector
strawberry = nlp('strawberry').vector
brick = nlp('brick').vector

print(((mango - strawberry)**2).sum())
print(((mango - brick)**2).sum())
print(((strawberry - brick)**2).sum())

## Similarity is also baked in, but only for "decent" language models

In [None]:
# If you do not download the larger language model like below,
# spaCy will complain about not having word vectors for the similarity
# measures. Download the larger model to your computer using 
# $ python -m spacy download en_core_web_lg
# Note that this takes about 800 MB of disk space
nlp = spacy.load('en_core_web_lg')
tokens = nlp('mango strawberry brick')
for i, token1 in enumerate(tokens):
    # similarities are symmetric!
    for token2 in tokens[i+1:]:
        print(token1.text, token2.text, token1.similarity(token2))

# On to Transformer-based language models!