# Advanced Natural Language Processing (NLP)

***Summary***
- [Load Data](#load-data) <br>
- [Preprocessing](#preprocessing) <br>
- [Feature Extraction](#feature-extraction) <br>
- [Clustering](#clustering) <br>

In this Jupyter Notebook you will apply state-of-the-art NLP methods to 357 media releases of the municipality St. Gallen.
The corresponding dataset can be found [here](https://daten.stadt.sg.ch/explore/dataset/newsfeed-stadtverwaltung-stgallen/table/?sort=published).
Currently, the dataset comprises over 300 HTML files which are neither structured nor assigned to consistent categories.<br><br>

Our aim is to group the texts into clusters which are related in content.
For this purpose we will clean the raw data and extract embedding vectors for each news release, utilizing a cutting-edge pretrained neural network (transformer).
This embedding vector is reduced in dimensionality by applying a modern manifold learning technique.
Finally we will cluster these vectors and analyse the quality of these clusters.

In [None]:
# Install required packages
!pip install --upgrade pip
!pip install --user sentence-transformers==2.2.2
!pip install --user protobuf==3.19.4
!pip install --user umap-learn==0.5.3

In [None]:
# Import libraries
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

<a id='load-data'></a>
## I. Load Data
We load the data into a pandas dataframe.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, which is often used in machine learning to preprocess raw data.
After loading the data, we sort the samples by the date it was published and print the first five samples (`head()`)<br><br>
The relevant information is in the columns `Title` and `Text`.
However, the `Text` is provided in raw form and contains many HTML tokens, as shown below.

In [None]:
path = '../data/newsfeed-stadtverwaltung-stgallen.csv'
df = pd.read_csv(path, sep=';')
df.sort_values(by=['Veröffentlicht'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

In [None]:
df.loc[0,'Text']

<a id='preprocessing'></a>
## II. Preprocessing
In a first step, we will get rid the HTML tokens and useless characters by applying `BeautifulSoup`.
If you want more information on how to use `BeautifulSoup`, see [here](https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/).
The corresponding documentation can be found [here](https://beautiful-soup-4.readthedocs.io/en/latest/).<br>

In a second step we will create a new dataframe `df_subset_1`, which only contains a subset of the original dataframe (`Title` and `Text`)

In [None]:
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(df.loc[0,'Text'], 'html.parser')
re.sub(r'\xa0+', '', soup.text.replace('\n', ''), flags=re.MULTILINE)

In [None]:
df['Text'] = df['Text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text().replace('\n', ''))
df['Text'] = df['Text'].apply(lambda x: re.sub(r'\xa0+', '', x, flags=re.MULTILINE))
df.head()

In [None]:
df_subset_1 = df.drop(['Link','Veröffentlicht','bild_url','bild'], axis=1)
df_subset_1.head()

Next, we concatenate the `Title` and `Text` into a single column, which then forms the input to our machine learning algorithm.

In [None]:
df_subset_2 = pd.Series(df_subset_1['Title'] + '. ' + df_subset_1['Text'], name='Text').to_frame()
df_subset_2.head()

<a id='feature-extraction'></a>
## III. Feature Extraction
In a next step, we will use a sentence transformer which was pretrained on a large corpus.
A sentence transformer is a neural network which was trained to calculate an embedding vector from a sentence / text.
These Embedding vectors represent the content of a text.
In other words, texts which have a similar meaning result in similar embedding vectors, whereas texts which are different in content result in different embedding vectors.<br><br>
For more information on sentence transformer, see [here](https://arxiv.org/abs/1908.10084).<br>
For more information on embedding vectors, see [here](https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4).<br>
For more information on transformer, see [here](https://towardsdatascience.com/transformers-89034557de14).

Fortunately, there exist neural sentence transformers that have been trained on German corpora.
Thus, we can skip the tedious work of German-English translation and feed it directly with the German texts (no further preprocessing needed).<br>

The transformer is downloaded from [huggingface](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer).

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')

In [None]:
corpus_embeddings = model.encode(df_subset_2['Text'], normalize_embeddings=True)
corpus_embeddings.shape

As you can see above, each news release is now represented by a 768-dimensional embedding vector.
It is generally not recommended working with vectors of such high dimension (remember the "curse of dimensionality" from lecture 4), especially if the distances between the vectors are relevant for the clustering algorithm.
Thus, we will apply a dimensionality reduction method to reduce the number of dimensions to 50.<br><br>
For this purpose we will use a state-of-the-art manifold learning technique called `Uniform Manifold Approximation and Projection` (UMAP).
UMAP arranges the samples in the 50-dimensional space in such a way that the arrangement (distances and densities) between the samples of the 768-dimensional space is approximated.
This is achieved by an iterative minimization of a cost function.<br><br>
[Here](https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668) you can find an more detailed explanation of how UMAP works.<br>
[Here](https://arxiv.org/abs/1802.03426) you can find the UMAP paper.

In [None]:
import umap

n_components = 50

reducer = umap.UMAP(n_components=n_components)
embedding_reduced = reducer.fit_transform(corpus_embeddings)
embedding_plot = umap.UMAP(n_components=2).fit_transform(corpus_embeddings)

<a id='clustering'></a>
## IV. Clustering
As a final processing step, we will group the texts (more accurately, the extracted embedding vectors) into different clusters.
Texts (embedding vectors) that are in the same group should have similar properties, while texts (embedding vectors) in different groups should have highly dissimilar properties.
Clustering belongs to the category of unsupervised machine learning and is therefore very difficult to evaluate.
There exist many different clustering methods (see [here](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68) for a theoretical explanation):
- [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
- [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

In this example we will apply KMeans to the embedding vectors, to cluster the texts into four groups (the number of groups was chosen arbitrarily).

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

n_clusters = 4

km = KMeans(n_clusters=n_clusters, random_state=10)
cluster_assignment = km.fit_predict(embedding_reduced)

silhouette_avg = silhouette_score(embedding_reduced, cluster_assignment, metric='euclidean')
print('Silhouette Coefficient: {:0.3f}'.format(silhouette_avg))

The following illustration shows the four clusters, whereas the embedding vectors were reduced to two dimensions (just for the sake of this plot).
Note that the clustering algorithm was applied to the 50-dimensional data.
That is why the cluster boundaries overlap in some places.

In [None]:
import matplotlib.cm as cm
fig, ax = plt.subplots(1,1, figsize=(10,10))
ax.scatter(embedding_plot[:,0],embedding_plot[:,1], s=50, c=cm.nipy_spectral(np.float64(km.labels_) / n_clusters))

It is generally very difficult to assess the performance of an unsupervised learning procedure because, by definition, the data do no include any ground truth to which the prediction could be compared.
However, there are some scores which can indicate the cluster quality.
One of them ist the silhouette plot / silhouette score.<br><br>
With the silhouette plot we first calculate a silhouette value for each data sample, which is a measure of how similar a sample is to its own cluster (cohesion) compared to other clusters (separation).
It ranges from -1 to +1, with -1 representing poor cohesion / separation and +1 representing good cohesion / separation.
If all these silhouette values are sorted by value, visualized as bar plot and colorized according to the cluster assignment, we get the silhouette plot (see below).
If all the bars have about the same length (positive values) then the clustering algorithm was able to find distinct clusters.
The average over all silhouette values corresponds to the silhouette score.<br><br>
More information on the silhouette method can be found [here](https://en.wikipedia.org/wiki/Silhouette_(clustering)).<br>
The following code was taken from [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html).

In [None]:
import matplotlib.cm as cm

fig, ax = plt.subplots(1,1)
fig.set_size_inches(18, 7)

ax.set_xlim([-0.1, 1])
ax.set_ylim([0, len(embedding_reduced) + (n_clusters + 1) * 10])

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(embedding_reduced, cluster_assignment, metric='euclidean')

y_lower = 10
for i in range(n_clusters):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = \
        sample_silhouette_values[cluster_assignment == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / n_clusters)
    ax.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  # 10 for the 0 samples

ax.set_title("The silhouette plot for the various clusters.")
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")

ax.set_yticks([])  # Clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

plt.show()

As part of the evaluation we print the title of four news releases for each cluster.
Can you find the similarities / dissimilarities?

In [None]:
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

doc_class = {}

for class_label in np.unique(km.labels_):
    embedding_subset = embedding_reduced[km.labels_==class_label]
    df_subset = df_subset_1[km.labels_==class_label]

    cluster_center = km.cluster_centers_[np.newaxis,class_label]

    metrics = euclidean_distances(embedding_subset, cluster_center)
    idx_sample = np.argsort(metrics[:,0])

    doc_class[class_label] = df_subset.iloc[idx_sample, 0].tolist()

In [None]:
for class_label, texts in doc_class.items():
    print('Class {:d}'.format(class_label))
    for text in texts[:4]:
        print(text)
    print()