# Introduction Natural Language Processing (NLP)

***Summary***
- [Load Data](#load-data) <br>
- [Preprocessing](#preprocessing) <br>
- [Feature Extraction](#feature-extraction) <br>
- [Clustering](#clustering) <br>

In this Jupyter Notebook you will apply conventional NLP methods to 323 media releases of the St. Gallen City Police.
The corresponding dataset can be found [here](https://daten.stadt.sg.ch/explore/dataset/newsfeed-stadtpolizei-stgallen-medienmitteilungen/table/?sort=published).
Currently, the dataset comprises over 300 HTML files neither structured nor assigned to consistent categories.<br><br>

Our aim is to group the texts into clusters which are related in content.
To this end, we will first clean the raw data, translate it into English, and reduce the texts to expressive words in their root form.<br>
Next, a feature vector is extracted for each document which should represent the content of the document.
Finally we will cluster these vectors and analyse the quality of these clusters.<br><br>

Parts of this Jupyter Notebook were copied from [this tutorial](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html).

In [None]:
# Install required packages
!pip install --upgrade pip
!pip install --user deep-translator==1.8.3
!pip install --user nltk
!pip install --user contractions==0.1.72

In [None]:
# Import libraries
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

<a id='load-data'></a>
## I. Load Data
We load the data into a pandas dataframe.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, which is often used in machine learning to preprocess raw data.<br><br>
The relevant information is in the columns `Title` and `Text`.
However, the `Text` is provided in raw form and contains many HTML tokens, as shown below.

In [None]:
path = '../data/newsfeed-stadtpolizei-stgallen-medienmitteilungen.csv'
df = pd.read_csv(path, sep=';')
df.sort_values(by=['Veröffentlicht'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

In [None]:
df.loc[0,'Text']

<a id='preprocessing'></a>
## II. Preprocessing
In a first step, we will get rid the HTML tokens and useless characters by applying `BeautifulSoup`.
If you want more information on how to use `BeautifulSoup`, see [here](https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/).
The corresponding documentation can be found [here](https://beautiful-soup-4.readthedocs.io/en/latest/).<br>

In a second step we will create a new dataframe `df_subset_1`, which only contains a subset of the original dataframe (`Title` and `Text`)

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(df.loc[0,'Text'], 'html.parser')
soup.text.replace('\n', '')

In [None]:
df['Text'] = df['Text'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text().replace('\n', ''))
df.head()

In [None]:
df_subset_1 = df.drop(['Link','Veröffentlicht','Bild URL','Bild'], axis=1)
df_subset_1.head()

Since most existing NLP tools only work with English words, we first need to translate the German texts.
For this purpose we use the Google Translate service which is accessed by using the Python package `deep-translator`. <br><br>
Since Google blocks us if the translation service is called too often, the texts are translated in batches and we have to wait 2 seconds after each translation step.
Thus, the translation of the entire dataset takes quite a long time (~30min).
For this reason I have provided you the translated dataframe in `df_subset_1_en`, which you can load by executing the next cell.

In [None]:
from deep_translator import GoogleTranslator

gtr = GoogleTranslator(source='de', target='en')

translated_title = gtr.translate_batch(df_subset_1['Title'].tolist())
translated_text  = gtr.translate_batch(df_subset_1['Text'].tolist())

df_subset_1['Title'] = translated_title
df_subset_1['Text']  = translated_text

df_subset_1.tail()
df_subset_1.to_csv('../data/df_subset_1_en.csv', sep=';')

In [None]:
df_subset_1_en = pd.read_csv('../data/df_subset_1_en.csv', sep=';', index_col=0)
df_subset_1_en.head()

Next, we concatenate the `Title` and `Text` into a single column, which then forms the input to our machine learning algorithm.

In [None]:
df_subset_2 = pd.Series(df_subset_1_en['Title'] + ' ' + df_subset_1_en['Text'], name='Text').to_frame()
df_subset_2.head()

There are some common preprocessing steps that are applied before training a machine learning model.
Their purpose is to standardize the documents and reduce the number of words.

- Tokenize, i.e. split texts into words ([I love NLP] → [I, love, NLP])
- Expand contractions (I'm → I am)
- Lowercase all words
- Remove stopwords (e.g. common words like the, a, and etc. because they have no expressive meaning)
- Keep specific word forms only (e.g. all nouns and adjectives)
- Lemmatization, i.e. reduces inflected words and ensures that the root word is a proper word (am → be, was → be, were → be)
- Stemming, i.e. reduces inflected words to their stem (root or base) forms even if the stem itself is not a valid word (happy → happi)

I our case we will apply the following preprocessing steps:
- Tokenize
- Remove Contractions
- Keep alphabetic tokens only
- Remove English stopwords
- Keep only nouns, adjectives and verbs
- Lemmatize words

We achieve this by using the class TextPreprocessor in the module textPreprocessing.<br>
Further information on text preprocessing can be found [here](https://towardsdatascience.com/text-preprocessing-steps-and-universal-pipeline-94233cb6725a).

In [None]:
import nltk
from nltk.corpus import stopwords, wordnet
from textPreprocessing import TextPreprocessor

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')


language = 'english'
stop_words = set(stopwords.words(language))
stop_words.add('\"')
stop_words.add('\'')
stop_words.add('\'\'')
stop_words.add('`')
stop_words.add('``')
stop_words.add('\'s')
# Extend the stop_word list if appropriate.

processor = TextPreprocessor(
    language = language,
    pos_tags = {wordnet.ADJ, wordnet.NOUN},
    stopwords = stop_words,
    n_jobs = 6,
    alpha_only=True,
)

df_subset_2['Processed'] = processor.transform(df_subset_2['Text'])

In [None]:
df_subset_2.loc[0,'Processed']

<a id='feature-extraction'></a>
## III. Feature Extraction
As part of the feature extraction process we determine a feature vector for each standardized, cleaned text.
This feature vector should represent the relevant content of the text in numbers.
For this purpose, we use the so-called `Term Frequency - Inverse Document Frequency` (TF-IDF) method.<br><br>
The TF-IDF represents a document by a vector which has an entry for each word in the corpus.
It assigns each vector entry the number of occurrence of the corresponding word (Term Frequency), weighted by how often the word occurs in the entire corpus (IDF).
For more information on TF-IDF, see [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or [here](https://towardsdatascience.com/how-tf-idf-works-3dbf35e568f0).<br><br>
[Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) provides an implementation of the TF-IDF method, which we will use here.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, min_df=2)
X_tfidf = vectorizer.fit_transform(df_subset_2['Processed'])

print('n_samples: {:d}, n_features: {:d}'.format(*X_tfidf.shape))

In [None]:
list(vectorizer.get_feature_names())[:10]

As you can see above, the feature vector which results from applying the TfidfVectorizer to this corpus is of dimension 1457.
It is generally not recommended working with vectors of such high dimension (remember the "curse of dimensionality" from lecture 4), especially if the distances between the vectors are relevant for the clustering algorithm.
Thus, we will apply a dimensionality reduction method to reduce the number of dimensions to 20.<br><br>
Truncated Singular Value Decomposition (TruncatedSVD) is a linear dimensionality reduction method which is often used for sparse data (data with many zero entries).
It is basically the same as Principal Component Analysis (PCA), but without prior subtraction of the mean vector (which would turn a sparse vector into a dense vector).
If you want more information TruncatedSVD, see [here](https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361).
For a comparison between PCA and TruncatedSVD, see [here](https://stats.stackexchange.com/a/342072).<br><br>
After reducing the dimensions, the feature vector is normalized, which improves the clustering performance.
Both steps (dimensionality reduction and normalization) are combined to a single step, using sklearn's [pipeline idea](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html).

In [None]:
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

n_components = 20

svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X_svd = lsa.fit_transform(X_tfidf)

explained_variance = svd.explained_variance_ratio_.sum()
print('Explained variance of the SVD step: {:d}%'.format(int(explained_variance * 100)))

In [None]:
X_svd.shape

<a id='clustering'></a>
## IV. Clustering
As a final processing step, we will group the texts (more accurately, the extracted feature vectors) into different clusters.
Texts (feature vectors) that are in the same group should have similar properties, while texts (feature vectors) in different groups should have highly dissimilar properties.
Clustering belongs to the category of unsupervised machine learning and is therefore very difficult to evaluate.
There exist many different clustering methods (see [here](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68) for a theoretical explanation):
- [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
- [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

In this example we will apply KMeans to the feature vectors, to cluster the texts into five groups (the number of groups was chosen arbitrarily).

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

n_clusters = 5

km = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = km.fit_predict(X_svd)

silhouette_avg = silhouette_score(X_svd, cluster_labels, metric='euclidean')
print('Silhouette Coefficient: {:0.3f}'.format(silhouette_avg))

It is generally very difficult to assess the performance of an unsupervised learning procedure because, by definition, the data do no include any ground truth to which the prediction could be compared.
However, there are some scores which can indicate the cluster quality.
One of them ist the silhouette plot / silhouette score.<br><br>
With the silhouette plot we first calculate a silhouette value for each data sample, which is a measure of how similar a sample is to its own cluster (cohesion) compared to other clusters (separation).
It ranges from -1 to +1, with -1 representing poor cohesion / separation and +1 representing good cohesion / separation.
If all these silhouette values are sorted by value, visualized as bar plot and colorized according to the cluster assignment, we get the silhouette plot (see below).
If all the bars have about the same length (positive values) then the clustering algorithm was able to find distinct clusters.
The average over all silhouette values corresponds to the silhouette score.<br><br>
More information on the silhouette method can be found [here](https://en.wikipedia.org/wiki/Silhouette_(clustering)).<br>
The following code was taken from [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html).

In [None]:
import matplotlib.cm as cm

fig, ax = plt.subplots(1,1)
fig.set_size_inches(18, 7)

ax.set_xlim([-0.5, 1])
ax.set_ylim([0, len(X_svd) + (n_clusters + 1) * 10])

# Compute the silhouette scores for each sample
# sample_silhouette_values = silhouette_samples(X_svd, cluster_labels, metric='cosine')
sample_silhouette_values = silhouette_samples(X_svd, cluster_labels, metric='euclidean')

y_lower = 10
for i in range(n_clusters):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = \
        sample_silhouette_values[cluster_labels == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / n_clusters)
    ax.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  # 10 for the 0 samples

ax.set_title("The silhouette plot for the various clusters.")
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")

ax.set_yticks([])  # Clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

plt.show()

Next we want to take a closer look at some results.
The following code snippet outputs ten keywords that occur frequently in a document that is near a cluster centroid.<br><br>
Moreover, in cell 47 we print some document titles for each cluster.

In [None]:
print("Top terms per cluster:")

original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print('Cluster {:d}:'.format(i), end='')
    for ind in order_centroids[i, :10]:
        print(' {:s}'.format(terms[ind]), end='')
    print()

In [None]:
idx_sample = np.argsort(np.linalg.norm(X_svd[...,np.newaxis] - km.cluster_centers_.T[np.newaxis], axis=1), axis=0)

In [None]:
for i in range(n_clusters):
    print('Examples cluster {:d}: '.format(i))
    for idx in idx_sample[:4,i]:
        print(df.loc[idx,'Title'])
    print()