## 💻 UnpackAI DL201 Bootcamp - Week 2 - Skills: NLP

### 📕 Learning Objectives

* Reinforce the need for data processing not only for NLP but for most machine learning tasks.
* Review common data processing steps for NLP tasks.

### 📖 Concepts map
* Text fomratting
* Tokenization
* Stemming
* Lemmatization
* Stopwords

As in most machine learning tasks, data preprocessing is a key step in the process of training a model as faulty and poor quality data will result in poor performance. Text preprocessing in NLP represent the set of techniques that format and correct the structure of the text, remove unwanted characters and words, simplify and highlight the semantical meaninig of the text as well as transforms the text into a form that can be used by the machine learning algorithm.

Preprocessing tasks are more standarized than the ones use for Computer Vision and tabular data analysis, altough there are differences (because not all tasks require the same level of preprocessing),some steps are reused, often in the same order. Below there is a brief description of some of these tasks.

These tasks are part of the **morphological and lexical analysis** of the text, which are at the bottom of the NLP pipeline (text matching)

- Text Integration: Combining text from different sources into a single corpus.
- Text Formatting: Cleaning and formatting text.
    - Removal of punctuation.
    - Lowercasing.
    - Removal of stopwords.
    - Removal of numbers (or replace them with word numbers)
    - Removal of special characters (e.g. HTML tags, URLs, string patterns, etc.)
    - Removal of short words (e.g. words with less than 3 characters)
    - Removal of repeated words.
    - Removal of rare words (e.g. words that appear only once or only in a few documents).
- Text segementation: Splitting text into sentences.

Other processing steps operate at the **sematic level**, which affects the meaning of the text. (often rule-based)

- Spell checking: Correcting misspelled words.
- Grammar checking: Correcting grammatical errors.
- Stemming: Removing suffixes from words.
- Lemmatization: Simplifying words by using a dictionary of known words and roots.

Example of Stemming. From: https://i0.wp.com/trevorfox.com/wp-content/uploads/2018/07/stemming-example.png

![](https://i0.wp.com/trevorfox.com/wp-content/uploads/2018/07/stemming-example.png?fit=500%2C605&ssl=1)

Comparison with lemmatization. From: https://medium.com/swlh/introduction-to-stemming-vs-lemmatization-nlp-8c69eb43ecfe

![](https://tse3-mm.cn.bing.net/th/id/OIP-C.2K4VxxRtewNw4iP-Kh5Z7QHaEH?pid=ImgDet&rs=1)

Additionally, text enrichment can be applied providing more semantics to the original text with data that we didn't have before. (machine-learning, learn-based)
- POS Tag: Part of speech tagging.
- Entity Recognition: Recognizing named entities.
- Entity relation extraction: Extracting relations between named entities.
- Dependency parsing: Parsing the sentence into a tree structure.

POS Tagging. From: https://www.researchgate.net/publication/337460636_Unpacking_the_Smart_Mobility_Concept_in_the_Dutch_Context_Based_on_a_Text_Mining_Approach

![](https://www.researchgate.net/publication/337460636/figure/download/fig1/AS:828223747284992@1574475337385/Example-of-part-of-speech-POS-tagging-and-lemmatization-for-two-example-sentences-The.ppm)

Entity recognition and dependency parsing. From: https://stanfordnlp.github.io/CoreNLP/

![](https://stanfordnlp.github.io/CoreNLP/assets/images/ner.png)

Entity relation-extraction. From: https://www.mdpi.com/2079-9292/9/10/1637

![](https://www.mdpi.com/electronics/electronics-09-01637/article_deploy/html/images/electronics-09-01637-g001.png)

Then we have text vectorization, which is the process of converting the text into a vector representation. This step is required to train a machine learning model.

Vectorized representations of text are usually obtained via:
- Bag of words: A vector representation of the text is obtained by counting the number of times each word appears in the text.
- TF-IDF: A vector representation of the text is obtained by counting the number of times each word appears in the text and then normalizing the counts by the number of documents in which the word appears.
- Word embeddings: A vector representation of the text is obtained by using a word embedding model to represent the text.
    - Word2Vec.
    - Bert.

From here the vectorized representation becomes the input for the machine learning algorithm. Depending of the algorithm, the input can be:
- Classification: The input is a vector representation of the text and the output is a class label.
- Regression: The input is a vector representation of the text and the output is a real number.
- Clustering: The input is a vector representation of the text and the output is a cluster label.
- Recommendation: The input is a vector representation of the text and the output is a list of recommendations.
- Sentiment analysis: The input is a vector representation of the text and the output is a real number.
- Topic modeling: The input is a vector representation of the text and the output is a list of topics.
- Text summarization: The input is a vector representation of the text and the output is a list of sentences.
- Text translation: The input is a vector representation of the text and the output is a list of translations.

### Revisit the previous example

Implement a few of the preprocessing steps mentioned above.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import requests
from transformers import BertTokenizer, BertModel
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [None]:
# Download dependencies
nltk.download('stopwords')
nltk.download('wordnet')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
# Load a sample text, from the provided url into a dataframe
response = requests.get('http://www.textfiles.com/stories/alad10.txt')
sample_text = response.text
sentences = sample_text.split('\n')                        # Split text into sentences
df = pd.DataFrame(sentences, columns=['sentence'])

In [None]:
# Text cleaning (morphological changes)
df['sentence'] = df['sentence'].str.lower()                 # Lowercase
df = df[df['sentence'].str.split().str.len() > 3]           # Remove short sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')   # Remove punctuation
max_len = df['sentence'].str.len().max()                    # longest sentence
df['sentence'].head(5)                                    

In [None]:
# Remove Stopwords
eng_stopwords = stopwords.words('english')
print(eng_stopwords[-10:])
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in eng_stopwords]))
df['sentence'].head(5) 

In [None]:
# Apply lemmatization
lemmatizer = WordNetLemmatizer()
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df['sentence'].head(5) 

In [None]:
# Apply stemming
stemmer = PorterStemmer()
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df['sentence'].head(5)

In [None]:
# Tokenize the sentences, add tokens ids
tokens_df = df.copy()
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

In [None]:
# Add the [CLS] and [SEP] special tokens and padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))
tokens_df['numericalized_sentence'].sample(10)

In [None]:
# Extract encoded value to a Tensor
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences = [list(x) for x in numericalized_sentences]
numericalized_sentences = np.array(numericalized_sentences)
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences.shape)

In [None]:
# Encode the numericalized sentences using BERT
encoded_sentences = bert_model(numericalized_sentences)[0]
encoded_sentences = encoded_sentences.detach().numpy()
print(encoded_sentences.shape)

In [None]:
# Add embedings of each sentence
encoded_sentences = np.sum(encoded_sentences, axis=1)
print(encoded_sentences.shape)

In [None]:
# Use PCA to reduce the embedding dimensionality to 3
pca = PCA(n_components=3)
pca.fit(encoded_sentences)
reduced_embeddings = pca.transform(encoded_sentences)
print(reduced_embeddings.shape)

In [None]:
# plot 3D embeddings
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], reduced_embeddings[:, 2])
plt.show()

### Exercise: experiment!

* Combine text from at least two different sources.
* Try different nlp libraries
* Perform an expanded NLP pipeline (check spelling, POS tagging, entity recognition, dependency parsing, etc.)
