<a href="https://colab.research.google.com/github/unpackAI/DL201/blob/Cohort_7/Week-2/2_6_NLP_Preprocessing_Book.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 💻 UnpackAI DL201 Bootcamp - Week 2 - Skills: NLP

### 📕 Learning Objectives

* Reinforce the need for data processing not only for NLP but for most machine learning tasks.
* Review common data processing steps for NLP tasks.

### 📖 Concepts map
* Text fomratting
* Tokenization
* Stemming
* Lemmatization
* Stopwords

As in most machine learning tasks, data preprocessing is a key step in the process of training a model as faulty and poor quality data will result in poor performance. Text preprocessing in NLP represent the set of techniques that format and correct the structure of the text, remove unwanted characters and words, simplify and highlight the semantical meaninig of the text as well as transforms the text into a form that can be used by the machine learning algorithm.

Preprocessing tasks are more standarized than the ones use for Computer Vision and tabular data analysis, altough there are differences (because not all tasks require the same level of preprocessing), some steps are reused, often in the same order. Below there is a brief description of some of these tasks.

These tasks are part of the **morphological and lexical analysis** of the text, which are at the bottom of the NLP pipeline (text matching)

- Text Integration: Combining text from different sources into a single corpus. (sample image from: https://www.opinosis-analytics.com)

<img src=https://i1.wp.com/www.opinosis-analytics.com/wp-content/uploads/2020/01/customer-support-enquiry-sources.png alt="alt text" title="image Title" height="300"/>

- Text Formatting: Cleaning and formatting text.
    - Removal of punctuation.
    - Lowercasing.
    - Removal of stopwords. (sample image source: https://lionbridge.ai/wp-content/uploads/2019/10/lm_02.png)
    - Removal of numbers (or replace them with word numbers)
    - Removal of special characters (e.g. HTML tags, URLs, string patterns, etc.)
    - Removal of short words (e.g. words with less than 3 characters)
    - Removal of repeated words.
    - Removal of rare words (e.g. words that appear only once or only in a few documents).

<img src=https://lionbridge.ai/wp-content/uploads/2019/10/lm_02.png alt="alt text" title="image Title" height="300"/>

- Text segementation: Splitting text into sentences.

Other processing steps operate at the **sematic level**, which affects the meaning of the text. (often rule-based)

- Spell checking: Correcting misspelled words.
- Grammar checking: Correcting grammatical errors.
- Stemming: Removing suffixes from words.
- Lemmatization: Simplifying words by using a dictionary of known words and roots.

Example of Stemming. From: https://i0.wp.com/trevorfox.com/wp-content/uploads/2018/07/stemming-example.png

![](https://i0.wp.com/trevorfox.com/wp-content/uploads/2018/07/stemming-example.png?fit=500%2C605&ssl=1)

Comparison with lemmatization. From: https://medium.com/swlh/introduction-to-stemming-vs-lemmatization-nlp-8c69eb43ecfe

![](https://tse3-mm.cn.bing.net/th/id/OIP-C.2K4VxxRtewNw4iP-Kh5Z7QHaEH?pid=ImgDet&rs=1)

Additionally, text enrichment can be applied providing more semantics to the original text with data that we didn't have before. (machine-learning, learn-based)
- POS Tag: Part of speech tagging.
- Entity Recognition: Recognizing named entities.
- Entity relation extraction: Extracting relations between named entities.
- Dependency parsing: Parsing the sentence into a tree structure.

POS Tagging. From: https://www.researchgate.net/publication/337460636_Unpacking_the_Smart_Mobility_Concept_in_the_Dutch_Context_Based_on_a_Text_Mining_Approach

![](https://www.researchgate.net/publication/337460636/figure/download/fig1/AS:828223747284992@1574475337385/Example-of-part-of-speech-POS-tagging-and-lemmatization-for-two-example-sentences-The.ppm)

Entity recognition and dependency parsing. From: https://stanfordnlp.github.io/CoreNLP/

![](https://stanfordnlp.github.io/CoreNLP/assets/images/ner.png)

Entity relation-extraction. From: https://www.mdpi.com/2079-9292/9/10/1637

![](https://www.mdpi.com/electronics/electronics-09-01637/article_deploy/html/images/electronics-09-01637-g001.png)

Then we have text vectorization, which is the process of converting the text into a vector representation. This step is required to train a machine learning model.

Vectorized representations of text are usually obtained via:
- Bag of words: A vector representation of the text is obtained by counting the number of times each word appears in the text.
- TF-IDF: (Term Frequency/ Inverse Document Frequency) A vector representation of the text is obtained by counting the number of times each word appears in the text and then normalizing the counts by the number of documents in which the word appears.</br>
(Visual example of these 2 techniques can be found at: https://dataaspirant.com/word-embedding-techniques-nlp/)</br>  
- Word embeddings: A vector representation of the text is obtained by using a word embedding model to represent the text.
    - Word2Vec.
    - Bert.



From here the vectorized representation becomes the input for the machine learning algorithm. Depending of the algorithm, the input can be:
- Classification: The input is a vector representation of the text and the output is a class label.
- Regression: The input is a vector representation of the text and the output is a real number.
- Clustering: The input is a vector representation of the text and the output is a cluster label.
- Recommendation: The input is a vector representation of the text and the output is a list of recommendations.
- Sentiment analysis: The input is a vector representation of the text and the output is a real number.
- Topic modeling: The input is a vector representation of the text and the output is a list of topics.
- Text summarization: The input is a vector representation of the text and the output is a list of sentences.
- Text translation: The input is a vector representation of the text and the output is a list of translations.

### Revisit the previous example

Implement a few of the preprocessing steps mentioned above.

In [9]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import requests
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [10]:
pip install -Uqq transformers

In [11]:
from transformers import BertTokenizer, BertModel

In [12]:
# Download dependencies
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a Be

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [16]:
# Load a sample text, from the provided url into a dataframe
response = requests.get('https://www.gutenberg.org/ebooks/8655.txt.utf-8')
sample_text = response.text
sentences = sample_text.split('\n')                        # Split text into sentences
df = pd.DataFrame(sentences, columns=['sentence'])

In [17]:
# Text cleaning (morphological changes)
df['sentence'] = df['sentence'].str.lower()                 # Lowercase
df = df[df['sentence'].str.split().str.len() > 3]           # Remove short sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')   # Remove punctuation
max_len = df['sentence'].str.len().max()                    # longest sentence
df['sentence'].head(5)                                    

  after removing the cwd from sys.path.


0    the project gutenberg ebook of the book of the...
1                        night volume i by anonymous\r
3    this ebook is for the use of anyone anywhere i...
4    other parts of the world at no cost and with a...
5    whatsoever  you may copy it give it away or re...
Name: sentence, dtype: object

In [18]:
# Remove Stopwords
eng_stopwords = stopwords.words('english')
print(eng_stopwords[-10:])
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in eng_stopwords]))
df['sentence'].head(5) 

['shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


0    project gutenberg ebook book thousand nights one
1                              night volume anonymous
3             ebook use anyone anywhere united states
4                parts world cost almost restrictions
5           whatsoever may copy give away reuse terms
Name: sentence, dtype: object

In [21]:
# Apply lemmatization
lemmatizer = WordNetLemmatizer()
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df['sentence'].head(5) 

0    project gutenberg ebook book thousand night one
1                             night volume anonymous
3             ebook use anyone anywhere united state
4                 part world cost almost restriction
5           whatsoever may copy give away reuse term
Name: sentence, dtype: object

In [22]:
# Apply stemming
stemmer = PorterStemmer()
df['sentence'] = df['sentence'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df['sentence'].head(5)

0    project gutenberg ebook book thousand night one
1                                 night volum anonym
3                 ebook use anyon anywher unit state
4                    part world cost almost restrict
5              whatsoev may copi give away reus term
Name: sentence, dtype: object

In [23]:
# Tokenize the sentences, add tokens ids
tokens_df = df.copy()
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence,numericalized_sentence
1365,thee one destroy thee answer,"[thee, one, destroy, thee, answer]","[14992, 2028, 6033, 14992, 3437]"
2230,vilest white heard saw pass,"[vile, ##st, white, heard, saw, pass]","[25047, 3367, 2317, 2657, 2387, 3413]"
10746,befel one day king seat throne,"[be, ##fe, ##l, one, day, king, seat, throne]","[2022, 7959, 2140, 2028, 2154, 2332, 2835, 6106]"
238,idiom otherwis retain strict letter,"[id, ##iom, other, ##wi, ##s, retain, strict, ...","[8909, 18994, 2060, 9148, 2015, 9279, 9384, 3661]"
10639,way christian broker stori,"[way, christian, broker, st, ##ori]","[2126, 3017, 20138, 2358, 10050]"
4847,hundredth door bade farewel depart leav,"[hundred, ##th, door, bad, ##e, fare, ##we, ##...","[3634, 2705, 2341, 2919, 2063, 13258, 8545, 21..."
10261,lord thi wife thi handmaid stand thee deign,"[lord, th, ##i, wife, th, ##i, hand, ##maid, s...","[2935, 16215, 2072, 2564, 16215, 2072, 2192, 2..."
5340,move piti said hear obey god,"[move, pit, ##i, said, hear, obey, god]","[2693, 6770, 2072, 2056, 2963, 15470, 2643]"
3369,news wept till swoon away heart,"[news, wept, till, sw, ##oon, away, heart]","[2739, 24966, 6229, 25430, 7828, 2185, 2540]"
12139,ill befallen make end,"[ill, be, ##fall, ##en, make, end]","[5665, 2022, 13976, 2368, 2191, 2203]"


In [24]:
# Add the [CLS] and [SEP] special tokens and padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))
tokens_df['numericalized_sentence'].sample(10)

10211    [101, 15876, 4859, 2099, 16101, 29122, 2165, 1...
877      [101, 2034, 2214, 2158, 2358, 10050, 102, 0, 0...
5287     [101, 2158, 19549, 2360, 5520, 2048, 102, 0, 0...
850      [101, 2214, 2158, 4012, 26569, 2556, 2019, 145...
4517     [101, 2067, 22864, 8202, 2604, 2546, 2078, 236...
9129     [101, 9765, 4017, 2269, 9099, 20051, 21442, 68...
13919    [101, 1057, 4983, 9385, 2025, 2594, 4297, 7630...
11351    [101, 2052, 2507, 14992, 2051, 2202, 3272, 102...
8834     [101, 2237, 2191, 2705, 2540, 3239, 2095, 2078...
11942    [101, 14412, 6305, 4848, 2548, 15547, 3672, 30...
Name: numericalized_sentence, dtype: object

In [25]:
# Extract encoded value to a Tensor
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences = [list(x) for x in numericalized_sentences]
numericalized_sentences = np.array(numericalized_sentences)
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences.shape)

torch.Size([12690, 79])


In [26]:
# Use a fourth of the sentences to reduce memory usage
numericalized_sentences = numericalized_sentences[:len(numericalized_sentences)//4, :]
print(numericalized_sentences.shape)

torch.Size([3172, 79])


In [None]:
# Encode the numericalized sentences using BERT
encoded_sentences = bert_model(numericalized_sentences)[0]
encoded_sentences = encoded_sentences.detach().numpy()
print(encoded_sentences.shape)

In [None]:
# Add embedings of each sentence
encoded_sentences = np.sum(encoded_sentences, axis=1)
print(encoded_sentences.shape)

In [None]:
# Use PCA to reduce the embedding dimensionality to 3
pca = PCA(n_components=3)
pca.fit(encoded_sentences)
reduced_embeddings = pca.transform(encoded_sentences)
print(reduced_embeddings.shape)

In [None]:
# plot 3D embeddings, add tight axis
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], reduced_embeddings[:, 2])
ax.grid(True)


# Add a label to each data point
for i in range(len(reduced_embeddings)):
    ax.text(reduced_embeddings[i, 0], reduced_embeddings[i, 1], reduced_embeddings[i, 2], f'{i}')
    

plt.show()

In [None]:
# Print sentences that appear distinct in the embeddings
print(f"- Sentence 76: {df['sentence'].iloc[76]}")
print(f"- Sentence 220: {df['sentence'].iloc[220]}")

### Exercise: experiment!

* Combine text from at least two different sources.
* Try different nlp libraries
* Perform an expanded NLP pipeline (check spelling, POS tagging, entity recognition, dependency parsing, etc.)
