**TEAM**: Wissem Boujlida, Majdi Bel Hadj Youssef, Aymen Rebhi, Med Ali Moualhi, Amin Bouhamed, Salma Jdidi, Brahim Lasmer

In this lab, we will be exploring how to preprocess unstructured text for building a a conceptual graph based recommendation system.<br>
Text Preprocessing (normalization) is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for your NLP task. It includes tokenization, stemming, lemmatization, punctuation and stop-word removal, part-of-speech tagging and corefrence resolution.

## **Tokenization**

Tokenization is the process of breaking down text into individual tokens. a Token could reference a paragraph, a sentence, a word, sub word, or even a character. In this same step, we will also convert each token in the text to lower case.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/wissem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def word_tokenizer(text):
    tokens_list = word_tokenize(text)
    tokens_list = [token.lower() for token in tokens_list]
    return tokens_list

In [None]:
text = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

In [None]:
tokens_list = word_tokenizer(text)
tokens_list

['it',
 'would',
 'be',
 'unfair',
 'to',
 'demand',
 'that',
 'people',
 'cease',
 'pirating',
 'files',
 'when',
 'those',
 'same',
 'people',
 'are',
 "n't",
 'paid',
 'for',
 'their',
 'participation',
 'in',
 'very',
 'lucrative',
 'network',
 'schemes',
 '.',
 'ordinary',
 'people',
 'are',
 'relentlessly',
 'spied',
 'on',
 ',',
 'and',
 'not',
 'compensated',
 'for',
 'information',
 'taken',
 'from',
 'them',
 '.',
 'while',
 'i',
 "'d",
 'like',
 'to',
 'see',
 'everyone',
 'eventually',
 'pay',
 'for',
 'music',
 'and',
 'the',
 'like',
 ',',
 'i',
 "'d",
 'not',
 'ask',
 'for',
 'it',
 'until',
 'there',
 "'s",
 'reciprocity',
 '.']

We can also operate at the level of sentences, using the sentence tokenizer to a split text into individual sentences as follows:

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
def sentence_tokenizer(text):
    sentences_list = sent_tokenize(text)
    return sentences_list

In [None]:
text = """Project Risk Management includes the processes of conducting risk management planning, identification, analysis,
response planning, response implementation, and monitoring risk on a project. The objectives of project risk management
are to increase the probability and/or impact of positive risks and to decrease the probability and/or impact of negative
risks, in order to optimize the chances of project success."""

In [None]:
sentences_list = sentence_tokenizer(text)
print(sentences_list)

['Project Risk Management includes the processes of conducting risk management planning, identification, analysis,\nresponse planning, response implementation, and monitoring risk on a project.', 'The objectives of project risk management\nare to increase the probability and/or impact of positive risks and to decrease the probability and/or impact of negative\nrisks, in order to optimize the chances of project success.']


## **Remove punctuations**

Punctuation marks are marks indicating how a text should be read and, consequently, understood. But whether punctuation marks should be omitted or retained, it is totally up to the NLP task you're performing and the context.

In [None]:
import string                              # for string operations

In [None]:
print('\nPunctuation\n')
print(string.punctuation)


Punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
def remove_punctuation(tokens_list):
    tokens_clean = []
    for token in tokens_list: # Go through every token in your tokens list
        if (token not in string.punctuation):  # remove punctuation
            tokens_clean.append(token)
    return tokens_clean

In [None]:
text = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

In [None]:
tokens_list = word_tokenizer(text)
tokens_clean = remove_punctuation(tokens_list)
tokens_clean

['it',
 'would',
 'be',
 'unfair',
 'to',
 'demand',
 'that',
 'people',
 'cease',
 'pirating',
 'files',
 'when',
 'those',
 'same',
 'people',
 'are',
 "n't",
 'paid',
 'for',
 'their',
 'participation',
 'in',
 'very',
 'lucrative',
 'network',
 'schemes',
 'ordinary',
 'people',
 'are',
 'relentlessly',
 'spied',
 'on',
 'and',
 'not',
 'compensated',
 'for',
 'information',
 'taken',
 'from',
 'them',
 'while',
 'i',
 "'d",
 'like',
 'to',
 'see',
 'everyone',
 'eventually',
 'pay',
 'for',
 'music',
 'and',
 'the',
 'like',
 'i',
 "'d",
 'not',
 'ask',
 'for',
 'it',
 'until',
 'there',
 "'s",
 'reciprocity']

## **Remove Stop Words**

Stop words are a set of commonly used words that don't add significant meaning to the text. But whether stop words should be omitted or retained, it is totally up to the NLP task you're performing and the context.

In [None]:
from nltk.corpus import stopwords          # module for stop words that come with NLTK
# download the stopwords from NLTK
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/wissem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [None]:
def remove_stopwords(tokens_list):
    #Import the english stop words list from NLTK
    stopwords_english = stopwords.words('english')
    tokens_clean = []
    for token in tokens_list: # Go through every token in your tokens list
        if (token not in stopwords_english):  # remove stopwords
            tokens_clean.append(token)
    return tokens_clean

In [None]:
text = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

In [None]:
tokens_list = word_tokenizer(text)
tokens_clean = remove_stopwords(tokens_list)
tokens_clean

['would',
 'unfair',
 'demand',
 'people',
 'cease',
 'pirating',
 'files',
 'people',
 "n't",
 'paid',
 'participation',
 'lucrative',
 'network',
 'schemes',
 '.',
 'ordinary',
 'people',
 'relentlessly',
 'spied',
 ',',
 'compensated',
 'information',
 'taken',
 '.',
 "'d",
 'like',
 'see',
 'everyone',
 'eventually',
 'pay',
 'music',
 'like',
 ',',
 "'d",
 'ask',
 "'s",
 'reciprocity',
 '.']

## **Stemming**

Stemming is the process of reducing a word to its most general form, or stem(root). This helps in reducing the size of our vocabulary.

Consider the words:

    learn
    learning
    learned
    learnt

All these words are stemmed from its common root learn. However, in some cases, the stemming process produces words that are not spelled correctly. For example, for happy and sunny. We can look at the set of words that comprises the different forms of happy:

    happy
    happiness
    happier

We can see that the prefix happi is more commonly used. We cannot choose happ because it is the stem of unrelated words like happen. So, we choose the most common stem for related words.

NLTK has different modules for stemming and we will be using the PorterStemmer module which uses the Porter Stemming Algorithm. We can also use the Snowball Stemmer which is an improvement to the Porter Stemmer, stemming words to a more accurate stem.

In [None]:
from nltk.stem import PorterStemmer        # module for stemming

In [None]:
def stemming(tokens_list):
    # Instantiate stemming class
    stemmer = PorterStemmer()
    stems_list = []
    for token in tokens_list: # Go through every token in your tokens list
        stem_token = stemmer.stem(token)  # stemming token
        stems_list.append(stem_token)  # append to the list of stems
    return stems_list

In [None]:
text = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

In [None]:
tokens_list = word_tokenizer(text)
stems_list = stemming(tokens_list)
stems_list

['it',
 'would',
 'be',
 'unfair',
 'to',
 'demand',
 'that',
 'peopl',
 'ceas',
 'pirat',
 'file',
 'when',
 'those',
 'same',
 'peopl',
 'are',
 "n't",
 'paid',
 'for',
 'their',
 'particip',
 'in',
 'veri',
 'lucr',
 'network',
 'scheme',
 '.',
 'ordinari',
 'peopl',
 'are',
 'relentlessli',
 'spi',
 'on',
 ',',
 'and',
 'not',
 'compens',
 'for',
 'inform',
 'taken',
 'from',
 'them',
 '.',
 'while',
 'i',
 "'d",
 'like',
 'to',
 'see',
 'everyon',
 'eventu',
 'pay',
 'for',
 'music',
 'and',
 'the',
 'like',
 ',',
 'i',
 "'d",
 'not',
 'ask',
 'for',
 'it',
 'until',
 'there',
 "'s",
 'reciproc',
 '.']

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
def stemming2(tokens_list):
    # Instantiate stemming class
    stemmer = SnowballStemmer("english")
    stems_list = []
    for token in tokens_list: # Go through every token in your tokens list
        stem_token = stemmer.stem(token)  # stemming token
        stems_list.append(stem_token)  # append to the list of stems
    return stems_list

In [None]:
tokens_list = word_tokenizer(text)
stems_list = stemming2(tokens_list)
stems_list

['it',
 'would',
 'be',
 'unfair',
 'to',
 'demand',
 'that',
 'peopl',
 'ceas',
 'pirat',
 'file',
 'when',
 'those',
 'same',
 'peopl',
 'are',
 "n't",
 'paid',
 'for',
 'their',
 'particip',
 'in',
 'veri',
 'lucrat',
 'network',
 'scheme',
 '.',
 'ordinari',
 'peopl',
 'are',
 'relentless',
 'spi',
 'on',
 ',',
 'and',
 'not',
 'compens',
 'for',
 'inform',
 'taken',
 'from',
 'them',
 '.',
 'while',
 'i',
 "'d",
 'like',
 'to',
 'see',
 'everyon',
 'eventu',
 'pay',
 'for',
 'music',
 'and',
 'the',
 'like',
 ',',
 'i',
 "'d",
 'not',
 'ask',
 'for',
 'it',
 'until',
 'there',
 "'s",
 'reciproc',
 '.']

## **Part-of-speech tagging (POST)**

Part-of-speech tagging (POST), is the process of marking up a word in a text (corpus) to its corresponding part of speech (nouns, verbs, adjectives, adverbs, etc...) based on both its definition and its context.
We want the computer to understand the meaning of text correctly and part of that is POST.
POST has a lot of applications:
    
    Text modeling
    Autocomplete (figure out the next word)
    word ambiguity resolution ( Pick a word like watch and figure out if we're talking about it as a noun or as a verb..)

The nltk.tag.AveragedPerceptronTagger is the default tagger as of NLTK version 3.1. This model was trained on on Sections 00-18 of the Wall Street Journal sections of OntoNotes 5. The original implementation comes from Matthew Honnibal, and it outperforms the predecessor maximum entropy POS model in NLTK.
You can check this link for the tags set : https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/wissem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/wissem/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [None]:
def POST(tokens_list):
    pos_tags = pos_tag(tokens_list, lang='eng')
    return pos_tags

In [None]:
text = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

In [None]:
tokens_list = word_tokenizer(text)
pos_tags = POST(tokens_list)
print(pos_tags)

[('it', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('unfair', 'JJ'), ('to', 'TO'), ('demand', 'VB'), ('that', 'IN'), ('people', 'NNS'), ('cease', 'VBP'), ('pirating', 'VBG'), ('files', 'NNS'), ('when', 'WRB'), ('those', 'DT'), ('same', 'JJ'), ('people', 'NNS'), ('are', 'VBP'), ("n't", 'RB'), ('paid', 'VBN'), ('for', 'IN'), ('their', 'PRP$'), ('participation', 'NN'), ('in', 'IN'), ('very', 'RB'), ('lucrative', 'JJ'), ('network', 'NN'), ('schemes', 'NNS'), ('.', '.'), ('ordinary', 'JJ'), ('people', 'NNS'), ('are', 'VBP'), ('relentlessly', 'RB'), ('spied', 'VBN'), ('on', 'IN'), (',', ','), ('and', 'CC'), ('not', 'RB'), ('compensated', 'VBN'), ('for', 'IN'), ('information', 'NN'), ('taken', 'VBN'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('while', 'IN'), ('i', 'JJ'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('see', 'VB'), ('everyone', 'NN'), ('eventually', 'RB'), ('pay', 'VB'), ('for', 'IN'), ('music', 'NN'), ('and', 'CC'), ('the', 'DT'), ('like', 'JJ'), (',', ','), ('i', 'JJ'), ("'d",

## **Lemmatization**

Lemmatization is also the process of reducing a word to its most base form, or lemma. This helps in reducing the size of our vocabulary.
Unlike stemming, lemmatization reduces words to their base word ensuring that the root word belongs to the language. It’s usually more sophisticated than stemming, since stemmers works on an individual word without knowledge of the context. In the other side lemmatizers uses the linguistic knowledge and the context to derive a properly spelled and grammatically correct base form.

Just like for stemming, there are different lemmatizers. For this example, we’ll use WordNet lemmatizer.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/wissem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def lemmatization(tokens_list):
    # Instantiate lemmatization class
    lemmatizer = WordNetLemmatizer()
    lemmas_list = []
    for token in tokens_list: # Go through every token in your tokens list
        lemma_token = lemmatizer.lemmatize(token)  # stemming token
        lemmas_list.append(lemma_token)  # append to the list of stems
    return lemmas_list

In [None]:
text = "the cat is sitting with the bats on the striped mat under many badly flying geese"

In [None]:
tokens_list = word_tokenizer(text)
lemmas_list = lemmatization(tokens_list)
lemmas_list

['the',
 'cat',
 'is',
 'sitting',
 'with',
 'the',
 'bat',
 'on',
 'the',
 'striped',
 'mat',
 'under',
 'many',
 'badly',
 'flying',
 'goose']

POS tagging can improve lemmatization accuracy. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word ‘leaf’, but with a verb tag, its lemma would become ‘leave’.
To get the best results, you’ll have to feed the POS tags to the lemmatizer, or otherwise it might not reduce all the words to the lemmas you desire.

In [None]:
from nltk.corpus import wordnet

In [None]:
# Define function to lemmatize each token with its POS tag

def lemmatization2(tokens_list):
    lemmatizer = WordNetLemmatizer()
    nltk_pos_tags = POST(tokens_list)
    lemmas_list = []
    for nltk_pos_tag in nltk_pos_tags:
        token, pos_tag = nltk_pos_tag
        if pos_tag.startswith('J'):
            lemma = lemmatizer.lemmatize(token, wordnet.ADJ)
            lemmas_list.append(lemma)
        elif pos_tag.startswith('V'):
            lemma = lemmatizer.lemmatize(token, wordnet.VERB)
            lemmas_list.append(lemma)
        elif pos_tag.startswith('N'):
            lemma = lemmatizer.lemmatize(token, wordnet.NOUN)
            lemmas_list.append(lemma)
        elif pos_tag.startswith('R'):
            lemma = lemmatizer.lemmatize(token, wordnet.ADV)
            lemmas_list.append(lemma)
        else:
            lemma = lemmatizer.lemmatize(token)
            lemmas_list.append(lemma)
    return lemmas_list

In [None]:
text = 'the cat is sitting with the bats on the striped mat under many badly flying geese'

In [None]:
tokens_list = word_tokenizer(text)
lemmas_list = lemmatization2(tokens_list)
lemmas_list

['the',
 'cat',
 'be',
 'sit',
 'with',
 'the',
 'bat',
 'on',
 'the',
 'striped',
 'mat',
 'under',
 'many',
 'badly',
 'fly',
 'geese']

## **Coreference resolution**

In linguistics, coreference occurs when two or more expressions refer to the same entity or thing.
The coreference resolution converts those expressions into the referred entities. Most often, coreferences are pronouns (He, she, it , my, his...)<br>
For example, given the sentence, “John went to the store. He bought some groceries.”, The coreference resolution model would identify that “John” and “He” both refer to the same entity "John" and replace the pronoun "He" with "John". and so the resolved sentence should be: "John went to the store. John bought some groceries."


we will be using the new Crosslingual Coreference model contributed by David Berenstein to the spaCy Universe.

In [None]:
import spacy

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
def coreference_resolution(text):
    # use any model that has internal spacy embeddings
    DEVICE = -1 # Number of the GPU, -1 if want to use CPU
    coreference_resolution_model = spacy.load('en_core_web_sm')
    coreference_resolution_model.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE})
    resolved_text = coreference_resolution_model(text)._.resolved_text
    return resolved_text

In [None]:
text = """
    Do not forget about Momofuku Ando!
    He created instant noodles in Osaka.
    At that location, Nissin was founded.
    Many students survived by eating these noodles, but they don't even know him."""

In [None]:
resolved_text = coreference_resolution(text)
print(resolved_text)

[nltk_data] Downloading package omw-1.4 to /home/wissem/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/wissem/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Some weights of the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large were not used when initializing XLMRobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaMod


    Do not forget about Momofuku Ando!
    Momofuku Ando created instant noodles in Osaka.
    At Osaka, Nissin was founded.
    Many students survived by eating instant noodles, but Many students don't even know Nissin.
