# Preprocessing tasks

In this notebook, we will study some of the basic NLP tasks that  that serve to develop more ambitious NLP applications.

In particular, we will learn about:


1.   Basic tasks such as tokenization, sentence spliting, PoS tagging, lemmatization and stemming.
3.   Stopwords
4.   Vectorization (from words to vectors)
        * Bag of words (BoW) model
        * Tf-idf model
        



## 1. Basic tasks: tokenization, sentence splitting, PoS taggin, lematization, and stemming.

* sentence splitting: the taks of splitting an input text into sentences.
* tokenization: the task of segmenting an input text into words (tokens).
* PoS tagging: consists of assigning to each word its PoS tag. 
* lemmatization: given a word, returns its lemma
* stemming: given a word, returns its root


There are several NLP libraries  that already performe these tasks for us. We will use Spacy in this notebook

In [1]:
!python3 -m spacy download en_core_web_sm

import spacy

nlp = spacy.load('en_core_web_sm')           # load model package "en_core_web_sm"

print('spacy.en loaded')




[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
spacy.en loaded


In [7]:
text="""Dozens of people have died in Canada amid an unprecedented heatwave that has smashed temperature records. Police in the Vancouver area have responded to more than 130 sudden deaths since Friday. Most were elderly or had underlying health conditions, with heat often a contributing factor. Canada broke its temperature record for a third straight day on Tuesday - 49.6C (121.3F) in Lytton, British Columbia.The US north-west has also seen record highs - and a number of fatalities. Experts say climate change is expected to increase the frequency of extreme weather events, such as heatwaves. However, linking any single event to global warming is complicated. US President Joe Biden said the heatwave was tied to climate change in a speech on Tuesday as he pitched a plan to update the country's infrastructure network. On Wednesday he is meeting with governors of western US states and fire officials, as the annual North American wildfire season begins. The heat over western parts of Canada and the US has been caused by a dome of static high-pressure hot air stretching from California to the Arctic territories. Temperatures have been easing in coastal areas but there is not much respite for inland regions. Before Sunday, temperatures in Canada had never passed 45C."""

document = nlp(text)

print("Sentences: ")
for i,s in enumerate(document.sents):
    print(i,s)
    #for token in s:
    #  print('\t',token.orth_, token.pos_)

Sentences: 
0 Dozens of people have died in Canada amid an unprecedented heatwave that has smashed temperature records.
1 Police in the Vancouver area have responded to more than 130 sudden deaths since Friday.
2 Most were elderly or had underlying health conditions, with heat often a contributing factor.
3 Canada broke its temperature record for a third straight day on Tuesday - 49.6C (121.3F) in Lytton, British Columbia.
4 The US north-west has also seen record highs - and a number of fatalities.
5 Experts say climate change is expected to increase the frequency of extreme weather events, such as heatwaves.
6 However, linking any single event to global warming is complicated.
7 US President Joe Biden said the heatwave was tied to climate change in a speech on Tuesday as he pitched a plan to update the country's infrastructure network.
8 On Wednesday he is meeting with governors of western US states and fire officials, as the annual North American wildfire season begins.
9 The heat ov

# Tokenization and Pos tagging

In [8]:
for i,s in enumerate(document.sents):
    print(i,s)
    for token in s:
      print('\t',token.orth_, token.pos_)
    break

0 Dozens of people have died in Canada amid an unprecedented heatwave that has smashed temperature records.
	 Dozens NOUN
	 of ADP
	 people NOUN
	 have AUX
	 died VERB
	 in ADP
	 Canada PROPN
	 amid ADP
	 an DET
	 unprecedented ADJ
	 heatwave NOUN
	 that DET
	 has AUX
	 smashed VERB
	 temperature NOUN
	 records NOUN
	 . PUNCT


Spacy also provides other useful features:

In [9]:
for i, token in enumerate(document):
    print("original:", token.orth_)
    print("shape:", token.shape_)
    print("PoS tag:", token.pos_)


    #print("lowercased:", token.lower_)
    print("lemma:", token.lemma_)
    print("prefix:", token.prefix_)
    print("suffix:", token.suffix_)
    print("----------------------------------------")
    #only shows three first tokens
    if i > 5:
        break


original: Dozens
shape: Xxxxx
PoS tag: NOUN
lemma: dozen
prefix: D
suffix: ens
----------------------------------------
original: of
shape: xx
PoS tag: ADP
lemma: of
prefix: o
suffix: of
----------------------------------------
original: people
shape: xxxx
PoS tag: NOUN
lemma: people
prefix: p
suffix: ple
----------------------------------------
original: have
shape: xxxx
PoS tag: AUX
lemma: have
prefix: h
suffix: ave
----------------------------------------
original: died
shape: xxxx
PoS tag: VERB
lemma: die
prefix: d
suffix: ied
----------------------------------------
original: in
shape: xx
PoS tag: ADP
lemma: in
prefix: i
suffix: in
----------------------------------------
original: Canada
shape: Xxxxx
PoS tag: PROPN
lemma: Canada
prefix: C
suffix: ada
----------------------------------------


## Lemmatization and stemming


In [10]:
import nltk
from nltk.stem.porter import *
stemmer = PorterStemmer()

tokens = ['studies', 'studied', 'studying', 'student']
text=' '.join(tokens)


#print(sentence)
for word in nlp(text):
    print('word: ' + word.text + '\tlemma:'+ word.lemma_+ "\tstem:"+stemmer.stem(word.text))


word: studies	lemma:study	stem:studi
word: studied	lemma:study	stem:studi
word: studying	lemma:study	stem:studi
word: student	lemma:student	stem:student


## Removing stopwords



In [14]:
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text='There are 25 children, who were playing, while their parents were chatting.'
text=text.lower()

my_doc = nlp(text)

# Create list of word tokens
token_list = []
# Create list of word tokens after removing stopwords
filtered =[] 

for token in my_doc:
    word=token.text
    token_list.append(token.text)
    lexeme = nlp.vocab[token.text]
    if lexeme.is_stop == False:
        filtered.append(token.text)
        


print("Input Sentence: \t{}".format(" ".join(token_list)))
s=" ".join(filtered)
print("Text without stopwords: \t{}".format(s))


{'some', 'same', 'wherever', '’ll', 'only', 'nothing', 'that', 'am', 'around', 'which', 'is', 'sixty', 'been', 'therefore', 'rather', 'third', 'across', 'amongst', 'seems', 'during', 'can', 'may', 'although', 'keep', 'fifteen', 'yourself', 'thereby', 'latter', 'mostly', 'myself', '‘s', "'re", 'could', 'hereby', 'on', 'various', 'him', 'whatever', 'call', 'done', 'quite', 'would', 'again', 'nor', 'see', 'whither', 'to', 'with', '’re', 'must', 'below', 'should', 'besides', 'sometimes', 'until', 'get', 'seemed', 'in', 'whereupon', 'those', 'towards', 'whose', 'whom', 'hers', 'whereafter', 'down', 'thereupon', 'first', "'m", 'under', "'ve", '’m', 'four', 'name', 'however', 'thence', 'it', 'nine', 'just', 'several', 'this', 'not', 'there', 'using', 'five', 'once', 'but', 'yours', 'of', '‘d', '’d', 'meanwhile', 'both', 'least', 'or', 'almost', 'toward', 'though', 'therein', 'most', 'did', 'nevertheless', 'part', 'how', 'much', 'ours', 'who', 'anywhere', 'beside', 'anything', 'each', 'next', 

In [12]:
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

{'some', 'same', 'wherever', '’ll', 'only', 'nothing', 'that', 'am', 'around', 'which', 'is', 'sixty', 'been', 'therefore', 'rather', 'third', 'across', 'amongst', 'seems', 'during', 'can', 'may', 'although', 'keep', 'fifteen', 'yourself', 'thereby', 'latter', 'mostly', 'myself', '‘s', "'re", 'could', 'hereby', 'on', 'various', 'him', 'whatever', 'call', 'done', 'quite', 'would', 'again', 'nor', 'see', 'whither', 'to', 'with', '’re', 'must', 'below', 'should', 'besides', 'sometimes', 'until', 'get', 'seemed', 'in', 'whereupon', 'those', 'towards', 'whose', 'whom', 'hers', 'whereafter', 'down', 'thereupon', 'first', "'m", 'under', "'ve", '’m', 'four', 'name', 'however', 'thence', 'it', 'nine', 'just', 'several', 'this', 'not', 'there', 'using', 'five', 'once', 'but', 'yours', 'of', '‘d', '’d', 'meanwhile', 'both', 'least', 'or', 'almost', 'toward', 'though', 'therein', 'most', 'did', 'nevertheless', 'part', 'how', 'much', 'ours', 'who', 'anywhere', 'beside', 'anything', 'each', 'next', 

## Removing puntuaction, special characters and numbers

More examples: 
https://github.com/isegura/BasicNLP/tree/master/RegEx

In [15]:
import re

print('input:', s)
clean = re.sub(r'[^\w\s]+','',s)
print("after removing puntuaction, special characters: ", clean)
clean = re.sub(r'[\d]+','',clean)
print("after removing numbers: ", clean)


input: 25 children , playing , parents chatting .
after removing puntuaction, special characters:  25 children  playing  parents chatting 
after removing numbers:   children  playing  parents chatting 


# More... 

* Named Entity Recognition with Spacy: https://github.com/isegura/BasicNLP/blob/master/NER/IntroNER_spacy.ipynb
* Noun chunker and parsing: https://github.com/isegura/BasicNLP/blob/master/TextProccesing/SpaCy_NLP.ipynb
* Word embeddings with Spacy: https://github.com/isegura/BasicNLP/blob/master/TextSimilarity/Word_Embeddings_By_Spacy.ipynb

