# Test analysis e KNN

The basic element of an NLP pipeline:
- Word tokenization;
- Identifying stop words;
- Stemming;
- Lemmatization;
- POS tagging;
- NER tagging;
- Sentence segmentation.
- The Bag Of Words (BOW) representation;
- Text classificaiton.

### 1. Text processing Base

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

words = [t.text for t in nlp.vocab]

doc = nlp("Let's go to N.Y.!") # aggiorna anche le parole in nlp
print(doc)

words2 = [t.text for t in nlp.vocab]
print(set(words2) - set(words)) # stampa le parole che sono state aggiunte durante il processing

Let's go to N.Y.!
{'go', 'to', '!'}


### 1.1 Tokenization

In [2]:
for i,token in enumerate(doc):
    print(f'Token {i+1} -> {token}')

Token 1 -> Let
Token 2 -> 's
Token 3 -> go
Token 4 -> to
Token 5 -> N.Y.
Token 6 -> !


### 1.2 Stemming

In [3]:
import nltk
from nltk.stem.porter import *

p_stemmer = PorterStemmer()
words = ['go', 'goes', 'went', 'wish', 'wishes', 'wished', 'runner', 'ran', 'running']

for w in words:
    print(f"{w} -> {p_stemmer.stem(w)}")

go -> go
goes -> goe
went -> went
wish -> wish
wishes -> wish
wished -> wish
runner -> runner
ran -> ran
running -> run


### 1.3 Lemmatization

In [4]:
doc = nlp("I will meet you in the meeting after meeting the runner when running.")
for token in doc:
    print(f"{token.text} -> {token.lemma_}")

I -> I
will -> will
meet -> meet
you -> you
in -> in
the -> the
meeting -> meeting
after -> after
meeting -> meet
the -> the
runner -> runner
when -> when
running -> run
. -> .


### 1.4 Stop Words

In [5]:
print(list(nlp.Defaults.stop_words)[:10], end="\n\n") 

for t in doc:
    print(f"{t.text} -> {t.is_stop}")
print("\n")

['keep', 'sometime', 'after', "'s", 'indeed', 'me', 'my', 'still', 'namely', 'various']

I -> True
will -> True
meet -> False
you -> True
in -> True
the -> True
meeting -> False
after -> True
meeting -> False
the -> True
runner -> False
when -> True
running -> False
. -> False




In [6]:
nlp.Defaults.stop_words.remove('go')  # rimuoviamo 'go' dalle stop words
nlp.vocab['go'].is_stop = False
outcome = nlp.vocab['go'].is_stop
print(f"Is 'go' a stop word now? {outcome}")

Is 'go' a stop word now? False


In [7]:
nlp.Defaults.stop_words.add('!') # aggiungo '!' alle stop words
nlp.vocab['!'].is_stop = True
outcome = nlp.vocab['!'].is_stop
print(f"Is '!' a stop word now? {outcome}")

Is '!' a stop word now? True


### 1.5 Part of Speech (POS) Tagging

In [12]:
for t in nlp("\"Let's go to N.Y.!\""): 
    print(f"{t.text} -> {t.pos_}")

" -> PUNCT
Let -> VERB
's -> PRON
go -> VERB
to -> ADP
N.Y. -> PROPN
! -> PUNCT
" -> PUNCT


In [13]:
for t in nlp("\"Let's go to N.Y.!\""): # ottengo più dettagli
    print(f"{t.text} -> {spacy.explain(t.tag_)}")

" -> opening quotation mark
Let -> verb, base form
's -> pronoun, personal
go -> verb, base form
to -> conjunction, subordinating or preposition
N.Y. -> noun, proper singular
! -> punctuation mark, sentence closer
" -> closing quotation mark
