# Spacy library

The spaCy library is a powerful and efficient library for natural language processing (NLP) in Python. It provides a range of tools and features for processing and analyzing textual data. 

In [1]:
import spacy
print(spacy.__version__)

3.7.5


The en_core_web_sm is a statistical model. 

We loaded `spaCy` model `en_core_web_sm` = `nlp` object

`nlp object` converts text into a `Doc` object (container) to store processed text. 

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
# check the type of nlp object
type(nlp)

spacy.lang.en.English

## Tokenization
A `Token` is defined as the smallest meaningful part of the text. 

Tokenization: the process of dividing a text into a list of meaningful tokens. 

### Tokenization with spaCy

In [4]:
# Sample sentence.
text = "Jack didn't want to pay Annie 100$ for this book."
doc = nlp(text)

In [5]:
print([token.text for token in doc])

['Jack', 'did', "n't", 'want', 'to', 'pay', 'Annie', '100', '$', 'for', 'this', 'book', '.']


Note how
- "didn't" is separated into "did"  and "n't".
- the currency symbol and amount are separated.
- the period at the end of the sentence is its own token.

The `doc` object can be indexed and sliced like a regular list. The `dic` object contains `Token` and `Span` objects, which offer different views into the text.

In [6]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

Jack


In [7]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [8]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

Jack didn't
<class 'spacy.tokens.span.Span'>


In [9]:
# Access a token's index in a sentence through i attribute.
print([(t.text, t.i) for t in doc])

[('Jack', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('Annie', 6), ('100', 7), ('$', 8), ('for', 9), ('this', 10), ('book', 11), ('.', 12)]


Spacy's tokenization is _non-destructive_, which means the original input can be reconstructed from the tokens.

In [10]:
# You can view the original input like so:
print(doc.text)

Jack didn't want to pay Annie 100$ for this book.


### Exercise

EXERCISE 1:
1) Tokenize the following text
2) Iterate through the tokens to check whether there's a currency symbol.
3) If there is, and the currency label is followed by a number, print both the symbol and the number.

Look through https://spacy.io/api/token#attributes on how to check whether a token is a currency symbol or a number.

Expected output: "$20".

In [11]:
s = "He didn't want to pay $20 for this book."
doc1 = nlp(s)

In [12]:
for index, token in enumerate(doc1): 
    if token.is_currency: 
        # Check if the next token exists and is a number
        if index + 1 < len(doc1) and doc[index + 1].like_num:
            # Print the currency symbol and the number
            print(token.text + doc1[index + 1].text)

$20


EXERCISE 2: Look up how to tokenize the sentence below using NLTK. The imports are done for you. Does the NLTK tokenizer handle "N.Y.C." correctly?

In [13]:
from nltk.tokenize import TreebankWordTokenizer

s = "Let's go to N.Y.C. for the weekend."

tokenizer = TreebankWordTokenizer()

tokens = tokenizer.tokenize(s)

print(tokens)

['Let', "'s", 'go', 'to', 'N.Y.C.', 'for', 'the', 'weekend', '.']


Remark: The TreebankWordTokenizer in NLTK correctly handles "N.Y.C." as a single token, preserving the period within it, which is typical in tokenizing abbreviations or proper nouns like "N.Y.C.".

## spaCy basis 

### Case Folding
Lower ot upper-casing all tokens. 

Pros: Break down text into smaller vocab which is efficient in space and computation. 

Cons: Information loss. 

In [14]:
# View your document with case-folding using the lower_ attribute.
print([t.lower_ for t in doc])

['jack', 'did', "n't", 'want', 'to', 'pay', 'annie', '100', '$', 'for', 'this', 'book', '.']


In [15]:
# You can also apply conditions when generating these views. 
# For example, we can skip case-folding if a token is the start of a sentence.

print([t.lower_ if not t.is_sent_start else t for t in doc])

[Jack, 'did', "n't", 'want', 'to', 'pay', 'annie', '100', '$', 'for', 'this', 'book', '.']


### Stop Word Removal 
Removing words which occur frequently but carry little information. (the, a, of,...)

In [16]:
# spaCy's default stop word list.
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'just', 'thence', 're', 'here', 'move', 'two', '‘d', 'and', 'make', 'thru', 'one', 'alone', 'other', 'on', 'rather', 'formerly', 'wherein', 'its', '’ve', 'how', 'with', "'re", 'once', 'back', 'regarding', 'our', 'of', 'whereupon', 'ca', 'i', 'twelve', 'almost', 'nobody', 'was', 'can', 'in', 'whither', 'their', 'various', 'forty', 'mine', 'yourself', "'m", 'my', 'that', 'wherever', 'there', 'put', 'done', 'itself', 'or', 'anywhere', 'front', 'will', 'me', 'a', 'afterwards', 'through', 'another', 'hereafter', 'please', 'seeming', 'own', "'ve", 'him', 'been', 'often', 'n’t', 'twenty', 'themselves', 'onto', 'somehow', 'using', 'less', 'therein', 'third', 'becoming', 'who', 'moreover', 'any', 'well', 'amount', 'always', 'get', 'so', 'really', 'too', 'around', 'although', 'again', 'neither', 'anyway', 'yet', 'last', 'serious', 'throughout', 'when', 'this', "n't", 'whence', 'beyond', 'fifteen', 'least', 'full', 'anyhow', 'why', '’re', 'never', 'call', 'across', 'above', 'per', 'others', 'out

In [17]:
# tokens without stop words
print([t for t in doc if not t.is_stop])

[Jack, want, pay, Annie, 100, $, book, .]


### Lemmatization 
Reduce a word down to its lemma, or dictionary form. (did, does, do => do,...)

In [18]:
print([(t.text, t.lemma_) for t in doc])

[('Jack', 'Jack'), ('did', 'do'), ("n't", 'not'), ('want', 'want'), ('to', 'to'), ('pay', 'pay'), ('Annie', 'Annie'), ('100', '100'), ('$', '$'), ('for', 'for'), ('this', 'this'), ('book', 'book'), ('.', '.')]


### Stemming 
Removing word suffixes (and sometimes prefixes) eg. ing, s, y, ed,...

Pros: reduce vocab size and generalize the model to behave the same for words with the same stem. 

Cons: A stemmer can overstem ('university' and 'universe' both stem to univers) and understem ('alumnus' and 'alumni' stem to 'alumnu' and 'amuni) which lead to poor results. 

NLTK stands for Natural Language Toolkit.

Components of NLTK:
- **Corpora**: Collections of texts for training and testing NLP models (e.g., movie reviews, newspapers).
- **Modules** and Libraries: Tools for processing natural language data, such as tokenizers, stemmers, and parsers.
- **Tools** and Algorithms: Implementations of various NLP algorithms and techniques.

In [19]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

# Sentence to tokenize and stem
s = 'He told Dr. Lovato that he was done with the tests and would post the results shortly.'

# Initialize the SnowballStemmer for English
stemmer = SnowballStemmer(language='english')

# Tokenize the sentence
tokens1 = word_tokenize(s)

# Stem each token and collect the stemmed tokens
stemmed_tokens = [stemmer.stem(token) for token in tokens1]

# Print the stemmed tokens
print("Original sentence tokens:", tokens1)
print("Stemmed sentence tokens:", stemmed_tokens)

Original sentence tokens: ['He', 'told', 'Dr.', 'Lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']
Stemmed sentence tokens: ['he', 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'test', 'and', 'would', 'post', 'the', 'result', 'short', '.']
