-----------------------
#### Tokenization with Spacy
--------------------

In [3]:
#pip install spacy

In [7]:
#!python -m spacy download en_core_web_sm

In [26]:
import spacy

nlp = spacy.load("en_core_web_sm")

**Hyphenated Words**
    
NOT GOOD

In [27]:
text = "Spacy is a state-of-the-art natural language processing library."

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['Spacy', 'is', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', 'natural', 'language', 'processing', 'library', '.']


In [28]:
text = "severe run-down"

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['severe', 'run', '-', 'down']


In [29]:
text = "Self-driving cars are becoming more common in modern society."

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['Self', '-', 'driving', 'cars', 'are', 'becoming', 'more', 'common', 'in', 'modern', 'society', '.']


**Comma Separated Words:**

Spacy's tokenizer splits words separated by commas into separate tokens.

In [30]:
text = "I like apples, oranges, and bananas."

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['I', 'like', 'apples', ',', 'oranges', ',', 'and', 'bananas', '.']


**Abbreviations:**

Spacy's tokenizer treats abbreviations as single tokens unless explicitly split by punctuation.

In [32]:
text = "Dr. Smith is a Ph.D. graduate from MIT. St. Xvaiers"

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['Dr.', 'Smith', 'is', 'a', 'Ph.D.', 'graduate', 'from', 'MIT', '.', 'St.', 'Xvaiers']


**Contractions**

Spacy's tokenizer splits contractions into separate tokens.

NOT GOOD

In [33]:
text = "I can't believe it's already evening."

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['I', 'ca', "n't", 'believe', 'it', "'s", 'already', 'evening', '.']


**Punctuations**

Spacy's tokenizer treats punctuations as separate tokens.

In [34]:
text = "The cat jumped @over the fence!"

doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['The', 'cat', 'jumped', '@over', 'the', 'fence', '!']


#### Examples

In [35]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
    
# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']


**Pros of Spacy Tokenizers:**
- Efficient and fast tokenization, especially for large datasets.
- Handles various linguistic structures and edge cases effectively.
- Provides support for multiple languages and pre-trained models.
- Integrated with other NLP functionalities like part-of-speech tagging, named entity recognition, and dependency parsing.

**Cons of Spacy Tokenizers:**
- Tokenization rules may not be customizable compared to other libraries like NLTK.
- Less suitable for specialized tokenization tasks requiring highly customized rules.
- Requires additional language models to be downloaded for non-English text processing.
- Heavier memory footprint compared to simpler tokenization libraries for basic tasks.