## A. Tokenization in Standard Python Library
In this section, we explore the basic way to tokenize text using Python's built-in `split()` method.
While `split()` provides a simple approach, it has limitations in handling edge cases such as punctuation, contractions, and complex linguistic structures.

In [None]:
# Tokenization using split()
sentence1 = "I can't believe it's working!"
sentence2 = "Dr. Smith lives on St. Patrick's St."
basic_split1 = # CODE HERE
basic_split2 = # CODE HERE
print("Sentence 1 split result:", basic_split1)
print("Sentence 2 split result:", basic_split2)

In [None]:
# Loop through each sentence in challenging_examples and tokenize each word
# Identify any issues with basic split() tokenization

challenging_examples = [
    "U.S.A. vs. USA vs. U.S.",
    "It's 3:30 p.m. on Dec. 25th.",
    "Email me at john@company.co.uk",
    "She scored 95.5% on the test.",
]

# CODE HERE


## B. Tokenization in spaCy
The `spaCy` library provides advanced tokenization capabilities using statistical models. It is faster, robust, and handles edge cases better than simple rule-based approaches.

spacy_pipeline.svg


In [None]:
# Install spacy library

# Install the small english language model for spacy


In [None]:

# Load spaCy's English model

# Apply the NLP pipeline to the text

# Extract tokens


In [None]:
# Print each token in the spacy doc


## C. Comparing Tokenization Approaches
Let us compare the outputs of `split()`, `NLTK`, and `spaCy` tokenization. Observe how each handles punctuation, contractions, and special cases.

In [None]:

# Ensure NLTK tokenizer is ready

# NLTK Tokenization

# Print comparison


## D. Custom Tokenizer with Regular Expressions
Sometimes, specific tasks require creating a custom tokenizer to handle unique text patterns. We can use the `re` library for this.

In [None]:
import re

# Custom tokenizer using regex
def custom_tokenizer(text):
    pattern = r"\b\w+(?:['-]\w+)?\b"
    return re.findall(pattern, text)

# Test the custom tokenizer
tokens_custom = custom_tokenizer(text)
print(tokens_custom)
# Note: This approach handles contractions and hyphenated words better.

## E. Part of Speech Tagging with spaCy
Part of Speech (POS) tagging is the process of labeling each token in a sentence with its grammatical category. spaCy makes this easy by providing access to the `pos_` and `tag_` attributes of each token.

In [None]:
# Display the parts of speech and other linguistic tagging information from tokens

# Example explanation:
# - `pos_`: Coarse-grained part of speech, e.g., NOUN, VERB, etc.
# - `tag_`: Fine-grained tag, providing additional details like tense or number.

## F. Named Entity Recognition with spaCy
Named Entity Recognition (NER) involves identifying proper nouns and entities in the text, such as people, organizations, dates, and locations. spaCy provides a pre-trained NER pipeline that can extract entities with their types.

In [None]:
# Display named entities in the text

# Example explanation:
# - `text`: The entity as it appears in the text.
# - `label_`: The type of the entity, e.g., PERSON, GPE (Geo-Political Entity), DATE.

### Visualizing Entities
spaCy also allows us to visualize entities in text using the `displacy` module.

In [None]:
from spacy import displacy

# Render the named entities
displacy.render(doc, style='ent', jupyter=True)
# Note: Run this in a Jupyter Notebook to see the visualization.