In [ ]:
"""
Title:   Web Text Tokenizer using NLTK and Project Gutenberg
Author:  Praveen Kumar G (22MID0300)
Date:    July 23, 2025
Purpose: To demonstrate how to fetch online text from Project Gutenberg using urllib,
         and tokenize it using the NLTK (Natural Language Toolkit) package.
         The program prints the first 200 tokens from the raw text content.
"""

In [1]:
# Importing 'request' from urllib to fetch web content
from urllib import request

# Importing nltk and its word tokenizer
import nltk
from nltk.tokenize import word_tokenize

# Downloading the 'punkt' tokenizer model (only required once per environment)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# Step 1: Fetch the raw text from a Project Gutenberg eBook
url = "https://www.gutenberg.org/cache/epub/76503/pg76503.txt"  # Link to the eBook text
response = request.urlopen(url)                                # Sending HTTP request and storing response
raw = response.read().decode('utf8')                           # Decoding the byte stream into UTF-8 text

In [3]:
# Step 2: Tokenize the fetched text into words and punctuation
tokens = word_tokenize(raw)  # Using NLTK's tokenizer to split raw text into individual tokens

In [4]:
# Step 3: Display the first 200 tokens to verify the output
print("First 200 tokens from the eBook:")
print(tokens[:200])


First 200 tokens from the eBook:
['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'man', 'who', 'mastered', 'time', 'This', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'man', 'who', 'mastered', 'time', 'Author', ':', 'Ray', 'Cummings', 'Illustrator', ':', 'Ed', 'Valigursky', 'Release', 'date', ':', 'July', '14

###  Summary:
- Used `urllib.request` to fetch raw text data from an online source (Project Gutenberg).
- Used `nltk.tokenize.word_tokenize` to split the large text into individual tokens (words/punctuations).
- Downloaded the `punkt` tokenizer model for tokenization.
- Displayed the first 200 tokens as proof of successful processing.
