# **Experiment-6**

### Objective: 
Write a program to read text data from a file and perform tokenization on sentences and word level using different tokenizers in python NLTK.

In [1]:
%pip install nltk pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Importing Libraries**

In [2]:
import nltk
from nltk.tokenize import (
    sent_tokenize,
    word_tokenize,
    PunktSentenceTokenizer,
    TreebankWordTokenizer,
    WordPunctTokenizer,
    WhitespaceTokenizer,
    RegexpTokenizer
)

# Download required NLTK data files
nltk.download("punkt", download_dir="C:/nltk_data")

[nltk_data] Downloading package punkt to C:/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### **Reading Text from a Text File**

In [3]:
# Cell 2: Read Text Data from a File
def read_text_file(file_path):
    """Read text data from the specified file."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Specify the file path
file_path = 'input.txt'

# Read the content of input.txt
text = read_text_file(file_path)
print("Content of input.txt:")
print(text)


Content of input.txt:
Microsoft announced a new product launch event in New York City on December 1st, 2024. 
Elon Musk, the CEO of Tesla, hinted at a partnership with NASA for a Mars mission. 
The GDP of India grew by 7% in the first quarter of 2023, according to a report by Reuters. 
Amazon is planning to open a new data center in Dublin, Ireland next year.
Barack Obama gave a keynote speech at Stanford University last Thursday.
Bitcoin reached an all-time high of $68,000 in November 2021.
The Louvre Museum in Paris saw record-breaking attendance last summer.



In [4]:
# Default sentence tokenizer
default_sentences = sent_tokenize(text)
print("\nDefault Sentence Tokenizer:")
print(default_sentences)

# PunktSentenceTokenizer
punkt_tokenizer = PunktSentenceTokenizer()
punkt_sentences = punkt_tokenizer.tokenize(text)
print("\nPunkt Sentence Tokenizer:")
print(punkt_sentences)


Default Sentence Tokenizer:
['Microsoft announced a new product launch event in New York City on December 1st, 2024.', 'Elon Musk, the CEO of Tesla, hinted at a partnership with NASA for a Mars mission.', 'The GDP of India grew by 7% in the first quarter of 2023, according to a report by Reuters.', 'Amazon is planning to open a new data center in Dublin, Ireland next year.', 'Barack Obama gave a keynote speech at Stanford University last Thursday.', 'Bitcoin reached an all-time high of $68,000 in November 2021.', 'The Louvre Museum in Paris saw record-breaking attendance last summer.']

Punkt Sentence Tokenizer:
['Microsoft announced a new product launch event in New York City on December 1st, 2024.', 'Elon Musk, the CEO of Tesla, hinted at a partnership with NASA for a Mars mission.', 'The GDP of India grew by 7% in the first quarter of 2023, according to a report by Reuters.', 'Amazon is planning to open a new data center in Dublin, Ireland next year.', 'Barack Obama gave a keynote 

In [5]:
# Default word tokenizer
default_words = word_tokenize(text)
print("\nDefault Word Tokenizer:")
print(default_words)

# TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_words = treebank_tokenizer.tokenize(text)
print("\nTreebank Word Tokenizer:")
print(treebank_words)

# WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()
word_punct_words = word_punct_tokenizer.tokenize(text)
print("\nWordPunct Tokenizer:")
print(word_punct_words)

# WhitespaceTokenizer
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_words = whitespace_tokenizer.tokenize(text)
print("\nWhitespace Tokenizer:")
print(whitespace_words)

# RegexpTokenizer (e.g., tokenizing words with apostrophes)
regexp_tokenizer = RegexpTokenizer(r"\w+('\w+)?")
regexp_words = regexp_tokenizer.tokenize(text)
print("\nRegexp Tokenizer:")
print(regexp_words)


Default Word Tokenizer:
['Microsoft', 'announced', 'a', 'new', 'product', 'launch', 'event', 'in', 'New', 'York', 'City', 'on', 'December', '1st', ',', '2024', '.', 'Elon', 'Musk', ',', 'the', 'CEO', 'of', 'Tesla', ',', 'hinted', 'at', 'a', 'partnership', 'with', 'NASA', 'for', 'a', 'Mars', 'mission', '.', 'The', 'GDP', 'of', 'India', 'grew', 'by', '7', '%', 'in', 'the', 'first', 'quarter', 'of', '2023', ',', 'according', 'to', 'a', 'report', 'by', 'Reuters', '.', 'Amazon', 'is', 'planning', 'to', 'open', 'a', 'new', 'data', 'center', 'in', 'Dublin', ',', 'Ireland', 'next', 'year', '.', 'Barack', 'Obama', 'gave', 'a', 'keynote', 'speech', 'at', 'Stanford', 'University', 'last', 'Thursday', '.', 'Bitcoin', 'reached', 'an', 'all-time', 'high', 'of', '$', '68,000', 'in', 'November', '2021', '.', 'The', 'Louvre', 'Museum', 'in', 'Paris', 'saw', 'record-breaking', 'attendance', 'last', 'summer', '.']

Treebank Word Tokenizer:
['Microsoft', 'announced', 'a', 'new', 'product', 'launch', 'eve