In [179]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [180]:
corpus="""He's brought many captives home to Rome! Their ransoms filled the general coffers. 
Did this seem ambitious in Caesar? When the poor cried, Caesar wept. Ambition should be made of sterner stuff! Yet Brutus says he was ambitious, and Brutus is an honorable man.
"""

In [181]:
print(corpus)

He's brought many captives home to Rome! Their ransoms filled the general coffers. 
Did this seem ambitious in Caesar? When the poor cried, Caesar wept. Ambition should be made of sterner stuff! Yet Brutus says he was ambitious, and Brutus is an honorable man.



In [182]:
from nltk.tokenize import sent_tokenize
documents=sent_tokenize(corpus, language='english')

### Handling NLTK LookUpError

If you encounter a `LookUpError` with NLTK, follow these steps:

1. **Find the location of the NLTK library and download the missing module:**

    ```python
    import sys
    import nltk
    import os

    # Find the location of the NLTK library
    nltk_location = os.path.dirname(sys.modules['nltk'].__file__)

    # Add the NLTK library location to the data path
    nltk.data.path.append(nltk_location)

    # Download the missing module (e.g., 'punkt') in the exact location of the NLTK library
    nltk.download('punkt', download_dir=nltk_location)
    ```

2. **Use NLTK functions after downloading the necessary data:**

    ```python
    from nltk.tokenize import sent_tokenize

    # Tokenize a corpus into sentences
    documents = sent_tokenize(corpus, language='english')
    ```


In [183]:
type(documents)

list

In [184]:
# The paragraph has been tokenized into sentences. Each token is an element of the list.
documents

["He's brought many captives home to Rome!",
 'Their ransoms filled the general coffers.',
 'Did this seem ambitious in Caesar?',
 'When the poor cried, Caesar wept.',
 'Ambition should be made of sterner stuff!',
 'Yet Brutus says he was ambitious, and Brutus is an honorable man.']

In [185]:
for sentence in documents:
    print(sentence)

He's brought many captives home to Rome!
Their ransoms filled the general coffers.
Did this seem ambitious in Caesar?
When the poor cried, Caesar wept.
Ambition should be made of sterner stuff!
Yet Brutus says he was ambitious, and Brutus is an honorable man.


In [186]:
## Tokenization 
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [187]:
# Each word is separated and considered an element of the list. 
# Punctuation marks like periods and commas are also separated and added to the list. 
# However, apostrophes are included with the preceding letter, such as in contractions.
word_tokenize(corpus)

['He',
 "'s",
 'brought',
 'many',
 'captives',
 'home',
 'to',
 'Rome',
 '!',
 'Their',
 'ransoms',
 'filled',
 'the',
 'general',
 'coffers',
 '.',
 'Did',
 'this',
 'seem',
 'ambitious',
 'in',
 'Caesar',
 '?',
 'When',
 'the',
 'poor',
 'cried',
 ',',
 'Caesar',
 'wept',
 '.',
 'Ambition',
 'should',
 'be',
 'made',
 'of',
 'sterner',
 'stuff',
 '!',
 'Yet',
 'Brutus',
 'says',
 'he',
 'was',
 'ambitious',
 ',',
 'and',
 'Brutus',
 'is',
 'an',
 'honorable',
 'man',
 '.']

In [188]:
for sentence in documents:
    print(word_tokenize(sentence))

['He', "'s", 'brought', 'many', 'captives', 'home', 'to', 'Rome', '!']
['Their', 'ransoms', 'filled', 'the', 'general', 'coffers', '.']
['Did', 'this', 'seem', 'ambitious', 'in', 'Caesar', '?']
['When', 'the', 'poor', 'cried', ',', 'Caesar', 'wept', '.']
['Ambition', 'should', 'be', 'made', 'of', 'sterner', 'stuff', '!']
['Yet', 'Brutus', 'says', 'he', 'was', 'ambitious', ',', 'and', 'Brutus', 'is', 'an', 'honorable', 'man', '.']


In [189]:
from nltk.tokenize import wordpunct_tokenize

In [190]:
# here the apostrophe is considered as a sperate element and add to the list, punctuation is considered as a separate element.
wordpunct_tokenize(corpus)

['He',
 "'",
 's',
 'brought',
 'many',
 'captives',
 'home',
 'to',
 'Rome',
 '!',
 'Their',
 'ransoms',
 'filled',
 'the',
 'general',
 'coffers',
 '.',
 'Did',
 'this',
 'seem',
 'ambitious',
 'in',
 'Caesar',
 '?',
 'When',
 'the',
 'poor',
 'cried',
 ',',
 'Caesar',
 'wept',
 '.',
 'Ambition',
 'should',
 'be',
 'made',
 'of',
 'sterner',
 'stuff',
 '!',
 'Yet',
 'Brutus',
 'says',
 'he',
 'was',
 'ambitious',
 ',',
 'and',
 'Brutus',
 'is',
 'an',
 'honorable',
 'man',
 '.']

In [191]:
from nltk.tokenize import TreebankWordTokenizer

In [192]:
tokenizer=TreebankWordTokenizer()

In [193]:
# only the last fullstop will be considered as a seperate element, but all other fullstop will be considered along with the preceding word.
tokenizer.tokenize(corpus)

['He',
 "'s",
 'brought',
 'many',
 'captives',
 'home',
 'to',
 'Rome',
 '!',
 'Their',
 'ransoms',
 'filled',
 'the',
 'general',
 'coffers.',
 'Did',
 'this',
 'seem',
 'ambitious',
 'in',
 'Caesar',
 '?',
 'When',
 'the',
 'poor',
 'cried',
 ',',
 'Caesar',
 'wept.',
 'Ambition',
 'should',
 'be',
 'made',
 'of',
 'sterner',
 'stuff',
 '!',
 'Yet',
 'Brutus',
 'says',
 'he',
 'was',
 'ambitious',
 ',',
 'and',
 'Brutus',
 'is',
 'an',
 'honorable',
 'man',
 '.']