In [1]:
import re
import string
import pandas as pd

from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords, wordnet, brown
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.probability import FreqDist

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
!cat data/text3.txt

It was a few weeks before my own marriage, during the days when I was
still sharing rooms with Holmes in Baker Street, that he came home from
an afternoon stroll to find a letter on the table waiting for him.
I had remained in-doors all day, for the weather had taken a sudden
turn to rain, with high autumnal winds, and the jezail bullet which I
had brought back in one of my limbs as a relic of my Afghan campaign,
throbbed with dull persistency. With my body in one easy-chair and my
legs upon another, I had surrounded myself with a cloud of newspapers,
until at last, saturated with the news of the day, I tossed them all
aside and lay listless, watching the huge crest and monogram upon the
envelope upon the table, and wondering lazily who my friend’s noble
correspondent could be.

“Here is a very fashionable epistle,” I remarked, as he entered. “Your
morning letters, if I remember right, were from a fish-monger and a
tide-waiter.”

## Reading content from a text file

In [3]:
with open('data/text3.txt', 'r') as f:
    text = f.read()

## Counting punctuation marks in the text

In [4]:
tokens = word_tokenize(text, language='english')
punc = [t for t in tokens if t in string.punctuation]

print(f'The number of punctuation marks: {len(punc)}')

The number of punctuation marks: 21


## Removing stop words

In [5]:
stoplist = stopwords.words('english')
clean_words = [t for t in tokens if t.lower() not in stoplist]

print(f'Cleaned text:\n\n{' '.join(clean_words)}')

Cleaned text:

weeks marriage , days still sharing rooms Holmes Baker Street , came home afternoon stroll find letter table waiting . remained in-doors day , weather taken sudden turn rain , high autumnal winds , jezail bullet brought back one limbs relic Afghan campaign , throbbed dull persistency . body one easy-chair legs upon another , surrounded cloud newspapers , last , saturated news day , tossed aside lay listless , watching huge crest monogram upon envelope upon table , wondering lazily friend ’ noble correspondent could . “ fashionable epistle , ” remarked , entered . “ morning letters , remember right , fish-monger tide-waiter . ”


## PoS identification

In [6]:
pos_dict = {
    'J': wordnet.ADJ,
    'V': wordnet.VERB,
    'N': wordnet.NOUN,
    'R': wordnet.ADV,
}

# Func to convert Penn Treebank PoS / Universal tags into WordNet format
to_wordnet_format = lambda pos: pos_dict.get(pos[0].upper(), wordnet.NOUN)

## Lemmatization

In [7]:
sents = sent_tokenize(text)
words = regexp_tokenize(sents[2], pattern=r'\w+')

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(w, pos=to_wordnet_format(p)) for w, p in pos_tag(words)]

print(lemmatized_words)

['With', 'my', 'body', 'in', 'one', 'easy', 'chair', 'and', 'my', 'legs', 'upon', 'another', 'I', 'have', 'surround', 'myself', 'with', 'a', 'cloud', 'of', 'newspaper', 'until', 'at', 'last', 'saturated', 'with', 'the', 'news', 'of', 'the', 'day', 'I', 'toss', 'them', 'all', 'aside', 'and', 'lay', 'listless', 'watch', 'the', 'huge', 'crest', 'and', 'monogram', 'upon', 'the', 'envelope', 'upon', 'the', 'table', 'and', 'wonder', 'lazily', 'who', 'my', 'friend', 's', 'noble', 'correspondent', 'could', 'be']


## Extracting sentences from a corpus

In [8]:
hobbies_fileids = brown.fileids(categories='hobbies')

hobbies_sents = brown.sents(fileids=[hobbies_fileids[2]])[1:]
hobbies_sents = [' '.join(s) for s in hobbies_sents]

hobbies_sents

['The tremendous energy released by giant rocket engines perhaps can be felt much better than it can be heard .',
 'The pulsating vibration of energy clutches at the pit of your stomach .',
 'Never before has the introduction of a weapon caused so much apprehension and fear .',
 'Nuclear weapons are fearsome , but the long-range ballistic missile gives them a stealth and merciless swiftness which is much more terrifying .',
 'A great many writers are bewitched by the apparently overwhelming advantage an attacker would have if he were to strike with complete surprise using nuclear rockets .',
 "It is relatively easy to go a step further and reason that an attacker , in possession of such absolute power , would simultaneously destroy his opponent's cities and people .",
 "With a nation defenseless before it , why would the attacker spare the victim's people ? ?",
 "Wouldn't the wanton destruction of cities and people be the logical act of complete subjugation ? ?",
 'The nation would be 

## Searching for the most common verbs in the text

In [9]:
tagged_words = brown.tagged_words(fileids=[hobbies_fileids[2]])
verbs_list = [w for w, t in tagged_words if t.startswith('V')]

verbs_freq = FreqDist(verbs_list)
df = pd.DataFrame(verbs_freq.most_common(10), columns=['word', 'count'])
df

Unnamed: 0,word,count
0,destroy,14
1,used,6
2,develop,6
3,destroyed,5
4,manned,5
5,given,4
6,take,4
7,moving,4
8,known,3
9,fixed,3
