# Task 1: Text preprocessing

Steps involved in preprocessing:

1. Noise removal/text cleaning:
   a. Converting to lowercase
   b. Removing punctuation

2. Tokenization: break the large string into smaller manageable strings (i.e., words)

3. Stopword removal: remove most common words from the tokenized data (like prepositions, articles)

4. Text normalization:
   a. Stemming: removing the suffixes of the words
   b. Lemmatization: reducing the words to their root form


# Task 2: NLP Libraries

- **nltk (Natural Language Toolkit)**: General purpose comprehensive NLP library. It has libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning of text. It can be used for educational purposes, research, and prototyping of NLP tasks.

- **spaCy**: Industrial grade NLP library for advanced NLP tasks. It is efficient for large-scale information extraction tasks. It includes tokenization, part-of-speech tagging, named entity recognition, and dependency parsing.

- **wordcloud**: Generates word cloud, where the size of a word in the diagram is proportional to its frequency in the text.

- **pyspellchecker**: Used to correct misspelled words, using Levenshtein Distance algorithm.

- **gensim**: Library for topic modelling, document indexing and similarity retrieval. It implements algorithms like Word2Vec, FastText, and Latent Dirichlet Allocation (LDA).

In [None]:
import pandas as pd
import numpy as np
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
text = """ THE BOY’S NAME WAS SANTIAGO. DUSK WAS FALLING AS the boy arrived
with his herd at an abandoned church. The roof had fallen in long
ago, and an enormous sycamore had grown on the spot where the
sacristy had once stood.
He decided to spend the night there. He saw to it that all the
sheep entered through the ruined gate, and then laid some planks
across it to prevent the flock from wandering away during the night.
There were no wolves in the region, but once an animal had strayed
during the night, and the boy had had to spend the entire next day
searching for it.
He swept the floor with his jacket and lay down, using the book
he had just finished reading as a pillow. He told himself that he
would have to start reading thicker books: they lasted longer, and
made more comfortable pillows.
It was still dark when he awoke, and, looking up, he could see
the stars through the half-destroyed roof.
I wanted to sleep a little longer, he thought. He had had the same
dream that night as a week ago, and once again he had awakened
before it ended.
He arose and, taking up his crook, began to awaken the sheep
that still slept. He had noticed that, as soon as he awoke, most of his
animals also began to stir. It was as if some mysterious energy
bound his life to that of the sheep, with whom he had spent the past
two years, leading them through the countryside in search of food
and water. “They are so used to me that they know my schedule,” he
muttered. Thinking about that for a moment, he realized that it
could be the other way around: that it was he who had become
accustomed to their schedule."""

Noise Removal and Tokenization

In [None]:
#Converting to lowercase and removing whitespaces
text = text.lower().strip()

#Removing punctuation
punction = string.punctuation
mapping = str.maketrans("","",punction)

text = text.translate(mapping)

stopwords_eng = stopwords.words('english')

def remove_stopwords(in_str):
    new_str = ''
    words = in_str.split()
    for word in words:
        if word not in stopwords_eng:
            new_str = new_str + word + " "

    return new_str

text_res = remove_stopwords(text)
print(text_res)

#tokenization
tokens = word_tokenize(text_res)



boy’s name santiago dusk falling boy arrived herd abandoned church roof fallen long ago enormous sycamore grown spot sacristy stood decided spend night saw sheep entered ruined gate laid planks across prevent flock wandering away night wolves region animal strayed night boy spend entire next day searching swept floor jacket lay using book finished reading pillow told would start reading thicker books lasted longer made comfortable pillows still dark awoke looking could see stars halfdestroyed roof wanted sleep little longer thought dream night week ago awakened ended arose taking crook began awaken sheep still slept noticed soon awoke animals also began stir mysterious energy bound life sheep spent past two years leading countryside search food water “they used know schedule” muttered thinking moment realized could way around become accustomed schedule 


Removing most common words

In [None]:
from collections import Counter

counter=Counter()

#Count frequency of every word
for word in tokens:
  counter[word]+=1

most_freq=counter.most_common(25)
print(counter)

most_freq_words=[]
for word, freq in most_freq:
    most_freq_words.append(word)

def remove_frequent(tokens):
    new_tokens = []
    for word in tokens:
        if word not in most_freq_words:
            new_tokens.append(word)
    return new_tokens

tokens_removed_cmn = remove_frequent(tokens)
print("Tokens most common removed: ",tokens_removed_cmn)

# most_rare = counter.most_common(-25)
# print("rare: ",most_rare)
# most_rare_words = []
# for word,freq in most_rare:
#   most_rare_words.append(word)
# print(most_rare_words)

# tokens_removed_rare = remove_frequent(tokens)
# tokens_processed = remove_frequent(tokens_removed_cmn)
# print(tokens_processed)

Counter({'night': 4, 'boy': 3, 'sheep': 3, 'roof': 2, 'ago': 2, 'spend': 2, 'reading': 2, 'longer': 2, 'still': 2, 'awoke': 2, 'could': 2, 'began': 2, 'schedule': 2, '’': 1, 's': 1, 'name': 1, 'santiago': 1, 'dusk': 1, 'falling': 1, 'arrived': 1, 'herd': 1, 'abandoned': 1, 'church': 1, 'fallen': 1, 'long': 1, 'enormous': 1, 'sycamore': 1, 'grown': 1, 'spot': 1, 'sacristy': 1, 'stood': 1, 'decided': 1, 'saw': 1, 'entered': 1, 'ruined': 1, 'gate': 1, 'laid': 1, 'planks': 1, 'across': 1, 'prevent': 1, 'flock': 1, 'wandering': 1, 'away': 1, 'wolves': 1, 'region': 1, 'animal': 1, 'strayed': 1, 'entire': 1, 'next': 1, 'day': 1, 'searching': 1, 'swept': 1, 'floor': 1, 'jacket': 1, 'lay': 1, 'using': 1, 'book': 1, 'finished': 1, 'pillow': 1, 'told': 1, 'would': 1, 'start': 1, 'thicker': 1, 'books': 1, 'lasted': 1, 'made': 1, 'comfortable': 1, 'pillows': 1, 'dark': 1, 'looking': 1, 'see': 1, 'stars': 1, 'halfdestroyed': 1, 'wanted': 1, 'sleep': 1, 'little': 1, 'thought': 1, 'dream': 1, 'week': 

Text Normalization: Stemming and Lemmatization

In [None]:
stemmed_text =[]
stemmer= PorterStemmer()
# stemmer = LancasterStemmer() Lancaster has more rules than Porter
for word in tokens_removed_cmn:
  stemmed_text.append(stemmer.stem(word))

print("Stemmed text: ")
for w in stemmed_text:
  print(w)

Stemmed text: 
enorm
sycamor
grown
spot
sacristi
stood
decid
saw
enter
ruin
gate
laid
plank
across
prevent
flock
wander
away
wolv
region
anim
stray
entir
next
day
search
swept
floor
jacket
lay
use
book
finish
pillow
told
would
start
thicker
book
last
made
comfort
pillow
dark
look
see
star
halfdestroy
want
sleep
littl
thought
dream
week
awaken
end
aros
take
crook
awaken
slept
notic
soon
anim
also
stir
mysteri
energi
bound
life
spent
past
two
year
lead
countrysid
search
food
water
“
they
use
know
”
mutter
think
moment
realiz
way
around
becom
accustom


In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_text =[]
for word in tokens_removed_cmn:
  lemmatized_text.append(lemmatizer.lemmatize(word))

print("Lemmatized text: ")
for w in lemmatized_text:
  print(w)

Lemmatized text: 
enormous
sycamore
grown
spot
sacristy
stood
decided
saw
entered
ruined
gate
laid
plank
across
prevent
flock
wandering
away
wolf
region
animal
strayed
entire
next
day
searching
swept
floor
jacket
lay
using
book
finished
pillow
told
would
start
thicker
book
lasted
made
comfortable
pillow
dark
looking
see
star
halfdestroyed
wanted
sleep
little
thought
dream
week
awakened
ended
arose
taking
crook
awaken
slept
noticed
soon
animal
also
stir
mysterious
energy
bound
life
spent
past
two
year
leading
countryside
search
food
water
“
they
used
know
”
muttered
thinking
moment
realized
way
around
become
accustomed


In [None]:
#Part of speech tagging, for eg: JJ: noun, IN: preposition, VBG: verb,gerund or present participles, RB: adverb
nltk.pos_tag(tokens_removed_cmn)

[('enormous', 'JJ'),
 ('sycamore', 'NN'),
 ('grown', 'JJ'),
 ('spot', 'NN'),
 ('sacristy', 'JJ'),
 ('stood', 'NN'),
 ('decided', 'VBD'),
 ('saw', 'NN'),
 ('entered', 'VBD'),
 ('ruined', 'JJ'),
 ('gate', 'NN'),
 ('laid', 'VBD'),
 ('planks', 'NNS'),
 ('across', 'IN'),
 ('prevent', 'NN'),
 ('flock', 'NN'),
 ('wandering', 'VBG'),
 ('away', 'RB'),
 ('wolves', 'VBZ'),
 ('region', 'NN'),
 ('animal', 'JJ'),
 ('strayed', 'VBN'),
 ('entire', 'JJ'),
 ('next', 'JJ'),
 ('day', 'NN'),
 ('searching', 'VBG'),
 ('swept', 'JJ'),
 ('floor', 'NN'),
 ('jacket', 'NN'),
 ('lay', 'VBD'),
 ('using', 'VBG'),
 ('book', 'NN'),
 ('finished', 'VBD'),
 ('pillow', 'JJ'),
 ('told', 'RB'),
 ('would', 'MD'),
 ('start', 'VB'),
 ('thicker', 'NN'),
 ('books', 'NNS'),
 ('lasted', 'VBD'),
 ('made', 'VBN'),
 ('comfortable', 'JJ'),
 ('pillows', 'NNS'),
 ('dark', 'VBP'),
 ('looking', 'VBG'),
 ('see', 'NN'),
 ('stars', 'NNS'),
 ('halfdestroyed', 'VBD'),
 ('wanted', 'JJ'),
 ('sleep', 'JJ'),
 ('little', 'JJ'),
 ('thought', 'JJ'),
