In [61]:
import spacy
from pymongo import MongoClient

In [62]:
# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

In [63]:
client = MongoClient()
db = client['article_recommendation']
article_collection = db['article']

# Find the first document in the collection
first_article = article_collection.find_one()
abstract = first_article['abstract']



# Example text
# text = "This is an example sentence. John go to the school."

In [64]:
# Process the text
doc = nlp(abstract)

Tokenization:
Tokenization is the process of splitting text into individual words or tokens.

In [65]:
# Iterate over tokens
for token in doc:
    print(token.text)

AbstractIn
this
paper
,
we
consider
the
problem
of
selection
on
coarse
-
grained
distributed
memory
parallel
computers
.
We
discuss
several
deterministic
and
randomized
algorithms
for
parallel
selection
.
We
also
consider
several
algorithms
for
load
balancing
needed
to
keep
a
balanced
distribution
of
data
across
processors
during
the
execution
of
the
selection
algorithms
.
We
have
carried
out
detailed
implementations
of
all
the
algorithms
discussed
on
the
CM-5
and
report
on
the
experimental
results
.
The
results
clearly
demonstrate
the
role
of
randomization
in
reducing
communication
overhead
.


Part-of-speech (POS) Tagging:
POS tagging assigns a grammatical label to each token, such as noun, verb, adjective, etc.

In [66]:
# Iterate over tokens with POS tags
for token in doc:
    print(token.text, token.pos_)


AbstractIn PROPN
this DET
paper NOUN
, PUNCT
we PRON
consider VERB
the DET
problem NOUN
of ADP
selection NOUN
on ADP
coarse ADV
- PUNCT
grained VERB
distributed VERB
memory NOUN
parallel ADJ
computers NOUN
. PUNCT
We PRON
discuss VERB
several ADJ
deterministic ADJ
and CCONJ
randomized ADJ
algorithms NOUN
for ADP
parallel ADJ
selection NOUN
. PUNCT
We PRON
also ADV
consider VERB
several ADJ
algorithms NOUN
for ADP
load NOUN
balancing NOUN
needed VERB
to PART
keep VERB
a DET
balanced ADJ
distribution NOUN
of ADP
data NOUN
across ADP
processors NOUN
during ADP
the DET
execution NOUN
of ADP
the DET
selection NOUN
algorithms VERB
. PUNCT
We PRON
have AUX
carried VERB
out ADP
detailed ADJ
implementations NOUN
of ADP
all DET
the DET
algorithms NOUN
discussed VERB
on ADP
the DET
CM-5 PROPN
and CCONJ
report VERB
on ADP
the DET
experimental ADJ
results NOUN
. PUNCT
The DET
results NOUN
clearly ADV
demonstrate VERB
the DET
role NOUN
of ADP
randomization NOUN
in ADP
reducing VERB
communication NOU

Named Entity Recognition (NER):
NER identifies named entities such as persons, organizations, locations, etc.

In [67]:
# Extract named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

AbstractIn PERSON
CM-5 NORP


1-Removing Stopwords:
Stopwords are common words (e.g., "the", "is", "and") that are often removed during preprocessing.

In [68]:
from spacy.lang.en.stop_words import STOP_WORDS

In [69]:
# Remove stopwords
filtered_tokens = [token.text for token in doc if token.text.lower() not in STOP_WORDS]

# Join filtered tokens back into a sentence
filtered_text = ' '.join(filtered_tokens)

doc = nlp(filtered_text)
print(filtered_text)

AbstractIn paper , consider problem selection coarse - grained distributed memory parallel computers . discuss deterministic randomized algorithms parallel selection . consider algorithms load balancing needed balanced distribution data processors execution selection algorithms . carried detailed implementations algorithms discussed CM-5 report experimental results . results clearly demonstrate role randomization reducing communication overhead .


In [70]:
# Filter out stopwords
# filtered_tokens = [token.text for token in doc if not token.is_stop]
# filtered_tokens

2-Remove punctuations

In [73]:
# Filter out tokens that are not punctuation
filtered_tokens = [token.text for token in doc if token.is_punct == False]

# Join the filtered tokens into a string
clean_text = " ".join(filtered_tokens)
doc = nlp(clean_text)
print(clean_text)

AbstractIn paper consider problem selection coarse grained distributed memory parallel computers discuss deterministic randomized algorithms parallel selection consider algorithms load balancing needed balanced distribution data processors execution selection algorithms carried detailed implementations algorithms discussed CM-5 report experimental results results clearly demonstrate role randomization reducing communication overhead


3-Lemmatization:
Lemmatization reduces words to their base or root form.

In [74]:
# Iterate over tokens with lemmatized forms
for token in doc:
    print(token.text, token.lemma_)


AbstractIn AbstractIn
paper paper
consider consider
problem problem
selection selection
coarse coarse
grained grain
distributed distribute
memory memory
parallel parallel
computers computer
discuss discuss
deterministic deterministic
randomized randomized
algorithms algorithm
parallel parallel
selection selection
consider consider
algorithms algorithm
load load
balancing balance
needed need
balanced balanced
distribution distribution
data datum
processors processor
execution execution
selection selection
algorithms algorithm
carried carry
detailed detailed
implementations implementation
algorithms algorithm
discussed discuss
CM-5 CM-5
report report
experimental experimental
results result
results result
clearly clearly
demonstrate demonstrate
role role
randomization randomization
reducing reduce
communication communication
overhead overhead


In [76]:
# Generate the sentence from lemmatized tokens
lemmatized_abstract = " ".join([token.lemma_ for token in doc])
lemmatized_abstract

'AbstractIn paper consider problem selection coarse grain distribute memory parallel computer discuss deterministic randomized algorithm parallel selection consider algorithm load balance need balanced distribution datum processor execution selection algorithm carry detailed implementation algorithm discuss CM-5 report experimental result result clearly demonstrate role randomization reduce communication overhead'