# **Natural Language Processing**

## spaCy

SpaCy is a natural language processing (NLP) library in Python, designed to be fast, efficient and easy to use in practical applications. It is widely used for tasks such as:

* Tokenisation: Breaking text into words, phrases or smaller units.
* Lemmatisation: Reducing words to their base form (e.g. running → run).
* POS tagging: Identifying the type of each word (noun, verb, adjective, etc.).
* Entity recognition (NER): Detect proper names, dates, organisations, etc.
* Parsing: Determining the grammatical structure of a sentence.
* Word vectorisation: Representing words in numerical form for ML and deep learning models.

Unlike NLTK, which is more research and experimentation oriented, SpaCy is optimised for production and efficiency. In addition, it includes pre-trained models for several languages and allows the integration of neural network models (such as transformers) to improve its performance.



We choose the model we need for information processing and download it:

In [None]:
!pip install spacy # Install the library if not preinstalled



In [None]:
import spacy
spacy.cli.download("en_core_web_sm") # Small model (sm) in English. Also available md (medium) and lg (large).

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


We can now use the downloaded model in our programme as follows:

In [None]:
nlp = spacy.load("en_core_web_sm")

## Morphological analysis

Morphological analysis consists of determining the grammatical form, class or category of each word in a sentence. This process is also known as POS tagging (Part Of Speech tagging). [List of tags](https://universaldependencies.org/u/pos/).

Next, let's see how to get the grammatical category of the words in the sentence:

1.   First we create a **doc** from the text we want to parse. Doc is a ***spaCy*** object that stores the text and all its annotations.
2.   We then iterate through the document to see what ***spaCy*** has analysed.



In [None]:
text = "Rainfall in Spain falls mainly on the plain without mountains."
doc = nlp(text)

for token in doc:
    print(token.text, token.pos_)

Rainfall NOUN
in ADP
Spain PROPN
falls VERB
mainly ADV
on ADP
the DET
plain NOUN
without ADP
mountains NOUN
. PUNCT


As we can see, spaCy first performs the tokenisation according to the model used and for the specific language. It then parses and annotates the sentence. Statistical models allow ***spaCy*** to make predictions of which tag is most likely to be applied in a given context.

A trained model includes data with enough examples so that it can make predictions that generalise across the language; for example, a word following a determiner ‘the’ is likely to be a noun.

Linguistic annotations are available as **Token** providing information for each word in the sentence such as:


* Text: The original text of the word.
* Lemma: The base form of the word.
* POS: The label of the word.
* Morph: Thorough revision of the above quality by specifying parameters within the category.
* Dep: Syntactic dependency, i.e. the relationship between the tokens.
* Form: The form of the word: capitalisation, punctuation, digits.
* Alpha: Is the token an alphanumeric character?
* Stop-word: Is the token part of a stop-list?


In [None]:
doc = nlp("Apple is considering buying a British company for $1 billion.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.morph, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN Number=Sing nsubj Xxxxx True False
is be AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin aux xx True True
considering consider VERB Aspect=Prog|Tense=Pres|VerbForm=Part ROOT xxxx True False
buying buy VERB Aspect=Prog|Tense=Pres|VerbForm=Part xcomp xxxx True False
a a DET Definite=Ind|PronType=Art det x True True
British british ADJ Degree=Pos amod Xxxxx True False
company company NOUN Number=Sing dobj xxxx True False
for for ADP  prep xxx True True
$ $ SYM  quantmod $ False False
1 1 NUM NumType=Card compound d False False
billion billion NUM NumType=Card pobj xxxx True False
. . PUNCT PunctType=Peri punct . False False


** The form of the words is very simple, a capital X is placed when the letter is in upper case and in lower case in the opposite case, giving structures of Xxxxx also if a number is introduced this also has its own value in which case it is a *d*.

Most morphological categories seem rather abstract and vary from language to language. The **spacy.explain()** method will show you a brief description in some cases:

In [None]:
import spacy
spacy.explain("DET")

'determiner'

Syntactic analysis

Syntactic analysis consists of determining the syntactic functions or concordance and hierarchical relationships that words have when they are grouped together.

The most commonly used types of parsing are constituent parsing and dependency parsing. Constituent analysis consists of dividing the sentence into its component parts (nominal syntagms, verbal syntagms, etc.), which are called constituents, so that each of these parts is in turn divided into smaller parts until we arrive at the words. On the other hand, dependency analysis is based on looking for the relationships between the different words in the sentence.

In this session, we are going to carry out a syntactic analysis of dependencies which can be very useful, for example, to determine the subject and the direct complement of a sentence, the words modified by a negator, etc. Its study is important, since the interpretation and comprehension of texts often depend on a correct syntactic analysis.

Next, we are going to see how to perform dependency parsing of a sentence using ***spaCy*** again:

1. First we must install the package if necessary and download the appropriate model.
2.   We create a **doc** from the text we want to parse.
3.   Then iterate through the document to see what ***spaCy*** has parsed.

In [None]:
import spacy
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
text = "The order has been delivered late."
doc = nlp(text)

for token in doc:
    print(token.text, token.pos_, token.dep_)

The DET det
order NOUN nsubjpass
has AUX aux
been AUX auxpass
delivered VERB ROOT
late ADV advmod
. PUNCT punct


Dependencies can be represented in a directed graph:

* The words are the nodes.
* The grammatical relations are the edges.

You can use **displacy** to visualise the dependency tree:

In [None]:
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True)

In [None]:
spacy.explain("nsubjpass")

'nominal subject (passive)'

In [None]:
text = "The computer is not good."
doc = nlp(text)

for token in doc:
  print(token.text, token.pos_, token.morph, token.dep_, token.head.text)

displacy.render(doc, style='dep', jupyter=True)

The DET Definite=Def|PronType=Art det computer
computer NOUN Number=Sing nsubj is
is AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin ROOT is
not PART Polarity=Neg neg is
good ADJ Degree=Pos acomp is
. PUNCT PunctType=Peri punct is


**Navigating the dependency tree**

The dependency parsing scheme has the properties of a tree. This tree contains information about sentence structure and grammar and can be traversed in different ways to extract relationships.

***SpaCy*** provides attributes such as *children*, *nbor*, *lefts* or *rights* to navigate the dependency tree.

In [None]:
text = 'bright red apples on the tree'
doc = nlp(text)
displacy.render(doc, style='dep', jupyter=True)

In [None]:
# Extract the children of ‘apples’.
print(f"Children of {doc[2]}: {list(doc[2].children)}")

# Extract the next neighbour node from `apples`. (default=1)
print(f"Neighbour of {doc[2]}: {doc[2].nbor()}")

# Extract the preceding neighbour node from `red`.
print(f"Preceding neighbour of {doc[1]}: {doc[1].nbor(-1)}")

# Extract the children to the left of `tree`.
print(f"Children to the left of {doc[-1]}: {list(doc[-1].lefts)}")

# Extract children to the right of `apples`.
print(f"Children to the right of {doc[2]}: {list(doc[2].rights)}")

Children of apples: [bright, red, on]
Neighbour of apples: on
Preceding neighbour of red: bright
Children to the left of tree: [the]
Children to the right of apples: [on]


For more information on dependencies, you can consult the following URLs:
* https://universaldependencies.org/u/dep/all.html

* https://universaldependencies.org/docs/u/dep/

* https://nlp.stanford.edu/software/dependencies_manual.pdf

## Semantic analysis

Semantic analysis consists of determining the meaning of the words in a text. In this session we are going to see how to determine the meaning of the different words in a text according to their grammatical category and how to obtain synonyms and antonyms, which can be very useful when it comes to facilitating the understanding of a text.

**WordNet** is a lexical database containing nouns, adjectives, verbs and adverbs grouped into synonym sets called synsets, providing short, general definitions and storing the semantic relationships between the synonym sets. It is the most widely used lexical database for word sense disambiguation (WSD), a task that aims to assign the most appropriate concept to terms according to the context in which they appear.

NLTK provides an interface to be able to use WordNet
(http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet). In the following, we are going to see some examples to understand the potential of this lexical database.



### Meaning of a term

Given a term, WordNet allows you to find out the different meanings that word can have, as well as some examples of sentences in which it can be used.

The function `synsets(lemma, pos=None, lang='eng')`, returns all synsets (sets of synonyms) of the specified lemma that belong to the specified POS tag. If no grammatical category is specified, it returns all synsets associated with the lemma.

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
for syn in wn.synsets("car"):
  print(syn, syn.definition())

Synset('car.n.01') a motor vehicle with four wheels; usually propelled by an internal combustion engine
Synset('car.n.02') a wheeled vehicle adapted to the rails of railroad
Synset('car.n.03') the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
Synset('car.n.04') where passengers ride up and down
Synset('cable_car.n.01') a conveyance for passengers or freight on a cable railway


In the example above we can see that the term car (‘car’) always acts as a noun (‘n’) and can have 5 different meanings (‘car.n.01’, ‘car.n.02’, ‘car.n.03’, ‘car.n.04’ and ‘cable_car.n.01’).



List of POS tags used in WordNet:
```
ADJ, ADJ_SAT, ADV, NOUN, VERB = ‘a’, ‘s’, ‘r’, ‘n’, ‘v’.
```

Adjectives are organised in groups containing head synsets (a) and satellite synsets (s):  

* ADJ: adjectives have minimal meaning, e.g. ‘dry’, ‘good’ etc.
* ADJ_SAT: adjective that imposes additional commitments beyond the meaning of the core adjective, e.g. ‘arid’ = ‘dry’ + a particular context (could mean place or climate).


To obtain an example of a sentence in which this term is used to refer to a four-wheeled motor vehicle (‘car.n.01’) we make a call to the `examples()` function of the synset:


In [None]:
wn.synset('car.n.01').examples()

['he needs a car to get to work']

### Synonyms

Although definitions help to understand the meaning of a word, sometimes it may be more useful to obtain other words with the same meaning (synonyms):

In [None]:
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [None]:
wn.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

In [None]:
wn.synonyms("car")

[['auto', 'automobile', 'machine', 'motorcar'],
 ['railcar', 'railroad_car', 'railway_car'],
 ['gondola'],
 ['elevator_car'],
 ['cable_car']]

### Antonyms

WordNet also allows you to obtain words with an opposite meaning to a given word (antonyms).

The following example shows how to get an antonym for the word ‘cheap’ when used with the meaning ‘low price’:


In [None]:
for ss in wn.synsets("cheap"):
  print(ss, ss.definition())

Synset('cheap.a.01') relatively low in price or charging low prices
Synset('brassy.s.02') tastelessly showy
Synset('bum.s.01') of very poor quality; flimsy
Synset('cheap.s.04') embarrassingly stingy


In [None]:
wn.synonyms("cheap")

[['inexpensive'],
 ['brassy',
  'flash',
  'flashy',
  'garish',
  'gaudy',
  'gimcrack',
  'loud',
  'meretricious',
  'tacky',
  'tatty',
  'tawdry',
  'trashy'],
 ['bum', 'cheesy', 'chintzy', 'crummy', 'punk', 'sleazy', 'tinny'],
 ['chinchy', 'chintzy']]

In [None]:
wn.synset("cheap.a.01").lemmas()[0].antonyms()

[Lemma('expensive.a.01.expensive')]

### Disambiguation

So far we have worked with concrete terms by selecting the synset for which we want to obtain the information, but what happens if what we have is not a term but a sentence. It will be necessary to carry out a disambiguation process taking into account the context in which the words occur.

NLTK provides an implementation of the Lesk disambiguation algorithm (https://www.nltk.org/api/nltk.wsd.lesk.html?highlight=lesk), which, given an ambiguous word and the context in which the word occurs, returns the synset with the highest number of common words between the context of the sentence and the dictionary definition of the synset. For example, the adjective ‘cheap’ is an ambiguous word that can have one of the following meanings: relatively low in price (‘cheap.a.01’), tasteless (‘brassy.s.02’), of very poor quality; flimsy (‘bum.s.01’) or embarrassingly cheap (‘cheap.s.04’).

In [None]:
from nltk.wsd import lesk

In [None]:
sent = 'I am surprised with the price of this restaurant , it is very cheap .'
sent = sent.split()
synset = lesk(sent, 'cheap', 'a')
print(synset)

Synset('cheap.a.01')


If we show the definition of the synset provided by the Lesk algorithm we can see that, in that sentence, the word ‘cheap’ refers to ‘low in price’.

In [None]:
print(synset.definition())

relatively low in price or charging low prices


## Exercises

The exercises must be done on this *notebook*, and must be submitted through PLATEA before the deadline indicated.

Download the story ‘carroll-alice.txt’ available in PLATEA (Supplementary Material folder) and perform the necessary methods to analyse the following information:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
data_path = "/content/drive/MyDrive/NLP/regreso_al_paraiso.txt"

### Exercise 1 - Morphological Analysis

Obtain the total number of **nouns/nouns**, **adjectives**, **verbs** and **adverbs** in the text. For nouns, calculate also the number of:
* Nouns in masculine and singular.
* Nouns in masculine and plural.
* Nouns in the feminine and singular.
* Feminine and plural nouns.

Are there genderless nouns?


In [3]:
import spacy
spacy.cli.download("es_core_news_sm") # Small model (sm) in English. Also available md (medium) and lg (large).
# nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("es_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
with open(data_path, 'r') as f:
  text = f.read()

doc = nlp(text)

In [5]:
noun_count = 0
adj_count = 0
verb_count = 0
adv_count = 0

masc_sing_nouns = 0
masc_plur_nouns = 0
fem_sing_nouns = 0
fem_plur_nouns = 0
genderless_nouns = 0

In [6]:
for token in doc:
    if token.pos_ == "NOUN":
        noun_count += 1
        morph = token.morph.to_dict()
        if morph.get("Number") == "Sing":
          if morph.get("Gender") == "Masc":
            masc_sing_nouns += 1
          elif morph.get("Gender") == "Fem":
            fem_sing_nouns +=1
          elif morph.get("Gender") == None:
            genderless_nouns +=1
        elif morph.get("Number") == "Plur":
          if morph.get("Gender") == "Masc":
            masc_plur_nouns += 1
          elif morph.get("Gender") == "Fem":
            fem_plur_nouns += 1
          elif morph.get("Gender") == None:
            genderless_nouns +=1
        elif morph.get("Gender") == None and morph.get("Number") == None:
          genderless_nouns +=1

    elif token.pos_ == "ADJ":
        adj_count += 1
    elif token.pos_ == "VERB":
        verb_count += 1
    elif token.pos_ == "ADV":
        adv_count += 1

In [7]:
print(f"Total Nouns: {noun_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adverbs: {adv_count}")

print(f"Masculine Singular Nouns: {masc_sing_nouns}")
print(f"Masculine Plural Nouns: {masc_plur_nouns}")
print(f"Feminine Singular Nouns: {fem_sing_nouns}")
print(f"Feminine Plural Nouns: {fem_plur_nouns}")
print(f"Genderless nouns: {genderless_nouns}")

Total Nouns: 6531
Total Adjectives: 2104
Total Verbs: 5280
Total Adverbs: 3038
Masculine Singular Nouns: 1723
Masculine Plural Nouns: 913
Feminine Singular Nouns: 2465
Feminine Plural Nouns: 917
Genderless nouns: 505


as in English nouns don't have gender I have decided to do this exercise with Spanish txt file.

### Exercise 2 - Syntactic Analysis

* What are the 5 most frequent nominal subjects?
* What are the 5 most frequent direct complements?
* What are the 5 most frequent negators?
* Search for information and browse the dependency tree in ***spaCy*** for the 3 most frequent words that are connected to each of the negators obtained in the previous question.

In [16]:
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [18]:
data_path = "/content/drive/MyDrive/NLP/carroll-alice.txt"
with open(data_path, 'r') as f:
  text = f.read()

doc = nlp(text)

In [19]:
from collections import Counter

In [20]:
nominal_subjects = [token.text for token in doc if token.dep_ == "nsubj"]
subject_counts = Counter(nominal_subjects)
print("5 most frequent nominal subjects:")
for subject, count in subject_counts.most_common(5):
    print(f"{subject}: {count}")

5 most frequent nominal subjects:
I: 516
she: 502
Alice: 286
it: 281
you: 280


In [21]:
direct_complements = [token.text for token in doc if token.dep_ == "dobj"]
complement_counts = Counter(direct_complements)
print("\n5 most frequent direct complements:")
for complement, count in complement_counts.most_common(5):
    print(f"{complement}: {count}")


5 most frequent direct complements:
it: 125
what: 40
them: 36
Alice: 30
her: 28


In [22]:
negators = [token.text for token in doc if token.dep_ == "neg"]
negator_counts = Counter(negators)
print("\n5 most frequent negators:")
for negator, count in negator_counts.most_common(5):
    print(f"{negator}: {count}")


5 most frequent negators:
n't: 215
not: 125
never: 39
Not: 9
no: 6


In [23]:
print("\nWords connected to the top 3 negators:")
for negator, count in negator_counts.most_common(3):
    print(f"\nNegator: '{negator}' (occurs {count} times)")
    connected_words = []  # List to store words connected to the negator

    for token in doc:
        if token.dep_ == "neg" and token.text == negator:
            # Find the head word, which is the word the negator modifies
            head_word = token.head.text
            connected_words.append(head_word)  # Add the head word to the list

            # Find child words that depend on the head word (excluding the negator itself)
            for child in token.head.children:
                if child.dep_ != 'neg':
                    connected_words.append(child.text)

    # Count the frequency of connected words
    connected_counts = Counter(connected_words)

    # Print the 3 most frequent connected words
    for word, count in connected_counts.most_common(3):
        print(f"  Word '{word}' (occurs {count} times)")


Words connected to the top 3 negators:

Negator: 'n't' (occurs 215 times)
  Word ''' (occurs 105 times)
  Word 'I' (occurs 81 times)
  Word ',' (occurs 76 times)

Negator: 'not' (occurs 125 times)
  Word ',' (occurs 81 times)
  Word 'could' (occurs 40 times)
  Word ''' (occurs 28 times)

Negator: 'never' (occurs 39 times)
  Word ',' (occurs 20 times)
  Word 'I' (occurs 14 times)
  Word ''' (occurs 14 times)


In [24]:
print("\nWords connected to the top 3 negators:")
for negator, count in negator_counts.most_common(3):
    print(f"\nNegator: '{negator}' (occurs {count} times)")
    connected_words = []  # List to store words connected to the negator

    for token in doc:
        if token.dep_ == "neg" and token.text == negator:
            # Find the head word, which is the word the negator modifies
            head_word = token.head.text
            connected_words.append(head_word)  # Add the head word to the list

            # Find child words that depend on the head word (excluding the negator itself)
            for child in token.head.children:
                if child.dep_ != 'neg' and child.is_alpha:
                    connected_words.append(child.text)

    # Count the frequency of connected words
    connected_counts = Counter(connected_words)

    # Print the 3 most frequent connected words
    for word, count in connected_counts.most_common(3):
        print(f"  Word '{word}' (occurs {count} times)")


Words connected to the top 3 negators:

Negator: 'n't' (occurs 215 times)
  Word 'I' (occurs 81 times)
  Word 'do' (occurs 54 times)
  Word 'you' (occurs 46 times)

Negator: 'not' (occurs 125 times)
  Word 'could' (occurs 40 times)
  Word 'did' (occurs 24 times)
  Word 'she' (occurs 21 times)

Negator: 'never' (occurs 39 times)
  Word 'I' (occurs 14 times)
  Word 'had' (occurs 11 times)
  Word 'before' (occurs 6 times)


### Exercise 3 - Semantic analysis
For each of the semantic analysis exercises you will use the document ‘carroll-alice.txt’, available in PLATEA.

1. Find a synonym for each of the adjectives in the text and generate a .json file containing the frequency of each substitution made:


```
{
    (adj_original_1, synonym): pair_frequency
    (adj_original_2, synonym): even_frequency
    ...
}
```

2. Obtain the possible **meanings**, **synonyms** and **antonyms** of the most frequent adjective in the text.
3. Generate a new file from the disambiguation of all the nouns in the document. To do this, the nouns must be replaced by the tuple (word, synset).
4. Obtain the most frequent sense without considering empty words. Note that they are relevant for disambiguation.

In [25]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [27]:
from nltk.corpus import wordnet as wn
import json

**1st task**

Find a synonym for each of the adjectives in the text and generate a .json file containing the frequency of each substitution made

In [40]:
substitution_counts = Counter()

for token in doc:
    if token.pos_ == "ADJ":
        adj_original = token.text
        synonyms = []
        for syn in wn.synsets(adj_original, pos=wn.ADJ):
            for lemma in syn.lemmas():
                synonyms.append(lemma.name())
        if synonyms:
            synonym = synonyms[0]
            substitution_counts[(adj_original, synonym)] += 1

In [41]:
output_file = "adjective_synonym_substitutions.json"
substitution_counts_str = {f"{adj}, {syn}": count for (adj, syn), count in substitution_counts.items()}


In [42]:
with open(output_file, "w") as f:
    json.dump(substitution_counts_str, f, indent=4)

**2nd task**

Obtain the possible meanings, synonyms and antonyms of the most frequent adjective in the text.

In [43]:
adjectives = [token.text for token in doc if token.pos_ == "ADJ"]
adj_counts = Counter(adjectives)

In [46]:
most_frequent_adj = adj_counts.most_common(1)[0][0]
print(f"Most frequent adjective: {most_frequent_adj}")

Most frequent adjective: little


In [49]:
synsets = wn.synsets(most_frequent_adj, pos=wn.ADJ)

In [50]:
if synsets:
    print("\nPossible meanings:")
    for synset in synsets:
        print(f"- {synset.definition()}")

    print("\nSynonyms:")
    for synset in synsets:
        for lemma in synset.lemmas():
            if lemma.name() != most_frequent_adj:  # Avoid duplicating the original word
                print(f"- {lemma.name()}")

    print("\nAntonyms:")
    for synset in synsets:
        for lemma in synset.lemmas():
            for antonym in lemma.antonyms():
                print(f"- {antonym.name()}")
else:
    print("Synonyms and antonyms not found.")


Possible meanings:
- limited or below average in number or quantity or magnitude or extent
- (quantifier used with mass nouns) small in quantity or degree; not much or almost none or (with `a') at least some
- (of children and animals) young, immature
- (informal) small and of little importance
- (of a voice) faint
- low in stature; not tall
- lowercase
- small in a way that arouses feelings (of tenderness or its opposite depending on the context)

Synonyms:
- small
- slight
- small
- fiddling
- footling
- lilliputian
- niggling
- piddling
- piffling
- petty
- picayune
- trivial
- small
- short
- minuscule
- small

Antonyms:
- large
- big
- much
- tall


**3rd task**

Generate a new file from the disambiguation of all the nouns in the document. To do this, the nouns must be replaced by the tuple (word, synset).

In [55]:
modified_tokens = []
for token in doc:
    if token.pos_ == "NOUN":
        # Attempt to disambiguate the noun using WordNet
        synsets = wn.synsets(token.text, pos=wn.NOUN)
        if synsets:
            synset = synsets[0]
            modified_tokens.append(f"({token.text}, {synset.name()})")
        else:
            # If no synset is found, keep the original noun
            modified_tokens.append(token.text)
    else:
        # Keep non-noun tokens as they are
        modified_tokens.append(token.text)

In [54]:
new_text = " ".join(modified_tokens)

output_file = "carroll-alice_nouns_disambiguated.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(new_text)

**4th task**

Obtain the most frequent sense without considering empty words. Note that they are relevant for disambiguation.

In [56]:
from nltk.corpus import  stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [57]:
stop_words = set(stopwords.words('english'))

In [59]:
synset_counts = Counter()
for token in doc:
    if token.pos_ == "NOUN":
        synsets = wn.synsets(token.text, pos=wn.NOUN)
        if synsets:
            synset = synsets[0]
            if token.text.lower() not in stop_words:
              synset_counts[synset] += 1

In [60]:
most_frequent_synset = synset_counts.most_common(1)

if most_frequent_synset:
    most_frequent_synset = most_frequent_synset[0][0]
    print(f"Most frequent synset (excluding stop words): {most_frequent_synset.name()}")
    print(f"Definition: {most_frequent_synset.definition()}")
else:
    print("No nouns found (excluding stop words) with synsets.")

Most frequent synset (excluding stop words): time.n.01
Definition: an instance or single occasion for some event
