# Read AraWordNet

contributed by **Ali Ahmed** 

A utility file to read **AraWordNet** and provide dictionary to map between the sense and its words.

AraWordNet[1][2] could be found at http://globalwordnet.org/resources/arabic-wordnet/.

## Prerequisite:
- Define `wordnet_path` variable

[1] Black W., Elkateb S., Rodriguez H., Alkhalifa M., Vossen P., Pease A., Bertran M., Fellbaum C., (2006) The Arabic WordNet Project, Proceedings of LREC 2006

[2] Lahsen Abouenour, Karim Bouzoubaa, Paolo Rosso (2013) On the evaluation and improvement of Arabic WordNet coverage and usability, Language Resources and Evaluation 47(3) pp 891–91

## Import and Setup

In [1]:
from bs4 import BeautifulSoup
from collections import Counter, defaultdict

%store -r wordnet_path

## Read AraWordNet

**AraWordNet** has the following structure which is embedded in XML:
- LexicalEntry:
  - Lemma. Its properties are: `partOfSpeech` and `writtenForm`. We are interested in the `writtenForm` which shows how the word looks like.
  - Sense. Its properties are: `id` and `synset`. We are interested in the `synset` to map between the word and its relations in the WordNet.
  - WordForm. Its properties are: `formType` and `writtenForm`. We are not interested in any of these properties.
- Synset. Its properties are `baseConcept` and `id`. We are interested in the `id` which maps to the `synset` property in the `Sense` node for every word:
  - SynsetRelations
    - SynsetRelation. Its properties are `relType` and `targets`. We are interested in both. `relType` shows the relation type (whether its `hypernym`, `hyponym`, .. etc). `targets` maps to the `synset` property in the `Sense` node for every word.

In [2]:
print("Loading AraWordNet")
wordnet_file = open(wordnet_path).read()
wordnet = BeautifulSoup(wordnet_file, "xml")

Loading AraWordNet


## Extracting relations from AraWordNet (AWN)

Relation types can be:
- `hypernym`: represents a parent to child relationship.
- `hyponym`: represents a child to parent relationship.
- `has_instance`: represents an object to one of its instances relationship.
- `is_instance`: represents an instance to its object relationship.

and other relationships that we are not interested in. 

In [3]:
print("Reading hypernym relations")
relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its hypernym relations
    synset_hypernym_relations = list(filter((lambda relation: relation['relType'] == 'hypernym'), synset.findAll('SynsetRelation')))
    for relation in synset_hypernym_relations:
        # Construct a pair between each synonym set and its child
        relations.append((synset['id'], relation['targets']))

Reading hypernym relations


### Testing if hypernym relations represent similar relations as hyponym relations

If this is true, we can safely ignore the `hyponym` relationship and work with the `hypernym` relationship only.

In [4]:
hyponym_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its hyponym relations
    synset_hyponym_relations = list(filter((lambda relation: relation['relType'] == 'hyponym'), synset.findAll('SynsetRelation')))
    for relation in synset_hyponym_relations:
        # Construct a pair between each synonym set and its parent
        hyponym_relations.append((relation['targets'], synset['id']))

relations.sort()
hyponym_relations.sort()
print("Test: Are hypernym relations similar to hyponym relations? {}".format(relations == hyponym_relations))
if relations == hyponym_relations: print("Considering hypernym relations only")

Test: Are hypernym relations similar to hyponym relations? True
Considering hypernym relations only


### We might also consider the is_instance and has_instance

In [5]:
has_instance_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its has_instance relations
    synset_has_instance_relations = list(filter((lambda relation: relation['relType'] == 'has_instance'), synset.findAll('SynsetRelation')))
    for relation in synset_has_instance_relations:
        # Construct a pair between each synonym set and its instance
        has_instance_relations.append((synset['id'], relation['targets']))

### Testing if has_instance relations represent similar relations as is_instance relations

If this is true, we can again safely ignore the `is_instance` relationship and work with the `has_instance` relationship only.

In [6]:
is_instance_relations = []
# For every synonym set
for synset in wordnet.findAll('Synset'):
    # Get its is_instance relations
    synset_is_instance_relations = list(filter((lambda relation: relation['relType'] == 'is_instance'), synset.findAll('SynsetRelation')))
    for relation in synset_is_instance_relations:
        # Construct a pair between each synonym set and its object
        is_instance_relations.append((relation['targets'], synset['id']))

is_instance_relations.sort()
has_instance_relations.sort()
is_instance_relations == has_instance_relations

True

### Testing if hypernym contains repeated relations

We have to remove repeated relations if they exist.

In [7]:
print("Number of hypernym relations: {}".format(len(relations)))
print("Contains unique relations only? {}".format(len(relations) == len(set(hyponym_relations))))

Number of hypernym relations: 19806
Contains unique relations only? False


### Therefore, We have to consider the set of unique relations ignoring the repeated ones

In [8]:
print("Considering unique hypernym relations only")
relations = list(set(relations))
print("Number of unique hypernym relations: {}".format(len(relations)))

Considering unique hypernym relations only
Number of unique hypernym relations: 9305


### Testing if we have self- or bi-directional relations, and removing them
Self-directional is a relation between the word and itself. Bi-directional is a relation between two words where every one of them is parent `hypernym` of the other. Both relations contain loops and will be problematic when constructing tree for generating catcode and word-sense-children files.

In [9]:
# List for the relations in both directions
bi_directional_relations = []
# Synset is the sense id for the parent, target is the sense id for the child
for synset, target in relations:
    # Add a relation between the parent and the child
    bi_directional_relations.append((synset, target))
    # Add a relation in the other way around
    bi_directional_relations.append((target, synset))

# Count the number of occurences for each pair. This should be 1 for every pair since we are
# considering the unique set of hypernym relations
counter = Counter(bi_directional_relations)
# If the counter of any pair is more than 1, it should be marked as invalid
invalid_relations = list(filter((lambda relation: counter[relation] > 1), bi_directional_relations))
print("Considering unique uni-directional hypernym relations only")
# Remove the invalid relations
relations = list(set(relations) - set(invalid_relations))
print("Number of unique uni-directional hypernym relations: {}".format(len(relations)))

Considering unique uni-directional hypernym relations only
Number of unique uni-directional hypernym relations: 9302


### Testing if every child is occuring once as a child

Child should have only one parent and therefore should occur in the unique uni-directional relations once. If child occur multiple times as child, this means the parents n-balls will have to intersect. As a result, we have to remove the relations containing repeated children.

In [10]:
children = list(map((lambda relation: relation[1]), relations))
counter = Counter(children)
invalid_children = list(filter((lambda child: counter[child] > 1), children))

print("Number of invalid children: {}".format(len(invalid_children)))

Number of invalid children: 1763


In [11]:
relations = list(filter((lambda relation: relation[0] not in invalid_children), relations))
relations = list(filter((lambda relation: relation[1] not in invalid_children), relations))
print("Number of valid hypernym relations without repeated children: {}".format(len(relations)))

Number of valid hypernym relations without repeated children: 7177


### Extract vocabulary

Our vocabulary is limited to those words appearing in the valid hypernym relations. We have to extract them as they are providing the written form which is used in the word embedding file. 

In [12]:
# List for synonym set ids 
synset_ids = []
for relation in relations:
    # Extract parent synonym set id 
    synset_ids.append(relation[0])
    # Extract child synonym set id
    synset_ids.append(relation[1])

# Create unique set of synonym set ids extracted from the valid hypernym relations 
synset_ids = list(set(synset_ids))
print("Number of synonym set ids: {}".format(len(synset_ids)))

Number of synonym set ids: 7622


In [13]:
# Filter lexical entries which have synonym set id appearing in our list
lexical_entries = list(filter((lambda entry: entry.Sense['synset'] in synset_ids), wordnet.findAll('LexicalEntry')))
# Extract words that correspond to our synonym set id list. These words form our vocabulary list.
words = list(set(map((lambda entry: entry.Lemma['writtenForm']), lexical_entries)))
print("Number of unique words: {}".format(len(words)))

Number of unique words: 14391


## Dictinary for synset to words

Construct a dictionary for every synonym set id and its set of words. The key is the synonym set id and the value is a list of words. 

In [14]:
lexical_entries = list(filter((lambda entry: entry.Sense['synset'] in synset_ids), wordnet.findAll('LexicalEntry')))
synset_dict = defaultdict(list)
for entry in lexical_entries:
    written_form = entry.Lemma['writtenForm']
    synset_dict[entry.Sense['synset']].append(written_form)

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>