<a href="https://colab.research.google.com/github/isegura/TextSimplification/blob/master/LexicalSimplification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lexical simplification

Lexical Simplification is the task of replacing individual words of a text with words that are easier to understand, so that the text as a whole becomes easier to comprehend, e.g. by people with learning disabilities or by children who learn to read.

The most basic approach is to use a dictionary containing synonyms. For example, WordNet (https://wordnet.princeton.edu/), a lexical database of semantic relations between words in more than 200 languages,  links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. 







## Using WordNet to get synonyms

An important advantage of WordNet is that it is integrated into NLTK (http://www.nltk.org/howto/wordnet.html.)

As a word can have several meanings, for each meaning, WordNet will return a different synset, ss, with a representative name, stored in ss.name, its definition (ss.definition()) and a list of synoynyms for this meaning.

The following cell shows how to obtain the synonyms for the word **wood**. WordNet returs 8 different meanings for this words, its definitions and synonyms.


In [0]:
import nltk
from nltk.corpus import wordnet as wn #we have to import the wordnet from nltk.corpus. Also, we rename it as wn. 

nltk.download('wordnet') #all-corpora

def showSynsets(word):
    i=0
    print('Synsets for: '+ word)
    for i, ss in enumerate(wn.synsets(word)):
        print("synset:", i, ss.name(), ss.definition(), ss.lemma_names())
    
    if i==0:
        print('\t'+word + ' is not found in WordNet')  
        
    print('')

word='wood'
showSynsets(word)



    

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Synsets for: wood
('synset:', 0, u'wood.n.01', u'the hard fibrous lignified substance under the bark of trees', [u'wood'])
('synset:', 1, u'forest.n.01', u'the trees and other plants in a large densely wooded area', [u'forest', u'wood', u'woods'])
('synset:', 2, u'wood.n.03', u'United States film actress (1938-1981)', [u'Wood', u'Natalie_Wood'])
('synset:', 3, u'wood.n.04', u'English conductor (1869-1944)', [u'Wood', u'Sir_Henry_Wood', u'Sir_Henry_Joseph_Wood'])
('synset:', 4, u'wood.n.05', u'English writer of novels about murders and thefts and forgeries (1814-1887)', [u'Wood', u'Mrs._Henry_Wood', u'Ellen_Price_Wood'])
('synset:', 5, u'wood.n.06', u'United States painter noted for works based on life in the Midwest (1892-1942)', [u'Wood', u'Grant_Wood'])
('synset:', 6, u'woodwind.n.01', u'any wind instrument other than the brass instruments', [u'woodwind', u'woodwind_instrument',

However, WordNet is not able to provide any synonym for 'abilify' (a medicine), because it does not exist into WordNet.

In [0]:
word='abilify'   #is a Drug
showSynsets(word)

Synsets for: abilify
	abilify is not found in WordNet




##Using BabelNet to obtain synonyms


BabelNet is a multilingual dictionary that covers hundreds of languages that can be used as a semantic network. Unfortuntaly, you can only make 500 requests per day (free). 

In order to use it, you should follow the instructions described at this link: https://babelnet.org/guide#HowcanIdownloadtheBabelNetindices?


BabelNet is also organized as a network of synsets (synonyms). It contains more than 14 millions of synsets (concepts) and more than 700 million of words. Its semantic networks also includes semantic relations including synonyms, hyponyms, and meronyms, with a total of 364.000 relations. 

The following cell shows how to obtain the synonysm for 'abilify' (a medicine): 


In [0]:
import urllib2
import urllib
import json
import gzip

from StringIO import StringIO

service_url = 'https://babelnet.io/v4/getSenses'

word = 'abilify'
lang = 'EN'
key  = '04254730-4c4a-4f7d-aeca-026afd9d0ff4'

params = {
    'word' : word,
    'lang' : lang,
    'key'  : key
}

url = service_url + '?' + urllib.urlencode(params)
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)

if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO( response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = json.loads(f.read())
    # retrieving BabelSense data
    for result in data:
        lemma = result.get('lemma')
        language = result.get('language')
        source = result.get('source')
        print language.encode('utf-8') \
            +"\t"+ str(lemma.encode('utf-8')) \
            +"\t"+ str(source.encode('utf-8'))

EN	ATCvet_code_QN05AX12	WIKIRED
EN	Abilify	WIKIRED
EN	Ariprazole	WIKIRED
EN	ATC_code_N05AX12	WIKIRED
EN	C23H27Cl2N3O2	WIKIRED
EN	Aripiprozole	WIKIRED
EN	Aripiprex	WIKIRED
EN	Aripiprazole	WIKI
EN	OPC-14597	WIKIRED
EN	Abilitat	WIKIRED
EN	Aripiprazole	WIKIDATA
