# About

This notebook provides code for getting the English synonym dictionary from [`WordNet`](https://wordnet.princeton.edu) via [`nltk`](https://www.nltk.org). You can learn how to install `nltk` [here](https://www.nltk.org/install.html). Once it has been installed, you can then download `WordNet` for English by the following code:

```python
>>> import nltk
>>> nltk.download('wordnet')
```

To get started, you can click [here](https://www.nltk.org/howto/wordnet.html) to learn sample usage for wordnet in nltk.

**If you do not want to download nltk, that is fine. The compiled synonym dictionary is already in this folder so you can reuse it directly.**

**Please also note that**, following [EDA](https://github.com/jasonwei20/eda_nlp), I did not take into account the parts of speech when compiling the syEnglish synonym dictionary.

## Initialization

In [1]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords

In [2]:
def findSyn(w):
    return [e.lemma_names() for e in wn.synsets(w)]

In [3]:
findSyn('like')

[['like', 'the_like', 'the_likes_of'],
 ['like', 'ilk'],
 ['wish', 'care', 'like'],
 ['like'],
 ['like'],
 ['like'],
 ['like'],
 ['like', 'similar'],
 ['like', 'same'],
 ['alike', 'similar', 'like'],
 ['comparable', 'corresponding', 'like']]

## Words and phrases

In [4]:
stop_words = stopwords.words('english') # stopwords will not be included

words = [w for w in wn.all_lemma_names() if w not in stop_words]
len(words), words[:10]

(147229,
 ['.22-caliber',
  '.22-calibre',
  '.22_caliber',
  '.22_calibre',
  '.38-caliber',
  '.38-calibre',
  '.38_caliber',
  '.38_calibre',
  '.45-caliber',
  '.45-calibre'])

In [5]:
phrases = [w for w in words if '_' in w]
len(phrases), phrases[:10]

(64188,
 ['.22_caliber',
  '.22_calibre',
  '.38_caliber',
  '.38_calibre',
  '.45_caliber',
  '.45_calibre',
  'a_cappella',
  'a_couple_of',
  'a_few',
  'a_la_carte'])

## Making syn dic

In [6]:
syn_wn = {}
trans_fn = lambda x: x.replace("_", " ")


for w in words:
    added = []
    syn_lst = findSyn(w)
    for lst in syn_lst:
        lst_ = lst.copy()
        
        if w in lst_:
            lst_.remove(w)
        
        if lst_:
            added.extend([trans_fn(l) for  l in lst_])
    if added:
        syn_wn[trans_fn(w)] = added
        

print(len(syn_wn))

117131


In [7]:
keys = list(syn_wn.keys())
keys[:10]

['.22-caliber',
 '.22-calibre',
 '.22 caliber',
 '.22 calibre',
 '.38-caliber',
 '.38-calibre',
 '.38 caliber',
 '.38 calibre',
 '.45-caliber',
 '.45-calibre']

In [8]:
syn_wn['.22 caliber']

['.22-caliber', '.22 calibre', '.22-calibre']

## Saving syn dict

In [9]:
import pickle

pickle.dump(syn_wn, open('synonyms', 'wb'))