# Word Ambiguity

To get ambiguous and non-ambiguous words, I will probably consult wordnet and the number of synsets for each word again.

First I have to get a list of fitting nouns, something like all the nouns from subtlex joined with the words from the concreteness corpus (as they at least seem to be sensible words).

Or maybe this dataset on lexical ambiguity is good:
https://onlinelibrary.wiley.com/doi/10.1111/cogs.12943
can do monosemes, polysemes, and homonyms with that, would get three lists

In [3]:
import pandas as pd
word_ambiguity_df = pd.read_csv('all_words_covar.csv')

In [5]:
word_ambiguity_df = word_ambiguity_df[['word', 'n.senses|f', 'nSenseWordsmyth|s', 'cat|s', 'n.noun.senses|f']]
word_ambiguity_df

Unnamed: 0,word,n.senses|f,nSenseWordsmyth|s,cat|s,n.noun.senses|f
0,lifelong,1,1,M,0
1,courier,1,1,M,1
2,keg,2,2,P,2
3,aristocrat,3,3,P,3
4,violate,4,4,P,0
...,...,...,...,...,...
5390,hypocrisy,1,1,M,1
5391,tame,7,7,P,0
5392,bulldozer,1,1,M,1
5393,batter,5,5,H,2


In [7]:
all(word_ambiguity_df['n.senses|f'] == word_ambiguity_df['nSenseWordsmyth|s'])

True

In [9]:
word_ambiguity_df.drop(columns = ['nSenseWordsmyth|s'], inplace = True)

In [11]:
word_ambiguity_df.columns = ['word', 'number of senses', 'category', 'number of noun senses']

In [12]:
word_ambiguity_df

Unnamed: 0,word,number of senses,category,number of noun senses
0,lifelong,1,M,0
1,courier,1,M,1
2,keg,2,P,2
3,aristocrat,3,P,3
4,violate,4,P,0
...,...,...,...,...
5390,hypocrisy,1,M,1
5391,tame,7,P,0
5392,bulldozer,1,M,1
5393,batter,5,H,2


In [21]:
word_ambiguity_df = word_ambiguity_df[word_ambiguity_df['number of noun senses'] > 0]
word_ambiguity_df

Unnamed: 0,word,number of senses,category,number of noun senses
1,courier,1,M,1
2,keg,2,P,2
3,aristocrat,3,P,3
5,novelty,3,P,3
6,pirate,7,P,3
...,...,...,...,...
5389,portrait,2,P,2
5390,hypocrisy,1,M,1
5392,bulldozer,1,M,1
5393,batter,5,H,2


In [22]:
monosemes = word_ambiguity_df[word_ambiguity_df['category'] == 'M']
monosemes

Unnamed: 0,word,number of senses,category,number of noun senses
1,courier,1,M,1
7,screenplay,1,M,1
8,gasoline,1,M,1
21,uranium,1,M,1
24,marijuana,1,M,1
...,...,...,...,...
5368,papaya,1,M,1
5372,dune,1,M,1
5387,bathrobe,1,M,1
5390,hypocrisy,1,M,1


In [23]:
polysemes = word_ambiguity_df[word_ambiguity_df['category'] == 'P']
polysemes

Unnamed: 0,word,number of senses,category,number of noun senses
2,keg,2,P,2
3,aristocrat,3,P,3
5,novelty,3,P,3
6,pirate,7,P,3
9,fringe,5,P,3
...,...,...,...,...
5385,leaf,7,P,5
5386,despite,3,P,2
5388,reputation,3,P,3
5389,portrait,2,P,2


In [26]:
polysemes[polysemes['number of noun senses'] > 1]

Unnamed: 0,word,number of senses,category,number of noun senses
2,keg,2,P,2
3,aristocrat,3,P,3
5,novelty,3,P,3
6,pirate,7,P,3
9,fringe,5,P,3
...,...,...,...,...
5385,leaf,7,P,5
5386,despite,3,P,2
5388,reputation,3,P,3
5389,portrait,2,P,2


In [28]:
polysemes['number of senses'].value_counts()

number of senses
2     816
3     646
4     423
5     315
6     240
7     193
8     120
9      69
11     56
10     56
12     38
14     25
13     25
16     10
15      8
18      6
19      6
21      4
17      4
24      3
20      3
22      2
29      1
25      1
Name: count, dtype: int64

In [24]:
homonyms = word_ambiguity_df[word_ambiguity_df['category'] == 'H']
homonyms

Unnamed: 0,word,number of senses,category,number of noun senses
17,pitch,25,H,11
25,pink,8,H,3
67,hind,3,H,2
96,toaster,2,H,2
136,angle,12,H,7
...,...,...,...,...
5319,fudge,4,H,1
5352,flaw,7,H,5
5374,liver,6,H,5
5381,ounce,4,H,4


In [25]:
homonyms[homonyms['number of noun senses'] > 1]

Unnamed: 0,word,number of senses,category,number of noun senses
17,pitch,25,H,11
25,pink,8,H,3
67,hind,3,H,2
96,toaster,2,H,2
136,angle,12,H,7
...,...,...,...,...
5313,bust,12,H,4
5352,flaw,7,H,5
5374,liver,6,H,5
5381,ounce,4,H,4


In [27]:
homonyms['number of senses'].value_counts()

number of senses
4     44
6     40
7     29
5     27
8     23
12    23
3     22
10    20
9     18
11    16
2     16
13    11
15     7
16     6
18     5
17     5
14     5
20     4
21     2
23     1
19     1
25     1
Name: count, dtype: int64

Let's take 300 of monosemes and homonyms and be done with it. Maybe only 200 because then I have better homonyms with more senses and that will be enough for 10 boards...

In [33]:
import json
with open('unambiguous_words.json', 'w') as file:
    json.dump({"words": list(monosemes.sample(n=200, random_state = 1)['word'])}, file)

In [36]:
homonyms.sort_values('number of senses').tail(200)

Unnamed: 0,word,number of senses,category,number of noun senses
4131,stripe,6,H,5
4529,bush,6,H,5
3804,pro,6,H,4
4629,spear,6,H,3
3161,rail,6,H,4
...,...,...,...,...
556,flush,20,H,6
4018,slip,21,H,9
1975,tip,21,H,10
1188,stick,23,H,7


In [37]:
with open('ambiguous_words.json', 'w') as file:
    json.dump({"words": list(homonyms.sort_values('number of senses').tail(200)['word'])}, file)