# Word Concreteness norms

Rating the standard category words by concreteness for new wordlists.
Concreteness ratings were taken from https://link.springer.com/article/10.3758/s13428-013-0403-5 and converted to csv.

In [69]:
# load default.json
import json, pandas as pd
with open('../category norms/default.json') as file:
    default_wordlist = json.load(file)["words"]

default_df = pd.DataFrame(default_wordlist, columns = ['Word']).set_index('Word')
len(default_df)

795

In [70]:
concreteness_df = pd.read_csv('word concreteness.csv', delimiter = ';')
concreteness_df.loc[:, 'Conc.M'] = concreteness_df['Conc.M'].str.replace(',', '.').astype(float)
concreteness_df.loc[:, 'Conc.SD'] = concreteness_df['Conc.SD'].str.replace(',', '.').astype(float)
concreteness_df.loc[:, 'Percent_known'] = concreteness_df['Percent_known'].str.replace(',', '.').astype(float)
concreteness_df

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX
0,a,0,1.46,1.14,2,30,0.93,1041179
1,aardvark,0,4.68,0.86,0,28,1.0,21
2,aback,0,1.65,1.07,4,27,0.85,15
3,abacus,0,4.52,1.12,2,29,0.93,12
4,abandon,0,2.54,1.45,1,27,0.96,413
...,...,...,...,...,...,...,...,...
37053,zoologist,0,4.3,1.02,0,30,1.0,12
37054,zoology,0,3.37,1.55,0,27,1.0,17
37055,zoom,0,3.1,1.49,0,30,1.0,181
37056,zoophobia,0,2.04,1.02,2,25,0.92,0


In [71]:
default_with_concreteness = concreteness_df.join(default_df, on='Word', how= "right")
default_with_concreteness

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX
31090.0,sugar,0.0,4.87,0.47,2.0,6055.0,1.0,1926.0
19369.0,manslaughter,0.0,3.37,1.31,0.0,27.0,1.0,140.0
,socks,,,,,,,
3173.0,bluegrass,0.0,3.52,1.6,0.0,29.0,1.0,37.0
31875.0,techno,0.0,2.96,1.34,2.0,27.0,0.93,38.0
...,...,...,...,...,...,...,...,...
32536.0,tomato,0.0,5.0,0.0,0.0,27.0,1.0,301.0
29596.0,soldier,0.0,4.72,0.59,0.0,29.0,1.0,1985.0
17662.0,jeans,0.0,5.0,0.0,0.0,28.0,1.0,337.0
36876.0,wrist,0.0,4.93,0.37,1.0,30.0,0.97,527.0


In [72]:
default_with_concreteness[default_with_concreteness['Conc.M'].isna()]['Word']

NaN         socks
NaN         india
NaN      clippers
NaN        sweden
NaN            rv
NaN        france
NaN      portugal
NaN       ireland
NaN        poland
NaN        canada
NaN       grouper
NaN           suv
NaN         spain
NaN        norway
NaN           atv
NaN          yurt
NaN       crappie
NaN     hydrangea
NaN         korea
NaN    rototiller
NaN        oldies
NaN        planer
NaN         taser
NaN       germany
NaN          iran
NaN      turmeric
NaN          peru
NaN        barbie
NaN        gloves
NaN       tilapia
NaN     bicycling
NaN           r&b
NaN           hiv
NaN        blocks
NaN       peppers
NaN         jacks
NaN        russia
NaN        africa
NaN        mexico
NaN    nor’easter
NaN         shoes
NaN         italy
NaN        chives
NaN          imam
NaN     australia
NaN           asp
NaN           koi
NaN         lungs
NaN      scotland
NaN        clouds
NaN        sander
NaN       england
NaN           ska
NaN       halibut
NaN         legos
Name: Word

In [73]:
default_with_concreteness = default_with_concreteness.dropna().drop(columns=['Bigram', 'Unknown', 'Total', 'Percent_known'])
default_with_concreteness

Unnamed: 0,Word,Conc.M,Conc.SD,SUBTLEX
31090.0,sugar,4.87,0.47,1926.0
19369.0,manslaughter,3.37,1.31,140.0
3173.0,bluegrass,3.52,1.6,37.0
31875.0,techno,2.96,1.34,38.0
35924.0,wall,4.86,0.45,3605.0
...,...,...,...,...
32536.0,tomato,5.0,0.0,301.0
29596.0,soldier,4.72,0.59,1985.0
17662.0,jeans,5.0,0.0,337.0
36876.0,wrist,4.93,0.37,527.0


In [74]:
default_with_concreteness['Conc.M'].astype(float).round().value_counts()

Conc.M
5.0    505
4.0    181
3.0     43
2.0     11
Name: count, dtype: int64

This does not look like a distribution I want to work with...

In [75]:
concreteness_df['Conc.M'].astype(float).round().value_counts()

Conc.M
2.0    14689
3.0    10043
4.0     7651
5.0     3968
1.0      707
Name: count, dtype: int64

In [76]:
concreteness_df[concreteness_df['Conc.M'].astype(float) < 2]

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX
0,a,0,1.46,1.14,2,30,0.93,1041179
2,aback,0,1.65,1.07,4,27,0.85,15
9,abasement,0,1.67,1.01,2,26,0.92,0
10,abatement,0,1.92,1.29,4,30,0.87,3
22,abhor,0,1.81,1.17,3,29,0.9,12
...,...,...,...,...,...,...,...,...
37008,zany,0,1.58,0.95,4,30,0.87,10
37013,zealously,0,1.45,0.57,0,29,1.0,3
37016,zen,0,1.55,1.12,0,29,1.0,130
37029,zillionth,0,1.92,1.35,4,28,0.86,3


Now I should extract nouns from all words... can I use Subtlex again?

In [77]:
subtlex = pd.read_csv('../word frequency/SUBTLEX-US frequency list with PoS and Zipf information.csv', delimiter = ';')
subtlex

Unnamed: 0,Word,FREQcount,CDcount,FREQlow,Cdlow,SUBTLWF,Lg10WF,SUBTLCD,Lg10CD,Dom_PoS_SUBTLEX,Freq_dom_PoS_SUBTLEX,Percentage_dom_PoS,All_PoS_SUBTLEX,All_freqs_SUBTLEX,Zipf-value
0,a,1041179,8382,976941,8380,20415.27,6.0175,99.93,3.9234,Article,993445.0,0.96,Article.Adverb.Letter.To.Noun.Preposition.Adje...,993445.33186.6441.744.257.52.5,7.309360
1,aa,87,70,6,5,1.71,1.9445,0.83,1.8513,Name,79.0,0.92,Name.Noun,79.7,3.236317
2,aaa,25,23,5,3,0.49,1.4150,0.27,1.3802,Name,20.0,0.80,Name.Noun,20.5,2.706807
3,aah,2688,634,52,37,52.71,3.4296,7.56,2.8028,Interjection,2657.0,1.00,Interjection,2657,4.721425
4,aahed,1,1,1,1,0.02,0.3010,0.01,0.3010,Verb,1.0,1.00,Verb,1,1.592864
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74281,zygoma,2,2,1,1,0.04,0.4771,0.02,0.4771,Noun,2.0,1.00,Noun,2,1.768955
74282,zygomatic,5,5,5,5,0.10,0.7782,0.06,0.7782,Adjective,5.0,1.00,Adjective,5,2.069985
74283,zygote,7,5,4,3,0.14,0.9031,0.06,0.7782,Noun,4.0,0.57,Noun.Verb,4.3,2.194924
74284,zygotes,1,1,1,1,0.02,0.3010,0.01,0.3010,Noun,1.0,1.00,Noun,1,1.592864


In [78]:
subtlex_nouns = subtlex[subtlex['Dom_PoS_SUBTLEX'] == 'Noun']
subtlex_nouns

Unnamed: 0,Word,FREQcount,CDcount,FREQlow,Cdlow,SUBTLWF,Lg10WF,SUBTLCD,Lg10CD,Dom_PoS_SUBTLEX,Freq_dom_PoS_SUBTLEX,Percentage_dom_PoS,All_PoS_SUBTLEX,All_freqs_SUBTLEX,Zipf-value
6,aahs,5,4,5,4,0.10,0.7782,0.05,0.6990,Noun,5.0,1.00,Noun,5,2.069985
8,aardvark,21,12,14,8,0.41,1.3424,0.14,1.1139,Noun,16.0,0.76,Noun.Verb.Name,16.3.2,2.634257
12,aarrghh,1,1,0,0,0.02,0.3010,0.01,0.3010,Noun,1.0,1.00,Noun,1,1.592864
13,aas,2,2,1,1,0.04,0.4771,0.02,0.4771,Noun,1.0,0.50,Noun.Name,1.1,1.768955
15,aba,4,4,1,1,0.08,0.6990,0.05,0.6990,Noun,2.0,1.00,Noun,2,1.990804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74280,zydeco,14,3,9,2,0.27,1.1761,0.04,0.6021,Noun,10.0,0.71,Noun.Name,10.4,2.467925
74281,zygoma,2,2,1,1,0.04,0.4771,0.02,0.4771,Noun,2.0,1.00,Noun,2,1.768955
74283,zygote,7,5,4,3,0.14,0.9031,0.06,0.7782,Noun,4.0,0.57,Noun.Verb,4.3,2.194924
74284,zygotes,1,1,1,1,0.02,0.3010,0.01,0.3010,Noun,1.0,1.00,Noun,1,1.592864


In [79]:
concreteness_df['Word']

0                a
1         aardvark
2            aback
3           abacus
4          abandon
           ...    
37053    zoologist
37054      zoology
37055         zoom
37056    zoophobia
37057     zucchini
Name: Word, Length: 37058, dtype: object

In [80]:
subtlex_nouns['Word']

6            aahs
8        aardvark
12        aarrghh
13            aas
15            aba
           ...   
74280      zydeco
74281      zygoma
74283      zygote
74284     zygotes
74285     zymurgy
Name: Word, Length: 37333, dtype: object

In [81]:
concreteness_df

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX
0,a,0,1.46,1.14,2,30,0.93,1041179
1,aardvark,0,4.68,0.86,0,28,1.0,21
2,aback,0,1.65,1.07,4,27,0.85,15
3,abacus,0,4.52,1.12,2,29,0.93,12
4,abandon,0,2.54,1.45,1,27,0.96,413
...,...,...,...,...,...,...,...,...
37053,zoologist,0,4.3,1.02,0,30,1.0,12
37054,zoology,0,3.37,1.55,0,27,1.0,17
37055,zoom,0,3.1,1.49,0,30,1.0,181
37056,zoophobia,0,2.04,1.02,2,25,0.92,0


In [82]:
concreteness_nouns = concreteness_df[['Word', 'Conc.M', 'Conc.SD']].merge(subtlex_nouns[['Word']], on='Word', how= "inner")

In [91]:
concreteness_nouns = concreteness_nouns.astype({'Conc.M':float, 'Conc.SD':float})
concreteness_nouns

Unnamed: 0,Word,Conc.M,Conc.SD
0,aardvark,4.68,0.86
1,abacus,4.52,1.12
2,abandoner,2.50,1.50
3,abandonment,2.54,1.29
4,abatement,1.92,1.29
...,...,...,...
14692,zoo,4.81,0.40
14693,zookeeper,4.71,0.53
14694,zoologist,4.30,1.02
14695,zoology,3.37,1.55


In [84]:
concreteness_nouns.to_csv('concreteness_nouns.csv')

Interestingly, these look far better than only the subtlex nouns, probably because here we do not have proper names and such.

In [92]:
concreteness_nouns['Conc.M'].round().value_counts()

Conc.M
4.0    4595
3.0    3603
5.0    3382
2.0    3013
1.0     104
Name: count, dtype: int64

Because I have so many words per rating, I'm gonna choose 1.0 and 2.0 for abstract words, (3.0 for medium), and 5.0 for high. I can sort by standard deviation and take the top 100/500? to have solid words for each tier.

In [100]:
abstract_nouns = concreteness_nouns[concreteness_nouns['Conc.M'] <= 2.5].sort_values('Conc.SD').head(500)
abstract_nouns

Unnamed: 0,Word,Conc.M,Conc.SD
12335,spirituality,1.07,0.37
12183,someway,1.24,0.52
6644,inconsequence,1.28,0.54
7528,limitlessness,1.38,0.56
7432,lenience,1.39,0.57
...,...,...,...
2645,comprehensiveness,1.71,0.97
8098,merit,1.66,0.97
1407,bonkers,1.76,0.97
12738,subjugation,1.92,0.97


In [101]:
with open('abstract_words.json', 'w') as file:
    json.dump({"words": list(abstract_nouns['Word'])}, file)

In [106]:
concrete_nouns = concreteness_nouns[concreteness_nouns['Conc.M'] > 4.5].sort_values('Conc.SD').head(500)
concrete_nouns

Unnamed: 0,Word,Conc.M,Conc.SD
1075,bedsheet,5.00,0.00
5049,flute,5.00,0.00
1002,bat,5.00,0.00
5034,flower,5.00,0.00
13641,tree,5.00,0.00
...,...,...,...
2390,clipboard,4.93,0.27
2203,chest,4.93,0.27
13988,urine,4.93,0.27
8650,needle,4.93,0.27


In [107]:
with open('concrete_words.json', 'w') as file:
    json.dump({"words": list(concrete_nouns['Word'])}, file)