# Word ease analysis
Several popular readability measures use as a feature the number of "difficult words," which in some cases are defined as any word not on a list of "easy" words. This seems likely to become less useful over time as diction changes. The purpose of this notebook is to investigate how much these word lists have been affected by the change in English word use, using Google n-gram data.

## The easy word lists
1. Dale and Chall made the first "easy" word list, which was generated by selecting words that 80% of surveyed 4th graders said they were familiar with. The original list was made in 1948. The one we are going to use was updated in 1995.
2. The Spache formula was proposed by George Spache in 1953. It was intended to help classify reading material grade level for the first few grades in particular, and therefore it uses a subset containing 769 of the **original** Dale-Chall easy words. This list does not appear to have been updated since then.

In [1]:
%pylab inline
import pandas as pd
from tqdm.notebook import tqdm

Populating the interactive namespace from numpy and matplotlib


The years we are concerned with are the year that each word list was constructed and the last year of the ngrams data, 2008.

In [2]:
years = [1953, 1995, 2008]

## Get the n-gram "A" data
Let's restrict ourselves to just the words on the list that begin with the letter "A." We'll also only consider words longer than 1 character, since it is probable that kids will continue to remember the one-character words like "a" and "I."

In [3]:
!wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz

--2021-09-27 21:15:22--  http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.122.128, 142.250.65.80, 172.253.115.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.122.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 343487895 (328M) [binary/octet-stream]
Saving to: ‘googlebooks-eng-all-1gram-20120701-a.gz’


2021-09-27 21:15:39 (20.1 MB/s) - ‘googlebooks-eng-all-1gram-20120701-a.gz’ saved [343487895/343487895]



In [4]:
! gunzip googlebooks-eng-all-1gram-20120701-a.gz 


gzip: googlebooks-eng-all-1gram-20120701-a: No space left on device


In [5]:
! head ../Data/ngrams/googlebooks-eng-all-1gram-20120701-a

A'Aang_NOUN	1879	45	5
A'Aang_NOUN	1882	5	4
A'Aang_NOUN	1885	1	1
A'Aang_NOUN	1891	1	1
A'Aang_NOUN	1899	20	4
A'Aang_NOUN	1927	3	1
A'Aang_NOUN	1959	5	2
A'Aang_NOUN	1962	2	2
A'Aang_NOUN	1963	1	1
A'Aang_NOUN	1966	45	13


In [6]:
! mv googlebooks-eng-all-1gram-20120701-a* ../Data/ngrams/

In [7]:
df_a = pd.read_csv("../Data/ngrams/googlebooks-eng-all-1gram-20120701-a", delimiter="\t", names=["word", "year", "counts", "num_books"])#, dtype={"word":str, "year": int, "counts": int, "num_books":int})

In [8]:
df_a.head()

Unnamed: 0,word,year,counts,num_books
0,A'Aang_NOUN,1879,45,5
1,A'Aang_NOUN,1882,5,4
2,A'Aang_NOUN,1885,1,1
3,A'Aang_NOUN,1891,1,1
4,A'Aang_NOUN,1899,20,4


In [9]:
df_a.shape

(86618505, 4)

In [10]:
df_a[df_a["year"]==2008].shape

(1212585, 4)

In [11]:
df_a = df_a[(df_a["year"]==1953)|(df_a["year"]==1995)|(df_a["year"]==2008)]

In [12]:
df_a.shape

(2687817, 4)

In [13]:
sums = df_a.groupby(["year"]).sum()

In [14]:
sums

Unnamed: 0_level_0,counts,num_books
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1953,299648748,17700512
1995,1542556650,92813649
2008,3679259804,226938787


In [15]:
sums.loc[1953]["counts"]

299648748

## Get the "A" words that are easy
This requires that we interact with the Spache and Dale-Chall lists.

In [16]:
!wget https://raw.githubusercontent.com/cdimascio/py-readability-metrics/3ffb97f6057ae2451599d083a69ece78a61a6fa4/readability/data/spache_easy.txt

--2021-09-27 21:16:25--  https://raw.githubusercontent.com/cdimascio/py-readability-metrics/3ffb97f6057ae2451599d083a69ece78a61a6fa4/readability/data/spache_easy.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5978 (5.8K) [text/plain]
Saving to: ‘spache_easy.txt’


2021-09-27 21:16:25 (83.4 MB/s) - ‘spache_easy.txt’ saved [5978/5978]



In [17]:
!wget https://raw.githubusercontent.com/cdimascio/py-readability-metrics/3ffb97f6057ae2451599d083a69ece78a61a6fa4/readability/data/dale_chall_easy.txt

--2021-09-27 21:16:25--  https://raw.githubusercontent.com/cdimascio/py-readability-metrics/3ffb97f6057ae2451599d083a69ece78a61a6fa4/readability/data/dale_chall_easy.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18727 (18K) [text/plain]
Saving to: ‘dale_chall_easy.txt’


2021-09-27 21:16:26 (111 MB/s) - ‘dale_chall_easy.txt’ saved [18727/18727]



In [18]:
! mv *easy.txt ../Data

In [19]:
with open("../Data/dale_chall_easy.txt", "r") as file_obj:
    dale_chall_words = file_obj.readlines()

In [20]:
dale_chall_words = [word.strip().lower() for word in dale_chall_words]

In [21]:
with open("../Data/spache_easy.txt", "r") as file_obj:
    spache_words = file_obj.readlines()

In [22]:
spache_words = [word.strip().lower().replace("\\", "'") for word in spache_words]

In [23]:
spache_a_words = [word for word in spache_words if (word[0]=='a' and len(word)>=2)]
dale_chall_a_words = [word for word in dale_chall_words if (word[0]=='a' and len(word)>=2)]

In [24]:
len(spache_words)

1064

In [25]:
len(dale_chall_words)

2950

In [26]:
len(set(spache_words).intersection(set(dale_chall_words)))

1006

In [27]:
for word in spache_words:
    if word not in dale_chall_words:
        print(word)

b
baby
c
comfortable
contest
continue
d
disappear
disappoint
distance
dragon
e
excite
exclaim
fierce
final
g
gasp
grin
growl
h
imagine
j
jet
k
l
m
n
o
p
perfect
picket
practice
pretend
probably
problem
q
r
raccoon
reply
s
scold
signal
sniff
special
strike
t
traffic
trot
u
usual
v
w
x
y
z
zoo


In [28]:
len(dale_chall_a_words)

125

## Compare
We'll look at how word popularity has changed using the Google N-Grams data to show how common each word of interest was during the years the lists were published and a more recent year.

In [29]:
def just_word(text):
    return text.split("_")[0].lower()

df_a["just_word"] = df_a["word"].apply(just_word)

In [30]:
dale_chall_data = np.zeros(shape=(len(dale_chall_a_words), len(years)))
for j in tqdm(range(len(years))):
    for i in tqdm(range(len(dale_chall_a_words))):
        word = dale_chall_a_words[i]
        year = years[j]
        dale_chall_data[i][j] = df_a[(df_a["just_word"]==word) & (df_a["year"]==year)]["counts"].sum()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

In [31]:
for i in range(len(years)):
    dale_chall_data[:,i] = dale_chall_data[:,i]/sums.loc[years[i]]["counts"]

In [32]:
relative_change = array([(dale_chall_data[i][2]-dale_chall_data[i][1])/dale_chall_data[i][1] for i in range(len(dale_chall_a_words))] )
relative_change = array([relative_change]).transpose()
dale_chall_data = np.append(array([array(dale_chall_a_words)]).transpose(), dale_chall_data, axis=1)
dale_chall_data = np.append(dale_chall_data, relative_change, axis=1)

In [33]:
dale_chall_data = dale_chall_data[dale_chall_data[:,4].argsort()][::-1]

In [34]:
print(f"{'WORD':<15}{'1995':<25}{'2008':<25}{'RELATIVE CHANGE':<25}")
for i in range(len(dale_chall_a_words)):
    print(f"{dale_chall_data[i][0]:<15}{dale_chall_data[i][2]:<25}{dale_chall_data[i][3]:<25}{dale_chall_data[i][4]:<25}")

WORD           1995                     2008                     RELATIVE CHANGE          
afterwards     0.0001442883799437771    0.0004942192443227638    2.425218610918908        
afar           1.745673327459319e-05    4.3087199177304955e-05   1.4682280756397166       
aren't         1.555858580623279e-07    3.677370101804314e-07    1.3635632104372588       
anyhow         2.6807443344139096e-05   6.29156439967456e-05     1.346946823278499        
airship        4.130804531554806e-06    9.08198979688035e-06     1.1986007150674722       
ah             0.00019594482964369574   0.0004228837002237421    1.1581773859137283       
awfully        1.8672896065113718e-05   4.023417966816675e-05    1.1546834260667067       
awhile         2.9913326035708315e-05   6.346765720271489e-05    1.1217184985364683       
awful          9.843204137754033e-05    0.00019544909528220966   0.9856247269378092       
axe            3.435789538102215e-05    6.654680915270316e-05    0.9368709408627166       

In [35]:
spache_data = np.zeros(shape=(len(spache_a_words), len(years)))
for j in tqdm(range(len(years))):
    for i in tqdm(range(len(spache_a_words))):
        word = spache_a_words[i]
        year = years[j]
        spache_data[i][j] = df_a[(df_a["just_word"]==word) & (df_a["year"]==year)]["counts"].sum()
        
for i in range(len(years)):
    spache_data[:,i] = spache_data[:,i]/sums.loc[years[i]]["counts"]     
    
relative_change = array([(spache_data[i][2]-spache_data[i][1])/spache_data[0][1] for i in range(len(spache_a_words))] )
relative_change = array([relative_change]).transpose()
spache_data = np.append(array([array(spache_a_words)]).transpose(), spache_data, axis=1)
spache_data = np.append(spache_data, relative_change, axis=1)    

spache_data = spache_data[spache_data[:,4].argsort()][::-1]

print(f"{'WORD':<15}{'1953':<25}{'2008':<25}{'RELATIVE CHANGE':<25}")
for i in range(len(spache_a_words)):
    print(f"{spache_data[i][0]:<15}{spache_data[i][1]:<25}{spache_data[i][3]:<25}{spache_data[i][4]:<25}")

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/47 [00:00<?, ?it/s]

  0%|          | 0/47 [00:00<?, ?it/s]

  0%|          | 0/47 [00:00<?, ?it/s]

WORD           1953                     2008                     RELATIVE CHANGE          
and            0.2513751000221099       0.2644520147618257       7.3831860004247805       
all            0.02464184832836345      0.025604244608544095     2.5681668800651596       
at             0.0403861557265709       0.03989014470803052      2.232184092076039        
as             0.06584835121687209      0.06648457027526616      2.011261588161028        
any            0.012590291216568007     0.010843458773046188     0.7792392028546662       
am             0.002472778561384144     0.0036679350518624043    0.6660645603418158       
again          0.0038527309314838184    0.004601867468449097     0.6370291454762873       
away           0.002600881883210805     0.0037903153739887404    0.5751971683035418       
after          0.0092745723736513       0.00930598485129429      0.4045532060263752       
always         0.0035258715647962595    0.003713137622178094     0.34051099960089315      

In [36]:
counter = 0
for i in range(len(dale_chall_a_words)):
    if float(dale_chall_data[i][4]) < 0:
        counter += 1
print(f"Out of the simple word list, {counter} out of {len(dale_chall_a_words)} words have become less popular ({counter/len(dale_chall_a_words)*100}%).")    

Out of the simple word list, 39 out of 125 words have become less popular (31.2%).


In [37]:
counter = 0
for i in range(len(spache_a_words)):
    if float(spache_data[i][4]) < 0:
        counter += 1
print(f"Out of the simple word list, {counter} out of {len(spache_data)} words have become less popular ({counter/len(spache_data)*100}%).")    

Out of the simple word list, 12 out of 47 words have become less popular (25.53191489361702%).


## Conclusion


While the Google N-Grams data is not necessarily a good way of predicting the words that a 4th grader might know, it does give insight into relative word popularity over time, and it is obvious from this data that the majority of words considered as "easy" in standard lists have become less popular over time, and in some cases this is an extreme change, e.g. "Airfield." The task of keeping such a list up-to-date is daunting. 