In March 2016, I did a minimalistic web-scrape of the _Oxford English Dictionary_ database for terms and years. The _OED_ has much more data than this, but I wanted a down-and-dirty alternative to the list of 10,5000 dictiojnary.com terms and years of origin Ted Underwood and Jordan Sellers used in "The Emergence of Literary Diction." I wanted to cross-validate their results and/or see if a larger term set would change anything. The _OED_ is a generally bette rsource than dictionary.com, but the results will depend heavily on how I choose to normalize their more idiosyncratic (humanistic?) data. I would like to parse the set down to usable, ratio-friendly pairs for term and date of earliest known use. I am not aiming at perfection here, as I think that would be a mistake. Instead, I'm shooting for (1) transparency and (2) a method for excluding data that are likely to increase errors or "statistical noise."  

In [88]:
import sqlite3
conn = sqlite3.connect('oed_data.db')
c = conn.cursor()
rows = c.execute('SELECT term, GROUP_CONCAT(year) FROM dictionary WHERE year !=" " GROUP BY term').fetchall()
conn.close()

In [89]:
rows[:10]

[("'Arry", ' 1874'),
 ("'Namgis", ' 1966'),
 ("'Sblood", ' 1598'),
 ("'Sbobs", ' 1694'),
 ("'Sbodikins", ' 1677'),
 ("'Sbores", ' 1640'),
 ("'Sbud(s", ' 1676'),
 ("'Sdeath", ' 1606'),
 ("'Sdeynes", ' 1616'),
 ("'Sflesh", ' 1705')]

Upon inspecting the first ten rows here, we see a few immediate issues. We will want to convert all to lowercase, account for the (s notation (which indicates a term's plural variant), and remove leading apostrophes (for consistency with our already tokenized dataset). Further, we can note that dates are string formations with a leading space. The reason for this is clear if we look a little further down the row list:

In [90]:
print(rows[50:60], rows[310:315], rows[-100:-80])

[("'low", ' a1382'), ("'magine", ' 1530'), ("'mong", ' ?c1200'), ("'mongst", ' 1567'), ("'n", ' 1678, 1828'), ("'n'", ' 1858'), ("'ndrangheta", ' 1978'), ("'neath", ' a1500'), ("'nother", ' a1635'), ("'nough", ' a1618')] [('Abaza', ' 1693'), ('Abba', ' OE'), ('Abbasid', ' 1664'), ('Abbe', ' 1876'), ('Abbevillian', ' 1783')] [('ˌultracytoˈchemistry', ' 1963'), ('ˌultrafilˈtration', ' 1908'), ('ˌultramicroˈscopic', ' 1870'), ('ˌultraˈcold', ' 1967'), ('ˌunder-coˈrrect', ' 1831'), ('ˌunder-differentiˈation', ' 1953'), ('ˌunder-diˈspersion', ' 1935'), ('ˌunder-exˈpose', ' 1890'), ('ˌunder-occuˈpation', ' 1961'), ('ˌunder-proˈficient', ' 1703'), ('ˌunder-proˈportion', ' 1813'), ('ˌunder-proˈportioned', ' 1689'), ('ˌunder-ˈargue', ' 1645'), ('ˌunder-ˈcapitalled', ' 1794'), ('ˌunder-ˈenter', ' 1692'), ('ˌunder-ˈestimate', ' 1812'), ('ˌunder-ˈfurnish', ' 1694'), ('ˌunder-ˈhorsing', ' 1839'), ('ˌunder-ˈmeasure', ' 1682'), ('ˌunder-ˈmeated', ' 1653')]


Seeing entries like a1382, ?c1200, and OE when I originally scraped the _OED_, I stored date values as unicode strings. The method I plan to apply to my data won't be able to differentiate between homonyms, whereas the _OED_ does, so I will need to account for repeats in this set. Finally, there's a fair amount of even messier data at the end of the list because the _OED_ search returns separate entries for data for prefixes, and uses special characters to offset the stem. For example, the entry _ˌultracytoˈchemistry_ means the prefix _ultracyto_ is first known to have been used with the ending _chemistry_ in 1963.

In [110]:
def term_normalize(mytuple):
    word = [i for i in mytuple[0].lower() if i.isalpha() or i=="-"]
    word = [i for i in word if i.encode('unicode_escape') != b'\\u02c8' and i.encode('unicode_escape') != b'\\u02cc']
    word = ''.join(word)
    new_tuple = (word, mytuple[1])
    return new_tuple
#lowercase all and drop leading apostrophe and other punctuation (keep hyphens)
new_rows = []
for i in rows:
        a = term_normalize(i)
        new_rows.append(a)
print(new_rows[:30], new_rows[-100:-80])

[('arry', ' 1874'), ('namgis', ' 1966'), ('sblood', ' 1598'), ('sbobs', ' 1694'), ('sbodikins', ' 1677'), ('sbores', ' 1640'), ('sbuds', ' 1676'), ('sdeath', ' 1606'), ('sdeynes', ' 1616'), ('sflesh', ' 1705'), ('sfoot', ' 1602'), ('sheart', ' c1596'), ('slid', ' 1606'), ('slife', ' a1634'), ('slight', ' 1600'), ('slud', ' 1606'), ('snails', ' 1599'), ('sneaks', ' 1602'), ('sniggers', ' 1633'), ('snigs', ' a1643'), ('snowns', ' 1594'), ('sprecious', ' 1631'), ('swill', ' 1602'), ('arf', ' 1854'), ('at', ' a1300'), ('burb', ' 1977'), ('cause', ' a1513'), ('cep', ' 1851'), ('cept', ' 1851'), ('chute', ' 1920')] [('ultracytochemistry', ' 1963'), ('ultrafiltration', ' 1908'), ('ultramicroscopic', ' 1870'), ('ultracold', ' 1967'), ('under-correct', ' 1831'), ('under-differentiation', ' 1953'), ('under-dispersion', ' 1935'), ('under-expose', ' 1890'), ('under-occupation', ' 1961'), ('under-proficient', ' 1703'), ('under-proportion', ' 1813'), ('under-proportioned', ' 1689'), ('under-argue', 

Now that we've done some basic normalization of the terms, we can begin to tackle the year list. We'll also purge homonyms in this block of code, as we'll want to preserve the earliest date for each type. First, let's inspect the range of year strings by converting every number to a 1 and grouping the results.  

In [10]:
from collections import Counter

years_lump = [''.join(["1" if s.isdigit() else s for s in i[1]]) for i in new_rows]
year_type_counts = Counter(years_lump)
print(len(year_type_counts.most_common()))

1717


What the code above suggests is that, if we convert every digit in our data to a '1', we still have 1717 forms to work with. Let's took a loo why this is the case.

In [11]:
print(year_type_counts.most_common()[:100])

[(' 1111', 190674), (' a1111', 13938), (' c1111', 12523), (' 1111, 1111', 8526), (' ?1111', 2109), (' ?a1111', 1616), (' c1111, 1111', 1532), (' 1111-1', 1327), (' a1111, 1111', 1327), (' OE', 1164), (' 1111-11', 1081), (' ?c1111', 1030), (' c111', 922), (' 1111, 1111, 1111', 862), (' 11..', 568), (' eOE', 560), (' 1111, a1111', 392), (' c1111, c1111', 342), (' c1111, 1111, 1111', 285), (' c1111, a1111', 280), (' ?a1111, 1111', 261), (' a1111, a1111', 225), (' 1111, 1111, 1111, 1111', 215), (' ?1111, 1111', 214), (' a1111, c1111', 204), (' a1111, 1111, 1111', 187), (' 1111, c1111', 169), (' a1111-11', 163), (' a111', 154), (' c111, 1111', 147), (' OE, 1111', 142), (' 111', 140), (' ?c1111, 1111', 139), (' eOE, 1111', 119), (' c1111, 1111, 1111, 1111', 90), (' c111, c1111', 89), (' 1111-1, 1111', 88), (' 1111-11, 1111', 75), (' 1111, 1111, 1111, 1111, 1111', 75), (' c1111, c1111, 1111', 74), (' 1111, ?1111', 71), (' lOE', 67), (' a1111, 1111, 1111, 1111', 59), (' 1111-1111', 58), (' 11.

In [12]:
print(len(new_rows), len(years_lump))
print(year_type_counts.most_common()[0][1]+year_type_counts.most_common()[1][1]+year_type_counts.most_common()[2][1])

249088 249088
217135


Among the types _year_, a+_year_, and c+_year_, we cover 217135 out of 249088 entries. _a_ represents "approximately" and _c_ stands for is circa, so, in both cases, for convenience, we can drop the letter qualifiers and assume they're reasonable on track (and we could run our analysis on non-qualified data and see if it produces a significantly different result). A few more generalizations:

1. "?" characters tend to precede _year_ or a+_year_/c+_year_
2. Most years in the set have four digits, suggesting that post-1100 terms will greatly outnumber pre-1100 terms in this set.
3. Every data field seems to be begin with an empty space

In some cases, the _OED_ has entries with multiple variants of the same term sharing a single date of origin, and other entries with multiple variants of the same term and multiple dates for each variant. We can locate these entries by searching for a comma in the year field. 

In [13]:
serial_dates = [r for r in rows if "," in r[1]]
print(serial_dates[:20])

[("'long", ' 1488, 1663'), ("'n", ' 1678, 1828'), ('-ie', ' 1727, 1941'), ('-in', ' 1881, 1960'), ('-ion', ' 1856, 1930'), ('-onium', ' 1858, 1987'), ('-some', ' a1400, 1921'), ('-y', ' c1430, 1850, 1941'), ('ABC', ' c1325, 1611, 1868'), ('ATP', ' 1939, 1971'), ('Abelian', ' ?1609, 1846'), ('Abraham', ' c1300, ?1592'), ('Actaeon', ' 1567, 1582'), ('Adam', ' OE, 1846, 1983'), ('Adam and Eve', ' 1789, 1925'), ('Adamish', ' 1821, 1838'), ('Addisonian', ' 1789, 1885'), ('Ahmadiyya', ' 1836, 1902'), ('Albanian', ' c1400, 1565, ?1569, 1689'), ('Albert', ' 1740, 1840, 1874')]


These data are here because the _OED_ is interested in _senses_ of words, not just tokens. _Albanian_ meaning "of or relating to Scotland or its people" came into usage around 1565 (according to the data) and _Albanian_ meaning "A native or inhabitant of Albania, a country once located in the eastern Caucasus, in the regions that are now Azerbaijan and the southern part of the Russian Republic of Dagestan" can be dated circa 1400. For all of these, I want to use the earliest possible date (and/or remove them from consideration). Note that our lists seem to be ordered by earliest to latest date. 

In [14]:
print(len(serial_dates))

20813


Dealing with these as a group will help us normalize another 20,000 terms. Next up, we have date ranges. We can locate these by searching for a hyphen in the date field.

In [15]:
date_ranges = [r for r in rows if "-" in r[1]]
print(date_ranges[:20])

[('-gon', ' 1867-78'), ('Americanized', ' 1811-12'), ('Bartholomew', ' 1552-3'), ('Cercaria', ' 1836-9'), ('Clypeˈaster', ' 1836-9'), ('Conˈcordium', ' 1841-3'), ('Cottonian', ' 1700-1, 1846'), ('Crestmarine', ' 1565-73'), ('Cydippe', ' 1835-6'), ('Decapoda', ' 1835-6'), ('Docetae', ' 1818-21'), ('Donegal', ' 1903-4'), ('Easter duty', ' 1598-9'), ('Easterling', ' 1378-9'), ('Eledone', ' 1835-6'), ('Elsan', ' 1939-40'), ('Eucharistize', ' 1714-7'), ('Finnish', ' 1789-96'), ('Fourierism', ' 1841-4'), ('Fructidor', ' 1793-97')]


Let's not fail to notice that many of our date ranges are mixed in with serialized lists of dates. We'll have to address this is our normalization code at the end. But, for now, we should also pay attention to the fact that date ranges tend to include only the digits that vary from start to end date. 1867-78 means 1867-1878, and 1836-9 means 1836-1839. These blocks have non-standard lengths, so we'll have to be careful when handling them. We could also use the size of the range to forecaset a certain level of certainty. 1855-56 is quite precise, overall, whereas 1000-1400 would be pretty vague and especially problematic if we're trying to separate terms that are most likely Germanic from terms that are most likely Latinate.

We also have some date fields with no numbers at all. We can use the 'years_lumped' list to inspect them:

In [16]:
years_all_letters = [i for i in years_lump if '1' not in i]
print(Counter(years_all_letters).most_common())

[(' OE', 1164), (' eOE', 560), (' lOE', 67), (' eOE, OE', 44), (' OE, OE', 31), (' eOE, eOE', 17), (' eOE, OE, OE', 4), (' eOE, eOE, eOE, eOE', 2), (' eOE, lOE', 2), (' eOE, eOE, eOE', 2), (' OE, lOE', 2), (' lOE, lOE', 1), (' OE, OE, OE', 1), (' eOE, eOE, eOE, OE', 1), (' eOE, OE, OE, OE', 1), (' eOE, eOE, OE, OE', 1), (' eOE, eOE, OE', 1)]


OE stands for Old English, where eOE is 'early Old English', and 'lOE' is late Old English. So, any time we find a purely textual entry, we should designate it Germanic. Further, if we have a serial data field with OE, eOE, or lOE in it, we should consider that term Germanic as well. Note that, in the following code, a list with 'OE' or its variants almost always has the 'OE' value in the first position. 

Note: even accounting for OE values, Latinate terms are still going to outnumber Germanic terms significantly.

In [18]:
years_with_oe = [i for i in years_lump if 'oe' in i.lower()]
print(len(years_with_oe))

3010


Let's try putting together what's we've seen so far. First, let's try for a normalization script that reduces every date to "Latin", "Germanic", or "Neologism" (for post-1700 data). 

In [58]:
import re
terms_and_origins = []
exceptions = []
for h,i in new_rows:
    quals = []
    if 'oe' in i.lower():
        #the following code is designed to make sure any qualifiers (a, c, ?) are directly before or after oe, eoe, or loe
        pattern = "...oe.|...oe|..oe.|..oe|.oe.|oe.|.oe|oe"
        quals_test = re.search(pattern, i.lower())
        if quals_test:
            if "a" in quals_test.group(0):
                quals.append("a")
            if "c" in quals_test.group(0):
                quals.append("c")
            if "?" in quals_test.group(0):
                quals.append("?")
        #750 is just a placeholder that will always resolve to 'Germanic" in the code below.
        row = [h, i, 750, quals]
    else:      
        if i[0] == " ":
            
            if i[1].isdigit():
                #means first char after space is a number
                #test for 4, 3,2, 1 digits
                for z in range(5, 1, -1):
                    if i[1:z].isdigit():
                        try:
                            if i[z+1] == "?":
                                quals.append("?")
                        except:
                            pass
                        year = i[1:z]
                        break
                row = [h,i,year,quals]
                
            else:
                #match letter or ?letter
                if i[2].isdigit():
                    quals.append(i[1])
                    #test for 4, 3,2, 1 digits
                    for z in range(6, 3, -1):
                        if i[2:z].isdigit():
                            try:
                                if i[z+1] == "?" or i[1] == "?":
                                    quals.append("?")
                            except:
                                pass
                            year = i[2:z]
                            break
                    row = [h,i,year,quals]
                else:
                    #by def should be ?c or ?a
                    quals.append(i[1])
                    quals.append(i[2])
                    #test for 4, 3,2, 1 digits
                    for z in range(7, 4, -1):
                        if i[3:z].isdigit():
                            try:
                                if i[z+1] == "?":
                                    quals.append("?")
                            except:
                                pass
                            year = i[3:z]
                            break
                    row = [h,i,year,quals]
    if int(row[2]) < 1100:
        row.append("germ")
    elif int(row[2]) > 1100 and int(row[2]) < 1700:
        row.append("lat")
    else:
        row.append("neo")
    terms_and_origins.append(row) 

In [75]:
exceptions, len(terms_and_origins), terms_and_origins[:25]

([],
 249088,
 [['arry', ' 1874', '1874', [], 'neo'],
  ['namgis', ' 1966', '1966', [], 'neo'],
  ['sblood', ' 1598', '1598', [], 'lat'],
  ['sbobs', ' 1694', '1694', [], 'lat'],
  ['sbodikins', ' 1677', '1677', [], 'lat'],
  ['sbores', ' 1640', '1640', [], 'lat'],
  ['sbuds', ' 1676', '1676', [], 'lat'],
  ['sdeath', ' 1606', '1606', [], 'lat'],
  ['sdeynes', ' 1616', '1616', [], 'lat'],
  ['sflesh', ' 1705', '1705', [], 'neo'],
  ['sfoot', ' 1602', '1602', [], 'lat'],
  ['sheart', ' c1596', '1596', ['c'], 'lat'],
  ['slid', ' 1606', '1606', [], 'lat'],
  ['slife', ' a1634', '1634', ['a'], 'lat'],
  ['slight', ' 1600', '1600', [], 'lat'],
  ['slud', ' 1606', '1606', [], 'lat'],
  ['snails', ' 1599', '1599', [], 'lat'],
  ['sneaks', ' 1602', '1602', [], 'lat'],
  ['sniggers', ' 1633', '1633', [], 'lat'],
  ['snigs', ' a1643', '1643', ['a'], 'lat'],
  ['snowns', ' 1594', '1594', [], 'lat'],
  ['sprecious', ' 1631', '1631', [], 'lat'],
  ['swill', ' 1602', '1602', [], 'lat'],
  ['arf', '

In [60]:
Counter([ i[4] for i in terms_and_origins])

Counter({'germ': 7740, 'lat': 115336, 'neo': 126012})

In [84]:
#scrub repeated terms, keep one with lower date
terms = {}
for i in terms_and_origins:
    try:
        a = terms[i[0]]
        if i[2] < a[2]:
            terms[i[0]] = i    
    except:
        terms[i[0]] = i
oed_normalized = terms.values()

In [97]:
terms['ˌunder-ˈestimate']

['ˌunder-ˈestimate', ' 1812', '1812', [], 'neo']

In [94]:
len(output_data)

247546

In [108]:
("ˌ").encode('unicode_escape')

b'\\u02cc'