# Clare's Daffies

>"Thus, only when the making of the 'nation', an entirely abstract group based on law, creates new usages and functions does it become indispensible to forge a *standard* language, impersonal and anonymous like the official uses it has to serve, and by the same token to undertake the work of normalizing the products of the linguistic habitus." (Bourdieu, _Language and Symbolic Power_)

>"A contemporary review of _Poems Descriptive_ in the _New Monthly Magazine_ (March 1820) scornfully describes several of Clare's dialect words, such as _bangs_, _chaps_, _eggs on_, _fex_, _flops_, _sniftling_ and _snufting_, as 'mere vulgarisms, and may as well be excluded from the political lexicon, as they have long since been banished from the dictionary of polite conversation.'"  (McKusick, _John Clare and the Tyranny of Grammar_)

In this notebook, I attempt to sift through John Clare's verse, capturing and collecting the "vulgarisms" that mark Clare, and through which Clare marks himself, as "A Northamptonshire Peasant."  As McKusick notes (citing John Barrell and Timothy Brownlow), Clare's use of these words was a way of resisting "enclosure":

>"...Clare's conception of language and his conception of landscape seem cloasely related; he regards both as ideally constituting an unrestircted communal zone, open to local browsing and free from the linearty, exclusivity, and standardization imposed by outside authorities."

Authorities, we might add (following Bourdieu), of _the nation_. 

My goal here is not really to produce any positive knowledge about Clare's diction.  Rather it is to generate a closed vocabulary (a linguistic "palette") of Clare-isms that may be then be put to use as the raw-material of quasi-imitative writing. 

***

## Gathering Clare

First I must gather some of Clare's verse.  I turn to three posthumously published collections downloaded from Project Gutenberg: _Life and Remains of John Clare_ (1872), _Poems_ (1901), _Poems Chiefly from Manuscript_ (1920).  Nb: According to McCusick, even Clare's earliest editors were quick to edit the poet's linguistically-unruly (read _anti-tyranical_) verse; no doubt the texts available on Gutenberg are already at least one step away from the Clare found in his manuscripts. 

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

In [2]:
clarefiles = ["clare52601.txt","clare8672.txt","clare9156.txt"]

In [3]:
from gutenberg.cleanup import strip_headers

In [4]:
claretext = ""

for c in clarefiles:
    with open(c,'r') as f:
        ct_temp = strip_headers(f.read().decode('utf-8'))
        claretext+=" "+ct_temp

The Project Gutenberg versions of Clare's texts contain plenty of editorial commentary.  I'll use a regular expression to try to only match Clare's verse. 

In [5]:
import re

In [6]:
stanzas = re.findall(r'(?:[A-Z].+[a-z].+\n *){3,}',claretext)

In [7]:
len(stanzas)

1628

In [8]:
import random
random.seed("fex")

In [9]:
for i in random.sample(stanzas,2):
    print i
    print "***"

Time looks on pomp with vengeful mood
  Or killing apathy's disdain;
So where old marble cities stood
  Poor persecuted weeds remain.
She feels a love for little things
  That very few can feel beside,
And still the grass eternal springs
  Where castles stood and grandeur died.

***
The boy, that scareth from the spiry wheat
  The melancholy crow--in hurry weaves,
  Beneath an ivied tree, his sheltering seat,
  Of rushy flags and sedges tied in sheaves,
  Or from the field a shock of stubble thieves.
  There he doth dithering sit, and entertain
  His eyes with marking the storm-driven leaves;
  Oft spying nests where he spring eggs had ta'en,
And wishing in his heart twas summer-time again.

***


In [10]:
stanzas_joined = " ".join(stanzas)

Filter out repeat lines and rejoin into a big string'o'Clare.

In [11]:
lines = stanzas_joined.split("\n") 
lines = [l.strip() for l in lines]
print len(lines)
lines = list(set(lines))
print len(lines)

13150
11664


In [12]:
all_clare_lines = " ".join(lines)

## Custom Tokenization

I doubt I can use a standard tokenizer because I don't want to break up Clare's nonstandard contractions.  An example of the problem:

In [13]:
from nltk import tokenize

In [14]:
tokenize.word_tokenize("And all thy pastures plough'd.")

['And', 'all', 'thy', 'pastures', 'plough', "'d", '.']

Here `plough'd` would be better off kept as one word.  I'll just have to roll my own tokenizer---one that, however imperfect, will at least try to respect Clare's orthography. 

In [15]:
def custom_tokenize(abigstring):
    """
    Note: I don't fool with lowering strings.  Too hard to keep out proper nouns.
    """
    tokens = abigstring.split()
    tokens = [w.replace(u'\u2019',"'") for w in tokens] ## replace unicode apostrophe with ascii apostrophe (better for regex searches)
    tokens = [re.sub(r"[^a-z']+$",'',w) for w in tokens] ## remove trailing non-alpha characters or single quote (e.g. punctuation)
    tokens = [re.sub(r"'s$","",w) for w in tokens] ## remove possessive: 's
    tokens = [re.sub(r"s'$","s",w) for w in tokens] ## remove possessive: s'
    tokens = [t for t in tokens if len(t)>0] ## remove 0-length strings
    tokens = [t for t in tokens if re.match(r"^[a-z']+-?[a-z']+$",t)] ## only keep words that consist of a-z, ', -.  
    tokens = [t for t in tokens if re.match(r"[ivx]+$",t)==None] ## remove roman numerals
    return tokens

In [16]:
clare_tokens = custom_tokenize(all_clare_lines)

In [17]:
clare_vocab = list(set(clare_tokens))

In [18]:
len(clare_vocab)

8308

## Check Against Dictionary

The most obvious way to isolate those words that characterize Clare's "vulgar" vocabular is to filter out words that appear in the dictionary.

In [19]:
import enchant
dictionary = enchant.Dict("en_GB")
dictionary.check("salmon")

True

In [20]:
clare_vocab_not_in_dictionary = [w for w in clare_vocab if (dictionary.check(w)==False)]
clare_vocab_not_in_dictionary = [w for w in clare_vocab_not_in_dictionary if dictionary.check(w.replace("-",""))==False]

In [21]:
len(clare_vocab_not_in_dictionary)

1180

Let's examine the words we've caught...

In [22]:
random.sample(clare_vocab_not_in_dictionary,10)

[u'shepherd-skill',
 u'wind-floating',
 u'veigled',
 u'dew-wet',
 u'summer-shorn',
 u'meek-eyed',
 u'unbought',
 u'summer-grass',
 u'tinty',
 u'flower-loving']

## Check Against Other Texts

The dictionary I have used above is no doubt biased toward modern usage and spelling.  One remedy is to further filter out words used by other writers---Browning, Hopkins, Wordsworth, etc.---under the twin assumptions that 1) such words may be anachronisms, not vernacular words, and that 2) if they are indeed vernacular terms, it is still worth filtering them out to leave behind those words that are most specific to Clare and his Northamptonshire. 

In [23]:
texts_to_compare = [
        37452, ## browning vol 1 
        33363, ## browning vol 2
        22403, ## hopkins
        29091, ## coleridge vol 1
        29092, ## coleridge vol 2
        10219, ## wordsworth vol 1
        12145, ## v2
        12383, ## v3
        32459, ## v4
        56361, ## v5
        47651, ## v6
        47143, ## v7
        52836, ## v8
        14353, ## swift vol 1
        13621, ## swift vol 2
        4800,  ## shelley
        14100, ## barbauld
        100,   ## complete shakespeare 
        ]

In [24]:
from gutenberg.acquire import load_etext

In [25]:
comparison_tokens = []

for t in texts_to_compare:
    text = strip_headers(load_etext(t)).strip()
    comparison_tokens+=list(set(custom_tokenize(text)))

In [26]:
comparison_tokens = list(set(comparison_tokens))

In [27]:
len(comparison_tokens)

53582

In [28]:
clariest_clare_vocab = list(set(clare_vocab_not_in_dictionary).difference(set(comparison_tokens)))

In [29]:
len(clariest_clare_vocab)

671

This method filtered out nearly 50% of the words that made it through the dictionary-based filtration above:

In [30]:
random.sample(clariest_clare_vocab,20)

[u'worship-moving',
 u'beesom',
 u'whitethorns',
 u'bent-stalks',
 u'glibbed',
 u'heart-bred',
 u'wind-floating',
 u'stingo',
 u'matty',
 u'nipt',
 u'hour-telling',
 u'cag',
 u'sheep-boy',
 u'drabbled',
 u'tottergrass',
 u"cowslip-yellow'd",
 u'sprunts',
 u'brustling',
 u'impersonifying',
 u"o'er-top't"]

This is only a start. It would remain for further socio-linguistic analysis to determine the degree to which these words index class, geography, or both.  No doubt some of them (perhaps especially the compound words) are Clare's neologisms. 

## Most Frequently Used Non-Standard Terms

Out of these terms, which does Clare tend to use the most frequently?

In [31]:
from nltk import FreqDist

In [32]:
clariest_clare_tokens = [t for t in clare_tokens if t in clariest_clare_vocab]

In [33]:
FreqDist(clariest_clare_tokens).most_common(40)

[(u'oer', 84),
 (u'agen', 20),
 (u'mither', 13),
 (u'neath', 7),
 (u"crop't", 6),
 (u'passer-bye', 5),
 (u'brither', 5),
 (u'morts', 4),
 (u'pilewort', 4),
 (u'sen', 4),
 (u'milk-pail', 4),
 (u'trotty', 4),
 (u'blebs', 3),
 (u'stoven', 3),
 (u'whisp', 3),
 (u'adry', 3),
 (u'flye', 3),
 (u'lambtoe', 3),
 (u'blea', 3),
 (u'wirl', 3),
 (u'closen', 3),
 (u'tyrant-like', 3),
 (u'stye', 3),
 (u'crimpled', 3),
 (u'chittering', 3),
 (u'yer', 3),
 (u'hing', 3),
 (u'leuks', 3),
 (u'gipsey', 3),
 (u'dog-rose', 3),
 (u"choak'd", 3),
 (u'chelping', 3),
 (u'bevering', 2),
 (u'yellowcups', 2),
 (u'thegither', 2),
 (u'abouten', 2),
 (u'pig-stye', 2),
 (u'gravemounds', 2),
 (u'firetail', 2),
 (u'rewardings', 2)]

## Clustering Terms into Linguistic Palettes

Here I very slightly adapt some code [posted to StackExchange](https://stats.stackexchange.com/a/158090) by [Lyndon White](https://stats.stackexchange.com/users/36769/lyndon-white).  It uses Affinity Propagation with Levenshtein distance to cluster terms that are orthographically similar.  I've separated words that are compounds from those that are not. 

In my mind, the resulting clusters together form _word-palettes_.  Organized but messy.  Discretized but blended. Ready-to-hand.

In [34]:
import numpy as np
import sklearn.cluster
import editdistance

words = [w for w in clariest_clare_vocab if "-" not in w]
words = np.asarray(words) 
lev_similarity = -1*np.array([[int(editdistance.eval(w1,w2)) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    if len(cluster)>1:
        cluster_str = ", ".join(cluster)
        print " - *%s:* %s" % (exemplar, cluster_str) 

 - *causey:* blushy, causey, chuffy, closen, couzen, daiseys, farder, gipsey, husseys, rosey
 - *weals:* deafed, knaps, knarls, leuks, oceanly, weal'd, wealed, weals, wearp, wheatlands, wheats
 - *wates:* babels, daffies, extacies, lambtoe, larkheels, mavis, oaktree, pratensis, swaliest, tauks, turnest, wa'n't, wanness, watchet, waterlilies, wates, whateer, witherest
 - *crizzled:* crimpled, crinked, crizzle, crizzled, drabbled, mizzled
 - *hummings:* hemmings, hummings, huzzing, thumming
 - *uncradled:* enchantedly, suncrackt, uncradled, unmended, upbraideth
 - *soodly:* bonnily, drowny, knowly, pooty, shily, shool, sloomy, soodle, soodles, soodly, stoney, trotty
 - *nipt:* chirrupt, copt, dib, nighty, nip't, nipt, untill, wirl
 - *cowpond:* coppled, cowpond, horsepond, scolloped, snowspots
 - *slive:* ashtree, flaze, ladslove, slive, sliveth, sybilline, ungive, whisp, ye've
 - *pol'ant'us:* pol'ant'us, pol'ant'uses, polyanthus
 - *milkpail:* firetail, milkpail
 - *jolls:* convolvulus

In [35]:
words = [c for c in clariest_clare_vocab if "-" in c]
words = np.asarray(words) 
lev_similarity = -1*np.array([[int(editdistance.eval(w1,w2)) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    if len(cluster)>1:
        cluster_str = ", ".join(cluster)
        print " - *%s:* %s" % (exemplar, cluster_str) 

 - *water-pots:* hazel-roots, lavender-cotton, miller-thumbs, pale-tops, spire-points, valley-depths, wash-pools, water-porridge, water-pots, water-pudge
 - *haw-tree:* a-dry, cat-tail, haw-tree, hay-time, heart-bred, hive-bees, oak-bridge
 - *rise-not:* dust-spot, frost-nip, grass-hid, mouse-ear, rise-not, rush-beds, time-torn, white-nosed
 - *bird-boy:* barn-hole, bee-fly, bird-boy, hard-burnt, high-lows, horse-boy, king-cup, lily-bud, mid-wood, night-brown, night-sky, passing-by, pig-stye, sheep-boy
 - *dew-laden:* dew-besprent, dew-falling, dew-laden, dew-wet, dog-badger, new-laying, new-leaved
 - *sow-grass:* honey-dreams, ribbon-grass, sheep-trays, slop-frock, snow-cap't, sow-grass, sun-crack'd, wood-grass
 - *power-mocking:* copper-coloured, flower-loving, morrow-morning, power-mocking, wonder-working
 - *blossom-seeking:* blossom-haunting, blossom-seeking, bosom-stirring, woolly-fleecing
 - *pearl-like:* breast-high, buttercup-like, early-risers, heart-glow, honeycomb-like, mis

***