# Chapter 1 – Introduction: Must I exist?

In the introduction to *Contingent Selves*, I describe the concept of the 'contingent self', and indicate what makes this concept important for the study of Romantic literature.

As part of this introduction, I analyse a corpus of academic documents from JSTOR Data for Research, and consider how the word 'self' is used in them. This notebook shows the upper-level code required to reproduce the analysis in the text. More experienced users can investigate the rest of this repository to see the details of the implementation.

In [2]:
import pickle as p
import os

from matplotlib import pyplot as plt
import pandas as pd
import nltk

from utils import JSTORCorpus, TargetedTrigramAssocFinder, RobustBigramAssocMeasures, RobustTrigramAssocMeasures

%matplotlib inline

In [6]:
DATA_DIR = 'data/'
STUB = 'associations-2020-07-07-wn'

corpus = JSTORCorpus.load(DATA_DIR + 'last-15-years-corpus.p')

analysis = {}

for wn in [15,25,35]:
    analysis[wn] = {}
    for file_name in os.listdir(DATA_DIR + STUB + str(wn)):
        with open(DATA_DIR + STUB + str(wn) + '/' + file_name, mode='rb') as file:
            analysis[wn][file_name] = p.load(file)

Corpus loaded from data/last-15-years-corpus.p


# The shape of the data

In [60]:
print(f'Unigrams in the corpus: {analysis[25]["romantic-self-trigrams.p"].N:,}\n')

print(f'Trigrams with "self" and "romantic":')
print(f'    types: {analysis[25]["romantic-self-trigrams.p"].ngram_fd.B():,}')
print(f'    tokens: {analysis[25]["romantic-self-trigrams.p"].ngram_fd.N():,}\n')

print(f'Bigrams with "self" or "romantic":')
print(f'    types: {analysis[25]["romantic-self-trigrams.p"].bigram_fd.B():,}')
print(f'    tokens: {analysis[25]["romantic-self-trigrams.p"].bigram_fd.N():,}')
print(f'Bigrams with "self":')
print(f'    types: {analysis[25]["self-bigrams.p"].ngram_fd.B():,}')
print(f'    tokens: {analysis[25]["self-bigrams.p"].ngram_fd.N():,}')
print(f'Bigrams with "romantic":')
print(f'    types: {analysis[25]["romantic-bigrams.p"].ngram_fd.B():,}')
print(f'    tokens: {analysis[25]["romantic-bigrams.p"].ngram_fd.N():,}')

Unigrams in the corpus: 112,023,237

Trigrams with "self" and "romantic":
    types: 4,594
    tokens: 12,342

Bigrams with "self" or "romantic":
    types: 104,704
    tokens: 2,578,060
Bigrams with "self":
    types: 54,157
    tokens: 666,241
Bigrams with "romantic":
    types: 43,892
    tokens: 548,300


In [41]:
corpus_df = pd.DataFrame.from_dict(corpus.corpus_meta, orient='index')

In [45]:
corpus_df['type'].value_counts()

research-article    6422
book-review         3014
misc                1732
book-chapter         179
other                 59
review-article        42
books-received         3
news                   3
announcement           2
front-matter           2
abstract               1
obituary               1
discussion             1
editorial              1
Name: type, dtype: int64

In [49]:
corpus_df['journal'].value_counts()[:15]

Studies in Romanticism                      553
The Wordsworth Circle                       551
The Modern Language Review                  295
Keats-Shelley Journal                       263
PMLA                                        234
Nineteenth-Century Literature               220
Victorian Studies                           169
The Review of English Studies               166
ELH                                         153
The Slavic and East European Journal        151
The German Quarterly                        145
Studies in English Literature, 1500-1900    144
Modern Philology                            143
German Studies Review                       140
Modern Language Review                      135
Name: journal, dtype: int64

In [48]:
corpus_df['year'].value_counts()

2001    881
2009    850
2002    847
2008    827
2005    818
2013    817
2003    808
2011    806
2007    803
2004    780
2006    769
2012    764
2010    750
2014    487
2015    455
Name: year, dtype: int64

In [55]:
corpus_df.shape[0]

11462

In [59]:
corpus_df[corpus_df['type'] == 'misc'].sample(20)

Unnamed: 0,type,title,year,journal,part-of
data/ocr/journal-article-10.2307_1261338.txt,misc,Divisions and Discussion Groups,2003,PMLA,
data/ocr/journal-article-10.2307_41468101.txt,misc,Notes on Contributors,2010,The Eighteenth Century,
data/ocr/journal-article-10.2307_41319897.txt,misc,Back Matter,2011,Studies in the Novel,
data/ocr/journal-article-10.2307_25601599.txt,misc,Front Matter,2003,Studies in Romanticism,
data/ocr/journal-article-10.2307_3251788.txt,misc,Volume Information,2001,MLN,
data/ocr/journal-article-10.2307_41058287.txt,misc,Front Matter,2010,PMLA,
data/ocr/journal-article-10.2307_3593540.txt,misc,Back Matter,2004,Contemporary Literature,
data/ocr/journal-article-10.2307_25833909.txt,misc,Back Matter,2004,The Year's Work in Modern Language Studies,
data/ocr/journal-article-10.2307_23348595.txt,misc,Living in Poetry,2011,Indian Literature,
data/ocr/journal-article-10.2307_23474940.txt,misc,Back Matter,2012,Italica,


# Stopwords?

What happens if we filter for stopwords?

In [8]:
import re
from nltk.corpus import stopwords
english = stopwords.words('english')

In [9]:
def stopword_filter(*ngram):
    """Returns true if the ngram contains junk or a stopword"""

    sw = set(stopwords.words('english'))

    if not sw.isdisjoint(ngram):
        return True
    elif any([re.match(r'[\W\d]+', wd) for wd in ngram]):
        return True
    else:
        return False

In [11]:
for wn, folder in analysis.items():
    for finder in folder.values():
        finder.apply_ngram_filter(stopword_filter)

In [47]:
rom_self_trigrams.apply_ngram_filter(stopword_filter)
romantic_bigrams.apply_ngram_filter(stopword_filter)
self_bigrams.apply_ngram_filter(stopword_filter)

In [28]:
rom_scores = analysis[25]['romantic-bigrams.p'].score_ngrams(RobustBigramAssocMeasures.likelihood_ratio)
self_scores = analysis[25]['self-bigrams.p'].score_ngrams(RobustBigramAssocMeasures.likelihood_ratio)
rom_self_scores = analysis[25]['romantic-self-trigrams.p'].score_ngrams(RobustTrigramAssocMeasures.likelihood_ratio)

In [36]:
out = "rank,romantic_bg,score,self_bg,score,rom_self_tg,score\n"
print("rnk: ? + romantic            : ? + self                  : ? + romantic & self")
for idx, (rom, self, rom_self) in enumerate(zip(rom_scores[:30], self_scores[:30], rom_self_scores[:30])):
    
    rank = idx + 1
    
    rom_bg, r_sc = rom
    self_bg, s_sc = self
    trigram, t_sc = rom_self
    
    r_wd = rom_bg[0]
    s_wd = self_bg[0]
    t_wd = trigram[0]
    
    out += f"{rank},{r_wd},{r_sc},{s_wd},{s_sc},{t_wd},{t_sc}\n"
    
    print(f"{rank:<2} : {r_wd:<11} {r_sc:>11,.0f} : {s_wd:<14} {s_sc:>10,.0f} : {t_wd:<13} {t_sc:>10,.0f}")

with open('data/association_analysis_dump.csv', 'wt') as file:
    file.write(out)

rnk: ? + romantic            : ? + self                  : ? + romantic & self
1  : romantic        829,835 : self            1,049,517 : self           1,601,037
2  : period           27,522 : conscious          19,889 : romantic       1,262,582
3  : poetry           22,248 : consciousness      18,869 : period            65,069
4  : era              18,052 : one                16,113 : poetry            61,877
5  : romanticism      15,009 : romantic           14,979 : one               57,504
6  : self             14,979 : identity           10,942 : conscious         52,548
7  : literature       14,803 : consciously        10,924 : consciousness     52,490
8  : poets            13,701 : sense              10,074 : era               50,331
9  : british          12,709 : also                9,845 : literature        48,807
10 : literary         12,026 : world               9,711 : romanticism       47,934
11 : writers          11,453 : reflexive           9,175 : literary          47,5

In [62]:
def rank_word(scores, word):
    for rank, (ngram, score) in enumerate(scores):
        if word in ngram:
            return rank, score

In [124]:
rank_word(rom_self_scores, 'necessity')

(1563, 23265.226372689973)

# What words are excluded in the context of 'romantic' and 'self'

It appears that several words, including 'abnegation' and 'contingent', never occur when both 'romantic' and 'self' are within 14 words either side. Are there other such words?

In [138]:
filtered_rom_scores = []
for ngram, score in rom_scores:
    if (ngram[0],'self','romantic') not in rom_self_trigrams.ngram_fd:
        filtered_rom_scores.append((ngram, score))

filtered_self_scores = []
for ngram, score in self_scores:
    if (ngram[0],'self','romantic') not in rom_self_trigrams.ngram_fd:
        filtered_self_scores.append((ngram, score))

In [146]:
for idx, (rom, self) in enumerate(zip(filtered_rom_scores[:50], filtered_self_scores[:50])):
    rom_bg, r_sc = rom
    self_bg, s_sc = self
    
    r_wd = rom_bg[0]
    s_wd = self_bg[0]
    
    print(f"{idx:<2} : {r_wd:<15} {r_sc:>13,.2f} : {s_wd:<15} {s_sc:>13,.2f}")

0  : abrams               3,086.69 : cannot               2,650.72
1  : circles              2,878.80 : preservation         2,531.96
2  : edited               1,718.78 : denial               2,085.75
3  : michael              1,657.13 : make                 2,004.17
4  : ecology              1,484.96 : interested           1,819.75
5  : scholars             1,469.87 : discipline           1,817.74
6  : melancholy           1,442.19 : effacing             1,784.51
7  : praxis               1,426.95 : esteem               1,690.77
8  : writings             1,354.74 : respect              1,621.89
9  : trumpener            1,244.66 : styled               1,531.31
10 : companion            1,234.64 : envelope             1,342.85
11 : err                  1,212.37 : things               1,297.58
12 : anthology            1,195.80 : boundaries           1,283.58
13 : scholarship          1,185.10 : addressed            1,233.29
14 : bate                 1,185.04 : abnegation           1,22

In [167]:
foo = [n for n in self_bigrams.score_ngram(w1='romantic',w2='self', score_fn=RobustBigramAssocMeasures._contingency)]

In [168]:
foo

[2378.0, 64569.0, 48458.0, 120826232.0]