# Text Matching with Anthologies

These are 19th C anthologies, two of which are discussed in Price, that "excerpt" _Middlemarch_. 

In [10]:
# The matcher module is provided by the text_matcher package. 
# This one uses the latest version from the submodule, 
# installed with `pip install --editable .`. 
from text_matcher.matcher import Text, Matcher
import json
from glob import glob
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
import spacy
plt.rcParams["figure.figsize"] = [16, 6]

In [2]:
nlp = spacy.load('en')

In [18]:
# Load the data. 

filenames = glob('../anthologies/*')

anthologies = [open(f).read() for f in filenames]

# Load Middlemarch
with open('../middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch', nlp)

In [19]:
labels = [filename.split('/')[2].split('.')[0] for filename in filenames]
labels

['birthday-book', 'witty-middlemarch', 'a-moment']

In [26]:
matches = {}
for i, article in enumerate(anthologies): 
    print('\r', 'Matching article %s of %s' % (i, len(anthologies)), end='')
    articleText = Text(article, labels[i], nlp)
    numMatches, locationsA, locationsB = \
    Matcher(mm, articleText, threshold=2, cutoff=4).match()
    matches[labels[i]] = [numMatches, locationsA, locationsB]

 Matching article 0 of 316 total matches found.
Extending match backwards with words: disappointment appointment
Extending match backwards with words: flowers ﬂowers
Extending match backwards with words: discontent content
Extending match forwards with words: true true
Extending match backwards with words: garden garden
Extending match forwards with words: greater greater
Extending match forwards with words: circumstance circumstance
Extending match forwards with words: strong strong


match 1:
[32mMiddlemarch[0m: (123620, 123850) attention than he had done before. We mortals, men and women, devour [31mdisappointment between breakfast and dinner-time; keep back the tears and look a little pale about the lips, and in answer to inquiries say, "Oh, nothing!" Pride helps us; and pride is not a bad thing when it only urges us to hide[0m hurts--not to hurt others. CHAPTER VII. "Piacer e
[32mbirthday-book[0m: (2735, 2964) mortals, men and women, devour many a dis [31mappointment betwee

In [27]:
df = pd.DataFrame(matches, index=['numMatches', 'Locations in A', 'Locations in B']).T

In [29]:
df

Unnamed: 0,numMatches,Locations in A,Locations in B
a-moment,4,"[(449682, 449755), (602684, 602751), (834712, ...","[(14157, 14230), (22917, 22984), (25481, 25695..."
birthday-book,12,"[(123620, 123850), (474599, 474711), (515893, ...","[(2735, 2964), (11241, 11355), (26267, 26336),..."
witty-middlemarch,44,"[(56, 707), (720, 1179), (1187, 1643), (1662, ...","[(90, 756), (769, 1228), (1282, 1738), (1758, ..."


In [30]:
# Write output somewhere. 
df.to_json('../txt/anthologies.json')    