# Checking and retrieving character indexes from quotations


What you will need to run this notebook:

+ The Project Gutenberg fulltext of your source text (text A). In this case, the Project Gutenberg version of *Middlemarch*: `middlemarch.txt`
+ The JSON file with the output of `text-matcher`. In this case, this is `default.json`

Both of these files must be in the same directory as this notebook for the filepaths below to run correctly.


In addition, you will need a list of the JSTOR article ids for the sample texts in the corpus.


### A preliminary note about  character indexes:

A match in text matcher takes the form of a pair, or a list of pairs, of character indexes. These character indexes store the position of a match and can be used to retrieve the corresponding text.

Let's say you were looking at an output :  [[173657, 173756], [292143, 292406]]. 

In each pair, the first number corresponds to the **starting character index**, and the second number corresponds to the **ending character index** of a quotation. 

So in this example, for match [173657, 173756].
+ the **starting charcter** is 173657
+ the **ending character** is 173756

### Import libraries
Run the cell below to import libraries

In [1]:
from text_matcher.matcher import Text, Matcher
import json
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16, 6]
#pd.set_option('display.max_colwidth', None)

### Load in our data files:

In [125]:
# Load Middlemarch .txt file 
# (Note: must have 'middlemarch.txt' in this directory)
with open('middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch')

# Load in the JSON file with our JSTOR articles and data from TextMatcher
# (Note: must have the file 'default.json' in the same directory as this notebook)
#df = pd.read_json('default.json')
df = pd.read_json('hyperparameter-data/t2-c3-n2-m3-no-stops.json')

In [126]:
# Let's peek inside our DataFrame
df

Unnamed: 0,creator,datePublished,docSubType,docType,id,identifier,isPartOf,issueNumber,language,outputFormat,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
0,[Rainer Emig],2006-01-01,book-review,article,http://www.jstor.org/stable/41158244,"[{'name': 'issn', 'value': '03402827'}, {'name...",Amerikastudien / American Studies,3,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/41158244,51,1109,1,"[[130022, 130046]]","[[6851, 6875]]",,,
1,[Martin Green],1970-01-01,book-review,article,http://www.jstor.org/stable/3722819,"[{'name': 'issn', 'value': '00267937'}, {'name...",The Modern Language Review,1,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/3722819,65,1342,0,[],[],,,
2,[Richard Exner],1982-01-01,book-review,article,http://www.jstor.org/stable/40137021,"[{'name': 'issn', 'value': '01963570'}, {'name...",World Literature Today,1,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/40137021,56,493,0,[],[],,,
3,[Ruth Evelyn Henderson],1925-10-01,research-article,article,http://www.jstor.org/stable/802346,"[{'name': 'issn', 'value': '00138274'}, {'name...",The English Journal,8,[eng],"[unigram, bigram, trigram, fullText]",...,American Education Week--November 16-22; Some ...,http://www.jstor.org/stable/802346,14,2161,0,[],[],,,
4,[Alan Palmer],2011-12-01,research-article,article,http://www.jstor.org/stable/10.5325/style.45.4...,"[{'name': 'issn', 'value': '00394238'}, {'name...",Style,4,[eng],"[unigram, bigram, trigram]",...,Rejoinder to Response by Marie-Laure Ryan,http://www.jstor.org/stable/10.5325/style.45.4...,45,1127,0,[],[],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5879,[Michaela Giesenkirchen],2005-10-01,research-article,article,http://www.jstor.org/stable/27747183,"[{'name': 'issn', 'value': '15403084'}, {'name...",American Literary Realism,1,[eng],"[unigram, bigram, trigram]",...,Ethnic Types and Problems of Characterization ...,http://www.jstor.org/stable/27747183,38,7349,1,"[[23799, 24121]]","[[41472, 41793]]",,,
5880,[Leon Botstein],2005-07-01,misc,article,http://www.jstor.org/stable/4123220,"[{'name': 'issn', 'value': '00274631'}, {'name...",The Musical Quarterly,2,[eng],"[unigram, bigram, trigram]",...,On the Power of Music,http://www.jstor.org/stable/4123220,88,1525,0,[],[],,,
5881,[Linda M. Shires],2013-01-01,research-article,article,http://www.jstor.org/stable/24575734,"[{'name': 'issn', 'value': '10601503'}, {'name...",Victorian Literature and Culture,4,[eng],"[unigram, bigram, trigram]",...,"HARDY'S MEMORIAL ART: IMAGE AND TEXT IN ""WESSE...",http://www.jstor.org/stable/24575734,41,10736,1,"[[173657, 173756]]","[[33963, 34061]]",,,
5882,[Edward H. Cohen],1990-07-01,misc,article,http://www.jstor.org/stable/3827815,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,Victorian Bibliography for 1989,http://www.jstor.org/stable/3827815,33,81819,0,[],[],,,


# Check quotation matches for particular articles


## Set the `article_id` ‼️

In the cell below, change the variable `article_id` to the id of the article you wish to exampine.

**Where can I find the article id?**

+ This can be found in the `id` column of URL of a given article.
+ For *Middlemarch*, please use the following article IDs: 
http://www.jstor.org/stable/41059781,
http://www.jstor.org/stable/2928567,
http://www.jstor.org/stable/25088885,
http://www.jstor.org/stable/462077,
http://www.jstor.org/stable/42827730,
http://www.jstor.org/stable/2933477,
http://www.jstor.org/stable/2873079,
http://www.jstor.org/stable/2932968,
http://www.jstor.org/stable/42827900,
http://www.jstor.org/stable/10.1525/ncl.2001.56.2.160,
http://www.jstor.org/stable/437748,
http://www.jstor.org/stable/27919123,
http://www.jstor.org/stable/2872038,
http://www.jstor.org/stable/3044620,
http://www.jstor.org/stable/591341,
http://www.jstor.org/stable/4334358,
http://www.jstor.org/stable/2933096,
http://www.jstor.org/stable/23539270,
http://www.jstor.org/stable/3751142,
http://www.jstor.org/stable/3825796,
http://www.jstor.org/stable/3826242,
http://www.jstor.org/stable/2932697,
http://www.jstor.org/stable/40754482,
http://www.jstor.org/stable/10.1525/ncl.2012.66.4.494,
http://www.jstor.org/stable/3828324,
http://www.jstor.org/stable/23099626,
http://www.jstor.org/stable/42965156,
http://www.jstor.org/stable/j.ctt155j8bf.9,
http://www.jstor.org/stable/3044863,
http://www.jstor.org/stable/2873139,
http://www.jstor.org/stable/3044571,
http://www.jstor.org/stable/29533514,
http://www.jstor.org/stable/42827934,
http://www.jstor.org/stable/43028240,
http://www.jstor.org/stable/30030019,
http://www.jstor.org/stable/40549795,
http://www.jstor.org/stable/25733489,
http://www.jstor.org/stable/1345484,
http://www.jstor.org/stable/27708593,
http://www.jstor.org/stable/27708062,
http://www.jstor.org/stable/3044589,
http://www.jstor.org/stable/42827827,
http://www.jstor.org/stable/25459494,
http://www.jstor.org/stable/439034


*Note: JSTOR outputs the fulltext of articles text as a list of strings, so we have to concatenate them using text-matcher;s `Text()` function.*


In [140]:
# ‼️ 🛑 Make sure to change the variable below to the correct article id 🛑  ‼️
article_id  = 'http://www.jstor.org/stable/10.5325/georelioghlstud.70.2.0143' # CHANGE THIS to article id
default_df = pd.read_json('default.json')

# Use article_id to get the index of the article in our DataFrame
article_index1 = df[df['id'] == article_id].index[0]
article_index2 = default_df[default_df['id'] == article_id].index[0]
article_text = default_df['fullText'].loc[article_index2]
article_title = df['title'].loc[article_index2]

# Assign the full text of this article to a variable called `cleaned_article_text`, with text-matcher's Text function
cleaned_article_text = Text(article_text, article_title)

# Print out the title and ID of the article we selected as confirmation
print(f"""
Article selected:
ID: {article_id}
Title: {article_title}
""")



Article selected:
ID: http://www.jstor.org/stable/10.5325/georelioghlstud.70.2.0143
Title: Hidden Allusion in the Finale of <em>Middlemarch</em>: George Eliot and the Jewish Myth of the <em>Lamed Vov</em>



In [103]:
df[df['id'] ==  'http://www.jstor.org/stable/44371993']

Unnamed: 0,creator,datePublished,docSubType,docType,id,identifier,isPartOf,issueNumber,language,outputFormat,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
1219,[Carol-Ann Farkas],2000-01-01,research-article,article,http://www.jstor.org/stable/44371993,"[{'name': 'issn', 'value': '00849812'}, {'name...",Dickens Studies Annual,,[eng],"[unigram, bigram, trigram]",...,Beauty is as Beauty Does: Action and Appearanc...,http://www.jstor.org/stable/44371993,29,13529,8,"[[1303, 1639], [1840, 2312], [1161110, 1161366...","[[43890, 44226], [44234, 44706], [46016, 46270...","In Jane Eyre and Villette, and Middlemarch and...",,


## Part 1: Get quotes (& their character indexes) from `text-matcher` output


### What are the index positions of matches in our source text (Text "A")?
Retrieve the character indexes in for the source text (Text A):

In [93]:
article_id  = 'https://www.jstor.org/stable/44371993'

In [136]:
# What are the locations in A?
print("Middlemarch character indexes:")
df.loc[df['id'] == article_id, 'Locations in A'].item()

Middlemarch character indexes:


[[1792915, 1793447]]

### What's the text of one of those matches?

Let's check the corresponding text in Middlemarch for one of the matches output above.  
Change the start and end character indexes to one of the index ranges in the cell above. 

In [132]:
#‼️ 🛑 IMPORTANT: Change the start and end character indexes to one of the ouputs above

mm_start = 1793108  # 🛑 REPLACE the number with one of the starting character indexes
mm_end =  1793148 # 🛑 REPLACE the number with one of the ending character indexes

# Output the text in "A" for the start and end characters selected above
print("Middlemarch character indexes:", f"[{mm_start}, {mm_end}]")
mm.text[mm_start:mm_end]

Middlemarch character indexes: [1793108, 1793148]


'great name on\nthe earth.  But the effect'

In [137]:
#‼️ 🛑 IMPORTANT: Change the start and end character indexes to one of the ouputs above

mm_start = 1792915  # 🛑 REPLACE the number with one of the starting character indexes
mm_end =  1793447 # 🛑 REPLACE the number with one of the ending character indexes

# Output the text in "A" for the start and end characters selected above
print("Middlemarch character indexes:", f"[{mm_start}, {mm_end}]")
mm.text[mm_start:mm_end]

Middlemarch character indexes: [1792915, 1793447]


'finely touched spirit had still its fine issues, though they were\nnot widely visible.  Her full nature, like that river of which Cyrus\nbroke the strength, spent itself in channels which had no great name on\nthe earth.  But the effect of her being on those around her was\nincalculably diffusive: for the growing good of the world is partly\ndependent on unhistoric acts; and that things are not so ill with you\nand me as they might have been, is half owing to the number who lived\nfaithfully a hidden life, and rest in unvisited tombs'

### What are the indexes positions of matches in our target text (Text "B")?
Retrieve the indexes in the B text (that is, the article index: 

In [138]:
# What are the locations in B?
print(f"Character index locations for {article_id}:")
df.loc[df['id'] == article_id, 'Locations in B'].item()

Character index locations for http://www.jstor.org/stable/10.5325/georelioghlstud.70.2.0143:


[[350, 876]]

### What's the text of one of those matches in Text "B" (the article)?
Change the start and end character indexes to one of the index ranges in the cell above.

In [141]:
#‼️ 🛑 IMPORTANT: Change the start and end character indexes to one of the ouputs above

textB_start = 350 # 🛑 REPLACE the number to the left with one of the starting character indexes
textB_end = 876 # 🛑 REPLACE the number to the left with one of the ending character indexes

# Output the text in "B" for the start and end characters selected above 
print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]")
cleaned_article_text.text[textB_start:textB_end]

Character index locations for http://www.jstor.org/stable/10.5325/georelioghlstud.70.2.0143: [350, 876]


'finely-touched spirit had still its fine issues, though they were not widely visible. Her full nature, like that river of which Cyrus broke the strength, spent itself in channels which had no great name on earth. But the effect of her being on those around her was incalculably diffusive: for the growing good of the world is partly dependent on unhistoric acts; and that things are not so ill with you and me as they might have been, is half owing to the number who lived faithfully a hidden life, and rest in unvisited tombs'

---

## Find the index positions of a given quotation

To establish all of the "ground truth" quotations (and their character indexes), we'll want to get the index characters not just for quotations that text-matcher successfully matched, but for *all* quotations in that article.

To retrieve the index characters for all quotations in an article legilbe to human eyes, follow the following steps.


### Step 1: Locate the quotation in the PDF of the article.

### Step 2:  Locate the text of that quotation as it appears in the JSON file in the ""fullText" field
(🛑 Make sure you've entered the `article_id` for the article in the section "Set the `article_id`", first!!)  
Run the cell below, and then use "CTRL+F" in your browser to find the quotation as it appears in the article text.

In [None]:
print(cleaned_article_text.text)

### Step 3: Copy that text of the quotation as it appears exactly in the article text above.

### Step 4: Paste the text of the quotation in the `quotation` field below
Make sure that you enclose the quotation in quotation marks.

If there are are quotation marks in the text of the quote, either place an escape character `\` in front of them, or change the quotation marks that you use. (Eg, if there are single quotes (`'`) in the text, use double quotes (`"`) to surround the text.

Run the cell below.

In [12]:
# PASTE the quotation below in the field, replacing the text below ‼️
# Make sure to include quotation marks around the string
quotation = "All of us, grave or light, get our thoughts entangled in metaphors and act fatally on the strength" #pas

index = cleaned_article_text.text.rindex(quotation)
print(f"Article id: {article_id}")
print('Starting index:', index) 
print('Ending index:', index + len(quotation))
print(f'Character indexes for match: [{index}, {index + len(quotation)}]')
print("\n Corresponding text:")
cleaned_article_text.text[index:index + len(quotation)]



Article id: http://www.jstor.org/stable/30030019
Starting index: 14718
Ending index: 14816
Character indexes for match: [14718, 14816]

 Corresponding text:


'All of us, grave or light, get our thoughts entangled in metaphors and act fatally on the strength'

### Step 5: Record the character indexes and article id in spreadsheet
Add the character indexes and article ID as a new row in a spreadsheet

In [74]:
quotation = "CHAPTER LIX." #pas

index = mm.text.rindex(quotation)
print('Starting index:', index) 
print('Ending index:', index + len(quotation))

Starting index: 1278826
Ending index: 1278838


In [51]:
len("helped to fill some dull blanks with love and knowledge, had not yet penetrated the times with its leaven and entered into everybody's food; it was fermenting still as a distinguishable vigorous enthusiasm in certain long-haired German artists at Rome,")

252

In [75]:
mm.text[1236993:1278826]

'CHAPTER LVIII.\n\n    "For there can live no hatred in thine eye,\n     Therefore in that I cannot know thy change:\n     In many\'s looks the false heart\'s history\n     Is writ in moods and frowns and wrinkles strange:\n     But Heaven in thy creation did decree\n     That in thy face sweet love should ever dwell:\n     Whate\'er thy thoughts or thy heart\'s workings be\n     Thy looks should nothing thence but sweetness tell."\n                                       --SHAKESPEARE: Sonnets.\n\n\nAt the time when Mr. Vincy uttered that presentiment about Rosamond,\nshe herself had never had the idea that she should be driven to make\nthe sort of appeal which he foresaw.  She had not yet had any anxiety\nabout ways and means, although her domestic life had been expensive as\nwell as eventful.  Her baby had been born prematurely, and all the\nembroidered robes and caps had to be laid by in darkness.  This\nmisfortune was attributed entirely to her having persisted in going out\non hor

In [52]:
398370+252

398622