# Notebook to drop junk phrases

Once results have been generated, we have the problem that some phrases detected as "quotations" by text-matcher are not really quotations. Two types of misclassification are common:

- common phrases/idioms in the language, e.g. "there are many ways", "the question to be answered is whether"

- multi-word proper nouns, e.g. book titles (*In Search of Lost Time*), people's names ("Kwame Anthony Appiah"), place names ("Place de la Madeleine")

This notebook allows the user to identify frequently recurring junk phrases and drop all instances of them from the results JSONL file before proceeding to phase 3 (results analysis).

## Initial setup

In [2]:
#Troubleshooting: !jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 if you 
# get a message about the data rate limit
from matcher import Text, Matcher
import json
import pandas as pd
import numpy as np
from IPython.display import clear_output

In [3]:
# ACTION: copy path to corpus data JSONL file here (filename should end "_fulltext.jsonl")

startData = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Proust/1922_SwannsMoncrieff/Results/Proust_1922_SwannsMoncrieff_results_t2-c3-n2-m3-nostops.jsonl"

In [4]:
# Infer naming variables from path

textTitle = startData.rsplit("_", 4)[-3]
publicationYear = startData.rsplit("_", 4)[-4]
authorSurname = startData.rsplit("_", 4)[-5]
authorSurname = authorSurname.rsplit("/", 1)[-1]
hyperparSuffix = startData.rsplit("_", 4)[-1]
hyperparSuffix = f"_{hyperparSuffix[:-6]}"
dataDir = startData.rsplit("/", 4)[0]

print(f"Author surname: {authorSurname}\nPublication year: {publicationYear}\nText title: {textTitle}\nHyperparameters suffix: {hyperparSuffix}\nData directory:{dataDir}")

projectName = f"{authorSurname}_{publicationYear}_{textTitle}"
sourceDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/SourceText"
corpusDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/TargetCorpus"
resultsDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Results"

Author surname: Proust
Publication year: 1922
Text title: SwannsMoncrieff
Hyperparameters suffix: _t2-c3-n2-m3-nostops
Data directory:/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data


## Drop phrases

In [6]:
# Read full text and results

sourceText = f"{sourceDir}/{projectName}_plaintext.txt"
with open(str(sourceText), encoding='utf-8') as f:
    rawText = f.read()


df = pd.read_json(startData, lines=True)

In [7]:
# Tally matches

# Calculate length of source text

textALength = len(rawText)

# Make an empty array the size of the text

tally = np.zeros(textALength)
#tally = [0] * textALength

# Read the matched locations from the results dataset, and literally evaluate them into lists. 

locationsInA = df['Locations in A']

# Tally up every time a letter in the text is quoted. 
for article in locationsInA: 
    for locRange in article: 
         for i in range(locRange[0], min(locRange[1]+1, len(tally))):
                tally[i] += 1

In [8]:
# Specify the range of frequencies to examine (e.g. 0 to 10 for top 10 most frequent)

freqFrom = 0
freqTo = 15

In [9]:
# Identify highest frequencies in descending order
topFreqs = sorted(set(tally), reverse=True)[freqFrom:freqTo]

print(topFreqs)

[114.0, 110.0, 85.0, 52.0, 45.0, 43.0, 42.0, 38.0, 35.0, 34.0, 27.0, 25.0, 24.0, 23.0, 21.0]


In [None]:
for freq in topFreqs:
    quotedRange = np.where(tally == freq)[0].tolist()
    if quotedRange[-1] - quotedRange[0] +1 == len(quotedRange):
        print(f"Number of quotations: {freq}")
        print(f"Character indices: {quotedRange[0]}:{quotedRange[-1]}")
        print(f"Quoted passage: '{rawText[quotedRange[0]:quotedRange[-1]]}'\n")
    else:
        # this for loop splits a discontinuous range into two
        # 🚨 needs updating for cases where >2 breaks exist in the range
        for i in range(2, len(quotedRange)):
            if quotedRange[i] - quotedRange[i-1] != 1:
                splitPoint = i
                firstQuot = quotedRange[:splitPoint]
                secondQuot = quotedRange[splitPoint:]
                print(f"Number of quotations: {freq}")
                print(f"Multiple character indices: {quotedRange[0]}:{quotedRange[splitPoint-1]} and {quotedRange[splitPoint]}:{quotedRange[-1]}")
                print(f"Quoted passages:\n'{rawText[firstQuot[0]:firstQuot[-1]]}'\n'{rawText[secondQuot[0]:secondQuot[-1]]}'\n")


Number of quotations: 114.0
Character indices: 157027:157041
Quoted passage: 'there are many'

Number of quotations: 110.0
Character indices: 157042:157048
Quoted passage: 'others'

Number of quotations: 85.0
Character indices: 761642:761652
Quoted passage: 'du
Bois-de'

Number of quotations: 52.0
Character indices: 761653:761661
Quoted passage: 'Boulogne'

Number of quotations: 45.0
Character indices: 669031:669051
Quoted passage: 'Place de la Concorde'

Number of quotations: 43.0
Character indices: 86204:86221
Quoted passage: 'years have passed'

Number of quotations: 42.0
Character indices: 761635:761641
Quoted passage: 'Avenue'

Number of quotations: 38.0
Multiple character indices: 86222:86227 and 365944:365966
Quoted passages:
'since'
'Saint-André-des-Champs'

Number of quotations: 35.0
Character indices: 835306:835327
Quoted passage: 'Mme. de
Saint-Euverte'

Number of quotations: 34.0
Character indices: 94243:94270
Quoted passage: 'would have liked me to have'

Number of quotati

In [1]:
# Make list of all indices matching the specified frequencies

quotedRange = []

for f in topFreqs:
    quotedRange.append(np.where(tally == f)[0].tolist())

NameError: name 'topFreqs' is not defined

In [307]:
# Split sublists at non-consecutive indices (i.e. multiple quoted passages coincidentally with the same frequency)

res = []
tmp = []
prv = quotedRange[0][0]
for r in quotedRange:
    for l in r:
        if l-prv > 1:
            res.append(tmp)
            tmp = []
        tmp.append(l)
        prv = l
    res.append(tmp)

In [None]:
# Print frequently quoted passages with some context left and right

for n in range(len(res)):
    print(f"""
    {rawText[res[n][0]-100:res[n][0]]}
    \033[1m{rawText[res[n][0]:res[n][-1]]}\033[0m
    {rawText[res[n][-1]:res[n][-1]+100]}
    ---""")



    students of narrative.
JONATHAN CULLER
Ithaca, New York
5 See Derrida's Of Grammatology (Baltimore: 
    [1m[0m
     IN METHOD
Translated by Jane E. Lewin
Foreword by Jonathan Culler
CORNELL UNIVERSITY PRESS
ITHACA, 
    ---

    students of narrative.
JONATHAN CULLER
Ithaca, New York
5 See Derrida's Of Grammatology (Baltimore: 
    [1m[0m
     IN METHOD
Translated by Jane E. Lewin
Foreword by Jonathan Culler
CORNELL UNIVERSITY PRESS
ITHACA, 
    ---

    students of narrative.
JONATHAN CULLER
Ithaca, New York
5 See Derrida's Of Grammatology (Baltimore: 
    [1m[0m
     IN METHOD
Translated by Jane E. Lewin
Foreword by Jonathan Culler
CORNELL UNIVERSITY PRESS
ITHACA, 
    ---

    students of narrative.
JONATHAN CULLER
Ithaca, New York
5 See Derrida's Of Grammatology (Baltimore: 
    [1m[0m
     IN METHOD
Translated by Jane E. Lewin
Foreword by Jonathan Culler
CORNELL UNIVERSITY PRESS
ITHACA, 
    ---

    students of narrative.
JONATHAN CULLER
Ithaca, New York
5 See Derri

In [289]:
# ACTION: specify a phrase to drop here

dropPhrase = "Mrs. Dalloway said she would buy the flowers herself"

In [290]:
# Check location(s) of phrase

import re

phraseIndices = []

for match in re.finditer(dropPhrase, rawText, re.IGNORECASE):
    startIndex = match.start()
    endIndex = match.end()
    indexTuple = (startIndex, endIndex)
    phraseIndices.append(indexTuple)
    print(f"""Matched phrase at {startIndex}:{endIndex}\n
    {rawText[startIndex-100:startIndex]}\033[1m{rawText[startIndex:endIndex]}\033[0m{rawText[endIndex:endIndex+100]}""")

Matched phrase at 0:52

    [1mMrs. Dalloway said she would buy the flowers herself[0m.

For Lucy had her work cut out for her. The doors would be taken off their hinges; Rumpelmayer's m


In [291]:
print(phraseIndices)

[(0, 52)]


In [294]:
# NOT YET WORKING

# Find "Locations in A" that contain any tuple from phraseIndices

df2 = df.explode(['Locations in A', 'Locations in B'])

df2[df2["Locations in A"].isin(phraseIndices)]

Unnamed: 0,creator,datePublished,Year,Decade,docSubType,docType,id,identifier,isPartOf,issueNumber,...,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,doi,keyphrase,abstract,placeOfPublication,subTitle


In [283]:
# Explode lists of matches to be new entry each

for row in df['Locations in A'][3]:
    for tuple in row:
        if tuple == [1831, 2251]:
            print("Match detected!")
        else:
            print("No matches detected!")

No matches detected!
No matches detected!
No matches detected!
No matches detected!
No matches detected!
No matches detected!
No matches detected!
No matches detected!


In [None]:
# Detect character ranges in Locations in A (+-1); report number of hits

In [2]:
# Delete character ranges from Locs in A and B
# Append string to existing list; save as csv; coded to work repeatedly as new phrases added

In [None]:
# Save results jsonl with _phrasesdropped; coded to work repeatedly