# Pre-processing procedures

This notebook will set you up to run quotation detection of a source text in a target corpus, processing both as needed and storing them in a convenient location. You should run every cell in this notebook except those marked "OPTIONAL". Cells that say "ACTION" require you to do something within the cell before running it.

# Initial setup

In [9]:
import pandas as pd
import json
import os
import pathlib as pl

In [10]:
# ACTION: Specify info on dataset here so all files and folders will be named consistently
authorSurname = "Price"
publicationYear = "2000"
textTitle = "AnthologyRise"
# NB use one or two unique keywords for text title

In [11]:
projectName = f"{authorSurname}_{publicationYear}_{textTitle}"

print(projectName)

Price_2000_AnthologyRise


In [12]:
# ACTION: Specify a directory for data to be stored

dataDir = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data"

# Note: because the source text and corpus contain copyrighted material, they should be stored locally.

In [7]:
# Create subfolders for source text, corpus and results

sourceDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/SourceText"
corpusDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/TargetCorpus"
resultsDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Results"

os.makedirs(f"{sourceDir}", exist_ok=True)
os.makedirs(f"{corpusDir}", exist_ok=True)
os.makedirs(f"{resultsDir}", exist_ok=True)

# Source text

**Recommended to run notebook in browser using Jupyter Notebook since user input functionality works better there.**

This notebook:

- converts a PDF to a text file
- detects page numbers printed in each page's header or footer
- inserts both PDF page number and printed page number (where found) at each page break in the text file

If multiple potential page numbers are detected, the user will be asked to confirm which of the page numbers is correct. The printed page numbers and the indexed page numbers are added after each page's text in the text file in the following format: 

\~printed_page_number:[NUM HERE]\~ 

\~indexed_page_number:[NUM HERE]\~

The PDFs are converted to text using this package: https://github.com/jsvine/pdfplumber#extracting-text. Follow the installation instructions before running this notebook. If you're running jupyter notebook in your browser, just run the cell below to install packages.

In [8]:
# run this cell to install packages if you're running jupyter notebook in browser 
import sys
!{sys.executable} -m pip install pdfplumber
import re 
import pdfplumber
import shutil



In [13]:
# ACTION: input the path and filename of the PDF you want to convert here

PDFfile = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/C19/Price_2000_The_anthology_and_the_rise_of_the_novel.pdf"

In [14]:
# Copy the PDF to the "Source" subfolder and rename

shutil.copyfile(f"{PDFfile}", f"{sourceDir}/{projectName}.pdf")

'/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Price/2000_AnthologyRise/SourceText/Price_2000_AnthologyRise.pdf'

In [15]:
full_pdf_text = ""

with pdfplumber.open(f"{sourceDir}/{projectName}.pdf") as pdf:
    for i in range(len(pdf.pages)):
        full_page = pdf.pages[i].extract_text()

        first_line = full_page.split("\n")[0] # where page numbers at the top of the page are likely to be found
        last_line = full_page.split("\n")[-1] # where page numbers at the bottom of the page are likely to be found
        full_pdf_text += full_page

        if len(full_page) != 0:
            top_left_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})\s", first_line)
            top_right_page_matches = re.findall("\s([xXvViIlL]+|\d{1,3})$", first_line)
            bottom_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})$", last_line)
            
            all_matches = []
            if top_left_page_matches:
                all_matches.append(top_left_page_matches[0].strip())
            if top_right_page_matches:
                all_matches.append(top_right_page_matches[-1].strip())
            if bottom_page_matches:
                all_matches.append(bottom_page_matches[-1].strip())
                
            if len(all_matches) > 0:
                if len(all_matches) > 1: # more than one potential page number 
                    print(f'\n{len(all_matches)} potential page numbers were found for this page.')
                    print(all_matches)
                    print(f'Check PDF page {i+1} for the correct page number and enter it below. If the page has no printed page number, do not input anything.')
                    correct_page_number = input()
                    print(correct_page_number)
                    if len(correct_page_number.strip()) != 0:
                        printed_page_number = correct_page_number.strip()
                    else:
                        printed_page_number = None
                else:
                    printed_page_number = all_matches[0]

                if printed_page_number is not None: 
                    full_pdf_text += f'\n\n~printed_page_number:{printed_page_number}~'
                
        full_pdf_text += f'\n~indexed_page_number:{i+1}~\n\n'

# save output as a plain text file
rawText = f"{sourceDir}/{projectName}_plaintext.txt"
with open(rawText, "w", encoding="utf-8") as text_file:
    text_file.write(full_pdf_text)

In [16]:
rawText = f"{sourceDir}/{projectName}_plaintext.txt"

In [None]:
# Remove book titles (important for scholarly source texts, less important for other text types)

from ipysheet import from_dataframe, to_dataframe

booktext=open(rawText,"r", encoding="utf8")
test_str = booktext.read()

## Round 1: auto deletion of high accuracy title matching sequences

In [None]:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

regex = r'“[^”[:lower:]]+”[^“”]+Vol\.|“[^”[:lower:]]+”[^“”]+No\.|eds\.[^V()]+Vol\.|eds\.[^()V]+\([^)]+\)|“[^”[:lower:]]+”[^()]+\([^()]+\)|\.,[^()]+\([^)]+\)|:[^()]+\([^()]+\)|“[^.]+pp\.|, “[^.]+p\.|“[^.]+ p\. \d|“[^V\d]+Vol\. \d|“[^(\d]+\([^\d]+\d\d\d\d\)'
subst = ""

re.findall(regex, test_str)

[':1~\n\nArchaeology of Knowledge\n‘Michel Foucault is a very brilliant writer ... he has a remark-\nable angle of vision, a highly disciplined and coherent one,\nthat informs his work to such a high degree as to make the\nwork sui generis original.’\nEdward W. Said\n‘The Archaeology of Knowledge ... provides an unusually\nsharp outline of [Foucault’s] theoretical stance as well as a\nfocused critique of the history of ideas.’\nJean Claude Guédon\n‘A necessary guide to Foucault’s often difficult ideas ... and\nto his overall historical ambition, which is to define the “soil”\nout of which contemporary events in a given period grow.’\nThe Times Literary Supplement\n‘No other thinker in recent history had so dynamically influ-\nenced the fields of history, philosophy, literature and literary\ntheory, the social sciences, even medicine.’\nLawrence D. Kritzman\n‘Next to Sartre’s Search for a Method, and in direct opposition\nto it, Foucault’s work is the most noteworthy effort at a theory\

In [None]:
print

# Target corpus

In [17]:
# ACTION: specify the location of the JSONL file downloaded from JSTOR.
# On a Mac, you can do this by locating the file in Finder, right-clicking, holding the "opt" key
# and selecting "Copy ... as Pathname" then pasting it between the quotation marks below.

pathToJsonLinesFile = '/Users/milan/Downloads/priceanthology.jsonl'

In [18]:
# Copy the downloaded JSONL file into the appropriate project directory for processing and rename.

import shutil

corpusFile = f"{corpusDir}/{projectName}_fulltext.jsonl"

shutil.copy(pathToJsonLinesFile, corpusFile)

'/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Price/2000_AnthologyRise/TargetCorpus/Price_2000_AnthologyRise_fulltext.jsonl'

In [19]:
df = pd.read_json(corpusFile, lines=True)
df

# NB running this cell can take 5+ mins with files >5GB, and sometimes crashes the kernel but works when restarted

Unnamed: 0,creator,datePublished,docSubType,docType,doi,fullText,id,identifier,isPartOf,keyphrase,...,sourceCategory,tdmCategory,title,url,volumeNumber,wordCount,issueNumber,placeOfPublication,abstract,subTitle
0,"[Henry Lowood, Stephen H. Cutcliffe, Katalin H...",1996-01-01,research-article,article,10.2307/3107088,[Current Bibliography in the History of Techno...,http://www.jstor.org/stable/3107088,"[{'name': 'doi', 'value': '10.2307/3107088'}, ...",Technology and Culture,"[technology, bibliography, history, illustrati...",...,"[Science & Mathematics, Science & Technology S...",[Biological sciences - Biology],Current Bibliography in the History of Technol...,http://www.jstor.org/stable/3107088,37.0,110617,,,,
1,[W. B. Worthen],2002-04-01,research-article,article,10.2307/1556121,"[SEL 42, 2 (Spring 2002): 399-458 399 ISSN 003...",http://www.jstor.org/stable/1556121,"[{'name': 'doi', 'value': '10.2307/1556121'}, ...","Studies in English Literature, 1500-1900","[hamlet, shakespeare, purgatory, worthen, thea...",...,"[Language & Literature, Humanities]",[Arts - Literature],Recent Studies in Tudor and Stuart Drama,http://www.jstor.org/stable/1556121,42.0,20598,2,,,
2,,1968-03-08,misc,article,10.2307/1723593,[SC tEI~i~J~~S~aE 8 March 1968 CEi J I E: hN J...,http://www.jstor.org/stable/1723593,"[{'name': 'doi', 'value': '10.2307/1723593'}, ...",Science,"[unitron, interscience book, boc amino acids, ...",...,"[Biological Sciences, General Science, Science...",[Applied sciences - Engineering],Front Matter,http://www.jstor.org/stable/1723593,159.0,30503,3819,,,
3,[Robin L. Cadwallader],1997-01-01,misc,article,10.2307/25679222,[LEGACY BOOKSHELF Below is a selected sampling...,http://www.jstor.org/stable/25679222,"[{'name': 'doi', 'value': '10.2307/25679222'},...",Legacy,"[american, zora neale, edith wharton, fiction,...",...,"[Language & Literature, Feminist & Women's Stu...","[Philosophy - Applied philosophy, Arts - Liter...",LEGACY BOOKSHELF,http://www.jstor.org/stable/25679222,14.0,3067,1,,,
4,,1995-04-01,misc,article,10.2307/467848,"[Chinese-American Literature, University of Ma...",http://www.jstor.org/stable/467848,"[{'name': 'doi', 'value': '10.2307/467848'}, {...",MELUS,"[university, amherst, postage fee, massachuset...",...,"[Language & Literature, Humanities]","[Arts - Literature, Biological sciences - Biol...",Front Matter,http://www.jstor.org/stable/467848,20.0,2224,1,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1682,"[W Maxwell Cowan, Steven E Hyman, Thomas M Jes...",2002-12-01,book-review,article,10.1086/374516,"[Volume 77, No. 4 THE QUARTERLY REVIEW OF BIOL...",http://www.jstor.org/stable/10.1086/374516,"[{'name': 'doi', 'value': '10.1086/374516'}, {...",The Quarterly Review of Biology,"[index isbn, fossil, ill index, ammonites, fos...",...,"[Ecology & Evolutionary Biology, Science & Mat...",[Biological sciences - Biology],Review Article,http://www.jstor.org/stable/10.1086/374516,77.0,47321,4,,,
1683,"[Nigel Rapport, Esther Hertzog, Orit Abuhav, H...",2013-04-01,book-review,article,10.2307/23486376,[This is the first of a series of book review ...,http://www.jstor.org/stable/23486376,"[{'name': 'doi', 'value': '10.2307/23486376'},...",Anthropology Today,"[israeli, anthropology, israeli anthropology, ...",...,"[Anthropology, Social Sciences]","[Philosophy - Applied philosophy, History - Hi...",ISRAEL/ANTHROPOLOGY: 'SCREAMING ASSEMBLY'. A r...,http://www.jstor.org/stable/23486376,29.0,5047,2,,,
1684,,1984-12-01,misc,article,10.2307/341966,[AATSP MEMBERSHIP LIST (Corrected to 1 October...,http://www.jstor.org/stable/341966,"[{'name': 'doi', 'value': '10.2307/341966'}, {...",Hispania,"[span port, langs lits, langs univ, rom langs,...",...,"[Language & Literature, Education, Latin Ameri...",[Arts - Performing arts],Back Matter,http://www.jstor.org/stable/341966,67.0,124681,4,,,
1685,,1995-11-01,misc,article,10.2307/462924,"[Program Wednesday, 27 December 3:30 p.m. 2. A...",http://www.jstor.org/stable/462924,"[{'name': 'doi', 'value': '10.2307/462924'}, {...",PMLA,"[hyatt regency, regency chicago, program arran...",...,"[Humanities, Language & Literature]",[Arts - Performing arts],Program,http://www.jstor.org/stable/462924,110.0,58391,6,,,


In [10]:
# Optional: Drop the columns containing ngrams (irrelevant to our research) and overwrite JSON file
# Could rewrite to check for columns first

# df.drop(['unigramCount', 'bigramCount', 'trigramCount'], inplace=True, axis=1)
# df.to_json(f'{projectName}.json')

In [20]:
# Identify items lacking full text

df.loc[pd.isnull(df['fullText'])]

# You should see "0 rows" below - if not there's a problem with the dataset

Unnamed: 0,creator,datePublished,docSubType,docType,doi,fullText,id,identifier,isPartOf,keyphrase,...,sourceCategory,tdmCategory,title,url,volumeNumber,wordCount,issueNumber,placeOfPublication,abstract,subTitle
