# Pre-processing procedures

This notebook will set you up to run quotation detection of a source text in a target corpus, processing both as needed and storing them in a convenient location. You should run every cell in this notebook except those marked "OPTIONAL". Cells that say "ACTION" require you to do something within the cell before running it.


NOTE, before you open this notebook, make sure you've run the following command on the command line to increase Jupyter notebook memory:

`jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10`

If you've already opened this notebook, close it, run the command above, and open the notebook from within the browser that pops up.

# Initial setup

In [3]:
import pandas as pd
import json
import os

In [4]:
# ACTION: Specify info on dataset here so all files and folders will be named consistently
authorSurname = "Foucault"
publicationYear = "1969"
textTitle = "Archaeology"
# NB use one or two unique keywords for text title

In [5]:
projectName = f"{authorSurname}_{publicationYear}_{textTitle}"

print(projectName)

Foucault_1969_Archaeology


In [6]:
# ACTION: Specify a directory for data to be stored

dataDir = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data"

# Note: because the source text and corpus contain copyrighted material, they should be stored locally.

In [7]:
# Create subfolders for source text, corpus and results

sourceDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Source"
corpusDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Corpus"
resultsDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Results"

os.makedirs(f"{sourceDir}", exist_ok=True)
os.makedirs(f"{corpusDir}", exist_ok=True)
os.makedirs(f"{resultsDir}", exist_ok=True)

# Source text

**Recommended to run notebook in browser using Jupyter Notebook since user input functionality works better there.**

This notebook:

- converts a PDF to a text file
- detects page numbers printed in each page's header or footer
- inserts both PDF page number and printed page number (where found) at each page break in the text file

If multiple potential page numbers are detected, the user will be asked to confirm which of the page numbers is correct. The printed page numbers and the indexed page numbers are added after each page's text in the text file in the following format: 

\~printed_page_number:[NUM HERE]\~ 

\~indexed_page_number:[NUM HERE]\~

The PDFs are converted to text using this package: https://github.com/jsvine/pdfplumber#extracting-text. Follow the installation instructions before running this notebook. If you're running jupyter notebook in your browser, just run the cell below to install packages.

In [16]:
# run this cell to install packages if you're running jupyter notebook in browser 
import sys
!{sys.executable} -m pip install pdfplumber
import re 
import pdfplumber
import shutil



In [15]:
# ACTION: input the path and filename of the PDF you want to convert here

PDFfile = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Foucault/Archaeology of Knowledge/Foucault - Archaeology of Knowledge.pdf"

In [17]:
# Copy the PDF to the "Source" subfolder and rename

shutil.copyfile(f"{PDFfile}", f"{sourceDir}/{projectName}.pdf")

'/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Foucault/1969_Archaeology/Source/Foucault_1969_Archaeology.pdf'

In [20]:
full_pdf_text = ""

with pdfplumber.open(f"{sourceDir}/{projectName}.pdf") as pdf:
    for i in range(len(pdf.pages)):
        full_page = pdf.pages[i].extract_text()

        first_line = full_page.split("\n")[0] # where page numbers at the top of the page are likely to be found
        last_line = full_page.split("\n")[-1] # where page numbers at the bottom of the page are likely to be found
        full_pdf_text += full_page

        if len(full_page) != 0:
            top_left_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})\s", first_line)
            top_right_page_matches = re.findall("\s([xXvViIlL]+|\d{1,3})$", first_line)
            bottom_page_matches = re.findall("^([xXvViIlL]+|\d{1,3})$", last_line)
            
            all_matches = []
            if top_left_page_matches:
                all_matches.append(top_left_page_matches[0].strip())
            if top_right_page_matches:
                all_matches.append(top_right_page_matches[-1].strip())
            if bottom_page_matches:
                all_matches.append(bottom_page_matches[-1].strip())
                
            if len(all_matches) > 0:
                if len(all_matches) > 1: # more than one potential page number 
                    print(f'\n{len(all_matches)} potential page numbers were found for this page.')
                    print(all_matches)
                    print(f'Check PDF page {i+1} for the correct page number and enter it below. If the page has no printed page number, do not input anything.')
                    correct_page_number = input()
                    print(correct_page_number)
                    if len(correct_page_number.strip()) != 0:
                        printed_page_number = correct_page_number.strip()
                    else:
                        printed_page_number = None
                else:
                    printed_page_number = all_matches[0]

                if printed_page_number is not None: 
                    full_pdf_text += f'\n\n~printed_page_number:{printed_page_number}~'
                
        full_pdf_text += f'\n~indexed_page_number:{i+1}~\n\n'

# save output as a plain text file     
with open(f"{sourceDir}/{projectName}_plaintext.txt", "w", encoding="utf-8") as text_file:
    text_file.write(full_pdf_text)

# Target corpus

In [2]:
# ACTION: specify the location of the JSONL file downloaded from JSTOR.
# On a Mac, you can do this by locating the file in Finder, right-clicking, holding the "opt" key
# and selecting "Copy ... as Pathname" then pasting it between the quotation marks below.

pathToJsonLinesFile = '/Users/milan/Downloads/foucault-archaeology.jsonl'

In [13]:
# Copy the downloaded JSONL file into the appropriate project directory for processing and rename.

import shutil

corpusFile = f"{corpusDir}/{projectName}_fulltext.jsonl"

shutil.copy(pathToJsonLinesFile, corpusFile)

'/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data/Foucault/1969_Archaeology/Corpus/Foucault_1969_Archaeology_fulltext.jsonl'

In [14]:
df = pd.read_json(corpusFile, lines=True)
df

# NB running this cell can take 5+ mins with files >5GB, and sometimes crashes the kernel but works when restarted

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,language,outputFormat,...,title,url,volumeNumber,wordCount,issueNumber,abstract,subTitle,keyphrase,collection,hasPartTitle
0,[Christophe Ippolito],2008-01-01,research-article,article,"[THE TWENTIETH CENTURY, 1900-1945 By Christoph...",http://www.jstor.org/stable/25834090,"[{'name': 'issn', 'value': '00844152'}, {'name...",The Year's Work in Modern Language Studies,[eng],"[unigram, bigram, trigram]",...,"THE TWENTIETH CENTURY, 1900–1945",http://www.jstor.org/stable/25834090,70,7727,,,,,,
1,,1990-11-01,misc,article,"[S OO -~~~ SeL~~~~s ai g; g; g, ; a a~ W ' S A...",http://www.jstor.org/stable/2073165,"[{'name': 'issn', 'value': '00943061'}, {'name...",Contemporary Sociology,[eng],"[unigram, bigram, trigram]",...,Front Matter,http://www.jstor.org/stable/2073165,19,2776,6,,,,,
2,[Ian Finseth],1999-04-01,research-article,article,[ESSAYS How Shall the Truth Be Told? Language ...,http://www.jstor.org/stable/27746772,"[{'name': 'issn', 'value': '00029823'}, {'name...","American Literary Realism, 1870-1910",[eng],"[unigram, bigram, trigram]",...,How Shall the Truth Be Told? Language and Race...,http://www.jstor.org/stable/27746772,31,10192,3,,,,,
3,[Vern L. Bullough],1989-07-01,research-article,article,[THE FIELDING H. GARRISON LECTURE* M THE PHYSI...,http://www.jstor.org/stable/44451381,"[{'name': 'issn', 'value': '00075140'}, {'name...",Bulletin of the History of Medicine,[eng],"[unigram, bigram, trigram]",...,THE PHYSICIAN AND RESEARCH INTO HUMAN SEXUAL B...,http://www.jstor.org/stable/44451381,63,10778,2,,,,,
4,[George Huppert],1974-10-01,research-article,article,[DIVINATlO ET ERUDJTlO: THOUGHTS ON FOUCAULT G...,http://www.jstor.org/stable/2504776,"[{'name': 'issn', 'value': '00182656'}, {'name...",History and Theory,[eng],"[unigram, bigram, trigram]",...,Divinatio et Eruditio: Thoughts on Foucault,http://www.jstor.org/stable/2504776,13,8105,3,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10415,[MONIKA KAUP],2013-12-01,research-article,article,[MONIKA KAUP The Neobaroque in W. G. Sebald's ...,http://www.jstor.org/stable/43297932,"[{'name': 'issn', 'value': '00107484'}, {'name...",Contemporary Literature,[eng],"[unigram, bigram, trigram]",...,"The Neobaroque in W. G. Sebald's ""The Rings of...",http://www.jstor.org/stable/43297932,54,13599,4,,,,,
10416,[Alexander Spencer],2014-08-01,research-article,article,"[International Studies Perspectives (2014) 15,...",http://www.jstor.org/stable/44218756,"[{'name': 'issn', 'value': '15283577'}, {'name...",International Studies Perspectives,[eng],"[unigram, bigram, trigram]",...,Romantic Stories of the Pirate in IARRRH: The ...,http://www.jstor.org/stable/44218756,15,9289,3,The article examines the attempt by some acade...,,,,
10417,"[פלג דור-חיים, Peleg Dor-haim]",2015-01-01,research-article,article,[הקבוצה הדינמית כמרחב להתמודדות עם ניכור וזרות...,http://www.jstor.org/stable/26240971,"[{'name': 'issn', 'value': '23102063'}, {'name...",Mikbatz: The Israel Journal of Group Psychothe...,[heb],"[unigram, bigram, trigram]",...,The Intimate Group as a Space of Coping with A...,http://www.jstor.org/stable/26240971,19,5248,2,"לאורך ההיסטוריה האנושית מילאה ""הקבוצה האינטימי...",,,,
10418,"[Michelle Bigenho, Henry Stobart]",2018-10-01,research-article,article,[SPECIAL COLLECTION WORLD HERITAGE AND THE ONT...,http://www.jstor.org/stable/26646268,"[{'name': 'issn', 'value': '00035491'}, {'name...",Anthropological Quarterly,[eng],"[unigram, bigram, trigram]",...,Grasping Cacophony in Bolivian Heritage Otherwise,http://www.jstor.org/stable/26646268,91,15207,4,"A ""fever"" of heritage registration (patrimonia...",,,,


In [11]:
# Optional: Drop the columns containing ngrams (irrelevant to our research) and overwrite JSON file
# Could rewrite to check for columns first

# df.drop(['unigramCount', 'bigramCount', 'trigramCount'], inplace=True, axis=1)
# df.to_json(f'{projectName}.json')

In [12]:
# Identify items lacking full text

df.loc[pd.isnull(df['fullText'])]

# You should see "0 rows" below - if not there's a problem with the dataset

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,language,outputFormat,...,title,url,volumeNumber,wordCount,issueNumber,abstract,subTitle,keyphrase,collection,hasPartTitle
