# Pre-processing procedures

This notebook will set you up to run quotation detection of a source text in a target corpus, processing both as needed and storing them in a convenient location. You should run every cell in this notebook except those marked "OPTIONAL". Cells that say "ACTION" require you to do something within the cell before running it.


NOTE, before you open this notebook, make sure you've run the following command on the command line to increase Jupyter notebook memory:

`jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10`

If you've already opened this notebook, close it, run the command above, and open the notebook from within the browser that pops up.

# Initial setup

In [18]:
import pandas as pd
import json
import os

In [19]:
# ACTION: Specify info on dataset here so all files and folders will be named consistently
authorSurname = "Foucault"
publicationYear = "1969"
textTitle = "Archaeology"
# NB use one or two unique keywords for text title

In [20]:
projectName = f"{authorSurname}_{publicationYear}_{textTitle}"

print(projectName)

Foucault_1969_Archaeology


In [21]:
# ACTION: Specify a directory for data to be stored

dataDir = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data"

In [22]:
# Create subfolders for source text and corpus

sourceDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Source"
corpusDir = f"{dataDir}/{authorSurname}/{publicationYear}_{textTitle}/Corpus"

os.makedirs(f"{sourceDir}", exist_ok=True)
os.makedirs(f"{corpusDir}", exist_ok=True)

# Target corpus

In [7]:
# ACTION: specify the location of the JSONL file downloaded from JSTOR.
# On a Mac, you can do this by locating the file in Finder, right-clicking, holding the "opt" key
# and selecting "Copy ... as Pathname" then pasting it between the quotation marks below.

path_to_jsonLines_file = '/Users/milan/Downloads/foucault-archaeology.jsonl'

with open(path_to_jsonLines_file) as f: 
    rawCorpus = f.readlines()

# Parse the JSTORdata line by line (processing each line of jsonl individually)
data = [json.loads(line) for line in rawCorpus]

# NB running this cell can take 5+ mins with files >5GB

In [23]:
# Dump the parsed JSONL into a new JSON array called ""
name_of_parsed_file = f"{corpusDir}/{projectName}.json"

with open(name_of_parsed_file, 'w') as outfile: 
    json.dump(data, outfile)

# NB running this cell can take 5+ mins with files >5GB

In [22]:
df = pd.read_json(name_of_parsed_file)
df

# NB running this cell can take 5+ mins with files >5GB, and sometimes crashes the kernel but works when restarted

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,issueNumber,language,...,title,url,volumeNumber,wordCount,abstract,subTitle,keyphrase,collection,hasPartTitle,editor
0,[JOHN MAYNARD],1996-12-01,book-review,article,"[REVIEWS / 597 KOPELSON, KEVIN. Love 's Litany...",http://www.jstor.org/stable/29533170,"[{'name': 'issn', 'value': '00393827'}, {'name...",Studies in the Novel,4,[eng],...,Review Article,http://www.jstor.org/stable/29533170,28,1566,,,,,,
1,[Julia Riches],1996-01-01,book-review,article,[Review Essay REVIEW ESSAY: READING THE BODY b...,http://www.jstor.org/stable/30003241,"[{'name': 'issn', 'value': '13513818'}, {'name...",Irish Journal of American Studies,,[eng],...,Reading the Body,http://www.jstor.org/stable/30003241,5,6906,,,,,,
2,[Jérôme Thélot],2014-01-01,research-article,article,[Jerome Thélot Prosodie et histoire Qu'il s'ag...,http://www.jstor.org/stable/45073935,"[{'name': 'issn', 'value': '12684082'}, {'name...",L'Année Baudelaire,,[fre],...,Prosodie et histoire,http://www.jstor.org/stable/45073935,18/19,7350,,,,,,
3,,2013-04-01,other,article,[contributors linda martín alcoff is professo...,http://www.jstor.org/stable/10.5325/critphilra...,"[{'name': 'issn', 'value': '21658684'}, {'name...",Critical Philosophy of Race,1,[eng],...,Back Matter,http://www.jstor.org/stable/10.5325/critphilra...,1,877,,,,,,
4,[J. Stephen Lansing],2003-08-01,research-article,article,[J. STEPHEN LANSING University of Arizona and ...,http://www.jstor.org/stable/3805433,"[{'name': 'issn', 'value': '00940496'}, {'name...",American Ethnologist,3,[eng],...,The Cognitive Machinery of Power: Reflections ...,http://www.jstor.org/stable/3805433,30,8042,"As Judith Butler has emphasized, for Michel Fo...",,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30210,[Brendan Dooley],1997-12-01,book-review,article,[Book Reviews Fantasy and Reality in History. ...,http://www.jstor.org/stable/10.1086/245598,"[{'name': 'issn', 'value': '00222801'}, {'name...",The Journal of Modern History,4,[eng],...,Book Reviews,http://www.jstor.org/stable/10.1086/245598,69,60891,,,,,,
30211,"[Louis Althusser, Richard Veasey]",1993-10-01,research-article,article,[?t:?::??:!? 1 ?i?lillsa;;? ?o?j?-;? i .. r* ....,http://www.jstor.org/stable/25007697,"[{'name': 'issn', 'value': '07345496'}, {'name...",Grand Street,47,[eng],...,Zones of Darkness,http://www.jstor.org/stable/25007697,,8060,,,,,,
30212,,1962-03-01,misc,article,[VIENNENT DE PARAITRE Emile Copfermann La géné...,http://www.jstor.org/stable/24255976,"[{'name': 'issn', 'value': '00140759'}, {'name...",Esprit (1940-),304 (3),[fre],...,Back Matter,http://www.jstor.org/stable/24255976,,3730,,,,,,
30213,[Anthony Guneratne],2016-10-01,research-article,article,[The Greatest Shakespeare Film Never Made: Tex...,http://www.jstor.org/stable/26355195,"[{'name': 'issn', 'value': '07482558'}, {'name...",Shakespeare Bulletin,3,[eng],...,The Greatest Shakespeare Film Never Made,http://www.jstor.org/stable/26355195,34,8956,,"Textualities, Authorship, and Archives",,,,


In [10]:
# Drop the columns containing ngrams (irrelevant to our research) and overwrite JSON file

df.drop(['unigramCount', 'bigramCount', 'trigramCount'], inplace=True, axis=1)
df.to_json(f'{projectName}.json')

# Browsing the contents of the dataset

In [23]:
# Get general info on data

df.info

<bound method DataFrame.info of                                  creator datePublished        docSubType  \
0                         [JOHN MAYNARD]    1996-12-01       book-review   
1                         [Julia Riches]    1996-01-01       book-review   
2                        [Jérôme Thélot]    2014-01-01  research-article   
3                                    NaN    2013-04-01             other   
4                   [J. Stephen Lansing]    2003-08-01  research-article   
...                                  ...           ...               ...   
30210                   [Brendan Dooley]    1997-12-01       book-review   
30211  [Louis Althusser, Richard Veasey]    1993-10-01  research-article   
30212                                NaN    1962-03-01              misc   
30213                [Anthony Guneratne]    2016-10-01  research-article   
30214                  [Adele E. Clarke]    2003-11-01  research-article   

       docType                                         

In [24]:
# Identify items lacking full text

df.loc[pd.isnull(df['fullText'])]

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,issueNumber,language,...,title,url,volumeNumber,wordCount,abstract,subTitle,keyphrase,collection,hasPartTitle,editor


In [13]:
# Identify items that include full text

df.loc[pd.notnull(df['fullText'])]

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,language,outputFormat,...,title,url,volumeNumber,wordCount,issueNumber,abstract,subTitle,keyphrase,collection,hasPartTitle
0,[Christophe Ippolito],2008-01-01,research-article,article,"[THE TWENTIETH CENTURY, 1900-1945 By Christoph...",http://www.jstor.org/stable/25834090,"[{'name': 'issn', 'value': '00844152'}, {'name...",The Year's Work in Modern Language Studies,[eng],"[unigram, bigram, trigram]",...,"THE TWENTIETH CENTURY, 1900–1945",http://www.jstor.org/stable/25834090,70,7727,,,,,,
1,,1990-11-01,misc,article,"[S OO -~~~ SeL~~~~s ai g; g; g, ; a a~ W ' S A...",http://www.jstor.org/stable/2073165,"[{'name': 'issn', 'value': '00943061'}, {'name...",Contemporary Sociology,[eng],"[unigram, bigram, trigram]",...,Front Matter,http://www.jstor.org/stable/2073165,19,2776,6,,,,,
2,[Ian Finseth],1999-04-01,research-article,article,[ESSAYS How Shall the Truth Be Told? Language ...,http://www.jstor.org/stable/27746772,"[{'name': 'issn', 'value': '00029823'}, {'name...","American Literary Realism, 1870-1910",[eng],"[unigram, bigram, trigram]",...,How Shall the Truth Be Told? Language and Race...,http://www.jstor.org/stable/27746772,31,10192,3,,,,,
3,[Vern L. Bullough],1989-07-01,research-article,article,[THE FIELDING H. GARRISON LECTURE* M THE PHYSI...,http://www.jstor.org/stable/44451381,"[{'name': 'issn', 'value': '00075140'}, {'name...",Bulletin of the History of Medicine,[eng],"[unigram, bigram, trigram]",...,THE PHYSICIAN AND RESEARCH INTO HUMAN SEXUAL B...,http://www.jstor.org/stable/44451381,63,10778,2,,,,,
4,[George Huppert],1974-10-01,research-article,article,[DIVINATlO ET ERUDJTlO: THOUGHTS ON FOUCAULT G...,http://www.jstor.org/stable/2504776,"[{'name': 'issn', 'value': '00182656'}, {'name...",History and Theory,[eng],"[unigram, bigram, trigram]",...,Divinatio et Eruditio: Thoughts on Foucault,http://www.jstor.org/stable/2504776,13,8105,3,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10415,[MONIKA KAUP],2013-12-01,research-article,article,[MONIKA KAUP The Neobaroque in W. G. Sebald's ...,http://www.jstor.org/stable/43297932,"[{'name': 'issn', 'value': '00107484'}, {'name...",Contemporary Literature,[eng],"[unigram, bigram, trigram]",...,"The Neobaroque in W. G. Sebald's ""The Rings of...",http://www.jstor.org/stable/43297932,54,13599,4,,,,,
10416,[Alexander Spencer],2014-08-01,research-article,article,"[International Studies Perspectives (2014) 15,...",http://www.jstor.org/stable/44218756,"[{'name': 'issn', 'value': '15283577'}, {'name...",International Studies Perspectives,[eng],"[unigram, bigram, trigram]",...,Romantic Stories of the Pirate in IARRRH: The ...,http://www.jstor.org/stable/44218756,15,9289,3,The article examines the attempt by some acade...,,,,
10417,"[פלג דור-חיים, Peleg Dor-haim]",2015-01-01,research-article,article,[הקבוצה הדינמית כמרחב להתמודדות עם ניכור וזרות...,http://www.jstor.org/stable/26240971,"[{'name': 'issn', 'value': '23102063'}, {'name...",Mikbatz: The Israel Journal of Group Psychothe...,[heb],"[unigram, bigram, trigram]",...,The Intimate Group as a Space of Coping with A...,http://www.jstor.org/stable/26240971,19,5248,2,"לאורך ההיסטוריה האנושית מילאה ""הקבוצה האינטימי...",,,,
10418,"[Michelle Bigenho, Henry Stobart]",2018-10-01,research-article,article,[SPECIAL COLLECTION WORLD HERITAGE AND THE ONT...,http://www.jstor.org/stable/26646268,"[{'name': 'issn', 'value': '00035491'}, {'name...",Anthropological Quarterly,[eng],"[unigram, bigram, trigram]",...,Grasping Cacophony in Bolivian Heritage Otherwise,http://www.jstor.org/stable/26646268,91,15207,4,"A ""fever"" of heritage registration (patrimonia...",,,,


In [25]:
# Examine an item in detail

# ACTION: Choose an index at random from 0 to the highest number listed above

article_index = 41

# Print summary of metadata and text
print(df.loc[article_index])

creator                                             [Maggie McBride]
datePublished                                             1989-02-01
docSubType                                          research-article
docType                                                      article
fullText           [A Foucauldian Analysis of Mathematical Discou...
id                              http://www.jstor.org/stable/40247944
identifier         [{'name': 'issn', 'value': '02280671'}, {'name...
isPartOf                             For the Learning of Mathematics
issueNumber                                                        1
language                                                       [eng]
outputFormat                              [unigram, bigram, trigram]
pageCount                                                        7.0
pageEnd                                                           46
pageStart                                                         40
pagination                        

In [26]:
# Print full text for item examined above
print(df['fullText'].loc[article_index])

['A Foucauldian Analysis of Mathematical Discourse MAGGIE MCBR1DE The important thing, I believe, is that truth isn\'t outside power, or lacking in power: contrary to a myth . . . truth isn\'t the reward of free spirits, . . . nor the privilege of those who have succeeded in liberating themselves . . . Truth is a thing of this world . . . Each society has its regime of truth: that is, the types of discourse which it accepts and makes function as true . . .[1] As I read these words of Michel Foucault, I cannot help but become more aware of the politics of my own teaching of mathematics. If 1 take my study of Foucault seriously, I must think about changing my teaching practices. In the past, I saw improvement or change in my teaching as a personal endeavor; now, however, it has become more than that- it is a political struggle in which 1 have begun to question how the teaching of mathematics has been constructed to empower certain individuals who engage in certain practices. Foucault has