# Assemble hand-coded articles and prepare for modeling

@author: Jaren Haber, PhD<br>
@coauthors: Prof. Heather Haveman, UC Berkeley; Yoon Sung Hong, Wayfair<br>
@contact: Jaren.Haber@georgetown.edu<br>
@project: Computational Literature Review of Organizational Scholarship<br>
@date: November 2020<br>

@description: '''Loads and merges two datasets in preparation for classification model training. We're dealing with three theoretical perspectives in org. science (cultural, demographic, and relational) and two subject areas (sociology & management/OB, not differentiated here). The first dataset is of articles hand-coded by the author and Prof. Haveman, and it comes as a clean .csv file. This first contains lots of false positives (from the previous approach based on cosine measures), so it consists of mainly negative cases. The second dataset is of articles identified by Prof. Haveman as being foundation/definitive for each perspective. This comes as a list of citations, one per perspective, and requires some pretty heavy cleaning to match with articles in the main JSTOR articles dataset.'''

## Initialize

In [1]:
# import packages
import imp, importlib # For working with modules
import pandas as pd # for working with dataframes
import numpy as np # for working with numbers
import pickle # For working with .pkl files
from tqdm import tqdm # Shows progress over iterations, including in pandas via "progress_apply"
tqdm.pandas(desc='')
import sys # For terminal tricks
import _pickle as cPickle # Optimized version of pickle
import gc # For managing garbage collector
import timeit # For counting time taken for a process
import datetime # For working with dates & times
import tables
import random
import os; from os import listdir; from os.path import isfile, join

In [35]:
# define filepaths
cwd = os.getcwd()
root = str.replace(cwd, 'classification/preprocess', '')
#root = '/home/jovyan/work/' # set root directory

# dictionary counts (using core dictionaries) and matched subjects 
counts_fp = root + 'dictionary_methods/counts_and_subject.csv'

# per-article info on cosine scores using each dictionary (core or 100-term dictionaries??)
cosines_fp = root + 'models_storage/word_embeddings_data/text_with_cosine_scores_wdg_2020_oct27.csv'

# per-article metadata with URLs
meta_fp = root + 'dictionary_methods/code/metadata_combined.h5' 

# Filtered index of research articles
articles_list_fp = root + 'classification/data/filtered_length_index.csv'

# coded output directory: save files here
#output_articles_list_fp = root + 'classification/data/filtered_length_index.csv'
output_fp = root + 'classification/data/hand_coded/'
coded_11620 = output_fp + 'coded_sample_cleaned_111620.csv'
coded_cult_fp = output_fp + 'true_positives_cultural.csv'
coded_relt_fp = output_fp + 'true_positives_relational.csv'
coded_demog_fp = output_fp + 'true_positives_demographic.csv'

# for text files
ocr_fp = root + 'jstor_data/ocr/' 

In [None]:
# collect article file list
colnames = ['file_name']
articles = pd.read_csv(articles_list_fp, names=colnames, header=None)

files_to_be_opened = [ocr_fp + file + '.txt' for file in tqdm(articles.file_name)]
all_files = [ocr_fp + f for f in tqdm(listdir(ocr_fp)) if isfile(join(ocr_fp, f))]

files = [file for file in tqdm(all_files) if file in files_to_be_opened]

100%|██████████| 69659/69659 [00:00<00:00, 1436647.60it/s]
100%|██████████| 399129/399129 [00:02<00:00, 170673.53it/s]
 56%|█████▋    | 225355/399128 [05:54<04:26, 650.93it/s]

## Read in & merge data

In [27]:
# Read in metadata file
df_meta = pd.read_hdf(meta_fp)
df_meta.reset_index(drop=False, inplace=True) # extract file name from index

# For merging purposes, get ID alone from file name, e.g. 'journal-article-10.2307_2065002' -> '10.2307_2065002'
df_meta['edited_filename'] = df_meta['file_name'].apply(lambda x: x[16:]) 
df_meta = df_meta[["edited_filename", "article_name", "jstor_url", "abstract", "journal_title", "given_names", "primary_subject", "year", "type"]] # keep only relevant columns

df_meta.head()

Unnamed: 0,edited_filename,article_name,jstor_url,abstract,journal_title,given_names,primary_subject,year,type
0,10.2307_4167860,Cross-Dialectal Variation in Arabic: Competing...,https://www.jstor.org/stable/4167860,Most researchers of Arabic sociolinguistics as...,Language in Society,,Other,1979,research-article
1,10.2307_2578336,,https://www.jstor.org/stable/2578336,,Social Forces,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1983,book-review
2,10.2307_2654760,,https://www.jstor.org/stable/2654760,,Contemporary Sociology,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1998,book-review
3,10.2307_43242281,editor's note: A KNIGHT'S TALE,https://www.jstor.org/stable/43242281,,Corporate Knights,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Other,2005,misc
4,10.2307_42862018,,https://www.jstor.org/stable/42862018,,Social Science Quarterly,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1985,book-review


In [28]:
# Read in filtered index, counts
df = pd.read_csv(articles_list_fp, low_memory=False, header=None, names=["file_name"])
df['edited_filename'] = df['file_name'].apply(lambda x: x[16:]) # New col with only article ID

df_counts = pd.read_csv(counts_fp, low_memory=False)
df_counts['edited_filename'] = df_counts['article_id'].apply(lambda x: x[16:]) # New col with only article ID
df_counts = df_counts[['edited_filename', 'word_count']]

# Merge meta data, counts into articles list DF
df = pd.merge(df, df_meta, how='left', on='edited_filename') # meta data
df = pd.merge(df, df_counts, how='left', on='edited_filename') # counts

# Filter to only full articles: >=1000 words (eliminates 69659 - 65372 = 4287 cases)
#df = df[df['word_count'] >= 1000]

# Show all columns in resulting DF
print("All columns:\n", list(df))
print()

print("Rows, cols in data:", df.shape)

df.head()

All columns:
 ['file_name', 'edited_filename', 'article_name', 'jstor_url', 'abstract', 'journal_title', 'given_names', 'primary_subject', 'year', 'type', 'word_count']

Rows, cols in data: (65372, 11)


Unnamed: 0,file_name,edited_filename,article_name,jstor_url,abstract,journal_title,given_names,primary_subject,year,type,word_count
0,journal-article-10.2307_2065002,10.2307_2065002,Toward More Cumulative Inquiry,https://www.jstor.org/stable/2065002,,Contemporary Sociology,"[Ariela, ARTHUR J., John A., Marilyn, Janemari...",Sociology,1978,research-article,3529
1,journal-article-10.2307_3380821,10.2307_3380821,An Analysis of an Incentive Sick Leave Policy ...,https://www.jstor.org/stable/3380821,Local health departments are under tremendous ...,Public Productivity & Management Review,"[Werner, Werner, Konrad, Rudi, Paul, Jean, Rob...",Management & Organizational Behavior,1986,research-article,5195
2,journal-article-10.2307_2095822,10.2307_2095822,Local Friendship Ties and Community Attachment...,https://www.jstor.org/stable/2095822,This study presents a multilevel empirical tes...,American Sociological Review,"[Alice O., Peter, W. Erwin, Bert, Robert W., C...",Sociology,1983,research-article,7100
3,journal-article-10.2307_40836133,10.2307_40836133,Knowledge Transfer within the Multinational Fi...,https://www.jstor.org/stable/40836133,This paper examines the process of knowledge t...,MIR: Management International Review,"[Ariela, ARTHUR J., John A., Marilyn, Janemari...",Management & Organizational Behavior,2005,research-article,7110
4,journal-article-10.2307_2579666,10.2307_2579666,Dynamics of Labor Market Segmentation in Polan...,https://www.jstor.org/stable/2579666,Research in the early 1980s showed that indust...,Social Forces,"[Ariela, ARTHUR J., John A., Marilyn, Janemari...",Sociology,1990,research-article,5313


In [233]:
# Read in true positives from H2--in citation format
coded_cult = pd.read_csv(coded_cult_fp, low_memory=False, header=None, 
                         encoding="Latin-1").rename(columns = {0:'citation'})
coded_relt = pd.read_csv(coded_relt_fp, low_memory=False, header=None, 
                         encoding="Latin-1").rename(columns = {0:'citation'})
coded_demog = pd.read_csv(coded_demog_fp, low_memory=False, header=None, 
                          encoding="Latin-1").rename(columns = {0:'citation'})

coded_cult.head()

Unnamed: 0,citation
0,"Barley, Stephen R. 1983. Semiotics and the s..."
1,"Barney, Jay B. 1986. Organizational culture:..."
2,"Castilla, Emilio J., and Stephen Benard. 2010..."
3,"Dutton, Jane E., and Janet M. Dukerich. 1991...."
4,"Fine, Gary Alan. 1984. Negotiated orders and..."


In [234]:
# Read in hand-coded data
coded_df = pd.read_csv(coded_11620, low_memory=False, header=0)
coded_df.head()

Unnamed: 0,cultural_score,relational_score,demographic_score,article_name,abstract,jstor_url,year,journal_title,edited_filename,culture_word2vec_cosine,culture_ngram_count.1,cultural_author_count,relational_word2vec_cosine,relational_ngram_count.1,relational_author_count,demographic_word2vec_cosine,demographic_ngram_count.1,demographic_author_count
0,1.0,0.0,0.0,"Intersecting Three Muddy Roads: Stability, Leg...",Several decades of research by multiple academ...,https://www.jstor.org/stable/25822540,2011.0,Journal of Managerial Issues,10.2307_25822540,0.754487,227.0,7.0,0.61303,33.0,1.0,0.560983,119.0,0.0
1,1.0,0.0,0.0,Rational Decision Making as Performative Praxi...,Organizational theorists built their knowledge...,external-fulltext-any,2011.0,Organization Science,10.2307_20868880,0.721939,55.0,6.0,0.588276,16.0,1.0,0.534615,2.0,0.0
2,1.0,1.0,0.0,From Fiefs to Clans and Network Capitalism: Ex...,China's rapid economic development is being ac...,https://www.jstor.org/stable/2393869,1986.0,Administrative Science Quarterly,10.2307_2393869,0.715111,73.0,0.0,0.644378,66.0,0.0,0.530408,17.0,0.0
3,1.0,1.0,0.0,The Collective Strategy Framework: An Applicat...,This paper investigates empirically the compet...,https://www.jstor.org/stable/2392643,1984.0,Administrative Science Quarterly,10.2307_2392643,0.702606,114.0,9.0,0.671079,88.0,2.0,0.6746,124.0,9.0
4,1.0,0.0,0.0,"Political Institutional Change, Obsolescing Le...",This paper studies the practice of integration...,https://www.jstor.org/stable/41682289,2012.0,MIR: Management International Review,10.2307_41682289,0.68824,218.0,1.0,0.692662,111.0,0.0,0.586081,98.0,0.0


### Merge true positives into hand-coded data

In [235]:
# Extract title from citation format & preprocess: lower-case, remove punctuation, strip whitespace
title_pattern = r'(?<=\d{4}\.).*' # regex pattern for getting title

for coded in [coded_cult, coded_relt, coded_demog]:
    coded['article_name_edited'] = coded['citation'].apply(
        lambda cite: re.sub(
            '\W+', ' ', re.findall( # remove any non-words
                title_pattern, cite)[0]. 
            split('.')[0]. # remove journal title (2nd element)
            strip().lower())) # remove whitespace, lower case
    
coded_cult['cultural_score'] = 1
coded_cult['relational_score'] = np.NaN
coded_cult['demographic_score'] = np.NaN

coded_relt['cultural_score'] = np.NaN
coded_relt['relational_score'] = 1
coded_relt['demographic_score'] = np.NaN

coded_demog['cultural_score'] = np.NaN
coded_demog['relational_score'] = np.NaN
coded_demog['demographic_score'] = 1

coded_cult.head()

Unnamed: 0,citation,article_name_edited,cultural_score,relational_score,demographic_score
0,"Barley, Stephen R. 1983. Semiotics and the s...",semiotics and the study of occupational and or...,1,0,0
1,"Barney, Jay B. 1986. Organizational culture:...",organizational culture can it be a source of s...,1,0,0
2,"Castilla, Emilio J., and Stephen Benard. 2010...",the paradox of meritocracy in organizations,1,0,0
3,"Dutton, Jane E., and Janet M. Dukerich. 1991....",keeping an eye on the mirror image and identit...,1,0,0
4,"Fine, Gary Alan. 1984. Negotiated orders and...",negotiated orders and organizational cultures,1,0,0


In [236]:
# Preprocess article names: lower-case, remove punctuation, strip whitespace
df['article_name_edited'] = df['article_name'].apply(
    lambda title: re.sub(
        '\W+', ' ', # remove non-words
        str(title).strip().lower())) # strip whitespace, lower-case

df['article_name_edited'].head()

0                       toward more cumulative inquiry
1    an analysis of an incentive sick leave policy ...
2    local friendship ties and community attachment...
3    knowledge transfer within the multinational fi...
4    dynamics of labor market segmentation in polan...
Name: article_name_edited, dtype: object

In [237]:
# Merge meta data into h2-coded articles on article_name_edited
coded_cult = pd.merge(coded_cult, df, how = 'left', on = 'article_name_edited')
coded_relt = pd.merge(coded_relt, df, how = 'left', on = 'article_name_edited')
coded_demog= pd.merge(coded_demog, df, how = 'left', on = 'article_name_edited')
coded_cult

Unnamed: 0,citation,article_name_edited,cultural_score,relational_score,demographic_score,file_name,edited_filename,article_name,jstor_url,abstract,journal_title,given_names,primary_subject,year,type,word_count
0,"Barley, Stephen R. 1983. Semiotics and the s...",semiotics and the study of occupational and or...,1,0,0,journal-article-10.2307_2392249,10.2307_2392249,Semiotics and the Study of Occupational and Or...,https://www.jstor.org/stable/2392249,Semiotics offers an approach for researching a...,Administrative Science Quarterly,"[Loren C., Jeffrey, Stephen, Gerrie ter, Mathi...",Management & Organizational Behavior,1979,research-article,8955.0
1,"Barney, Jay B. 1986. Organizational culture:...",organizational culture can it be a source of s...,1,0,0,,,,,,,,,,,
2,"Castilla, Emilio J., and Stephen Benard. 2010...",the paradox of meritocracy in organizations,1,0,0,journal-article-10.2307_41149515,10.2307_41149515,The Paradox of Meritocracy in Organizations,https://www.jstor.org/stable/41149515,"In this article, we develop and empirically te...",Administrative Science Quarterly,"[ROBERT, Riziki S., Idris S., Patrick, L. A., ...",Management & Organizational Behavior,2010,research-article,12889.0
3,"Dutton, Jane E., and Janet M. Dukerich. 1991....",keeping an eye on the mirror image and identit...,1,0,0,,,,,,,,,,,
4,"Fine, Gary Alan. 1984. Negotiated orders and...",negotiated orders and organizational cultures,1,0,0,journal-article-10.2307_2083175,10.2307_2083175,Negotiated Orders and Organizational Cultures,https://www.jstor.org/stable/2083175,Negotiated order and organizational culture re...,Annual Review of Sociology,"[Loren C., Jeffrey, Stephen, Gerrie ter, Mathi...",Sociology,1973,research-article,8474.0
5,"Fiol, C. Marlene. 2002. Capitalizing on para...",capitalizing on paradox the role of language i...,1,0,0,journal-article-10.2307_3086086,10.2307_3086086,Capitalizing on Paradox: The Role of Language ...,https://www.jstor.org/stable/3086086,A strongly identified workforce presents a par...,Organization Science,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Management & Organizational Behavior,1993,research-article,7969.0
6,"Goldberg, Amir, Sameer B. Srivastava, V. Govin...",fitting in or standing out the tradeoffs of st...,1,0,0,,,,,,,,,,,
7,"Morrill, Calvin. 1991. Conflict management, ...",conflict management honor and organizational c...,1,0,0,journal-article-10.2307_2781778,10.2307_2781778,"Conflict Management, Honor, and Organizational...",https://www.jstor.org/stable/2781778,How do top managers of a large American corpor...,American Journal of Sociology,"[Klaus, Urs, Justus M., Brian, A. Gus, Celia, ...",Sociology,1984,research-article,11255.0
8,"Ouchi, William G., and Alan L. Wilkins. 1985....",organizational culture,1,0,0,journal-article-10.2307_2083303,10.2307_2083303,Organizational Culture,https://www.jstor.org/stable/2083303,The contemporary study of organizational cultu...,Annual Review of Sociology,"[Emile, Roger W., Robert N., Mathieu, M., Rich...",Sociology,1983,research-article,9425.0
9,"Pettigrew, Andrew M. 1979. On studying organ...",on studying organizational cultures,1,0,0,journal-article-10.2307_2392363,10.2307_2392363,On Studying Organizational Cultures,https://www.jstor.org/stable/2392363,,Administrative Science Quarterly,"[Claude, François, Karen, Caglar, Gavin, Sarah...",Management & Organizational Behavior,1973,research-article,4101.0


In [None]:
# TO DO: For those rows with no file_name matched by strict method, 
# implement fuzzy matching of df's article name edited onto coded_cult (etc)'s article name edited


In [239]:
# Concatenate h2-coded data with hand-coded data
coded_df = pd.concat([coded_df, coded_cult], axis=0, join='inner')
coded_df

Unnamed: 0,cultural_score,relational_score,demographic_score,article_name,abstract,jstor_url,year,journal_title,edited_filename
0,1.0,0.0,0.0,"Intersecting Three Muddy Roads: Stability, Leg...",Several decades of research by multiple academ...,https://www.jstor.org/stable/25822540,2011,Journal of Managerial Issues,10.2307_25822540
1,1.0,0.0,0.0,Rational Decision Making as Performative Praxi...,Organizational theorists built their knowledge...,external-fulltext-any,2011,Organization Science,10.2307_20868880
2,1.0,1.0,0.0,From Fiefs to Clans and Network Capitalism: Ex...,China's rapid economic development is being ac...,https://www.jstor.org/stable/2393869,1986,Administrative Science Quarterly,10.2307_2393869
3,1.0,1.0,0.0,The Collective Strategy Framework: An Applicat...,This paper investigates empirically the compet...,https://www.jstor.org/stable/2392643,1984,Administrative Science Quarterly,10.2307_2392643
4,1.0,0.0,0.0,"Political Institutional Change, Obsolescing Le...",This paper studies the practice of integration...,https://www.jstor.org/stable/41682289,2012,MIR: Management International Review,10.2307_41682289
5,1.0,0.0,0.0,Culture and Meaning: Making Sense of Conflicti...,,https://www.jstor.org/stable/40397128,1989,International Studies of Management & Organiza...,10.2307_40397128
6,1.0,0.0,0.0,Linking Organizational Values to Relationships...,This study explores the organizational values ...,https://www.jstor.org/stable/2640266,1995,Organization Science,10.2307_2640266
7,1.0,0.0,0.0,Beyond the red tape: How victims of terrorism ...,We use a storyteller perspective to examine ho...,external-fulltext-any,2011,Journal of Organizational Behavior,10.2307_41415713
8,1.0,0.0,0.0,Embedding Sustainability Across the Organizati...,This article is a response to Haugh and Talwar...,https://www-jstor-org.proxy.library.georgetown...,2011,Academy of Management Learning & Education,10.2307_23100442
9,1.0,1.0,0.0,When Experience Meets National Institutional E...,We develop an institutional change perspective...,https://www.jstor.org/stable/27735492,2009,Strategic Management Journal,10.2307_27735492
