# Filter and sample articles for hand-coding
## Evaluate articles as high/low in engagement with each perspective using word2vec engagement scores, randomize each slice, take random sample

""" 
This script filters academic articles to those high (above 75th percentile) or low (below 25th percentile) on engagement with each perspective (relational, demographic, or cultural) across two subject areas (Sociology or Management or Organizational Behavior). This yields 12 slices of articles: 2 levels (high/low engagement) by 2 subject areas by 3 perspectives. It uses Word2Vec cosine similarity scores as measure of engagement with each perspective. It randomizes each resulting list of articles and saves the whole thing in .xlsx format. Finally, it takes a random sample of 50 articles from each of the 12 slices, resulting in 600 articles saved in 12 .csv files. <br>
Data source: JSTOR Data-for-Research <br>
Project repository: https://github.com/h2researchgroup/Computational-Analysis-For-Social-Science
"""

@title: Random Sample of Articles by Word2Vec Cosine Scores <br>
@author: Jaren Haber <br>
@collaborators: Heather Haveman, Yoon Sung Hong <br>
@date: October 30th, 2020

In [1]:
# import packages, set directories
import pandas as pd
import os
import tables
import random

In [2]:
# Set file locations
root = '/home/jovyan/work/' # set root directory

# dictionary counts (using core dictionaries) and matched subjects 
counts_path = root + 'Computational-Analysis-For-Social-Science/Dictionary Mapping/counts_and_subject.csv'

# per-article info on cosine scores using each dictionary (core or 100-term dictionaries??)
cosines_path = root + 'models_storage/word_embeddings_data/text_with_cosine_scores_wdg_2020_oct27.csv'

# per-article metadata with URLs
meta_path = root + 'Computational-Analysis-For-Social-Science/Dictionary Mapping/Pipe/metadata_combined.h5' 

# Filtered index of research articles
filtered_path = root + 'Computational-Analysis-For-Social-Science/Dictionary Mapping/Pipe/filtered_index.csv'

# output directory: save files here
output_path = root + 'Computational-Analysis-For-Social-Science/WordEmbedding/sample_generation/sample_output_w2v/'

# for text files (not used here)
ocr_path = root + 'jstor_data/ocr/' 

# Read, filter, merge articles data

In [3]:
# Read in metadata file
df_meta = pd.read_hdf(meta_path)
df_meta.reset_index(drop=False, inplace=True) # extract file name from index

# For merging purposes, get ID alone from file name, e.g. 'journal-article-10.2307_2065002' -> '10.2307_2065002'
df_meta['edited_filename'] = df_meta['file_name'].apply(lambda x: x[16:]) 
df_meta = df_meta[["edited_filename", "article_name", "jstor_url", "abstract", "journal_title", "given_names", "primary_subject", "year", "type"]] # keep only relevant columns

df_meta.head()

Unnamed: 0,edited_filename,article_name,jstor_url,abstract,journal_title,given_names,primary_subject,year,type
0,10.2307_4167860,Cross-Dialectal Variation in Arabic: Competing...,https://www.jstor.org/stable/4167860,Most researchers of Arabic sociolinguistics as...,Language in Society,,Other,1979,research-article
1,10.2307_2578336,,https://www.jstor.org/stable/2578336,,Social Forces,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1983,book-review
2,10.2307_2654760,,https://www.jstor.org/stable/2654760,,Contemporary Sociology,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1998,book-review
3,10.2307_43242281,editor's note: A KNIGHT'S TALE,https://www.jstor.org/stable/43242281,,Corporate Knights,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Other,2005,misc
4,10.2307_42862018,,https://www.jstor.org/stable/42862018,,Social Science Quarterly,"[Sidney, Hyman P., Riv-Ellen, Stephen, Thomas,...",Sociology,1985,book-review


In [4]:
# Read in filtered index, counts, and cosine scores data
df = pd.read_csv(filtered_path, low_memory=False, header=None, names=["file_name"]) # filtered index
df['edited_filename'] = df['file_name'].apply(lambda x: x[16:]) # New col with only article ID

df_counts = pd.read_csv(counts_path, low_memory=False)
df_counts['edited_filename'] = df_counts['article_id'].apply(lambda x: x[16:]) # New col with only article ID
df_counts = df_counts[['edited_filename', 'culture_ngram_count.1', 'relational_ngram_count.1', 
                       'demographic_ngram_count.1', 'word_count', 'cultural_author_count', 
                       'demographic_author_count', 'relational_author_count']]

df_scores = pd.read_csv(cosines_path, low_memory=False)
df_scores.drop(columns=['Unnamed: 0','Unnamed: 0.1'], inplace=True) # drop useless columns

In [5]:
# Merge cosine scores into counts DF
df = pd.merge(df, df_scores, how='left', on='edited_filename')

# Merge counts into cosine+counts DF
df = pd.merge(df, df_counts, how='left', on='edited_filename')

# Merge URLs into cosine+counts DF
df = pd.merge(df, df_meta, how='left', on='edited_filename')

# Show all columns in final DF
print("All columns:\n", list(df))
print()

# Show breakdown of subjects
print("Breakdown of subject areas:")
print(df["primary_subject"].value_counts())

df.head()

All columns:
 ['file_name', 'edited_filename', 'filename', 'text', 'relational_word2vec_cosine', 'demographic_word2vec_cosine', 'culture_word2vec_cosine', 'culture_ngram_count.1', 'relational_ngram_count.1', 'demographic_ngram_count.1', 'word_count', 'cultural_author_count', 'demographic_author_count', 'relational_author_count', 'article_name', 'jstor_url', 'abstract', 'journal_title', 'given_names', 'primary_subject', 'year', 'type']

Breakdown of subject areas:
Sociology                               51077
Management & Organizational Behavior    18582
Name: primary_subject, dtype: int64


Unnamed: 0,file_name,edited_filename,filename,text,relational_word2vec_cosine,demographic_word2vec_cosine,culture_word2vec_cosine,culture_ngram_count.1,relational_ngram_count.1,demographic_ngram_count.1,...,demographic_author_count,relational_author_count,article_name,jstor_url,abstract,journal_title,given_names,primary_subject,year,type
0,journal-article-10.2307_2065002,10.2307_2065002,../../../jstor_data/ocr/journal-article-10.230...,symposium toward more cumulative inquiry phili...,0.593038,0.582339,0.646683,3.0,5.0,7.0,...,0.0,0.0,Toward More Cumulative Inquiry,https://www.jstor.org/stable/2065002,,Contemporary Sociology,"[Ariela, ARTHUR J., John A., Marilyn, Janemari...",Sociology,1978,research-article
1,journal-article-10.2307_3380821,10.2307_3380821,../../../jstor_data/ocr/journal-article-10.230...,productivity review analysis incentive sick le...,0.62688,0.590552,0.63219,6.0,7.0,5.0,...,0.0,0.0,An Analysis of an Incentive Sick Leave Policy ...,https://www.jstor.org/stable/3380821,Local health departments are under tremendous ...,Public Productivity & Management Review,"[Werner, Werner, Konrad, Rudi, Paul, Jean, Rob...",Management & Organizational Behavior,1986,research-article
2,journal-article-10.2307_2095822,10.2307_2095822,../../../jstor_data/ocr/journal-article-10.230...,local friendship ties community attachment mas...,0.685393,0.643482,0.657989,6.0,24.0,61.0,...,0.0,0.0,Local Friendship Ties and Community Attachment...,https://www.jstor.org/stable/2095822,This study presents a multilevel empirical tes...,American Sociological Review,"[Alice O., Peter, W. Erwin, Bert, Robert W., C...",Sociology,1983,research-article
3,journal-article-10.2307_2631839,10.2307_2631839,../../../jstor_data/ocr/journal-article-10.230...,management science vol december printed notes ...,0.635738,0.616048,0.602856,0.0,0.0,0.0,...,0.0,0.0,The Managerial Economics of Civil Litigation: ...,https://www.jstor.org/stable/2631839,,Management Science,"[JOHN, Michael, Tom, Beth B., Gari, Lola, E., ...",Management & Organizational Behavior,1985,research-article
4,journal-article-10.2307_40836133,10.2307_40836133,../../../jstor_data/ocr/journal-article-10.230...,mir special issue pp mir management internatio...,0.774757,0.674117,0.697802,52.0,28.0,7.0,...,0.0,0.0,Knowledge Transfer within the Multinational Fi...,https://www.jstor.org/stable/40836133,This paper examines the process of knowledge t...,MIR: Management International Review,"[Ariela, ARTHUR J., John A., Marilyn, Janemari...",Management & Organizational Behavior,2005,research-article


In [6]:
# Notes to detect # of Abstracts (not listed as "None"):
#mask = df.dropna(subset=["abstract"])["abstract"].apply(lambda x: len(x)!=4) 

# Filter, randomize, and save articles

## Filter articles to high/low engagement

In [7]:
# Separate articles by subject, to make computing quartiles easier: create masks for this purpose
socmask = df["primary_subject"]=="Sociology"
mgtmask = df["primary_subject"]=="Management & Organizational Behavior"

In [8]:
# Compute quartiles of each cosine score: per perspective, per subject area
soc_relt_q25, soc_relt_q50, soc_relt_q75 = [thresh for thresh in df[socmask]["relational_word2vec_cosine"].quantile(q=[.25,.5,.75])]
soc_demog_q25, soc_demog_q50, soc_demog_q75 = [thresh for thresh in df[socmask]["demographic_word2vec_cosine"].quantile(q=[.25,.5,.75])]
soc_cult_q25, soc_cult_q50, soc_cult_q75 = [thresh for thresh in df[socmask]["culture_word2vec_cosine"].quantile(q=[.25,.5,.75])]

mgt_relt_q25, mgt_relt_q50, mgt_relt_q75 = [thresh for thresh in df[mgtmask]["relational_word2vec_cosine"].quantile(q=[.25,.5,.75])]
mgt_demog_q25, mgt_demog_q50, mgt_demog_q75 = [thresh for thresh in df[mgtmask]["demographic_word2vec_cosine"].quantile(q=[.25,.5,.75])]
mgt_cult_q25, mgt_cult_q50, mgt_cult_q75 = [thresh for thresh in df[mgtmask]["culture_word2vec_cosine"].quantile(q=[.25,.5,.75])]

In [9]:
# Get article slices for random sampling vy filtering each subject's articles 
# to high or low on each perspective: those above q75 vs. those below q25 quartile
soc_relt_low, soc_relt_high = df[socmask][lambda x: x["relational_word2vec_cosine"] < soc_relt_q25], df[socmask][lambda x: x["relational_word2vec_cosine"] > soc_relt_q75]
soc_demog_low, soc_demog_high = df[socmask][lambda x: x["demographic_word2vec_cosine"] < soc_demog_q25], df[socmask][lambda x: x["demographic_word2vec_cosine"] > soc_demog_q75]
soc_cult_low, soc_cult_high = df[socmask][lambda x: x["culture_word2vec_cosine"] < soc_cult_q25], df[socmask][lambda x: x["culture_word2vec_cosine"] > soc_cult_q75]

mgt_relt_low, mgt_relt_high = df[mgtmask][lambda x: x["relational_word2vec_cosine"] < mgt_relt_q25], df[mgtmask][lambda x: x["relational_word2vec_cosine"] > mgt_relt_q75]
mgt_demog_low, mgt_demog_high = df[mgtmask][lambda x: x["demographic_word2vec_cosine"] < mgt_demog_q25], df[mgtmask][lambda x: x["demographic_word2vec_cosine"] > mgt_demog_q75]
mgt_cult_low, mgt_cult_high = df[mgtmask][lambda x: x["culture_word2vec_cosine"] < mgt_cult_q25], df[mgtmask][lambda x: x["culture_word2vec_cosine"] > mgt_cult_q75]

## Save all articles in random order

In [10]:
# Filter to only key columns for hand-coding. First define columns to keep:
keepcols = ['article_name', 'abstract', 'jstor_url', 'word_count', 'year', 'journal_title', 'edited_filename', 
            'culture_word2vec_cosine', 'culture_ngram_count.1', 'cultural_author_count', 
            'relational_word2vec_cosine', 'relational_ngram_count.1', 'relational_author_count', 
            'demographic_word2vec_cosine', 'demographic_ngram_count.1', 'demographic_author_count'] 

# Implement filter and reset index:
soc_relt_low, soc_relt_high = soc_relt_low.reset_index(drop=True)[keepcols], soc_relt_high.reset_index(drop=True)[keepcols]
soc_demog_low, soc_demog_high = soc_demog_low.reset_index(drop=True)[keepcols], soc_demog_high.reset_index(drop=True)[keepcols]
soc_cult_low, soc_cult_high = soc_cult_low.reset_index(drop=True)[keepcols], soc_cult_high.reset_index(drop=True)[keepcols]

mgt_relt_low, mgt_relt_high = mgt_relt_low.reset_index(drop=True)[keepcols], mgt_relt_high.reset_index(drop=True)[keepcols]
mgt_demog_low, mgt_demog_high = mgt_demog_low.reset_index(drop=True)[keepcols], mgt_demog_high.reset_index(drop=True)[keepcols]
mgt_cult_low, mgt_cult_high = mgt_cult_low.reset_index(drop=True)[keepcols], mgt_cult_high.reset_index(drop=True)[keepcols]

In [11]:
soc_relt_low.head()

Unnamed: 0,article_name,abstract,jstor_url,word_count,year,journal_title,edited_filename,culture_word2vec_cosine,culture_ngram_count.1,cultural_author_count,relational_word2vec_cosine,relational_ngram_count.1,relational_author_count,demographic_word2vec_cosine,demographic_ngram_count.1,demographic_author_count
0,XIth Polish Sociological Congress,,https://www.jstor.org/stable/41274741,1201.0,2000,Polish Sociological Review,10.2307_41274741,0.548939,2.0,0.0,0.482802,3.0,0.0,0.487472,2.0,0.0
1,Transitions Into and Out of Cohabitation in La...,Cohabitation among adults over age 50 is risin...,external-fulltext-any,10902.0,2012,Journal of Marriage and Family,10.2307_41678755,0.540821,0.0,0.0,0.542854,12.0,0.0,0.563501,26.0,0.0
2,Mary Lou Purcell Receives the 1979 Ernest G. O...,,https://www.jstor.org/stable/583727,301.0,1980,Family Relations,10.2307_583727,0.579847,0.0,0.0,0.558994,0.0,0.0,0.519012,1.0,0.0
3,Sociodemographic Differentials in Mate Selecti...,"Data from over 2,000 respondents in the Nation...",https://www.jstor.org/stable/352998,6514.0,1987,Journal of Marriage and Family,10.2307_352998,0.570249,15.0,0.0,0.562282,27.0,0.0,0.549956,57.0,0.0
4,National Integration and National Language,The language situation in the Philippines toda...,https://www.jstor.org/stable/23892183,741.0,1972,Philippine Sociological Review,10.2307_23892183,0.601977,0.0,0.0,0.539708,0.0,0.0,0.51921,2.0,0.0


In [12]:
# Save resulting slices in .csv format (skip .xlsx for now)
for var, name in list(zip([soc_relt_low, soc_relt_high, soc_demog_low, soc_demog_high, soc_cult_low, soc_cult_high,
                           mgt_relt_low, mgt_relt_high, mgt_demog_low, mgt_demog_high, mgt_cult_low, mgt_cult_high], 
                          ['soc_relt_low', 'soc_relt_high', 'soc_demog_low', 'soc_demog_high', 'soc_cult_low', 'soc_cult_high',
                           'mgt_relt_low', 'mgt_relt_high', 'mgt_demog_low', 'mgt_demog_high', 'mgt_cult_low', 'mgt_cult_high'])):
    
    path = output_path + name + '_w2v_articles_randomized'
    var.to_csv(path + '.csv')
    #var.to_excel(path + '.xlsx')

## Take random sample of articles
Keep total of 100 (a few low, but most high, due to high false positive rate) for each subject x perspectives: total 600.

In [13]:
# Merge subject-perspective slices and save random sample as .csv
random.seed(43)

# Get 100 random articles per subject x perspective
# Do this by taking 70 high + 30 low and shuffling the result
soc_relt = pd.concat([soc_relt_low.iloc[100:106], soc_relt_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")
soc_demog = pd.concat([soc_demog_low.iloc[100:106], soc_demog_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")
soc_cult = pd.concat([soc_cult_low.iloc[100:106], soc_cult_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")

mgt_relt = pd.concat([mgt_relt_low.iloc[100:106], mgt_relt_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")
mgt_demog = pd.concat([mgt_demog_low.iloc[100:106], mgt_demog_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")
mgt_cult = pd.concat([mgt_cult_low.iloc[100:106], mgt_cult_high.iloc[100:294]]).sample(frac=1).reset_index(drop=True).sort_values(by="article_name")

# Save random samples in .csv format
for dframe, name in list(zip([soc_relt, soc_demog, soc_cult, mgt_relt, mgt_demog, mgt_cult], 
                             ['soc_relt', 'soc_demog', 'soc_cult', 'mgt_relt', 'mgt_demog', 'mgt_cult'])):
    
    # Add new, empty first columns to use for manual coding of engagement, notes, and whether uses some other perspective 
    col1 = "stance twd. " + str(name[4:]) + " persp. (1: positive; 0.5: maybe; 0: none)" # name of column to be first in DF
    col2 = "notes"
    col3 = "other persp.?"
    
    for col in [col3, col2, col1]: # To get final order right, loop over new cols in reverse order: 3rd col, then 2nd, then 1st
        dframe[col] = "" # create empty column
        popcol = dframe.pop(col) # extract empty column
        dframe.insert(0, col, popcol) # put empty column in beginning
    
    # Save each resulting DF to file
    path = output_path + name + '_w2v'
    dframe.to_csv(path + '.csv', index=False)
    #var.to_excel(path + '.xlsx')

In [14]:
soc_relt

Unnamed: 0,stance twd. relt persp. (1: positive; 0.5: maybe; 0: none),notes,other persp.?,article_name,abstract,jstor_url,word_count,year,journal_title,edited_filename,culture_word2vec_cosine,culture_ngram_count.1,cultural_author_count,relational_word2vec_cosine,relational_ngram_count.1,relational_author_count,demographic_word2vec_cosine,demographic_ngram_count.1,demographic_author_count
147,,,,The Creation of a Collective Identity in a So...,,https://www.jstor.org/stable/657729,7764.0,1983,Theory and Society,10.2307_657729,0.711726,25.0,0.0,0.651302,24.0,0.0,0.608884,9.0,0.0
186,,,,'System Destroys Trust?'—Regulatory Institutio...,This article aims to explore public perception...,https://www.jstor.org/stable/40649317,6717.0,2010,Social Indicators Research,10.2307_40649317,0.728911,12.0,0.0,0.686552,10.0,0.0,0.625499,8.0,0.0
73,,,,<bold>Congregations and Crime: Is the Spatial ...,Few studies have focused on how religious cong...,external-fulltext-any,8234.0,2010,Journal for the Scientific Study of Religion,10.2307_40664675,0.624268,13.0,0.0,0.638301,32.0,0.0,0.615536,28.0,0.0
172,,,,A Mixed-Methods Social Networks Study Design f...,This paper advocates the adoption of a mixed-m...,external-fulltext-any,6939.0,2011,Journal of Marriage and Family,10.2307_29789624,0.670382,15.0,0.0,0.707259,136.0,0.0,0.597385,14.0,0.0
23,,,,A U.S. Automobile Plant Analysis of Workers Co...,This study reviews the issues of union commitm...,https://www.jstor.org/stable/43496993,7121.0,2014,"Race, Gender & Class",10.2307_43496993,0.664317,4.0,0.0,0.680939,13.0,0.0,0.626652,1.0,0.0
159,,,,An Evaluation of Socially Responsive Planning ...,New resource towns on the Canadian frontier ha...,https://www.jstor.org/stable/27520872,6634.0,1991,Social Indicators Research,10.2307_27520872,0.691086,5.0,0.0,0.685234,34.0,0.0,0.641717,16.0,0.0
74,,,,An Odd and Inseparable Couple: Emotion and Rat...,The dichotomy between emotion and rationality ...,https://www.jstor.org/stable/40345661,8438.0,2009,Theory and Society,10.2307_40345661,0.732743,36.0,0.0,0.636306,20.0,0.0,0.562708,41.0,0.0
78,,,,Anonymity and the Rise of Universal Occasions ...,"In this research, Durkheim's theory of the uni...",https://www.jstor.org/stable/1387003,6311.0,1981,Journal for the Scientific Study of Religion,10.2307_1387003,0.718938,86.0,0.0,0.649925,58.0,0.0,0.596998,97.0,0.0
189,,,,Appalachian Appellations,This article provides a series of critical ref...,https://www.jstor.org/stable/10.1525/irqr.2014...,7214.0,2014,International Review of Qualitative Research,10.1525_irqr.2014.7.3.359,0.695307,48.0,0.0,0.635210,3.0,0.0,0.623269,4.0,0.0
196,,,,Are Informal Connections a Functional Alternat...,This article aims to ascertain whether organiz...,https://www.jstor.org/stable/24721456,7260.0,2014,Social Indicators Research,10.2307_24721456,0.706787,10.0,0.0,0.709255,49.0,0.0,0.603360,5.0,0.0
