# Compare classification methods for identifying org. science perspectives in JSTOR articles
## Using grid search and balanced samples from hand-labeled set of articles

@author: Thomas Lu, Jaren Haber PhD<br>
@coauthors: Prof. Heather Haveman, UC Berkeley; Yoon Sung Hong, Wayfair<br>
@contact: Jaren.Haber@georgetown.edu<br>
@project: Computational Literature Review of Organizational Scholarship<br>
@date: September 2021

'''
Trains classifiers to predict whether an article is about a given perspective in org. science. To train the classifiers, uses preliminary labeled articles, broken down as follows: 
Cultural: 105 yes, 209 no
Relational: 92 yes, 230 no
Demographic: 77 yes, 249 no
Compares f1_weighted scores of four model structures using 10-Fold Cross Validation: Logistic regression, SVM, Naive Bayes, and Decision Tree. Oversamples training data to .7 (7:10 minority:majority class).
'''

In [1]:
!pip install nltk



# Initialize

In [2]:
######################################################
# Import libraries
######################################################

import pandas as pd
import numpy as np
import re
from collections import Counter
from datetime import date
from tqdm import tqdm
import os

import nltk
from nltk import word_tokenize
nltk.download('punkt')


import matplotlib.pyplot as plt

import joblib
import csv

import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings(action='once')

import sys; sys.path.insert(0, "../preprocess/") # For loading functions from files in other directory
from quickpickle import quickpickle_dump, quickpickle_load # custom scripts for quick saving & loading to pickle format

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
######################################################
# Define filepaths
######################################################

data_folder = 'classification'
folder = 'tlu_test'

cwd = os.getcwd()
root = str.replace(cwd, f'{folder}/modeling', '')

thisday = date.today().strftime("%m%d%y")

# Directory for prepared data and trained models: save files here
data_fp = root + f'{data_folder}/data/'
model_fp = root + f'{data_folder}/models/'
logs = root + f'{folder}/modeling/logs/'

# Current article lists
article_list_fp = data_fp + 'filtered_length_index.csv' # Filtered index of research articles
article_paths_fp = data_fp + 'filtered_length_article_paths.csv' # List of article file paths

# Preprocessed training data
cult_labeled_fp = data_fp + 'training_cultural_preprocessed_022621.pkl'
relt_labeled_fp = data_fp + 'training_relational_preprocessed_022621.pkl'
demog_labeled_fp = data_fp + 'training_demographic_preprocessed_022621.pkl'
orgs_labeled_fp = data_fp + 'training_orgs_preprocessed_022621.pkl'

# Model filepaths
cult_model_fp = model_fp + f'classifier_cult_MLP_{str(thisday)}.joblib'
relt_model_fp = model_fp + f'classifier_relt_MLP_{str(thisday)}.joblib'
demog_model_fp = model_fp + f'classifier_demog_MLP_{str(thisday)}.joblib'
orgs_model_fp = model_fp + f'classifier_orgs_MLP_{str(thisday)}.joblib'

# Vectorizers trained on hand-coded data (use to limit vocab of input texts)
cult_vec_fp = model_fp + 'vectorizer_cult_022621.joblib'
relt_vec_fp = model_fp + 'vectorizer_relt_022621.joblib'
demog_vec_fp = model_fp + 'vectorizer_demog_022621.joblib'
orgs_vec_fp = model_fp + 'vectorizer_orgs_022621.joblib'

In [4]:
print(root)

/home/jovyan/work/tlu_test/preprocess


In [5]:
def process_frame(in_fp, out_fp=None):
    df = quickpickle_load(in_fp)
#     df['text'] = df.text.apply(lambda x: x[0])
    
    
    if out_fp:
        df.to_csv(out_fp)
    return df
    
    

In [6]:
cult_df = process_frame('/home/jovyan/work/tlu_storage/training_cultural_preprocessed_100121.pkl')
cult_df

Unnamed: 0,text,cultural_score,primary_subject,edited_filename,article_name
0,"[[Where, Do, Interorganizational, Networks, Co...",0.0,Sociology,10.1086_210179,Where Do Interorganizational Networks Come From?
1,"[[Civil, Rights, Law, at, Work:, Sex, Discrimi...",1.0,Sociology,10.1086_210317,Civil Rights Law at Work: Sex Discrimination a...
2,"[[Between, Markets, and, Politics:, Organizati...",0.0,Sociology,10.1086_231084,Between Markets and Politics: Organizational R...
3,"[[World, Society, and, the, Nation-State, John...",1.0,Sociology,10.1086_231174,World Society and the Nation‐State
4,"[[<body, xmlns:xlink=""http://www..org//xlink""]...",1.0,Sociology,10.1086_382347,Kinship Networks and Entrepreneurs in China’s ...
...,...,...,...,...,...
914,"[[Institutionalized, Organizations:, Formal, S...",1.0,Sociology,10.2307_2778293,Institutionalized Organizations: Formal Struct...
915,"[[The, Social, Construction, of, Organizationa...",1.0,Management & Organizational Behavior,10.2307_2667051,The Social Construction of Organizational Know...
916,"[[Organizational, Structure, and, the, Institu...",1.0,Management & Organizational Behavior,10.2307_2392303,Organizational Structure and the Institutional...
917,"[[Institutional, Sources, of, Change, in, the,...",1.0,Management & Organizational Behavior,10.2307_2392383,Institutional Sources of Change in the Formal ...


In [9]:
def get_text(df, i):
    return ' '.join([' '.join(word for word in para) for para in df.text[i]])

In [10]:
get_text(cult_df, 0)

'Where Do Interorganizational Networks Come From? Ranjay Gulati Northwestern University Martin Gargiulo INSEAD Organizations enter alliances with each other to access critical resources, but they rely on information from the network of prior alliances to determine with whom to cooperate. These new alliances modify the existing network, prompting an endogenous dynamic between organizational action and network structure that drives the emergence of interorganizational networks. Testing these ideas on alliances formed in three industries over nine years, this research shows that the probability of new alliance between specific organizations increases with their interdependence and also with their prior mutual alliances, common third parties, and joint centrality in the alliance network. The differentiation of the emerging network structure, however, mitigates the effect of interdependence and enhances the effect of joint centrality on new alliance formation. INTRODUCTION Sociologists have

In [14]:
for word in ['cultural', 'relational', 'demographic', 'orgs']:
    df = process_frame(f'/home/jovyan/work/tlu_storage/training_{word}_preprocessed_100121.pkl')
    print(len(df))
    df.to_csv(f'data/training_{word}.csv')

733
736
740
824


## Load & inspect data