# 3. Disciplines of Journals using OJS <a name=languages></a>

### Notebook objectives:
1. Translate concatenated titles and abstracts from Bahasa Indonesia, Spanish, and Portuguese to English using [UKP's EasyNMT neural machine translator.](#nmt) This is necessary because the field of study classifier was trained on English text.<br><br>
2. Classify the journals known to be actively using OJS by applying [Weber et al.'s (2020) neural field of study classifier](#fosc) to titles and abstracts.<br><br>
3. Group the classified journals into 3 divisions: [STEM, Social Sciences, or Humanities.](#visuals)
  1. [English, Bahasa Indonesia, Spanish, Portuguese](#all)
  2. [English](#en)
  3. [Bahasa Indonesia](#id)
  4. [Spanish](#es)
  5. [Portuguese](#pt)

Import packages:

In [None]:
from collections import defaultdict
from collections import Counter
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import ijson
import json
import time
import re
import os

### 1. Translating jounal titles and abstracts with <a href='https://github.com/UKPLab/EasyNMT'>UKP's EasyNMT neural machine translator</a><a id='nmt'></a>

Import EasyNMT:

In [None]:
import numpy
import tensorflow
from easynmt import EasyNMT

First, create a function that:
<br>
1. Reads each of the Indonesian, Spanish, and Portuguese .json files mapping journal ISSNs to concatenated article titles and abstracts, or payloads;
<br><br>
2. Translates the payloads for each journal from the specified source language ('id', 'es', 'pt') to the target language, English ('en');
<br><br>
3. Saves a dictionary mapping journal ISSN to translated payload as a .json file:

In [None]:
def translate(infile, outfile, model, source_lang):
    issn2translation = {}
    with open(infile, 'r') as _infile:
        d = json.load(_infile)
        
    for k, v in d.items():
        d[k] = ' '.join(w.strip('\n') for w in v.split()[:100]) #clip all inputs at 100 tokens to lessen memory needs
    
    print('Translating payloads...')
    for k, v in tqdm(list(d.items())):
        issn2translation[k] = model.translate(v, source_lang=source_lang, target_lang='en')
    
    with open(outfile, 'w') as _outfile:
        json.dump(issn2translation, _outfile)
        print('Translated payloads saved.')

Bahasa Indonesia:

In [None]:
infile_id = os.path.join('data', 'issn2id.json')
outfile_id = os.path.join('data', 'issn2id_trans.json')
opus = EasyNMT('opus-mt') #Helsinki NLP opus for id -> en
%time translate(infile=infile_id, outfile=outfile_id, model=opus, source_lang='id')

Spanish:

In [None]:
infile_es = os.path.join('data', 'issn2es.json')
outfile_es = os.path.join('data', 'issn2es_trans.json')
opus = EasyNMT('opus-mt') #Helsinki NLP opus for es -> en
%time translate(infile=infile_es, outfile=outfile_es, model=opus, source_lang='es')

In [None]:
del opus

Portuguese:

In [None]:
infile_pt = os.path.join('data', 'issn2pt.json')
outfile_pt = os.path.join('data', 'issn2pt_trans.json')
mbart = EasyNMT('mbart50_m2en') #Facebook (Meta) mbart50_m2en for pt -> en
%time translate(infile=infile_pt, outfile=outfile_pt, model=mbart, source_lang='pt')

### 2. OJS Field of Study Classification<a id='fosc'></a>
<br>
Instantiate <a href='https://direct.mit.edu/qss/article/1/2/525/96148/Using-supervised-learning-to-classify-metadata-of'>Weber et al.'s (2020)</a> feedforward neural net for classifying academic fields of study:

In [None]:
from fosc import load_model, vectorize
from fosc.config import config
model_id = 'mlp_l'
model = load_model(model_id)

Create a dict mapping the labels of the fosc classifier `int` to `str`:

In [None]:
anzsrc = {
    0:'Mathematical Sciences',
    1:'Physical Sciences',
    2:'Chemical Sciences',
    3:'Earth and Environmental Sciences',
    4:'Biological Sciences',
    5:'Agricultural and Veterinary Sciences',
    6:'Information and Computing Sciences',
    7:'Engineering and Technology',
    8:'Medical and Health Sciences',
    9:'Built Environment and Design',
    10:'Education',
    11:'Economics',
    12:'Commerce, Management, Tourism and Services',
    13:'Studies in Human Society',
    14:'Psychology and Cognitive Sciences',
    15:'Law and Legal Studies',
    16:'Studies in Creative Arts and Writing',
    17:'Language, Communication and Culture',
    18:'History and Archaeology',
    19:'Philosophy and Religious Studies'
}

Group the <a href='https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-research-classification-anzsrc/latest-release'>20 ANZSRC labels</a> into three broad divisions: STEM, Social Sciences, and Humanities

In [None]:
STEM = ['Agricultural and Veterinary Sciences', 
        'Biological Sciences', 
        'Built Environment and Design', 
        'Chemical Sciences',
        'Earth and Environmental Sciences',
        'Engineering and Technology',
        'Information and Computing Sciences',
        'Mathematical Sciences',
        'Medical and Health Sciences',
        'Physical Sciences']

SOCSCI = ['Commerce, Management, Tourism and Services',
          'Economics',
          'Education',
          'Law and Legal Studies',
          'Psychology and Cognitive Sciences',
          'Studies in Human Society']

HUM = ['History and Archaeology',
       'Language, Communication and Culture',
       'Philosophy and Religious Studies',
       'Studies in Creative Arts and Writing']

Create a short helper function for converting integer labels to text labels:

In [None]:
def assign_discipline(row):
    return anzsrc[row['discipline']]

Create a function that:
<br>
1. Reads a .json file with a dictionary mapping journal issn to a payload of concatenated article abstracts for each journal;
<br><br>
2. Passes each payload to Weber et al.'s field of study classifier (fosc);
<br><br>
3. Selects the most likely field of study label :

In [None]:
def fosc(issn2payload, model_id, model):
 
    to_df = {'issn': issn2payload.keys(), 'payload': issn2payload.values()}
    df = pd.DataFrame.from_dict(to_df)
    del issn2payload, to_df
    print('{} examples loaded.'.format(len(to_df))

    vectorized = vectorize(df['payload'], model_id)
    preds = pd.DataFrame(model.predict(vectorized))
    payDF = df.join(preds)

    #select a primary field of study classification label
    df['discipline'] = df[[i for i in range(0, 20)]].idxmax(axis=1)
    df['discipline'] = df.apply(assign_discipline, axis=1)
    print('Journals classified.')
    
    #return a final DF of discipline counts
    countDF = pd.DataFrame(df['discipline'].value_counts())
    del df
    countDF.reset_index(inplace=True)
    countDF = countDF.rename(columns = {'index':'Discipline',
                                        'discipline': 'Count'})
    return countDF

### Visualize <a id='visuals'></a>

#### English, Bahasa Indonesia, Spanish, and Portuguese: <a id='all'></a>
<br>
Create a helper function to load a dataset of English + translated payloads for disciplinary classification:

In [None]:
def create_dataset(lang_list):
    #English
    with open(os.path.join('data', 'issn2en.json'), 'r') as en:
        issn2payload = json.load(en)
        
    for code in lang_list:
        filename = 'issn2' + code + '_trans.json'
        with open(os.path.join('data', filename), 'r') as infile:
            d = json.load(infile)
            for k, v in d.items():
                if k not in issn2payload:
                    issn2payload[k] = v
            del d
            
    return issn2payload
    print('{} payloads ready for classification.'.format(len(issn2payload)))

In [None]:
issn2payload = create_dataset(['id', 'es', 'pt'])

In [None]:
%%time 
OJS = fosc(
    issn2payload=issn2payload, 
    model_id=model_id, 
    model=model)

In [None]:
sns.set_theme(style="whitegrid")

fig, ax = plt.subplots(figsize=(4, 10))

sns.barplot(x="Count", y="Discipline", data=OJS,
            label="Total", color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 3000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Discipline',
       title = 'Disciplines of English-language journals ($\it{n}$ = 17,761)')

matplotlib.pyplot.xticks([0, 500, 1000, 1500, 2000, 2500],
                         ['0', '500', '1,000', '1,500', '2,000', '2,500'])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.25
    percent = round(((p.get_width() / 17761) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 150, _y, value, ha='left', weight='bold')

plt.savefig('disc_en.png', bbox_inches='tight')

English:

In [None]:
%%time 
EN = fosc(
    issn2payload=issn2payload_en, 
    model_id=model_id, 
    model=model)

In [None]:
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
fig, ax = plt.subplots(figsize=(4, 10))

# Plot the
sns.set_color_codes("pastel")
sns.barplot(x="Count", y="Discipline", data=EN,
            label="Total", color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 3000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Discipline',
       title = 'Disciplines of English-language journals ($\it{n}$ = 17,761)')

matplotlib.pyplot.xticks([0, 500, 1000, 1500, 2000, 2500],
                         ['0', '500', '1,000', '1,500', '2,000', '2,500'])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.25
    percent = round(((p.get_width() / 17761) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 150, _y, value, ha='left', weight='bold')

plt.savefig('disc_en.png', bbox_inches='tight')

In [None]:
pivot_en = EN.pivot_table(columns='Discipline')

triDF_en = {'Division': ['Social Sciences', 'STEM', 'Humanities'],
            'Count': [pivot_en[SOCSCI].values.sum(axis=1)[0],
                      pivot_en[STEM].values.sum(axis=1)[0],
                      pivot_en[HUM].values.sum(axis=1)[0]]}

In [None]:
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
fig, ax = plt.subplots(figsize=(10,4))

sns.barplot(x="Count", y="Division", data=triDF_en, color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 10000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Division',
       title = 'English-language journals ($\it{n}$ = 17,761)')

matplotlib.pyplot.xticks([0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000],
                         ['0', '1,000', '2,000', '3,000', '4,000', '5,000', '6,000', '7,000', '8,000', '9,000', ''])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.33
    percent = round(((p.get_width() / 17761) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 125, _y, value, ha='left', weight='bold')

plt.savefig('div_en.png', bbox_inches='tight')

Bahasa Indonesia:

In [None]:
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
fig, ax = plt.subplots(figsize=(10,4))

sns.barplot(x="Count", y="Division", data=triDF_en, color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 10000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Division',
       title = 'English-language journals ($\it{n}$ = 17,761)')

matplotlib.pyplot.xticks([0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000],
                         ['0', '1,000', '2,000', '3,000', '4,000', '5,000', '6,000', '7,000', '8,000', '9,000', ''])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.33
    percent = round(((p.get_width() / 17761) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 125, _y, value, ha='left', weight='bold')

plt.savefig('div_en.png', bbox_inches='tight')

In [None]:
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
fig, ax = plt.subplots()

# Plot
sns.set_color_codes("pastel")
sns.barplot(x="Count", y="Discipline", data=countDF_all, color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 3000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Discipline',
       title = 'Disciplines of active journals ($\it{n}$ = 20,181)')

matplotlib.pyplot.xticks([0, 500, 1000, 1500, 2000, 2500],
                         ['0', '500', '1,000', '1,500', '2,000', '2,500'])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.25
    percent = round(((p.get_width() / 20181) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 40, _y, value, ha='left', weight='bold')

plt.savefig('disc_all.png', bbox_inches='tight')

pivot_all = countDF_all.pivot_table(columns='Discipline')

triDF_all = {'Division': ['Social Sciences', 'STEM', 'Humanities'],
            'Count': [pivot_all[SOCSCI].values.sum(axis=1)[0],
                      pivot_all[STEM].values.sum(axis=1)[0],
                      pivot_all[HUM].values.sum(axis=1)[0]]}
triDF_all

sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
fig, ax = plt.subplots(figsize=(10,4))

sns.barplot(x="Count", y="Division", data=triDF_all, color="grey")

sns.despine(bottom=True)

ax.set(xlim=(0, 10000),
       xlabel = 'Active journals using OJS',
       ylabel = 'Division',
       title = 'English, Indonesian, Spanish, and Portuguese-language journals ($\it{n}$ = 20,181)')

matplotlib.pyplot.xticks([0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000],
                         ['0', '1,000', '2,000', '3,000', '4,000', '5,000', '6,000', '7,000', '8,000', '9,000', ''])

for p in ax.patches:
    _x = p.get_x() + p.get_width()
    _y = p.get_y() + p.get_height() - 0.33
    percent = round(((p.get_width() / 20181) * 100), 1)
    if len(str(int(p.get_width()))) == 4:
        value = str(int(p.get_width()))[0] + ',' + str(int(p.get_width()))[1:] + ' ({})'.format(str(percent)+'%')
    else:
        value = str(int(p.get_width())) + ' ({})'.format(str(percent)+'%')
    ax.text(_x + 150, _y, value, ha='left', weight='bold')

plt.savefig('div_all.png', bbox_inches='tight')