<h1> <center> Data Acquisition: Publications per Department through JSON </center> </h1>

For the 12 departments we are analysing, we need data on their publications. While this data can be webscraped, it is already available in JSON format, which is semi-structured and so much easier to convert to a dataframe and manipulate. As we want to look at publications for department over time, we will be loading multiple JSON files based on department.

<h2> <i> <center> Importing Data </center> </i> </h2> 

In [69]:
import json
import pandas as pd
from datetime import datetime

departments = ['Anthropology', 'Economic History', 'Finance', 'Geography and Environment', 'Government', 
               'International Relations', 'Management', 'Mathematics', 'Psychological and Behavioural Science',
              'Social Policy', 'Sociology', 'Statistics']
publications_data = {}

# loading files
for department in departments:
    with open(f"Data/{department}_publications.json", "r") as file:
        publications_data[department] = json.load(file)

After loading in the data, we need to figure out the formatting of the file in order to extract the data on publication data.

In [70]:
# an example item
publications_data['Anthropology'][0]

{'date': '2024-03-11',
 'title': 'What relevance has division of labour in a world of precarious work?',
 'metadata_visibility': 'show',
 'issn': '0962-8436',
 'refereed': 'TRUE',
 'note': '© 2024 Royal Society',
 'eprint_status': 'archive',
 'subjects': ['HD', 'B1', 'GN', 'H'],
 'userid': 32689,
 'eprintid': 122334,
 'documents': [{'formatdesc': 'What relevance has division of labour in a world of precarious work',
   'uri': 'http://eprints.lse.ac.uk/id/document/371833',
   'eprintid': 122334,
   'pos': 1,
   'license': 'cc_by',
   'rev_number': 2,
   'files': [{'filename': 'What_relevance_has_division_of_labour_in_a_world_of_precarious_work.pdf',
     'mtime': '2024-03-12 09:54:09',
     'fileid': 2902930,
     'datasetid': 'document',
     'uri': 'http://eprints.lse.ac.uk/id/file/2902930',
     'filesize': 251666,
     'hash': 'e7a23cba36609a8176bd7e51fdb7e828',
     'objectid': 371833,
     'mime_type': 'application/pdf',
     'hash_type': 'MD5'}],
   'docid': 371833,
   'mime_type

In [71]:
# how to extract date
publications_data['Anthropology'][0]['date']

'2024-03-11'

We specifically chose an example with multiple authors to look at the format. While examining the information for different publications, we noticed that some store author data under creator and other under editor, while still meaning the author

In [72]:
# how to extract authors
publications_data['Anthropology'][10]['creators']

[{'instid': None,
  'id': None,
  'name': {'family': 'Olk',
   'honourific': None,
   'lineage': None,
   'given': 'Christopher'}},
 {'name': {'given': 'Colleen',
   'lineage': None,
   'honourific': None,
   'family': 'Schneider'},
  'instid': None,
  'id': None},
 {'id': 'j.e.hickel@lse.ac.uk',
  'instid': 627742,
  'name': {'given': 'Jason',
   'lineage': None,
   'honourific': None,
   'family': 'Hickel'}}]

In [73]:
publications_data['Anthropology'][17]['editors']

[{'name': {'honourific': None,
   'family': 'Copeman',
   'given': 'Jacob',
   'lineage': None},
  'orcid': None,
  'id': None,
  'instid': None},
 {'name': {'given': 'Nicholas J.',
   'lineage': None,
   'honourific': None,
   'family': 'Long'},
  'orcid': '0000-0002-4088-1661',
  'instid': 675764,
  'id': 'n.j.long@lse.ac.uk'},
 {'id': None,
  'instid': None,
  'orcid': None,
  'name': {'family': 'Chau',
   'honourific': None,
   'lineage': None,
   'given': 'Lam Minh'}},
 {'name': {'given': 'Joanna',
   'lineage': None,
   'honourific': None,
   'family': 'Cook'},
  'orcid': None,
  'id': None,
  'instid': None},
 {'name': {'given': 'Magnus',
   'lineage': None,
   'honourific': None,
   'family': 'Marsden'},
  'id': None,
  'instid': None,
  'orcid': None}]

From this we can see that some authors have a value for id, which is their LSE email, and instid, which means their institute id. It will be interesting to examine not just the number of collaborators but also how many of them are pure LSE staff as it affects the productivity. 

By examining some of the dates, particularly the earlier ones, we see that the dates are incomplete. As dropping them will lose us a lot of the earlier data, we instead assume them to occure on the first day of the first month (if only year is available) or on the first day of the available month. This will skew the earlier data so that there is a spike every first January, however it will help to create an accurate zoomed-out account and should not affect the more current data too much or the yearly analysis as year remainds unchanged.

In [74]:
# creating lists
title_list = []
department_list = []
date_list = []
author_list = []
author_count_list = []
staff_count_list = []

# iterating over each publication
for department, publications in publications_data.items():
    for publication_index in range(len(publications)):
        
        # getting title
        title = (publications_data[department][publication_index]['title']).strip('"')
        
        # getting date data
        date = str((publications_data[department][publication_index]['date']))
        # there are multiple dates that contain incomplete date information so for those we'll assume first day
        if len(date) == 4:  
            date += '-01-01'
        elif len(date) == 7:
            date += '-01'
        date_dt = datetime.strptime(date, "%Y-%m-%d")

        # getting author information
        publication = publications_data[department][publication_index]
        # as information is stored under different tags
        if 'creators' in publication:
            authors = publication['creators']
        elif 'editors' in publication:
            authors = publication['editors']
        # formatting the authors names
        author_names = ', '.join([f"{author['name']['given']} {author['name']['family']}" for author in authors]).strip('"')
        num_authors = len(authors)
        num_authors_with_instid = sum(1 for author in authors if author.get('instid'))
        
     
        # adding information to lists
        title_list.append(title)
        department_list.append(department)
        date_list.append(date_dt)
        author_list.append(author_names)
        author_count_list.append(num_authors)
        staff_count_list.append(num_authors_with_instid)

        

# converting to dataframe
publication_data_df = pd.DataFrame({"Title": title_list, "Department": department_list, "Date": date_list, 
                                    "Authors": author_list, "Number of Authors": author_count_list, 
                                    "Number of Authors as Staff": staff_count_list})
publication_data_df

Unnamed: 0,Title,Department,Date,Authors,Number of Authors,Number of Authors as Staff
0,What relevance has division of labour in a wor...,Anthropology,2024-03-11,Deborah James,1,1
1,Suspicion and evidence: on the complexities of...,Anthropology,2024-03-08,Mathijs Pelkmans,1,1
2,Against book enclosures: moving towards more d...,Anthropology,2024-02-20,"Simon P.J. Batterbury, Andrea E. Pia, Gerda Wi...",4,1
3,Intimate extractions: demand dowry and neo lib...,Anthropology,2024-02-14,Katy Gardner,1,1
4,Beyond social policy? ‘patchwork’ livelihoods,Anthropology,2024-02-12,Deborah James,1,1
...,...,...,...,...,...,...
32343,Identification of the structure of multivariat...,Statistics,1973-01-01,"Mark Priestley, T Subba Rao, Howell Tong",3,1
32344,On the analysis of bivariate non-stationary pr...,Statistics,1973-01-01,"Mark Priestley, Howell Tong",2,1
32345,On time-dependent linear stochastic control sy...,Statistics,1973-01-01,"T Subba Rao, Howell Tong",2,1
32346,Some comments on spectral representations of n...,Statistics,1973-01-01,Howell Tong,1,1


In [75]:
# sorting it in chronological order
publication_data_df_sorted = publication_data_df.sort_values(by="Date").reset_index(drop=True)

# converting date format to a more common format 
publication_data_df_sorted['Date'] = publication_data_df_sorted['Date'].dt.strftime("%d-%m-%Y")

publication_data_df_sorted

Unnamed: 0,Title,Department,Date,Authors,Number of Authors,Number of Authors as Staff
0,British incomes and property in the early nine...,Economic History,01-12-1959,Patrick O'Brien,1,1
1,National assistance: service or charity?,Social Policy,01-01-1962,Howard Glennerster,1,1
2,Twelve wasted years,Social Policy,01-01-1963,Howard Glennerster,1,1
3,Public schools,Social Policy,01-01-1964,Howard Glennerster,1,1
4,Man as tranducer for probabilities in Bayesian...,Management,01-01-1964,"W. Edwards, Lawrence D. Phillips",2,1
...,...,...,...,...,...,...
32343,Should I do this? Incongruence in the face of ...,Management,01-06-2024,"Jigyashu Shukla, Christopher Stein, John T. Bu...",4,1
32344,Landscape and management influences on smallho...,Social Policy,01-06-2024,"Alexandra C. Morel, Sheleme Demissie, Techane ...",13,2
32345,Pre-electoral coalitions and the distribution ...,Government,11-09-2024,"Rafael Hortala-Vallve, Jaakko Meriläinen, Jann...",3,2
32346,The power of protest in the media: examining p...,Psychological and Behavioural Science,01-12-2024,"Eric G. Scheuch, Mark Ortiz, Ganga Shreedhar, ...",4,1


In [76]:
# saving as csv file 
publication_data_df_sorted.to_csv('data/departmental_publications_data.csv', index=False)