<h1> <center> Data Acquisition: Publications per Department through JSON </center> </h1>

For the 12 departments we are analysing, we need data on their publications. While this data can be webscraped, it is already available in JSON format, which is semi-structured and so much easier to convert to a dataframe and manipulate. As we want to look at publications for department over time, we will be loading multiple JSON files based on department, not just time.

<h2> <i> <center> Importing Data </center> </i> </h2> 

In [94]:
import json
import pandas as pd
from datetime import datetime

departments = ['Anthropology', 'Economic History', 'Finance', 'Geography and Environment', 'Government', 
               'International Relations', 'Management', 'Mathematics', 'Psychological and Behavioural Science',
              'Social Policy', 'Sociology', 'Statistics']
publications_data = {}

# loading files
for department in departments:
    with open(f"Data/{department}_publications.json", "r") as file:
        publications_data[department] = json.load(file)

After loading in the data, we need to figure out the formatting of the file in order to extract the data on publication data.

In [95]:
# an example item
publications_data['Anthropology'][0]

{'date': '2024-03-11',
 'title': 'What relevance has division of labour in a world of precarious work?',
 'metadata_visibility': 'show',
 'issn': '0962-8436',
 'refereed': 'TRUE',
 'note': '© 2024 Royal Society',
 'eprint_status': 'archive',
 'subjects': ['HD', 'B1', 'GN', 'H'],
 'userid': 32689,
 'eprintid': 122334,
 'documents': [{'formatdesc': 'What relevance has division of labour in a world of precarious work',
   'uri': 'http://eprints.lse.ac.uk/id/document/371833',
   'eprintid': 122334,
   'pos': 1,
   'license': 'cc_by',
   'rev_number': 2,
   'files': [{'filename': 'What_relevance_has_division_of_labour_in_a_world_of_precarious_work.pdf',
     'mtime': '2024-03-12 09:54:09',
     'fileid': 2902930,
     'datasetid': 'document',
     'uri': 'http://eprints.lse.ac.uk/id/file/2902930',
     'filesize': 251666,
     'hash': 'e7a23cba36609a8176bd7e51fdb7e828',
     'objectid': 371833,
     'mime_type': 'application/pdf',
     'hash_type': 'MD5'}],
   'docid': 371833,
   'mime_type

In [96]:
# how to extract date
publications_data['Anthropology'][0]['date']

'2024-03-11'

By examining some of the dates, particularly the earlier ones, we see that the dates are incomplete. As dropping them will lose us a lot of the earlier data, we instead assume them to occure on the first day of the first month (if only year is available) or on the first day of the available month. This will skew the earlier data so that there is a spike every first January, however it will help to create an accurate zoomed-out account and should not affect the more current data too much.

In [97]:
department_list = []
date_list = []

# iterating over each publication
for department, publications in publications_data.items():
    for publication_index in range(len(publications)):
        date = str((publications_data[department][publication_index]['date']))
        
        # there are multiple dates that contain incomplete date information so for those we'll assume first day
        if len(date) == 4:  
            date += '-01-01'
        elif len(date) == 7:
            date += '-01'
        date_dt = datetime.strptime(date, "%Y-%m-%d")
        
        # adding information to lists
        department_list.append(department)
        date_list.append(date_dt)

# converting to dataframe
publication_date_df = pd.DataFrame({"Department": department_list, "Date": date_list})
publication_date_df

Unnamed: 0,Department,Date
0,Anthropology,2024-03-11
1,Anthropology,2024-03-08
2,Anthropology,2024-02-20
3,Anthropology,2024-02-14
4,Anthropology,2024-02-12
...,...,...
32343,Statistics,1973-01-01
32344,Statistics,1973-01-01
32345,Statistics,1973-01-01
32346,Statistics,1973-01-01


In [98]:
# sorting it in chronological order
publication_date_df_sorted = publication_date_df.sort_values(by="Date").reset_index(drop=True)

# converting date format to a more common format 
publication_date_df_sorted['Date'] = publication_date_df_sorted['Date'].dt.strftime("%d-%m-%Y")

publication_date_df_sorted

Unnamed: 0,Department,Date
0,Economic History,01-12-1959
1,Social Policy,01-01-1962
2,Social Policy,01-01-1963
3,Social Policy,01-01-1964
4,Management,01-01-1964
...,...,...
32343,Management,01-06-2024
32344,Social Policy,01-06-2024
32345,Government,11-09-2024
32346,Psychological and Behavioural Science,01-12-2024


In [99]:
# saving as csv file 
publication_date_df_sorted.to_csv('departmental_publications_date.csv', index=False)