This project is about distinguishing interdisciplinary scholars from monodisciplinary scholars, which means that we need a measure of how interdisciplinary scholars are. The National Academis of Science, in their 2015 report *Enhancing the Effectiveness of Team Science* define interdisciplinarity as follows:

>“Interdisciplinary research integrates the data, tools, perspectives, and theories of two or more disciplines to advance understanding or solve problems. Transdisciplinary research aims to deeply integrate and also transcend disciplinary approaches to generate fundamentally new conceptual frameworks, theories, models, and applications.”

In an ideal universe, we'd have highly trained experts who are able to closely examine the papers written by the person or group under consideration, interview the people involved, and even ethnographically follow their work. Let's say that there were really fast, and could do this at a rate of one hour per person (absurdly quick, if you know anything about qualitative social science).  My little sample of 40 people would take all week to analyze.  

Let's use a data science approach.  In particular, pay attention to the ideas of *integration* and *diversity*. From reading Latour's *Science in Action*, we see that scientific claims are constructed and made firm by tying a new claim to an existing web of established facts, which are recorded in the peer-reviewed scientific literature.  Or in plain language, scientific papers have citations, and it's reasonable to expect that an interdisciplinary paper has a different pattern of citations than a monodisciplinary paper.

Over the past decade, this idea has been operationalized in the Rao-Stirling diversity index, or SDI, which varies between 0 for monodisciplinary and 1 for maximally interdisciplinary. The form is similar to the Gini index.  In a 2007 paper, "A general framework for analysing diversity in science, technology and society", Andy Stirling translated ideas from ecology to note that mathematically, diversity is how a set of objects is apportioned over a set of categories. In particular, diversity is represented by the sum of proportions between all categories *i* and *j* multiplied by the distance between *i* and *j*, as represented by the equation in the bottom line of this table.

![](https://raw.githubusercontent.com/mburnamfink/scientometrics_101_project/master/2_Rao-Stirling_diversity.png)

For scientific papers, we can use citations as our objects, and library categories as our categories. We then need to figure out the distance measure.  Rafols, Porter, and Leydesdorff in 2010 ("Science Overlay Maps: A New Tool for Research Policy and Library Management") looked at the patterns of citations across every paper indexed in Web of Science in 2007 to see how often papers in Web of Science subject category *A* cited papers in Web of Science subject category *B*, and normalized that matrix to find the cosine similarity between all pairs of categories.  1 - the cosine similarity gives us a distance.  Rafols et al point out that the large scale epistemic structure of science is fairly stable, so this matrix should still give us a good idea of the diversity of categories.

It gets a little more complicated, because while the Rao-Stirling diversity index makes sense for a single object in one category that cites many things in many categories, we're dealing with corpi of all the papers written by a scientist. Imagine that we have a person who's written an interdisciplinary paper mostly in chemistry (SDI=0.5).  If they write another paper that with the same pattern of citations, their Rao-Stirling score shouldn't change. Yet if we have a physicist who written a monodisciplinary paper in biology (SDI=0.1) and she suddenly writes a paper that cites a bunch of physics paper (SDI = 0.1), her Rao-Stirling score should increase.  If we simply average out the SDIs, that isn't the case.

Cassi, Mescheba, and Turckheim point out in 2014's "How to evaluate the degree of interdisciplinarity of an institution" that Rao-Stirling diversity is mathematically isomorphic to the equations used to calculate the roatiational moment of intertia of a cluster of points. If we're calculating the Rao-Stirling diversity index for many objects, we can find the moment of intertia around the common "center of gravity" for all the citations. Cassi and Turckheim expanded their work in 2017 ["Analysing Institutions Interdisciplinarity by Extensive Use of Rao-Stirling Diversity Index"](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170296#sec015), and have some [R code available](https://github.com/turckheim/interdisciplinarity).

Let's visualize it with an example.

In [1]:
from bokeh.plotting import figure, show, output_notebook
import numpy as np

#generate a cloud of points with radius 0.25 centered around (-0.5, 0)
x1 = [np.random.rand()/4-0.5 for i in range(30)]
y1 = [np.random.rand()/4-0.125 for i in range(30)]

p1 = figure(x_range=(-1,1), y_range=(-1,1), title = "Monodisciplinary Paper")
p1.circle(x1,y1, color="red")
output_notebook()
show(p1)

x_mean = np.mean(x1)
y_mean = np.mean(y1)

def Inertia(xs, ys):
    x_mean = np.mean(xs)
    y_mean = np.mean(ys)
    return sum([(x-x_mean)**2 for x in xs]+[(y-y_mean)**2 for y in ys])/len(xs)

print("Moment of Inertia: %.5f" % (Inertia(x1, y1)))


Moment of Inertia: 0.00900


In [2]:
#generate a cloud of points with radius 0.5 centered around (+0.5, 0)
x2 = [np.random.rand()/2+0.5 for i in range(30)]
y2 = [np.random.rand()/2-0.25 for i in range(30)]

p2 = figure(x_range=(-1,1), y_range=(-1,1), title = "Interdisciplinary Paper")
p2.circle(x2,y2, color="blue")
output_notebook()
show(p2)

print("Moment of Inertia: %.5f" % (Inertia(x2, y2)))

Moment of Inertia: 0.04112


In [3]:
p3 = figure(x_range=(-1,1), y_range=(-1,1), title="Combined papers")
p3.circle(x1,y1, color="red",legend="Paper1")
p3.circle(x2,y2, color="blue",legend="Paper2")
show(p3)

x3=x1+x2
y3=y1+y2

print("Moment of Inertia: %.5f" % (Inertia(x3, y3)))

Moment of Inertia: 0.35396


We normalize the moment of inertia by making the total weight of the points = 1.  Note how the moment of inertia for the combined papers is roughly 10 times higher than for any individual paper.

Now to do it for real, with Web of Science records.  First we need to read in Rafols et al cosciting distance matrix and a spreadsheet of Web of Science categories for every journal.

In [70]:
import os
import metaknowledge as mk
import networkx as nx
import pandas as pd
import csv
import math
import numpy as np

#import files to let us calculate Rao-Stirling diversity
Cosciting = pd.read_csv('CosCiting.csv', sep=',', header=0, index_col=0)
JournalCats = pd.read_excel('WoS History Nov 2017.xlsx', header=0)
JournalCats.set_index('20 Char', inplace=True)

#load up a record collection
RC = mk.RecordCollection("savedrecsKT.txt")

In [71]:
#paperprep converts a record into the format
#{Title:{WCs:{AAA:1,BBB:2...},Cites:{CCC:1, DDD:2...}}}
#which calculates the Rao-Stirling index
#to access a part, use titles[Title]['WCs'] or titles[Title]['Cites'] and titles[Title]['Year'] for publication year

def PaperPrep(record, JournalCats):
    title = record['TI']
    line=record['WC']
    year=record['PY']
    dic = {}
    for item in line:
        if item.upper() in dic:
            dic[item.upper()] = dic[item.upper()] + 1
        else:
            dic[item.upper()] = 1
    WCs = dic

    citeslist =[]
    try:
        for cite in record['CR']:
            cite = str(cite)
            try:
                journal = cite.split(', ')[2]
            except:
                pass
            try:
                category = JournalCats.at[journal, 'WoS Category']
                if type(category) == str:
                    category = [category.upper()]
                if type(category) == np.ndarray:
                    catlist = []
                    for item in category:
                        catlist.append(item.upper())
                    category = catlist
            except:
                category = ['Unknown']
            citeslist = citeslist + category
    except:
        citeslist = ['Unknown']

    cites = {}
    for item in citeslist:
        if item in cites:
            cites[item]=cites[item]+1
        else:
            cites[item] = 1

    return({title:{'WCs':WCs,'Cites':cites, 'Year':year}})

In [72]:
#RaoStirling takes a list of paper titles, and the prepped record collection.

def RaoStirling(samplelist, preppedRC):
    if type(samplelist)==str:
        samplelist = [samplelist]
    n = len(samplelist)
    qi = {}
    for title in samplelist:
        line = preppedRC[title]['WCs']
        dic = {}
        qi_N = 0.0
        for item in line:
            if item in dic:
                dic[item] = dic[item] +line[item]
            else:
                dic[item] = line[item]
            qi_N = qi_N +line[item]
        for WC in dic:
            if WC in qi:
                qi[WC] = qi[WC] + dic[WC] / qi_N * (1 / n)
            else:
                qi[WC] = dic[WC] / qi_N * (1 / n)
    qj = {}
    for title in samplelist:
        line = preppedRC[title]['Cites']
        dic = {}
        qj_N = 0.0
        for item in line:
            if item in dic:
                dic[item] = dic[item] + line[item]
            else:
                dic[item] = line[item]
            qj_N = qj_N + line[item]
        for WC in dic:
            if WC in qj:
                qj[WC]= qj[WC]+dic[WC]/qj_N*(1/n)
            else:
                qj[WC] = dic[WC]/qj_N*(1/n)
    SDI = 0
    for WCi in qi:
        for WCj in qj:
            try:
                invdistance = float(Cosciting.loc[WCi.upper(), WCj.upper()])
            except :
                invdistance = 1
            SDI = SDI + (1-invdistance)*qi[WCi]*qj[WCj]
    return(SDI)

Now let's prep the papers, and check one of Kip Thorne's papers at random to see what it's categorized as, what it cites, and its SDI.

In [73]:
titles = {}
for R in RC:
    titles.update(PaperPrep(R,JournalCats))

check = next(iter(titles.keys()))
    
print(check)
print(titles[check]['WCs'])
print(titles[check]['Cites'])
print("Rao-Stirling Diversity Index:",RaoStirling(check, titles))

Visualizing spacetime curvature via frame-drag vortexes and tidal tendexes: General theory and weak-gravity applications
{'ASTRONOMY & ASTROPHYSICS': 1, 'PHYSICS, PARTICLES & FIELDS': 1}
{'ASTRONOMY & ASTROPHYSICS': 21, 'PHYSICS, MULTIDISCIPLINARY': 25, 'PHYSICS, PARTICLES & FIELDS': 21, 'MATHEMATICS': 1, 'EDUCATION, SCIENTIFIC DISCIPLINES': 2, 'PHYSICS, MATHEMATICAL': 2, 'ENGINEERING, ELECTRICAL & ELECTRONIC': 1, 'Unknown': 6, 'GENETICS & HEREDITY': 1}
Rao-Stirling Diversity Index: 0.29989374999999996


And let's see what the SDI of the entire corpus is.

In [74]:
print("Rao-Stirling Diversity Index:",RaoStirling(titles.keys(), titles))

Rao-Stirling Diversity Index: 0.2804939269657604


And while we're at it, let's see the distribution of SDI for all of Kip's papers by year.

In [94]:
SDIs = []
years = []
t = []

for title in titles.keys():
    #print('SDI: %.3f :' % (RaoStirling(title, titles)),titles[title]['Year'],title)
    t.append(title)
    SDIs.append(RaoStirling(title, titles))
    years.append(titles[title]['Year'])
    
import pandas as pd

df = pd.DataFrame({'Title':t, 'PubYear':years,'SDI':SDIs})
df.sample(10)

#and yes I realize this is a suboptimal way to do it. Only figure out what I wanted to do after writing a bunch of code.

Unnamed: 0,Title,PubYear,SDI
33,MEMBRANE VIEWPOINT ON BLACK-HOLES - PROPERTIES...,1986,0.310396
86,Search for gravitational-wave bursts in LIGO d...,2007,0.244279
103,SEARCHES FOR GRAVITATIONAL WAVES FROM KNOWN PU...,2010,0.1975
104,"ROTATION HALTS CYLINDRICAL, RELATIVISTIC GRAVI...",1992,0.284367
84,First all-sky upper limits from LIGO on the st...,2005,0.246557
137,SWIFT FOLLOW-UP OBSERVATIONS OF CANDIDATE GRAV...,2012,0.192813
101,Upper limits on a stochastic gravitational-wav...,2012,0.230797
124,First low-latency LIGO plus Virgo search for b...,2012,0.221977
170,DISK-ACCRETION ONTO A BLACK-HOLE .2. EVOLUTION...,1974,0.21175
115,Human gravity-gradient noise in interferometri...,1999,0.402


In [105]:
mean_SDI=df.groupby("PubYear").agg(np.mean)
mean_SDI.SDI

PubYear
1973    0.297202
1974    0.164139
1975    0.059681
1976    0.116068
1977    0.231358
1978    0.102364
1979    0.000000
1980    0.315240
1981    0.094633
1982    0.112386
1983    0.188176
1984    0.187647
1985    0.316370
1986    0.189931
1987    0.465250
1988    0.357800
1989    0.155423
1990    0.213509
1991    0.290310
1992    0.337255
1993    0.506464
1994    0.243770
1995    0.443122
1996    0.169970
1997    0.076513
1998    0.286422
1999    0.402000
2000    0.238235
2002    0.446506
2003    0.234500
2004    0.336284
2005    0.255189
2006    0.190194
2007    0.255412
2008    0.221034
2009    0.240845
2010    0.226629
2011    0.259594
2012    0.245005
2013    0.294795
2014    0.184242
2015    0.313946
2017    0.197820
2018    0.312962
Name: SDI, dtype: float64

In [107]:
from bokeh.models import HoverTool
from bokeh.models import ColumnDataSource

source = ColumnDataSource(data=df)

p=figure(x_range =(min(years)-2, max(years)+2),plot_width=950, title="Kip Thorne's Stirling Diversity Indexes by Publication Year")
p.circle(x="PubYear", y="SDI", size=10, hover_color="green", source=source)
p.line(x=mean_SDI.index, y=mean_SDI.SDI, color="red", legend="Mean Annaul SDI")
hover = HoverTool(tooltips="@Title")
p.add_tools(hover)
show(p)

In [110]:
from bokeh.plotting import figure, output_file, show
output_file("2_KT_SDI.html")
show(p)