In [1]:
# Inline matplotlib output
%matplotlib inline

In [2]:
import json
import matplotlib.pyplot as plt
import networkx as nx
import psycopg2
import pandas as pd

from pandas.io.json import json_normalize

# SQL Setup

As we are accessing data on a remote PostgreSQL server, we need to create a Psycopg2 `connection` object using our database parameters. These are read in from a JSON file at the path below. We use these to construct a connection string which is passed as the only parameter to the connect method.

In [3]:
# Read in config file with DB params
with open('../scripts/config.json') as f:
    conf = json.load(f)
    
# Define a connection string
conn_string = 'host={} dbname={} user={} password={}'.format(conf.get('host'),
                                                             conf.get('database'),
                                                             conf.get('user'),
                                                             conf.get('passw'))

# Create a connection object
conn = psycopg2.connect(conn_string)

# Persons

This section of the notebook creates a DataFrame object holding data relating to persons in the Gateway to Research database. We first read in the data from a PostgreSQL database, expanding nested JSON elements to individual rows using Postgres' [jsonb_array_elements](http://www.postgresql.org/docs/9.4/static/functions-json.html) function.

As this isn't a small database we use the pandas.read_sql `chunksize` parameter, which returns an iterator when specified. Instead of reading in every line of the data returned from a sql query, this iterator reads in `chunk` number of lines at a time, removing the requirement to hold every row in-memory while creating python objects to represent the data (which typically take up far less memory).

The difference between this approach and the usual flow of directly creating a DataFrame is that we define the iterator `results` and an empty DataFrame `df` to which we append data `chunk` number of times while looping over the iterator. 

In [4]:
# Number of rows to iterate at a time
chunk = 5000

# SQL string that unpacks nested JSON arrays
sql_str = """
SELECT 
  id, first_name, surname, 
  jsonb_array_elements(links->'link')->'rel' as relationship,
  jsonb_array_elements(links->'link')->'href' AS href,
  jsonb_array_elements(links->'link')->'otherAttributes' as other_attribs
FROM
  gtr.persons;
"""

# chunksize returns an iterator that reads chunk number of rows at a time
results = pd.read_sql(sql_str,
                      conn,
                      chunksize=chunk)

# New dataframe object that can be appended to
df = pd.DataFrame()

# Iterate through result
# This can take a while on large tables
for result in results:
    df = df.append(result)

As the `df` object was an empty DataFrame that has now been appended to, the index will conatin duplicate values in the range 0 - (`chunk` - 1), so we reset the index to get unique integers.

In [5]:
# Check that index values are reoccuring
df.index.value_counts().head()

2047    38
2098    38
819     38
2866    38
691     38
dtype: int64

In [6]:
# Remove non-unique integers in index
df.reset_index(drop=True, inplace=True)

In [7]:
df.index.value_counts().head()

2047      1
163173    1
150883    1
148834    1
154977    1
dtype: int64

In [8]:
# Check everything looks normal
df.head()

Unnamed: 0,id,first_name,surname,relationship,href,other_attribs
0,181FC03A-FB8E-4C3D-8952-1D0DA0902AB3,Candice Coker,Morey,COI_PER,http://gtr.rcuk.ac.uk:80/gtr/api/projects/3DFA...,{}
1,181FC03A-FB8E-4C3D-8952-1D0DA0902AB3,Candice Coker,Morey,EMPLOYED,http://gtr.rcuk.ac.uk:80/gtr/api/organisations...,{}
2,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,PI_PER,http://gtr.rcuk.ac.uk:80/gtr/api/projects/06E1...,{}
3,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,COI_PER,http://gtr.rcuk.ac.uk:80/gtr/api/projects/621E...,{}
4,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,PI_PER,http://gtr.rcuk.ac.uk:80/gtr/api/projects/F33A...,{}


One of the aspects of these data we are interested in are the links between researchers via projects they have worked on. The SQL query pulled through the URIs for projects that people worked on and their organisations (`href` column) as well as their relationship to that project or organisation (`relationship`).

To make the URI easier to match to the project/organisation ID later on we split it and keep just the string after the last forward slash. This corresponds to the project/organisation ID which is contained in the project database. This is easily achieved using the `map` method and a `lambda` expression with a list slice.

In [9]:
# We only want the unique identifier from the href
df.href = df.href.map(lambda x: x.split('/')[-1])

In [10]:
df.head()

Unnamed: 0,id,first_name,surname,relationship,href,other_attribs
0,181FC03A-FB8E-4C3D-8952-1D0DA0902AB3,Candice Coker,Morey,COI_PER,3DFAF224-A9B4-486C-BF0D-C1B0A6F9D585,{}
1,181FC03A-FB8E-4C3D-8952-1D0DA0902AB3,Candice Coker,Morey,EMPLOYED,F0F2AC58-F3D5-4F25-8E51-6AF679C5EBFA,{}
2,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,PI_PER,06E1A449-5F20-40F8-8ECF-E8FC9E585A71,{}
3,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,COI_PER,621E5125-239C-410C-B18B-339FB0108A84,{}
4,1FAD8549-7B84-4B12-9D6C-F6F2B9A85481,Jaroslaw,Nowak,PI_PER,F33A7364-0224-4DA1-AF89-363F763FF57C,{}


Lets look at some basic stats. 

In [11]:
print("Number of researchers:\t\t{}".format(df.id.nunique()))
print("Number of projects:\t\t{}".format(df[df.relationship != "EMPLOYED"].href.nunique()))
print("Number of organisations:\t{}".format(df[df.relationship == "EMPLOYED"].href.nunique()))

Number of researchers:		51618
Number of projects:		64394
Number of organisations:	3501


To look at some stats across researchers, it's easier to group them by their id values, carry out the analyses, and then recombine the data. We do this by creating a GroupedBy object using the `groupby` DataFrame method, passing it the name of the column we wish to group by as a string.

The `len` built-in function returns the number of keys in the GroupedBy object dictionary representing the groups. We can use this to check the number of groups is the same as the number of researchers. We will create two GroupedBy objecs, one with just projects and one with only 

In [12]:
grp_projects = df[df.relationship != 'EMPLOYED'].groupby('id')
grp_orgs = df[df.relationship == 'EMPLOYED'].groupby('id')
print('Project groups: {}\nOrganisation groups: {}'.format(len(grp_projects), len(grp_orgs)))

Project groups: 50738
Organisation groups: 51618


So all researchers have an organisation link, but not all have a project link. Lets look at the data on number of projects by researcher.

In [13]:
avg_projects = grp_projects['href'].count().mean()
max_projects = grp_projects['href'].count().max()
print("Mean number of projects per researcher: {}".format(avg_projects))
print("Max number of projects per researcher: {}".format(max_projects))

Mean number of projects per researcher: 2.69376010091056
Max number of projects per researcher: 5348


5,348 projects for a single person doesn't sound sensible. Lets look at the count of person's project links in descending order to see whether there are other issues.

In [14]:
grp_projects['href'].count().sort_values(ascending=False)

id
6BAA32D3-0592-4662-8707-B8FF49287B99    5348
1EB3E0DF-CDB7-4690-8AE2-76659A55FADA    1796
4E408C5C-D574-4927-97AF-1CE2C4AADDB1     671
DB3BCEAD-7AF1-453F-B56E-8CEA2199DA0F      50
855ACD43-8240-48FB-AF40-C8F87B294B74      45
0C563960-68AC-41EB-8535-4F23806F09AA      45
A757DCA4-11A9-4E31-A909-4178F9A4326C      41
66F27E10-48BB-4714-9A85-9BCC4090FB65      40
C7BFB802-13FD-482C-9D1D-02A1356BEFEF      40
E7265890-DFAA-4CF4-8181-48515950E2FA      38
E0972EDE-C975-43D8-8381-DACF83343666      37
5C00A286-2044-4AD7-A91E-BE1DB9EF0C76      37
C85FD069-6D6C-43BE-9E33-7C9643E7AC90      37
CF80E2A9-03AA-405B-91C1-B018C1D25248      36
C9511462-1605-4095-BC4C-F24797BF2FD9      35
F19082DC-8F80-46C5-9B48-856FAE1FA203      35
A0B95F21-37DF-4B88-84CD-091248E3E24D      35
02864FC6-23B6-43C9-AFC8-29854658787B      35
08B1BE0A-068E-4904-846E-CC08E3B637A3      34
681275C6-A950-4FB2-B5A1-33F6D51A90C9      34
0CEC6E19-2999-49FF-95D0-6A682CEB301E      34
50011127-8841-4973-BE0A-1001E77667C8      33
5C7FF15

We can see from this that there are a small number of `persons` that have a very high number of projects. It makes sense to check their details to see whether we want to remove them from the analyses.

In [15]:
# Get the ids as a list
ids = [x for x in grp_projects['href'].count().sort_values(ascending=False)[0:3].index]

In [16]:
df[(df['id'].isin(ids)) & (df['relationship'] != 'EMPLOYED')].groupby('id').head()

Unnamed: 0,id,first_name,surname,relationship,href,other_attribs
32129,6BAA32D3-0592-4662-8707-B8FF49287B99,Unknown,Unknown,PM_PER,DE15D7ED-BA85-41FB-B063-14562C1104BE,{}
32130,6BAA32D3-0592-4662-8707-B8FF49287B99,Unknown,Unknown,PM_PER,BAAC1842-EA35-47FE-8582-163333BB1A0A,{}
32131,6BAA32D3-0592-4662-8707-B8FF49287B99,Unknown,Unknown,PM_PER,BCAE4818-A7CE-464E-86D4-1674A65565A1,{}
32132,6BAA32D3-0592-4662-8707-B8FF49287B99,Unknown,Unknown,PM_PER,EFBBC8FC-0E23-4640-87AB-1680EC6EAF86,{}
32133,6BAA32D3-0592-4662-8707-B8FF49287B99,Unknown,Unknown,PM_PER,8EBEAB04-66F6-4FF8-B0A8-119C278C5F00,{}
75737,1EB3E0DF-CDB7-4690-8AE2-76659A55FADA,Grants Team,Administered At Tsb,PM_PER,BE1DE1B8-2BE5-432C-8546-0DC0E884F47A,{}
75738,1EB3E0DF-CDB7-4690-8AE2-76659A55FADA,Grants Team,Administered At Tsb,PM_PER,CAD86664-958A-4B98-840B-0E452CDBC1A4,{}
75739,1EB3E0DF-CDB7-4690-8AE2-76659A55FADA,Grants Team,Administered At Tsb,PM_PER,51997933-F1BD-4E89-9911-1547032667E0,{}
75740,1EB3E0DF-CDB7-4690-8AE2-76659A55FADA,Grants Team,Administered At Tsb,PM_PER,EAACF862-7E1B-4547-BF0D-0D2277C80748,{}
75741,1EB3E0DF-CDB7-4690-8AE2-76659A55FADA,Grants Team,Administered At Tsb,PM_PER,CE88677C-E0B5-4BFD-8073-1488C1375425,{}


We can see from this that those persons with a high number of project links actually all have `PM_PER` links (which from the GtR [online dictionary](http://gtr.rcuk.ac.uk/resources/GtRDataDictionary.pdf) we can see are *project managers* and not researchers). Lets check what other `relationship` types there are, mapping them back to the data dictionary.

In [17]:
df.relationship.unique()

array(['COI_PER', 'EMPLOYED', 'PI_PER', 'RESEARCH_COI_PER', 'FELLOW_PER',
       'SUPER_PER', 'RESEARCH_PER', 'TGH_PER', 'PM_PER'], dtype=object)

From this we can see that there are a number of researcher person types:
- COI_PER (Co-investigator)
- PI_PER (Principal investigator)
- RESEARCH_COI_PER (Post-doc research assistant with COI status)
- FELLOW_PER (Research Fellow)
- RESEARCH_PER (Post-doc research assistant)

As well as non-researcher person types:
- EMPLOYED (Employing organisation)
- TGH_PER (Training Grant Holder)
- PM_PER (Project manager)

The 'SUPER_PER' maps to supervisors of doctoral training students.

So we need to be smarter in our GroupBy method calls

In [35]:
grp_projects = df[(df.relationship != 'EMPLOYED')
                  & (df.relationship != 'PM_PER')
                  & (df.relationship != 'SUPER_PER')
                  & (df.relationship != 'TGH_PER')].groupby('id')

grp_orgs = df[df.relationship == 'EMPLOYED'].groupby('id')
print('Project groups: {}\nOrganisation groups: {}'.format(len(grp_projects), len(grp_orgs)))

Project groups: 46736
Organisation groups: 51618


In [19]:
avg_projects = grp_projects['href'].count().mean()
max_projects = grp_projects['href'].count().max()
min_projects = grp_projects['href'].count().min()
print("Mean number of projects per researcher: {:.2}".format(avg_projects))
print("Max number of projects per researcher: {}".format(max_projects))

Mean number of projects per researcher: 2.5
Max number of projects per researcher: 49


That is all we want to look at for now in terms of the Persons data. We'll move on to the Projects data next, which will be similar in terms of method execution, so I'll be less verbose.

# Projects

We take a slightly different approach to reading projects data as these are stored as materialized views in PostgreSQL. To read them in, we get seperate dataframes and then use `pd.merge` using the `id` as a key. 

In [20]:
def get_project_data(type):
    sql_str = "SELECT * FROM gtr.projects_{};".format(type)
    
    # chunksize returns an iterator that reads chunk number of rows at a time
    results = pd.read_sql(sql_str,
                      conn,
                      #parse_dates=['created'],
                      chunksize=100000)
    
    dfx = pd.DataFrame()
    
    # Iterate through result
    # This can take a while on large tables
    for result in results:
        dfx = dfx.append(result)

    # Appending creates non-unique integers in index
    dfx.reset_index(drop=True, inplace=True)
    
    return dfx.copy(deep=True)

In [21]:
df_projects_subject = get_project_data('subject')
df_projects_subject.head()

Unnamed: 0,id,research_subject
0,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Economics
1,D44A979C-2895-49B6-B411-4C54A11C1057,Marine environments
2,D44A979C-2895-49B6-B411-4C54A11C1057,Animal Science
3,D44A979C-2895-49B6-B411-4C54A11C1057,Economics
4,D44A979C-2895-49B6-B411-4C54A11C1057,"Ecol, biodivers. & systematics"


In [22]:
df_projects_topic = get_project_data('topic')
df_projects_topic.head()

Unnamed: 0,id,research_topic
0,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Environmental economics
1,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Public economics
2,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Behavioural & experimental eco
3,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Industrial Organisation (R&D)
4,D44A979C-2895-49B6-B411-4C54A11C1057,Conservation Ecology


In [23]:
df_projects_link = get_project_data('link')
df_projects_link.head()

Unnamed: 0,id,href,rel
0,0C22922F-7A57-450E-A0A1-4B8674BE6080,http://gtr.rcuk.ac.uk:80/gtr/api/persons/EBB4C...,COI_PER
1,0C22922F-7A57-450E-A0A1-4B8674BE6080,http://gtr.rcuk.ac.uk:80/gtr/api/persons/17D48...,COI_PER
2,0C22922F-7A57-450E-A0A1-4B8674BE6080,http://gtr.rcuk.ac.uk:80/gtr/api/persons/D6BC9...,COI_PER
3,0C22922F-7A57-450E-A0A1-4B8674BE6080,http://gtr.rcuk.ac.uk:80/gtr/api/persons/6C6E9...,COI_PER
4,0C22922F-7A57-450E-A0A1-4B8674BE6080,http://gtr.rcuk.ac.uk:80/gtr/api/persons/2237F...,COI_PER


In [24]:
sql_str = "SELECT id, grant_cats, lead_org_dpts, status, titles as title FROM gtr.projects;"
    
# chunksize returns an iterator that reads chunk number of rows at a time
results = pd.read_sql(sql_str,
                  conn,
                  #parse_dates=['created'],
                  chunksize=100000)

df_projects = pd.DataFrame()

# Iterate through result
# This can take a while on large tables
for result in results:
    df_projects = df_projects.append(result)

# Appending creates non-unique integers in index
df_projects.reset_index(drop=True, inplace=True)

In [25]:
df_projects.head()

Unnamed: 0,id,grant_cats,lead_org_dpts,status,title
0,DC58EBBF-A402-4C83-A567-50AEEE8D63EC,Research Grant,School of Social Sciences,Active,The Impact of Price and Information on Water C...
1,D44A979C-2895-49B6-B411-4C54A11C1057,Research Grant,Plymouth Marine Lab,Active,Quantifying benefits and impacts of fishing ex...
2,D54AB9E8-00A2-4099-A446-4F3C37B7ED81,Studentship,Edinburgh College of Art,Active,Drawings by Joseph Beuys in the Collection of ...
3,D583EFCF-EF3D-499C-AFC3-4E771DA1808C,Training Grant,Graduate School,Closed,Liverpool - ESRC Standard Research Transition ...
4,D5ECC507-14CE-4FBD-B95F-4ECA3E95049E,Intramural,,Active,Epigenetics and Development


In [26]:
# Merge all the dataframes using the ID as key
df_merged = df_projects.merge(
                df_projects_topic, on='id', sort=False, how='left').merge(
                    df_projects_link, on='id', sort=False, how='left').merge(
                        df_projects_subject, on='id', sort=False, how='left')

In [27]:
# Sense check
df_merged.describe()

Unnamed: 0,id,grant_cats,lead_org_dpts,status,title,research_topic,href,rel,research_subject
count,3269398,3269398,3144574,3269398,3269398,2895273,3269398,3269398,2895273
unique,65053,27,2401,2,55636,607,663699,35,82
top,BFC66D10-D65B-4997-9E19-AE1DDC8ED405,Research Grant,Physics,Closed,Horizon: Digital Economy Hub at the University...,Beyond the Standard Model,http://gtr.rcuk.ac.uk:80/gtr/api/organisations...,PUBLICATION,Info. & commun. Technol.
freq,27450,2882279,189214,2246626,27450,96465,11359,1978332,146969


## Outcomes

We can look at the `rel` column to identify the number and types of outcomes.

In [28]:
# rel unique values
df_merged.rel.unique()

array(['PI_PER', 'COI_PER', 'LEAD_ORG', 'FUND', 'RESEARCH_PER',
       'PUBLICATION', 'SUPER_PER', 'STUDENT_PER', 'TGH_PER', 'PM_PER',
       'PARTICIPANT_ORG', 'PP_ORG', 'KEY_FINDING', 'IMPACT_SUMMARY',
       'DISSEMINATION', 'FURTHER_FUNDING', 'IP', 'TRANSFER', 'COFUND_ORG',
       'RESEARCH_COI_PER', 'COLLAB_ORG', 'COLLABORATION', 'PRODUCT',
       'STUDENTSHIP', 'FELLOW_PER', 'FELLOW_ORG', 'RESEARCH_MATERIAL',
       'TRANSFER_FROM', 'STUDENTSHIP_FROM', 'RESEARCH_DATABASE_AND_MODEL',
       'POLICY', 'ARTISTIC_AND_CREATIVE_PRODUCT', 'STUDENT_PP_ORG',
       'SOFTWARE_AND_TECHNICAL_PRODUCT', 'SPIN_OUT'], dtype=object)

We can create a list of relevant outcomes and filter on that to view outcomes that are relevant.

In [29]:
outcomes = ['IP',
            'PRODUCT',
            'RESEARCH_DATABASE_AND_MODEL',
            'POLICY',
            'ARTISTIC_AND_CREATIVE_PRODUCT',
            'SOFTWARE_AND_TECHNICAL_PRODUCT',
            'SPIN_OUT']

df_merged[df_merged.rel.isin(outcomes)].describe()

Unnamed: 0,id,grant_cats,lead_org_dpts,status,title,research_topic,href,rel,research_subject
count,6083,6083,5966,6083,6083,5319,6083,6083,5319
unique,805,5,390,2,792,277,1933,7,70
top,B1EE98F8-4E51-40B7-8C3D-A882E08BC321,Research Grant,Social Sciences,Closed,IMPRINTS Identity Management: Public Responses...,Applied Arts HTP,http://gtr.rcuk.ac.uk:80/gtr/api/outcomes/rese...,ARTISTIC_AND_CREATIVE_PRODUCT,Visual arts
freq,275,5473,275,5993,275,162,25,1688,417


In [30]:
[print('{}: {}'.format(outcome, df_merged[df_merged == outcome].rel.count())) for outcome in outcomes]

IP: 691
PRODUCT: 184
RESEARCH_DATABASE_AND_MODEL: 1229
POLICY: 1563
ARTISTIC_AND_CREATIVE_PRODUCT: 1688
SOFTWARE_AND_TECHNICAL_PRODUCT: 496
SPIN_OUT: 232


[None, None, None, None, None, None, None]

In [31]:
df_merged[(df_merged.id == 'B1EE98F8-4E51-40B7-8C3D-A882E08BC321')].reset_index().ix[0].title

'IMPRINTS Identity Management: Public Responses to IdeNtity Technologies and Services.'

In [32]:
df_merged[(df_merged.id == 'B1EE98F8-4E51-40B7-8C3D-A882E08BC321') &
          (df_merged.rel.isin(outcomes))].describe()

Unnamed: 0,id,grant_cats,lead_org_dpts,status,title,research_topic,href,rel,research_subject
count,275,275,275,275,275,275,275,275,275
unique,1,1,1,1,1,5,11,2,5
top,B1EE98F8-4E51-40B7-8C3D-A882E08BC321,Research Grant,Social Sciences,Closed,IMPRINTS Identity Management: Public Responses...,Cultural Geography,http://gtr.rcuk.ac.uk:80/gtr/api/outcomes/arti...,POLICY,Psychology
freq,275,275,275,275,275,55,25,150,55


Out of 65,053 projects, only 805 have at least one outcome recorded against them. The project with the single largest number of *relevant* outcomes against it is *IMPRINTS Identity Management: Public Responses to IdeNtity Technologies and Services*, at 275 outcomes. Of these, 150 were POLICY outcomes.

We can also look at the total number of all relevant outcomes.

In [33]:
df_merged.rel[df_merged.rel.isin(outcomes)].value_counts().sort_values(ascending=False)

ARTISTIC_AND_CREATIVE_PRODUCT     1688
POLICY                            1563
RESEARCH_DATABASE_AND_MODEL       1229
IP                                 691
SOFTWARE_AND_TECHNICAL_PRODUCT     496
SPIN_OUT                           232
PRODUCT                            184
Name: rel, dtype: int64

# People -> Projects Network Analysis

It would be interesting to see what the network of researchers looks like in terms of project links. We can use the `href` column and the `id` column of the `df_merged` DataFrame to construct a network. To create the network, we need to pass a dataframe to a `networkx` method; here we join the two dataframes.

In [36]:
G = nx.from_pandas_dataframe(df_merged,
                             'href',
                             'id',
                             create_using=nx.Graph())

In [None]:
degree_sequence=sorted(nx.degree(G).values(),reverse=True) # degree sequence

In [None]:
dmax=max(degree_sequence)

In [None]:
df.head()