# Creative Nation GTR Analysis

This notebook analyses Gateway to Reseaerch data for the Creative Nation report.

We start with a list of projects classified by Discipline (based on the Arloesiadur analysis) and we classify those projects into different categories based on the nature of the organisations involved:

`````
                Creative 
                organisation
                Y     N
=============================
Creative    Y  CI    CE 
discipline  N  EK    OTHER
=============================

```

* ```CI```: A creative organisation engaging with a creative discipline. 
* ```CE```: A non-creative organisation engaging with a creative discipline. 
* ```EK```: A creative organisation engaging with a non-creative discipline. 
* ```OTHER```: A non-creative organisation engaging with a non-creative project. 

**Definitions**

We define **creative organisations** based on DCMS SIC codes

How do we deal with creative non-commercial organisations and creative stakeholders e.g. BBC, Arts Council, Design Council etc.? Use members of the Creative Industries Council or Federation

We define **creative disciplines** based on their creative intensity (the relative importance of creative businesses in the discipline). 

**Activities**

1. Load GtR data 
   * Are we doing anything with Innovate UK and other data sources?
2. Query org names vs Companies House
3. Identify creative non-commercial orgs
4. Define creative disciplines
5. Classify projects
6. Consider options for SNA. 




# 1. Preamble

In [1]:
#Imports

% matplotlib inline

#Use this script to format some of the code as markdown
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

#Utilities
import os
import urllib3
import requests
import zipfile
import json
import pickle
import datetime
import sys
import http.client, urllib.request, urllib.parse, urllib.error, base64
import re


#Data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from itertools import combinations

import http.client, urllib.request, urllib.parse, urllib.error, base64

#Analytical
import networkx as nx
import community

#Otheer
import ratelim
from bs4 import BeautifulSoup
import chwrapper


#Paths
top = os.path.dirname(os.getcwd())

#External data (to download the GRID database)
ext_data = os.path.join(top,'data/external')

#Interim data (to place seed etc)
int_data = os.path.join(top,'data/interim')

#Figures
fig_path = os.path.join(top,'reports/figures')


#Get date for saving files
today = datetime.datetime.today()

today_str = "_".join([str(x) for x in [today.day,today.month,today.year]])

#Increase number of recursions to save a pickle file later
sys.setrecursionlimit(10000)

#Token
#Load token
with open(top+'/keys.json','r') as infile:
    tokens = json.load(infile)
ch_token = tokens['ch']



In [2]:
#Write the class to do this

class GTR_org():
    '''
    We use this class to process a set of companies from Gateway to Research.
    This involves:
        Querying CH with the company name
        Disambiguate results with the GTR company postcode
        Matching the results with the CH datadump using CH number and obtaining the SIC code
        Classifying companies into CE sectors based on their SIC code
    
    '''
    
    def __init__(self,company_name,company_post_code,company_id,token=ch_token):
        '''
        Initialise with the access token and the company name 
        '''
        #Initialise the search client
        self.token = token
    
        #Get the company name (from GTR)
        self.name = company_name
        
        #Get the company postcode (from GTR)
        self.post_code = company_post_code
        
        #Get the company ID (from GTR)
        self.id = company_id
    
    @ratelim.patient(550,time_interval=300)
    def query_ch(self):
        '''
        Query CH using the ch wrapper
        '''
        
        #Initialise the search client
        search_client = chwrapper.Search(self.token)
        
        #Search company
        search_results = search_client.search_companies(self.name)
        
        #We either get results of a failure
        
        #If we get a result
        if search_results.status_code == 200:
            self.results = search_results
            self.json = search_results.json()
            self.success = True
        
        else:
            self.results= search_results
            self.json = {'failed_query: {error}'.format(error=search_results.status_code)}
            self.success = False    
    
    def get_best_result(self):
        '''
        Extract results and get postcode matches
        
        '''
        #Only does this if the query was successful
        #This is quite long. Probably shoud create some functions and take out of here
        
        if self.success == True:
            
            #Extract all results
            items = self.json['items']
            
            #If there are any results!
            if len(items)>0:
            
                
                all_results_pc = [[x['address']['postal_code'] 
                                if 'postal_code' in x['address'].keys() else 'no_postal_code',x] for x in items if
                                 x['address']!=None]
                
                #Only returns those results with a match on postcodes
                postcode_matches = [x[1] for x in all_results_pc if x[0]==self.post_code]
                
                #How many matches on postcode do we get?
                self.number_matches = len(postcode_matches)
                
                if len(postcode_matches)>0:
                    #For now, focus on the top result returned by CH <- check_this
                    self.best_match = postcode_matches[0]

                    self.best_ch_number = postcode_matches[0]['company_number']
                    
                else:
                    self.best_match = 'no_matches'
                    self.best_ch_number = 'no_matches'
        
            else:
                self.best_match = 'no_addresses'
                self.best_ch_number = 'no_addresses'
        
        else:
            self.best_match = 'no_data'
            self.best_ch_number = 'no_data'
            
                
    
    def get_ch_data(self,ch,creative_lookup):
        
        '''
        Extract metadata from ch dataset
        
        '''
        
        #If the system hasn't failed, match vs. ch
        
        if self.best_ch_number not in ['no_data','no_matches','no_addresses']:
            
            #Fortunately, the company numbers are strings so we can match without too much hassle
            company_match = ch.loc[ch.CompanyNumber==self.best_ch_number,:]
            self.company_match = company_match
            
            if len(company_match)==0:
                company_match = 'no_ch_metadata'
                self.sic_4 = 'no_ch_metadata'
                self.is_creative = 'no_ch_metadata'
                self.creative_sector = 'no_ch_metadata'
            
            else:
                #self.ch_database_match = company_match
                
                #Extract SIC code
                sic_code = list(company_match['SICCode.SicText_1'])[0]
                
                #Perhaps pre-process this in the ch data above
                
                #There are missing values in the SIC codes
                
                if sic_code != 'None Supplied':
                    
                    #If it is not missing, separate the sector code from the text description
                    sic_number = sic_code.split(" -")[0]
                    
                    #Some of them are 4 digits and some are 5
                    if len(sic_number)<5:
                        self.sic_4 = sic_number
                    else:
                        self.sic_5 = sic_number
                        self.sic_4 = sic_number[:-1]
                        
                    #Now match with CIs:
                    if self.sic_4 in [str(x) for x in ce_lookup.keys()]:
                        self.is_creative = True
                        self.creative_sector = creative_lookup[self.sic_4]
                    else:
                        self.is_creative = False
                        self.creative_sector = 'non_creative'
                
                else:
                    self.sic_4 = 'no_sic'
                    self.is_creative = 'no_sic'
                    self.creative_sector = 'no_sic'
        else:
            self.company_match = self.best_match
            self.sic_4 = self.best_match
            self.is_creative = self.best_match
            self.creative_sector = self.best_match
                    
    def process_all(self):
        '''
        Process all the data with the 3 methods above
        
        '''
        
        self.query_ch()
        self.get_best_result()
        self.get_ch_data()
            

## 1. Load project data

In [3]:
#Load Gateway to research projects and organisations
projects = pd.read_csv(ext_data+'/gtr_projects.csv')
orgs = pd.read_csv(ext_data+'/gtr_organisations.csv')

#Addresses
with open(ext_data+'/org_geo.p','rb') as infile:
    org_lat_lng_dict = pickle.load(infile)


In [4]:
#Load the CSV data

cols_to_keep = ['CompanyName', ' CompanyNumber','RegAddress.PostCode','IncorporationDate',
                'SICCode.SicText_1', 'SICCode.SicText_2', 'SICCode.SicText_3','SICCode.SicText_4',
               'CompanyStatus','CountryOfOrigin']

#Read the data
ch_df = pd.read_csv(ext_data+'/BasicCompanyDataAsOneFile-2017-07-01.csv',usecols=cols_to_keep)
ch_df.columns = [x.strip() for x in ch_df.columns]

In [5]:
#Read DCMS codes
ce_codes = pd.read_csv(ext_data+'/ce_codes.csv')

ce_lookup = {str(x):y for x,y,z in zip(ce_codes.code,ce_codes.label2,ce_codes.type) if z=='SIC'}

## 2. Load Companies House matched data


No postcode examples are outside the UK, no address are a mix of orgs outside the UK and departments and organisations with little data.


In [6]:
#Load the data from the queries we ran
with open(int_data+'/gtr_query_results.p','rb') as infile:
    gtr_processing_results = pickle.load(infile)

## 3. Load Grid matched data


In [7]:
#Load the extract_charity data
#We will use the fuzzywuzzy fuzzy matching package
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


def extract_country_institutions(grid_file,country_list):
    '''
    Extracts contact details for the institutions in a country
    country is a list (we can extract information from more than one country)
    
    
    '''
    
    #List of institutes
    institute_list = grid_file['institutes']
    
    #Some institutes have no address - they are 'redirected'.
    #We don't need their information
    
    institute_w_address = [x for x in grid['institutes'] if 'addresses' in x.keys()]
    
    #Get institutes in country
    institutes_in_country = [x for x in institute_w_address if x['addresses'][0]['country_code'] in country_list]
    
    #Return institutes
    return(institutes_in_country)

def flatten(my_list):
    '''
    Goes through elements in a list and if they are not empty, it extracts the first value. 
    Otherwise it turns values into nans.
    
    '''
    
    flat = [x[0] if len(x)>0 else np.nan for x in my_list]
    return(flat)




In [145]:
with open(ext_data+'/grid.json', 'r') as infile:
    grid = json.load(infile)

#iso code for emirates = AE
uk_institutes = extract_country_institutions(grid,['GB'])

#Get information 
uk_res_df = pd.DataFrame([[x['name'],x['addresses'][0]['city'],x['addresses'][0]['lng'],
                           x['addresses'][0]['lat'],
                          x['links'],x['wikipedia_url'],x['types']] for x in uk_institutes],
                       columns=['name','city','lon','lat','url','wikipedia','type'])

#Flatten (ie. extract elements from lists)
uk_res_df['url'] = flatten(uk_res_df['url'])
uk_res_df['type'] = flatten(uk_res_df['type'])

#Here are the UK research institutions
uk_research_institutions = uk_res_df.loc[uk_res_df.type !='Company',:]

#Lower case and stripped name
uk_research_institutions.loc[:,'name'] = [x.lower().strip() for x in uk_research_institutions['name']]
                          

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [9]:
with open(int_data+'/grid_matches.p','rb') as infile:
    grid_matches = pickle.load(infile)

In [10]:
np.sum([x[2]!='no_good_match' for x in grid_matches])
#We got 639 matches here. Let's see what they are.

grid_matches_df = pd.DataFrame([[x[0],x[1],x[2][0],x[2][1]] for x in grid_matches if x[2]!='no_good_match'],
                              columns=['org_id','gtr_name','grid_name','score'])


grid_matches_df.iloc[:,:50]
#Assumption - these ones will have a massive presence in GTR. Let's see.

Unnamed: 0,org_id,gtr_name,grid_name,score
0,AEBFFA9B-9B1E-4235-96E2-ACB761119425,South West College Omagh,south west college,95
1,9D2232B9-DC18-4507-96F6-63DAE75B0284,Royal Pharmaceutical Society of Great Br,royal pharmaceutical society,95
2,C9E3BB74-B3E7-4454-9C35-8EB1373856B7,Association for the Conservation of Energy,association for the conservation of energy,100
3,58451366-6877-4B7D-B2D4-8D0416BA6DFD,The Royal Institute of Navigation,royal institute of navigation,95
4,5BB4F8BF-B4E0-4EAF-9AF5-885E19D64850,University of Strathclyde,university of strathclyde,100
5,289B3363-B06A-4E83-9F57-80E81DCB0DA2,University Hospital Southampton NHS Foundation...,university hospital southampton nhs foundation...,100
6,9152582B-272B-4BE3-8223-A92E94DF0D75,Department of Agriculture & Rural Development ...,department of agriculture and rural development,91
7,440BEB68-030A-41A9-924E-02DCF0DD71D4,The Ideas Foundation,ideas foundation,95
8,09B4D85F-B21E-4FB5-A591-07DCB5E231B0,Scottish Universities Physics Alliance,scottish universities physics alliance,100
9,7C771F1D-4232-4D26-BB06-09A91AE633E5,The Arts Catalyst,arts catalyst,95


In [11]:
#35% of all projects
np.sum([x in set(grid_matches_df['org_id']) for x in projects['org_id_short']])/len(projects)

0.34882462498741568

In [146]:
#We can combine this with the grid_data to determine the type of organisation
#Later, we can remove these from the GTR data.
grid_matches_meta_df = pd.merge(grid_matches_df,uk_research_institutions,left_on='grid_name',right_on='name')

#These are the types of organisations in the data
grid_matches_meta_df['type'].value_counts()

Healthcare    157
Government    154
Nonprofit     106
Education      85
Other          64
Facility       52
Archive        26
Name: type, dtype: int64

## 3. Combine sources

Stages

* Create DF with SIC codes
* Load SIC - sector lookup and rearrange topics
* Relabel education as academic
* Combine with university data and relabel fields
* Combine with geo data
* Load projects (with topics)
* Think strategy for analysis


In [13]:
#Initial data load: 
#Segments
industry_segments = pd.read_csv(ext_data+'/industry_cluster_lookup_feb_2017.csv',dtype='str')

industry_segments['sic_4'] = [str(x) if len(str(x))==4 else '0'+str(x) for x in industry_segments['sic_4']]


#Projects (labelled)
projects_comms = pd.read_csv(ext_data+'/3_08_2017_projects_topic_communities.csv')

In [14]:
#Create dict lookup
sic_industry_lookup = {x:y for x,y in zip(industry_segments['sic_4'],industry_segments['cluster'])}

In [15]:
#Extract data from the gtr_processing results file
#These are the CH data 
ch_matched = pd.concat([pd.DataFrame(x,index=['id','name','ch_metadata','sic','type']).T for x in gtr_processing_results])


In [34]:
#The dates are quite gnarly because they are in different formats. I had to sort them with a loop.
ch_dates = []

for num,x in enumerate(gtr_processing_results):
    if type(x[2]!=str):
        try:
            ch_dates.append(x[2]['IncorporationDate'].iloc[0])
        except:
            ch_dates.append(np.nan)
    else:
        ch_dates.append(np.nan)

In [35]:
ch_matched['inc_date'] = ch_dates

In [36]:
#Data with matched SICs
ch_matched_sics = ch_matched.loc[[x not in ['no_matches','no_addresses','no_ch_metadata','no_sic'] for x in 
                                 ch_matched.sic],:].reset_index(drop=True)


ch_matched_sics.drop(['ch_metadata','type'],axis=1,inplace=True)

In [37]:
#Now we convert them to clusters
ch_matched_sics['industry'] = [sic_industry_lookup[x] if x in sic_industry_lookup.keys() else 'no_cluster' for x in ch_matched_sics['sic']]

ch_matched_sics['domain'] = ['industry' if 'education' not in x else 'education' for x in ch_matched_sics['industry']]

In [38]:
#Makes sense. We will turn the 'services education post primary' into 'education'
ch_matched_sics['industry'].value_counts()[:20]

services_computing                 1079
services_r_&_d                      898
services_creative                   598
no_cluster                          490
services_administrative             453
manufacture_machinery_tools         382
manufacture_electronics             378
services_utilities                  370
services_professional               337
services_financial_legal            329
manufacture_furniture               326
services_kibs                       261
manufacture_food                    243
services_wholesale                  216
services_education_post_primary     206
services_recreation                 188
manufacture_chemical                163
manufacture_electrical              159
construction_construction           149
services_entertainment              149
Name: industry, dtype: int64

In [39]:
#Now we want to combine grid here

grid_match = grid_matches_meta_df[['org_id','gtr_name','type']]
grid_match.rename(columns={'org_id':'id','gtr_name':'name','type':'domain'},inplace=True)

grid_match['domain'] = [x.lower() if type(x)==str else np.nan for x in grid_match['domain']]


org_sectors = pd.concat([ch_matched_sics,grid_match])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# Analysis

**Actions**

1. Org with org metadata and geography
2. Create inputs for the class: edge list with distances between nodes.
3. Class with relevant methods:
 * get_topic_participants(topic)
 * get_industry_topics(industry or domain)
 * get_distance_matrix (topic)
 * Get interdisciplinary activity
 * Get interdisciplinary distance
 




In [40]:
#First thing. lat lon

import geopy
import time

geolocator = geopy.Nominatim()


In [41]:
with open(ext_data+'/org_lat_lng_dict.p','rb') as infile:
    org_lat_long = pickle.load(infile)

#Create DF
org_lat_long_df = pd.DataFrame(org_lat_long).T
org_lat_long_df.index = [x.lower() for x in org_lat_long_df.index]

In [42]:
#Join. NB we have lost quite a few observations here. Unclear why. Come back to this
org_geo = org_sectors.set_index('name').join(org_lat_long_df)

In [43]:
#Org to geo need to be geocoded too.
org_to_geo = org_geo.loc[org_geo.lon.isnull(),:]

In [49]:
orgs_with_address = orgs.loc[[len(json.loads(x)['address'])>0 for x in orgs['addresses']],:]

orgs_with_address['addresses'] = [json.loads(x) for x in orgs_with_address['addresses']]
             

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [50]:
#Therse are organisations that need to be geocoded

orgs_with_address_selected = orgs_with_address.loc[[x.lower() in org_to_geo.index for x in 
                                                    orgs_with_address.name],:].reset_index(drop=True)

In [51]:
#Clas

def osm_series_converter(osm_json,address_keys=['postcode', 'suburb', 'city', 'university', 'county', 'country_code', 'state', 'road', 'country']):
    '''
    Takes an osm json object and returns a pandas series.
    NB it also takes a list of address keys to create an empty series in case we have missing addresses
    '''
    
    #Extract address
    try:
        series_metadata = osm_json['address']
    except:
        series_metadata = {x:np.nan for x in address_keys}
    
    for x in ['importance','lon','lat','type']:
        try:
            series_metadata[x] = osm_json[x]
        except:
            series_metadata[x] = np.nan
            
    series_out = pd.Series(series_metadata)
    return(series_out)
    

class OsmQuerier():
    '''
    Class to query the Open Street Map Nomatim API
    
    '''
    
    def __init__(self,
                 base_url="http://nominatim.openstreetmap.org/search?q={place}&format=json&addressdetails=1"):
        '''
        Reads the base url.
        
        '''
    
        self.base_url = base_url
        
    @ratelim.patient(max_calls=60,time_interval=60)
    def query_osm(self,place):
        '''
        Takes a string, queries the OSM API and returns the top result with key fields as a Df 
        
        '''
        
        #Format the string (lowercase, replace spaces with +)
        query_string = "+".join(place.lower().split(" "))
        
        self.query_string = query_string
        #Format the base_url
        url = self.base_url.format(place=query_string)
        
        #Run query
        try:
            #Run query
            response = requests.get(url)
            json_object = response.json()
            
            self.json = json_object
            
        #If we obtain an exception like this, return an error
        except requests.exceptions.RequestException as e:
            return("Error:{}".format(e))
            self.json = []
    
    def best_match_data(self):
        '''
        Loops over the json object (if there is anything to loop over) and creates a datafrae with results.
        It then returns the best result
        
        '''
        
        if len(self.json)<1:
            output = osm_series_converter(self.json)
            output['n_results'] = len(self.json)
            output['query'] = self.query_string
            self.parsed_output = output
            self.best_output = output
        
        else:
            output = [osm_series_converter(x) for x in self.json]
            
            concat_output= pd.concat(output,axis=1).T.reset_index(drop=False)
            concat_output['n_results'] = len(self.json)
            concat_output['query'] = self.query_string
            
            self.parsed_output = concat_output
            self.best_output = concat_output.sort_values('importance',ascending=False).iloc[0,:]
        
        
        #out = [osm_series_converter(x) for x in ]
        
    def run_pipeline(self,place,num):
        '''
        Runs the things above
        '''
        
        #Counter
        if num %100 == 0:
            print(num)
        
        self.query_osm(place)
        self.best_match_data()
        
        return(self)
        
        

In [62]:
orgs_with_address_selected['address_to_geocode'] = [" ".join([
    val.lower() for key,val in x[
        'address'][0].items() if 'line' in key])+" "+ x[
    'address'][0]['postCode'].lower() if 'postCode' in x['address'][0].keys() else "" for x in 
                                                    orgs_with_address_selected['addresses']]


In [65]:
orgs_with_address_selected['address_to_geocode'] = [re.sub(
    '\n','',x) for x in orgs_with_address_selected['address_to_geocode']]


In [66]:
addresses_to_geocode = [re.sub('\n','',x) for x in orgs_with_address_selected.address_to_geocode]

addresses_geocoded = []

for num,x in enumerate(addresses_to_geocode):
    
    try:
        output= OsmQuerier().run_pipeline(x,num)
        addresses_geocoded.append(output)
        
    except:
        addresses_geocoded.append('error')
    

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600


In [78]:
#8 errors
error_indices = [num for num,x in enumerate(addresses_geocoded) if x=='error']
error_indices

[1782, 1786, 1787, 1788, 2324, 2400, 2477, 2484, 2489, 3547]

In [79]:
for x in error_indices:
    print(addresses_to_geocode[x])

bristol and bath science parkemersons green bs16 7fr
21 hawthornvale bill scott sculpture centre eh6 4jt
sruc edinburgh campus king's buildings , west mains road eh9 3jg
lansdowne house , berkeley square w1j 6er
23 king street , suite #280 cb1 1ah
net park , thomas wright way ts21 3fd
the cam centre , wilbury way sg4 0tw
oddfellows house , 19 newport road cf24 0aa
5 tenterfield closegreenfieldoldham ol3 7fp
future business centre , kings hedges road cb4 2hy


In [91]:
#Outputs
#Extract addresses
addresses_geocoded_df = pd.concat([x.best_output if x!='error' else pd.DataFrame({'error':True},index=[0]).T for x in addresses_geocoded],
                               axis=1).T[['country_code','country','lat','lon','n_results','query']]

#addresses_geocoded[0].best_output
addresses_geocoded_df.to_csv(int_data+'/{today}_gtr_org_addresses_geocoded.csv'.format(today=today_str))

In [81]:
#This df contains all the geocoded addresses
orgs_geocoded_df = pd.concat([orgs_with_address_selected.reset_index(drop=True),
                              addresses_geocoded_df.reset_index(drop=True)],axis=1)

In [82]:
#We have managed to geocode another 1345
np.sum(orgs_geocoded_df['lat'].isnull()==False)

1348

In [83]:
#Combine all orgs including those that we geocoded and those that we didn't.

#Processing the organisations that originally had geocoded data
org_geo_to_merge = org_geo.dropna(axis=0,subset=['lat']).reset_index(drop=False).set_index('id')
org_geo_to_merge.rename(columns={'index':'name'},inplace=True)

#Processing the organisations that *now* have geocoded data
org_new_geo_to_merge = pd.merge(org_to_geo.reset_index(drop=False)[
    ['id','domain','index','inc_date','industry','sic']],
                                orgs_geocoded_df,
                                    left_on='id',right_on='id').reset_index(
    drop=False)[['id','name', 'domain', 'inc_date', 'industry', 'sic', 'lat', 'lon']].set_index('id')

org_new_geo_to_merge.rename(columns={'index':'name'},inplace=True)


final_org_set = pd.concat([org_geo_to_merge,org_new_geo_to_merge])    
final_org_set.rename(columns={'index':'name'},inplace=True)

final_org_set = final_org_set.dropna(subset=['lat'],axis=0)

In [178]:
grid_matches_meta_df['domain'] = grid_matches_meta_df['type']

grid_matches_not_geocoded = grid_matches_meta_df.loc[[x not in final_org_set.index for x in
                                                     grid_matches_meta_df['org_id']],
                                                    ['org_id','gtr_name','lon','lat','domain']]

grid_matches_not_geocoded.rename(columns={'gtr_name':'name'})

final_org_set_2 = pd.concat([final_org_set,grid_matches_not_geocoded.set_index('org_id')])

final_org_set_2 = final_org_set_2.dropna(subset=['lat'],axis=0)

In [179]:
#Calculate distances
#Create a network where every edge is a pair of organisations (we keep the id for )

projects_by_id = projects.groupby('project_id')['org_id_short'].apply(lambda x: list(x)).reset_index(drop=False)

#Create edges
edge_list = pd.concat([pd.concat(
    [pd.DataFrame(list(combinations(set(x),2)),columns=['e1','e2']),
     pd.DataFrame({'pid':[y]*len(list(combinations(set(x),2)))})],axis=1) for x,y in
                       zip(projects_by_id['org_id_short'],
                          projects_by_id['project_id'])])

In [180]:
#Now we want to, for each pair of obs, obtain the distance.

final_org_coords = {x:(lat,lon) for x,lat,lon in zip(final_org_set_2.index,
                                                    final_org_set_2.lat,
                                                    final_org_set_2.lon)}

In [181]:
#Focus on 
edge_list_with_coords = edge_list.iloc[[x in final_org_coords.keys() for x in edge_list['e1']],:]
edge_list_with_coords = edge_list_with_coords.iloc[[x in final_org_coords.keys() for x in edge_list_with_coords['e2']],:]


In [182]:
from geopy.distance import vincenty

In [183]:
%%time
#Extract distances
edge_list_with_coords['distances'] = [vincenty(final_org_coords[x],final_org_coords[y]).miles 
                                      for x,y in zip(edge_list_with_coords['e1'],
                                                     edge_list_with_coords['e2'])]

CPU times: user 2.46 s, sys: 12.6 ms, total: 2.48 s
Wall time: 2.48 s


In [184]:
edge_list_with_coords.distances.describe()

count    55540.000000
mean       116.058306
std         96.774554
min          0.000000
25%         45.381371
50%         96.065011
75%        163.145014
max        612.517563
Name: distances, dtype: float64

In [185]:
#Project allocations to topics
project_allocs = pd.concat([projects['project_id'],
                            projects.loc[:,[x for x in projects.columns if 'tc_' in x]].apply(lambda x: x.idxmax(),
                                                                                            axis=1)],axis=1)

#Merge with the edge list

edge_list_with_coords = pd.merge(edge_list_with_coords,project_allocs,
                                left_on='pid',right_on='project_id')

edge_list_with_coords.rename(columns={0:'field'},inplace=True)


In [186]:
#Mean/median distances
edge_list_with_coords.groupby('field')['distances'].aggregate(['mean','median']).sort_values('median')

Unnamed: 0_level_0,mean,median
field,Unnamed: 1_level_1,Unnamed: 2_level_1
tc_architecture_spatial,45.968819,4.104008
tc_military_history,33.506631,4.752904
tc_health_social_issues,33.695155,4.885052
tc_philosophy,25.819364,6.028743
tc_classic_religious_history,48.148793,12.886244
tc_visual,66.007181,15.489025
tc_migration_colonial_history,46.789327,34.049160
tc_linguistics,52.411309,44.595181
tc_computer_vision,58.792582,48.203597
tc_gender_sexuality,96.169856,50.085592


In [188]:
#Subset the edge list by type of organisation
org_lookup = {x:{'domain':d,'industry':ind} for x,d,ind in zip(final_org_set_2.index,
                                                               final_org_set_2.domain,
                                                               final_org_set_2.industry)}

#How is this going to work?
edge_list_with_coords['n1_domain'] = [org_lookup[x]['domain'] for x in edge_list_with_coords['e1']]
edge_list_with_coords['n2_domain'] = [org_lookup[x]['domain'] for x in edge_list_with_coords['e2']]

edge_list_with_coords['n1_industry'] = [org_lookup[x]['industry'] for x in edge_list_with_coords['e1']]
edge_list_with_coords['n2_industry'] = [org_lookup[x]['industry'] for x in edge_list_with_coords['e2']]



edge_list_with_coords['combination'] = ["_".join(sorted([x,y])) for x,y in zip(edge_list_with_coords['n1_domain'],
                                                                      edge_list_with_coords['n2_domain'])]




In [189]:
distances_org_types = edge_list_with_coords.groupby(['field','combination'])['distances'].median().reset_index(drop=False)

In [190]:
distances_wide = pd.pivot_table(distances_org_types,index='field',columns='combination',values='distances')

In [196]:
#Ok there might be something here
distances_wide.fillna(value=0,inplace=True)
distances_wide.sort_values('industry_industry')

combination,Archive_Archive,Archive_Education,Archive_Facility,Archive_Government,Archive_Healthcare,Archive_Nonprofit,Archive_Other,Archive_education,Archive_industry,Education_Education,...,Nonprofit_Nonprofit,Nonprofit_Other,Nonprofit_education,Nonprofit_industry,Other_Other,Other_education,Other_industry,education_education,education_industry,industry_industry
field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tc_linguistics,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,97.850255,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,44.595181,0.000000
tc_political_history,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,116.144308,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,142.282204,91.522971,0.000000
tc_migration_colonial_history,0.000000,0.000000,0.000000,0.000000,0.000000,1.973122,0.000000,43.191862,2.759849,0.000000,...,0.000000,0.000000,43.600971,1.211095,0.000000,0.000000,0.000000,30.665236,42.371503,0.761819
tc_sociology,0.000000,44.234891,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,62.895610,131.600284,...,0.000000,171.896840,331.330797,40.239347,0.000000,0.000000,13.928117,288.932716,128.655775,1.455329
tc_military_history,0.000000,0.577514,0.000000,0.000000,0.000000,0.000000,0.000000,0.464574,0.000000,151.293732,...,117.504478,0.000000,41.218614,58.871721,0.000000,0.000000,0.000000,195.714331,1.186010,1.859295
tc_philosophy,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,35.563683,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,82.864390,4.332329,3.029536
tc_architecture_spatial,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,86.090404,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.245478,3.481790
tc_visual,0.000000,109.743719,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,162.995168,0.000000,89.456683,11.765046
tc_gender_sexuality,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,187.527885,84.605068,11.991397
tc_criminology_violence,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,82.069953,110.261158,104.796916,0.000000,0.000000,22.741454,0.000000,100.122925,16.170040


In [135]:
#distances_wide.sort_values('education_industry')

In [136]:
#Next steps
#Revise the geocoding - why do so many educational/policy orgs we identified through grid drop out?
#Consider differences in geo-distance between projects involving more than 1 discipline (taking into account
#the discipline distances distance). This would almost be like a two by two comparing projects with
#more than x disciplines where the average distance between disciplines is y with disciplines with more than x
#collabs where the average distance is w


In [138]:
#Could I create another network of collaboration between industries
#Could I visualise the geography of this network?
#Subsetting the network for local and wider collaborations?
