# Gateway to Research

This notebook loads and shows the Gateway to Research data

Check this [repo](https://github.com/nestauk/gtr_data_processing) for additional information about the GtR data.

# Preamble

In [None]:
%run notebook_preamble.ipy

In [None]:
# Functions etc here

def flatten_list(a_list):
    return([x for el in a_list for x in el])

# Load data

In [None]:
my_path = '../../../ai_analysis/data/processed/19_7_2019_gtr_processed.csv'


gtr = pd.read_csv(my_path,compression='zip',na_values='[]').iloc[:,1:]

In [None]:
from ast import literal_eval 

#Hacky way to find columns with lists-as-strings that we need to parse into lists
is_list = [col for n,col in enumerate(gtr.columns) if '[' in str(gtr.iloc[0,n])]

#Now we parse the lists. As part of this we need to replace nans in the list with some other value (literal_eval doesn't know how to parse missing values)
for c in is_list:
    
    
    gtr[c] = [literal_eval(re.sub(' nan ','missing',x)) if pd.isnull(x)==False else np.nan for x in gtr[c]]

In [None]:
gtr.shape

In [None]:
gtr.head()

## Some features of the data

#### a. The data only covers 2007 to 2018. We have more recent data but for now we are focusing on 'full years'

In [None]:
gtr.year.value_counts().loc[np.arange(2007,2018)]

#### b. The data includes all research councils and Innovate UK

In [None]:
gtr.funder.value_counts()

####  c. We have various types of grants but most are grants.

In [None]:
gtr.grant_category.value_counts()

#### d. The `out_` prefix refers to projects outputs. This comes from a merge of the GTR projects table with an outputs table

* out_prod: products (mostly clinical etc.)
* out_tech: technologies (mostly software)
* out_spin: spinouts
* pubs: papers (most popular)
* db: databases

Each of these categories has its own tables with metadata, and there are others (eg cultural products)

In [None]:
out = [x for x in gtr.columns if 'out_' in x]

gtr[out].sum()

#### e. Research topics and activities are user generated labels. We have used them to create a labelled dataset to classify projects into disciplines.

`disc_*` give the probabilities and the names are self explanatory (env is environmental)

`sel_disc` is the top discipline for a project

In [None]:
gtr['sel_disc'].value_counts()

#### f. `ind_*` does as above but for industries based on an ML analysis with a labelled industry dataset

`sel_industry` has the top industry for each source

**WARNING** These predictions are experimental

`computing`, `creative`, `content`, `cultural`, `entertainment` and `publishing` capture the creative industries SIC codes

In [None]:
gtr['sel_industry'].value_counts()

#### g. `sdg_*` does as above but for SDGs

**WARNING** The predictions for SDGs are very noisy. This model needs to be significantly improved before being used for analysis


#### h. ORG contains ids for organisations participating in projects. Not that useful unless matched with the relevant GtR table

#### i. The `*_lad_code` and `*_lad_name` variables contain local authority district codes and names for organisations participating in projects

lead = lead organisation (there is only one per project, they are generally academic institutions)

all = all organisations

involved = all organisations involved except the lead one


#### j. The `scot` variables contain geo information which is relevant for the Scotland project

#### k. (finally!) AI and AI mod are the AI booleans.

AI is based on an analysis that only considered research grants and therefore did not consider Innovate UK. 

AI mod considers all projects - I suggest using the latter 

In [None]:
#What is the distribution of AI over disciplines?

pd.crosstab(gtr['sel_disc'],gtr['ai_mod'],normalize=1).plot.bar()

In [None]:
#What is the distribution of AI over industries?

pd.crosstab(gtr['sel_industry'],gtr['ai_mod'],normalize=1).sort_values(True).plot.barh(figsize=(5,8))

#### l. `companyname` and cluster is from the organisation - company house matches.

The `clusters` are the same sector categories we used in the industry ML analysis that I mentioned above

In [None]:
pd.Series(flatten_list(gtr['companyname'].dropna())).value_counts()[:20]

Note that this is our first analysis using this matched dataset so we are very likely to find errors. In particular, the university-related matches above are probably not that useful as the names have been matched with the Companies House presence of universities, which as we see are Technology Transfer Offices and so forth. 

In [None]:
pd.Series(flatten_list(gtr['cluster'].dropna())).value_counts()[:35]

A look at the sectors conform this - we see some presence of various creative sic codes

### And to conclude, some statistics about creative activity related to AI

In [None]:
creative_sector_names = ['creative','content','_cultural','computing','entertainment', 'publishing']


#Focusing on semantic analysis
gtr['creative_flag_semantic'] = [any(val in x for val in creative_sector_names) if 
                        pd.isnull(x) == False else np.nan for x in gtr['sel_industry']]


pd.crosstab(gtr['creative_flag'],gtr['ai_mod'])

Half of the organisations involved in AI are creative - but then you knew that ;-)

In [None]:
#Focusing on organisations

ai_org_counts = pd.Series(flatten_list(gtr.loc[gtr['ai_mod']==True]['cluster'].dropna())).value_counts()

ai_org_counts.loc[[x for x in ai_org_counts.index if any(val in x for val in creative_sector_names)]]
