# Topic Modeling Analysis - CRISPR
This Python notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) in order to extract [concepts](https://docs.dimensions.ai/dsl/functions.html?highlight=concept#function-extract-concepts) from grants and use them as the basis for more advanced topic analyses tasks. 

Date: 2020-03-26

## Load libraries and log in

In [0]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli tqdm plotly -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import time
import sys
import json
import pandas as pd
from pandas.io.json import json_normalize
import tqdm.notebook as tqdm

#
# charts libs
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)

DimCli v0.6.6.3 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)


## Grants analysis #1: 2015:2016

For the purpose of this exercise, we will are going to use [grid.471244.0](https://grid.ac/institutes/grid.471244.0). Feel free though to change the parameters below as you want, eg by [choosing another GRID organization](https://grid.ac/institutes).


In [0]:
q = """
search grants in title_abstract_only for "gene editing"
    where active_year in [2015:2016]
    return grants [id+concepts]
"""
print("===\nRetrieving Grants data.. ")
data = dsl.query_iterative(q)

print("===\nExtracting concepts.. ")
concepts = data.as_dataframe_concepts()
concepts_unique = concepts.drop_duplicates("concept")
print("===\nFound (total):", len(concepts))
print("===\nFound (unique):", len(concepts_unique))

===
Retrieving Grants data.. 
1000 / 1814
1814 / 1814
===
Extracting concepts.. 
===
Found (total): 138267
===
Found (unique): 60450


### Concepts with high `frequency`

In [0]:
temp = concepts_unique.sort_values("frequency", ascending=False)
px.bar(temp[:50], x="concept", y="frequency", color="rank_avg")

### Concepts with high `frequency` and high `rank_avg`

NOTE: rank goes from 1 to N, where the 1 is the highest. 


In [0]:
temp = concepts_unique.query("rank_avg < 5").sort_values("frequency", ascending=False)
px.bar(temp[:50], x="concept", y="frequency", color="rank_avg", height=600)

## Grants analysis #2 : 2017:2018

In [0]:
q = """
search grants in title_abstract_only for "gene editing"
    where active_year in [2017:2018]
    return grants [id+concepts]
"""
print("===\nRetrieving Grants data.. ")
data = dsl.query_iterative(q)

print("===\nExtracting concepts.. ")
concepts = data.as_dataframe_concepts()
concepts_unique = concepts.drop_duplicates("concept")
print("===\nFound (total):", len(concepts))
print("===\nFound (unique):", len(concepts_unique))

===
Retrieving Grants data.. 
1000 / 3361
2000 / 3361
3000 / 3361
3361 / 3361
===
Extracting concepts.. 
===
Found (total): 254591
===
Found (unique): 101244


### Concepts with high `frequency`

In [0]:
temp = concepts_unique.sort_values("frequency", ascending=False)
px.bar(temp[:50], x="concept", y="frequency", color="rank_avg")

### Concepts with high `frequency` and high `rank_avg`

NOTE: rank goes from 1 to N, where the 1 is the highest. 


In [0]:
temp = concepts_unique.query("rank_avg < 5").sort_values("frequency", ascending=False)
px.bar(temp[:50], x="concept", y="frequency", color="rank_avg", height=600)