# InterMineR Workshop Use Case



We are going to re-create the workflow we did using the web interface using the R API.

The basic steps are:  

1. Load the InterMine library and choose an InterMine to query.
2. Query 1: Diabetes Genes: Fetch a list of genes that are associated with diabetes
3. Query 2: PAX6 + Pancreas: Fetch a list of genes that have medium or high expression in the pancreas and are in our PAX6 targets list
4. Intersection: Find which genes are present in _both_ Query 1 and Query2.
5. GWAS: Compare the intersection of the previous query with results from GWAS studies.

### Getting started - Load InterMineR and choose an InterMine

Load the InterMine library. If it's not already installed, visit
https://bioconductor.org/packages/release/bioc/html/InterMineR.html and follow the instructions to install.


In [10]:
library(InterMineR)

We want to query human data - so let's look and see what InterMines are available: 

In [11]:
listMines()

Okay, let's select HumanMine from the list:

In [12]:
humanMine <- listMines()["HumanMine"] #select humanmine
humanMine #print out the value to see what's inside

Okay, now let's tell InterMineR that we want to use HumanMine for our queries.

**Important:** you'll need an API token for this part so you can access your HumanMine account. You can get your token by logging into [HumanMine](http://www.humanmine.org/) and going to the account details tab within MyMine. Cut and paste your token into the code below.

In [9]:
im <- initInterMine(mine=humanMine, "YOUR TOKEN HERE")

### First Query: Diabetes Genes

Our first query will be to select all human genes that are associate with diabetes. This will require two constraints:

1. Ensure all genes returned are `Home sapiens` genes (HumanMine contains some non-human genes for homology query purposes)
2. Restrict results to genes that are associated with `diabetes`.

In [17]:
query1Diabetes <- setQuery( 
  # here we're choosing which columns of data we'd like to see
  select = c("Gene.primaryIdentifier", "Gene.symbol"),
    # set the logic for constraints. The first constraint is the first path+operator+value, 
    # e.g. Gene.organism.name = Homo sapiens, and the second constraint is the combination 
    # of the second path+operator+value, e.g. Gene.diseases.name CONTAINS diabetes
  where = setConstraints(
    paths = c("Gene.organism.name", "Gene.diseases.name"),
    operators = c("=", "CONTAINS"),
    values = list("Homo sapiens","diabetes")
  )
)

Okay, we've set the query up, so now let's actually run it:

In [22]:
query1DiabetesResults <- runQuery(im,query1Diabetes)

# and let's print out the first few results to make sure it looks like we'd expect:
head(query1DiabetesResults)

Gene.primaryIdentifier,Gene.symbol
<chr>,<chr>
100188782,NIDDM4
1056,CEL
10644,IGF2BP2
11132,CAPN10
1234,CCR5
1493,CTLA4


### Query 2: Pax6 targets that have high expression in the Pancreas

This time we're creating another query, but with slightly more complex constraints. We're looking for genes that are in the public HumanMine list `PL_Pax6_Targets`, that are also expressed in the pancreas at a `High` or `Medium` level. 

We'll need a few more **constraints** than we did in Query 1:  
1. all `Gene`s should be in the list `PL_Pax6_Targets`
2. `Gene.proteinAtlasExpression.tissue.name` should be equal to `Pancreas`
3. `Gene.proteinAtlasExpression.level` should be set to `High` OR `Medium`. This will require two constraints, one for each of medium and high.

We'd also like to see a few more **columns** this time: 
1. The `Gene`'s `primaryIdentifier` and `symbol`
2. The following expression data from Protein Atlas: 
    - `Gene.proteinAtlasExpression.cellType`
    - `Gene.proteinAtlasExpression.level`
    - `Gene.proteinAtlasExpression.tissue.name`

In [25]:
# Create a new query
query2UpInPancreas = newQuery(
  #here we're choosing which columns of data we'd like to see
  view = c("Gene.primaryIdentifier",
             "Gene.symbol",
             "Gene.proteinAtlasExpression.cellType",
             "Gene.proteinAtlasExpression.level",
             "Gene.proteinAtlasExpression.tissue.name"
  ),
  # set the logic for constraints (see the function below for this to make sense)
  constraintLogic = "A and (B or C) and D"
)

# If we ran the query above, it'd show us *all* genes and their expression. 
# Let's narrow it down a little by constraining it to genes that are of interest
query2UpInPancreasConstraint = setConstraints(
  paths = c("Gene", 
            "Gene.proteinAtlasExpression.level", 
            "Gene.proteinAtlasExpression.level", 
            "Gene.proteinAtlasExpression.tissue.name"),
  operators = c("IN", rep("=", 2), "="),
  # each constraint is automatically given a code, allowing us to manipulate the 
  # logic for the constraint. 
  # Below, the constraints are set to codes A, B, C, D in order, 
  #  e.g. Code A: "Gene" should be "IN" the list named "PL_DiabetesGenes"
  #       Code B: "Gene.proteinAtlasExpression.level" should be equal to "Medium"
  #       Code C: "Gene.proteinAtlasExpression.level" should be equal to "High"
  #       Code D: "Gene.proteinAtlasExpression.tissue.name" should be equal to Pancreas"
  # 
  # Now, you might be thinking "how can the expression level be equal to both Medium AND High?"
  # and the answer is - it can't, but take a quick look at the constraintLogic we set earlier - 
  # (B or C) makes it clear that we want one or the other (but not, for instance, Low) 
  values = list("PL_Pax6_Targets", "Medium", "High", "Pancreas")
)

# Add the constraint to our expressed pancreas query (previously we just _defined_ the constraint)
query2UpInPancreas$where <- query2UpInPancreasConstraint

Remember, that was just setting up the query - we haven't run it yet

In [29]:
# Now we have the query set up the way we want, let's actually *run* the query! 
query2UpInPancreasResults <-  runQuery(im = im, qry = query2UpInPancreas)

# Show me the first few results please! 
head(query2UpInPancreasResults) 

Gene.primaryIdentifier,Gene.symbol,Gene.proteinAtlasExpression.cellType,Gene.proteinAtlasExpression.level,Gene.proteinAtlasExpression.tissue.name
<chr>,<chr>,<chr>,<chr>,<chr>
10097,ACTR2,exocrine glandular cells,Medium,Pancreas
10097,ACTR2,islets of Langerhans,Medium,Pancreas
10196,PRMT3,exocrine glandular cells,Medium,Pancreas
10196,PRMT3,islets of Langerhans,Medium,Pancreas
1121,CHM,exocrine glandular cells,Medium,Pancreas
1121,CHM,islets of Langerhans,Medium,Pancreas


### Intersection: Which genes overlap in Query1 and Query2?

Let's check which genes are in BOTH lists that we've created. To do this, we'll strip down the columns we have to just primary identifiers, and then run a list intersect function.

In [34]:
primaryIdentifiers.diabetes <- query1DiabetesResults[["Gene.primaryIdentifier"]]
primaryIdentifiers.pancreas <- query2UpInPancreasResults[["Gene.primaryIdentifier"]]

primaryIdentifiers.diabetes
print ("---")
primaryIdentifiers.pancreas 

intersect(primaryIdentifiers.diabetes,primaryIdentifiers.pancreas)

[1] "---"
