Your Name:  

# Part A:  Using R to navigate the National Center for Biotechnology Information (ungraded)

In this section, we will use the R package rentrez [1] to navigate the vast information
found on the National Center for Biotechnology Information (NCBI) [2] online platform. To
exemplify the functionality of this package we will use the gene MC4R (melanocortin 4 receptor).
Defects in this gene lead to autosomal dominant obesity.
    
References:

[1] https://cran.r-project.org/web/packages/rentrez/index.html

[2] https://www.ncbi.nlm.nih.gov/

[3] https://www.ncbi.nlm.nih.gov/clinvar/intro/

1a) Install and load packages: we want to use the "rentrez" package in this case, so replace "yourpackagehere" with "rentrez" and run the following commands.

In [7]:
install.packages("yourpackagehere", repos="http://cran.us.r-project.org")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [8]:
library(yourpackagehere)

1b) What is the gene ID for the MC4R gene? This page may be helpful: https://www.ncbi.nlm.nih.gov/gene/

In [51]:
geneid <- 4160

1c) Use the function entrez_link to obtain links to datasets related to MC4R. Check if ClinVar
[3] is part of your result. ClinVar is a freely accessible dataset linking human variations with phenotypes while providing supporting evidence.

There are three paramaters that we need to pass to the entrez_link function. 

- id = which gene are we interested in? We specified this in 1b
- dbfrom = what database does the ID that we pass belong to? NCBI contains many databases...PubMed, Nucleotide, Gene, SNP, etc. We should input the name of the relevant database as a string. 
- db = since we're interested in which datasets link to MC4R, we should set this to 'all'

In [52]:
link_list = entrez_link(dbfrom = "gene", id = geneid, db = "all")

Let's view the list of links.  There should be 54 entries. 
Jut for your own reflection (no need to write down or submit your thoughts):  

- Can you guess what information each of these links contains?  
- Do any seem redundant?
- Is clinvar present?

In [53]:
link_list$links

1d) We can use the entrez_db_summary and entrez_fetch functions to get more information about our gene of interest from each of these linked sources. 

In [54]:
# For example, we can ask for a summary about ClinVar that will tell us
# how many total variants are stored in this database, and when it was
# last updated

entrez_db_summary(db = "clinvar")

Here, we will show you how to extract the specific entities associated with our gene of interest MC4R based on the links we just gathered.  The commands below extract information for another database, OMIM (Online Mendelian Inheritance in Man), a compendium of human genes and genetic phenotypes.  Adapt the commands in this example to extract information from ClinVar.

In [55]:
# Retrieve a list of links related to MC4R from OMIM
omim <- link_list$links$gene_omim
omim

In [56]:
# We can get an overview of the first entry using this command
rank_in_list = 1
entries = data.frame(entrez_summary(db="omim", id=omim[rank_in_list]))
entries

uid,oid,title,alttitles,locus
618406,#618406,BODY MASS INDEX QUANTITATIVE TRAIT LOCUS 20; BMIQ20,"OBESITY, RESISTANCE TO, INCLUDED",18q21.32


In [57]:
# To make this information print out nicely in a human-readable format
# we can use the command "paste" and include whatever narrative text we like
paste("The entry with UID" , entries$uid , "has the title:" , entries$title)

In [None]:
# YOUR CODE HERE

1b) Write an iterative function (i.e. a loop) that goes through the first 10 ClinVar IDs associated with the gene MC4R and prints out a textual summary for each entry that includes 

- ClinVar Variation ID
- Clinvar Accession ID
- Species
- Variation Name
- Variation Type
- Clinical Significance


In [61]:
# Here we need more details than we can find from just the summary function
# Instead, we want to fetch all information for each variant, which we
# can do with the entrez_fetch function.

# Here is an example drawn from the ID 744058.  
# First we get the accession number for our variant of interest.
my_accession <- entrez_summary(db = "clinvar", id = "744058")$accession
my_accession

In [62]:
# Then we want to get the full details for this variant
# Because this command is communicating with the NCBI website, 
# it can take a few seconds to complete your pull request
variant_details = entrez_fetch(db="clinvar", id=my_accession, rettype='VCV')
variant_details

UGH! Maybe you can see some keywords you are looking for embeded in that text, but the string looks like a mess.  This is because the text is in XML format, a common format for online text table organization.

We can used a library called "XML" to help us parse this string into a more readable format

In [63]:
library("XML")

In [None]:
# Let's convert our result from XML to a list
variant_details_list = xmlToList(variant_details)
variant_details_list

In [66]:
# We can access information in this list by the list item titles like
variant_details_list$VariationArchive$.attrs

In [67]:
# Note that lists can be embedded in lists, so we might access DateCreated 
# by extracting the four item in the VariationArchive list
variant_details_list$VariationArchive$.attrs[[4]]

In [68]:
# Here is an example loop that prints out the numbers 1 through 5
for (i in c(1:5)){
    print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5


In [75]:
# YOUR CODE HERE

# Part B:   Gene expression data 

a) Use the R function read.csv to read in the gene expression data set provided on Canvas.  This data was produced by gathering RNA profiling (a “transcriptome”) from the blood to study Autism Spectrum Disorder [1].  Your imported data will have probes stored in rows (make sure you have probe IDs set as row names) and microarray SNPs stored in the columns (make sure you have microarray IDs as the column names).

In [71]:
# YOUR CODE HERE

b) Calculate the mean and standard deviation of expression values for each probe across all microarrays.  You may find the functions mean(), st.dev() and apply() useful for this task. Plot a histogram of the mean expression values, and another histogram of the standard deviations values across all probes. You may find the function hist() useful.  Please include informative axis labels and figure titles.  Also generate histograms of raw expression counts (the values in your table) for four genes of your choice, labeled appropriately.

In [72]:
# YOUR CODE HERE

c) Annotate [3] the probe IDs (Affymetrix Gene 1.0 ST (GeneST) microarrays) with gene names.

In [None]:
# You can perform gene annotation with the reference pd.hugene.1.0.st.v1 
# through the oligo package

# First install the necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("oligo")
BiocManager::install("pd.hugene.1.0.st.v1")
BiocManager::install("annotate")
BiocManager::install("hugene10sttranscriptcluster.db")

# for additional information see:
# http://www.bioconductor.org/packages//2.7/bioc/vignettes/oligo/inst/doc/V5ExonGene.pdf

In [None]:
library(oligo)

We recommend that you examine the function getNetAffx to convert your AffyMetrix gene IDs to gene names. 

Hint 1:  The getNetAffx function requires that your gene IDs be in the format of an ExpressionSet object; you can use the function ExpressionSet to make that conversion, with the annotation parameter set to "pd.hugene.1.0.st.v1".

Hint 2: Within the "@data" component of the resulting annotated data frame object, the gene name can be found under the column "geneassignment."  The strings found under this heading for each gene will contain many pieces of information separated by the characters "//".  You may find the function strsplit() helpful for separating those pieces of information.  The final gene names you want will have names like

- ' olfactory receptor, family 4, subfamily F, member 29 '
- ' uncharacterized LOC100287497 '
- ' family with sequence similarity 87, member B '

In [17]:
# YOUR CODE HERE

Which gene has the greatest number of probes?

In [73]:
# YOUR CODE HERE

Which gene has the highest mean expression in the dataset?

In [74]:
# YOUR CODE HERE