# A Methodology for Machine Learning Analysis of Space-Exposed Murine Liver (Transcriptomics)

The purpose of this analysis notebook is to provide the necessary code to perform our analysis and generate figures necessary for our the publication *Spaced Out Data No More: Genomic Harmonization Meets Machine Learning in Murine Livers*. This notebook can be executed to identify psuedogenes, and pre-filter the ENSEMBL gene IDs from GLDS. Identify subset of genes with valid external IDs (NCBI standard).

- Notebook Author: Hari Ilangovan


|Version History | Date | 
|----------| ----- |
|v0| 12/10/2022 | 
|v1 | 11/2/2023 | 


Publication Authorship:
- Hari Ilangovan<sup>1</sup>
- Prachi Kothiyal<sup>2</sup>
- Katherine A. Hoadley<sup>3</sup>
- Robin Elgart<sup>4</sup>
- Greg Eley<sup>2</sup>
- Parastou Eslami<sup>5</sup>

<sup>1</sup> Science Applications International Corporation (SAIC), Reston, VA 20190, USA
<sup>2</sup>Scimentis LLC, Statham, GA 30666, USA
<sup>3</sup>Department of Genetics, Computational Medicine Program, Lineberger Comprehensive Cancer Center, University of North Caroline at Chapel Hill, Chapel Hill, NC 27599, USA
<sup>4</sup>University of Houston, Houston, TX 77204, USA
<sup>5</sup>Universal Artificial Intelligence Inc, Boston, MA 02130, USA

# Table of Contents

- [Library Loading](#Directory-Configuration-and-Package-Loading)
- [BioMaRt Query - ENSEMBL to Gene Symbol Mapping](#Gene-Symbol-Conversion)
- [Filtering for Protein Coding Genes](#Filtering-BioMart-Query-for-Protein-Coding-Genes)

## Directory Configuration and Package Loading

[Back to Top](#Table-of-Contents)

We change the working directory to the helper function and metadata reference directory

In [1]:
generate_flat_files = TRUE

In [2]:
setwd("./../scripts/")

In [3]:
# load packages - biomaRt is essential for query
require('data.table')
require('dplyr')
require('biomaRt')

Loading required package: data.table

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: biomaRt

Possible Ensembl SSL connectivity problems detected.
Please see the 'Connection Troubleshooting' section of the biomaRt vignette
vignette('accessing_ensembl', package = 'biomaRt')Error in curl::curl_fetch_memory(url, handle = handle) : 
  SSL peer certificate or SSH remote key was not OK: [uswest.ensembl.org] SSL certificate problem: certificate has expired




## Gene Symbol Conversion

[Back to Top](#Table-of-Contents)

In [4]:
httr::set_config(httr::config(ssl_verifypeer = FALSE))

In [5]:
# BioMaRt query to obtain ENSEMBL Gene ID mapping to Gene Symbol
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('mmusculus_gene_ensembl', mart)

annotLookup <- getBM(
mart = mart,
attributes = c(
'ensembl_gene_id',
'ensembl_transcript_id',
'external_gene_name',
'gene_biotype'),
uniqueRows = TRUE)

In [6]:
# inspect the query head to confirm the query structure
head(annotLookup)

Unnamed: 0_level_0,ensembl_gene_id,ensembl_transcript_id,external_gene_name,gene_biotype
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,ENSMUSG00000064336,ENSMUST00000082387,mt-Tf,Mt_tRNA
2,ENSMUSG00000064337,ENSMUST00000082388,mt-Rnr1,Mt_rRNA
3,ENSMUSG00000064338,ENSMUST00000082389,mt-Tv,Mt_tRNA
4,ENSMUSG00000064339,ENSMUST00000082390,mt-Rnr2,Mt_rRNA
5,ENSMUSG00000064340,ENSMUST00000082391,mt-Tl1,Mt_tRNA
6,ENSMUSG00000064341,ENSMUST00000082392,mt-Nd1,protein_coding


## Filtering BioMart Query for Protein Coding Genes

[Back to Top](#Table-of-Contents)

In [7]:
# perform this filter to identify genes with "pseudogene" in its biotype
pseudogene_list <- annotLookup[annotLookup$gene_biotype %like% "pseudogene",] %>% dplyr::select(c(ensembl_gene_id, external_gene_name))
# this filter identifies genes without "pseudogene" in its biotype
non_pseudogene_list <- annotLookup[!(annotLookup$gene_biotype %like% "pseudogene"),] %>% dplyr::select(c(ensembl_gene_id, external_gene_name))
# this filter identifies genes without "pseudogene" in its biotype AND non-empty external IDs
non_pseudogene_wExternal_list <- non_pseudogene_list[-which(non_pseudogene_list$external_gene_name == ""),]

In [8]:
# import the ENSEMBL IDs from GeneLab Data Sets
genelist_complete <- read.table('./data/raw_counts/379.csv', header=TRUE, sep=",", row.names = 1)
genelist_complete <- rownames(genelist_complete) %>% as.data.frame() 
colnames(genelist_complete) <- c("ensemble_gene_id")

In [9]:
# filter for only non-pseudogenes with external IDs
genelist_keep_non_pseudo <- genelist_complete %>% dplyr::filter(genelist_complete$ensemble_gene_id %in% non_pseudogene_wExternal_list$ensembl_gene_id)
if (generate_flat_files) {
    write.table(genelist_keep_non_pseudo, './data/prefiltering/glds_no_pseudogenes_wExternal.csv', row.names=FALSE, sep=',', quote=FALSE)
}