## Convert R Data

Looking at log-normalized Thalamus scRNA-seq data from Marcus Hooper on 7/5/22. The goal of this notebook is to convert data from R formats into AnnData objects for further analysis. R Data is originally in S4 format (which cannot be loaded by pyreadr), so it's first converted to a .mtx file in RStudio before being loaded here. Original data can be found at: /allen/programs/celltypes/workgroups/mct-t200/marcus/Z_data_analyses/gene_panels/P11_P14_data

In [119]:
import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc
import scipy.io
import scipy.sparse
import pyreadr

# Read gene names and metadata
annoDict = pyreadr.read_r("../Data/R Formats/P11_P14_metadata")
metadata = annoDict[None]

thGenes = pd.read_feather("../Data/gene_names.feather")
thGenes.columns = ['gene'] # Alter dataframe column name

In [3]:
# Read sparse data matrix (converted in RStudio from .rda to .mtx) ~15 min
logNormCOO = scipy.io.mmread("../Data/dev_data.mtx")

In [6]:
# Tranpose matrix to match AnnData standards (cells x genes), and convert to indexable sparsity format
logNormCOO = logNormCOO.transpose()
logNormCSR = logNormCOO.tocsr()

In [10]:
# Construct AnnData object
thData = ad.AnnData(logNormCSR, dtype='float32')

In [129]:
# Make AnnData object indexable by assigning proper names to rows and columns
thData.obs_names = metadata["sample_id"]
thData.var_names = thGenes['gene']

# Example indexing:
thData["AAACCCAAGCCAACCC-L8TX_210319_01_A07-1157582237"]
thData[0:5,["Xkr4"]]

View of AnnData object with n_obs × n_vars = 5 × 1

In [131]:
# Need to copy over annotation into AnnData object
# Perhaps isn't perfect; didn't copy over every element, and some variable types may be suboptimal
thData.obs["sample_id"] = pd.Categorical(metadata["sample_id"])
thData.obs["umi_counts"] = np.float32(metadata["umi.counts"])
thData.obs["gene_counts"] = np.float32(metadata["gene.counts.0"])
thData.obs["sex"] = pd.Categorical(metadata["sex"])
thData.obs["age"] = pd.Categorical(metadata["age"])
thData.obs["donor"] = pd.Categorical(metadata["donor_name"])
thData.obs["roi"] = pd.Categorical(metadata["annotated_ROI"])
thData.obs["cluster_id"] = pd.Categorical(metadata["cluster_id"])
thData.obs["cluster_label"] = pd.Categorical(metadata["cluster_label"])
thData.obs["cluster_color"] = pd.Categorical(metadata["cluster_color"])
thData.obs["supertype_id"] = pd.Categorical(metadata["supertype_id"])
thData.obs["supertype_label"] = pd.Categorical(metadata["supertype_label"])
thData.obs["supertype_color"] = pd.Categorical(metadata["supertype_color"])
thData.obs["subclass_id"] = pd.Categorical(metadata["subclass_id"])
thData.obs["subclass_label"] = pd.Categorical(metadata["subclass_label"])
thData.obs["subclass_color"] = pd.Categorical(metadata["subclass_color"])
thData.obs["class_id"] = pd.Categorical(metadata["class_id"])
thData.obs["class_label"] = pd.Categorical(metadata["class_label"])
thData.obs["class_color"] = pd.Categorical(metadata["class_color"])

In [132]:
# Save file ~15 min(?)
fnResults = "../Data/devData.h5ad"
thData.write(fnResults)