In [2]:
import warnings
import pandas as pd
import scanpy as sc
import numpy as np
import scrublet as scr
import matplotlib.pyplot as plt
import seaborn as sns
import anndata as ad



# Construct AnnData object from sparse matrix, cell barcode list, and gene list

The AnnData object is a widely used data structure in single-cell genomics analysis, provided by the Python package anndata. It serves as a container for high-dimensional single-cell data, such as gene expression from sc/snRNA-seq. The AnnData object encapsulates the data matrix along with various annotations and metadata associated with the cells and features (genes). Here's an overview of the structure of an AnnData object:

**Data Matrix**: The primary component of an AnnData object is a two-dimensional data matrix. It represents the expression values of genes (features) across individual cells. By convention, the matrix is stored as a dense or sparse matrix object, depending on the size and sparsity of the data. The matrix typically has cells as rows and genes as columns.

**Observations**: Each row of the data matrix corresponds to an individual cell (aka observation). The AnnData object stores various properties or annotations associated with each cell, for example the cell's barcode, the type of the cell, or sample information associated with the sample the cell came from. These annotations are typically stored as a pandas DataFrame, where each row corresponds to a cell and each column represents a specific annotation.

**Variables**: Each column of the data matrix corresponds to a gene (aka variable). The AnnData object also stores various properties or annotations associated with each feature. For example, it can include gene names, gene IDs, and gene type. Similar to cell annotations, feature annotations are stored as a pandas DataFrame, where each row corresponds to a gene and each column represents a specific annotation.

See [anndata documentation](https://anndata.readthedocs.io/en/latest/) for more details.


In [3]:
# the directory containing the data
path = "/data/class/cosmos2023/PUBLIC/shai_hulud/scanpy/"

# Sample 1

In [4]:
sample = 'Gastroc'

In [5]:
# Load sparse matrix from file
mtx = sc.read_mtx(path + sample + '_matrix.mtx').T

In [6]:
# Load gene IDs or names from CSV file
var = pd.read_csv(path + sample + '_var.csv')

# Load observations from CSV file
obs = pd.read_csv(path + sample + '_obs.csv')

In [7]:
mtx.obs = obs
mtx.var = var

In [8]:
adata1 = ad.AnnData(X=mtx.X, var=var, obs=obs)



In [10]:
adata1.obs.index = adata1.obs['cellID']

In [13]:
adata1.var_names = adata1.var['gene_name'] 
adata1.var_names_make_unique()

# Save

In [14]:
adata1

AnnData object with n_obs × n_vars = 7542 × 47707
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'X', 'barcode', 'depth1', 'sublibrary', 'sample', 'cellID', 'sublibrary_sample', 'doublet_scores', 'Sample', 'TSSEnrichment', 'ReadsInTSS', 'ReadsInPromoter', 'ReadsInBlacklist', 'PromoterRatio', 'PassQC', 'NucleosomeRatio', 'nMultiFrags', 'nMonoFrags', 'nFrags', 'nDiFrags', 'BlacklistRatio', 'original_atac_bc', 'atac_bc', 'rna_bc', 'DoubletScore', 'DoubletEnrichment', 'archr_cellID', 'batch', 'depth2', 'experiment_accession', 'file_accession', 'genotype', 'library_accession', 'lower_nCount_RNA', 'lower_nFeature_RNA', 'rep', 'run_number', 'sex', 'technology', 'timepoint', 'tissue', 'upper_doublet_scores', 'upper_nCount_RNA', 'upper_percent.mt', 'percent.mt', 'percent.ribo', 'nCount_SCT', 'nFeature_SCT', 'integrated_snn_res.0.8', 'seurat_clusters', 'S.Score', 'G2M.Score', 'Phase', 'predicted.id', 'prediction.score.Type.IIb.Myonuclei', 'prediction.score.Type.IIx.Myonuclei', 'prediction.

In [15]:
#sc.pp.filter_cells(adata, min_counts=500)

In [16]:
adata1.write_h5ad("/data/class/cosmos2023/PUBLIC/shai_hulud/scanpy/gastroc_adata.h5ad")
