### Dataset cleaning and merging

#### 1) In this step the 3 lung cancer datasets are stripped off of both common and non-common elements.
#### 2) This is done in an effort to preserve space and to omit having NaN values due to non-common elements.
#### 3) Subsequently the 3 datasets are concatenated into a single anndata object and the tissue values are replaced (see dictionary below for details).

In [1]:
import scanpy as sc
import anndata as ad
import pandas as pd
import hdf5plugin

### Parameters

In [2]:
# nsclc_tissue_dic = {
#     "tumor": "cancer",
#     "blood": "normal",
#     "adjacent normal": "normal",
# }


# luad_gse131907_tissue_dic = {
#     "tLung": "cancer",
#     "nLung": "cancer",
#     "nLN": "cancer",
#     "mBrain": "cancer",
#     "mLN": "cancer",
#     "PE": "cancer",
#     "tL/B": "cancer",
# }


# luad_gse123902_tissue_dic = {
#     'PRIMARY': "cancer",
#     'NORMAL': "normal",
#     'METASTASIS': "cancer",
# }


luad_gse97168_tissue_dic = {
    'normal': "normal",
    'tumor': "cancer",
}

### Dataset paths

In [3]:
# DATA_PATH = "./data/standalone_h5ads/NSCLC_T_SS2_GSE99254.h5ad"

# DATA_PATH = "./data/standalone_h5ads/LUAD_UNB_10X_GSE131907.h5ad"

# DATA_PATH = "./data/standalone_h5ads/LUAD_UNB_10X_GSE123902.h5ad"

DATA_PATH = "./data/standalone_h5ads/LUAD_MYE_MRS_GSE97168.h5ad"

### Read & Sanitize adata object

In [4]:
adata = sc.read_h5ad(DATA_PATH)
adata

AnnData object with n_obs × n_vars = 1275 × 10523
    obs: 'nCount_RNA', 'nFeature_RNA', 'Amp_batch_ID', 'well_coordinates', 'plate_ID', 'Pool_barcode', 'Cell_barcode', 'tissue', 'seurat_clusters', 'annotation_CHETAH', 'cell_ontology', 'cell_ontology_id', 'annotation_major', 'annotation_immune', 'annotation_minor'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'X_pca', 'X_tsne', 'X_umap'

In [5]:
# Step 1: Keep only specific columns in `.obs`
# Replace 'column1', 'column2' with the names of the columns you want to keep
adata.obs = adata.obs[['tissue', 'annotation_immune']]

# Step 2: Remove all columns from `.var`
# This retains the gene names/index but removes all associated annotations
adata.var = pd.DataFrame(index=adata.var.index)

# Step 3: Remove all items from `.obsm`
adata.obsm = {}

In [6]:
adata.obs["tissue"].value_counts()

tumor     804
normal    471
Name: tissue, dtype: int64

In [7]:
adata.obs.rename(columns = {'Tissue':'tissue'}, inplace = True)
adata.obs['tissue'] = adata.obs['tissue'].replace(luad_gse97168_tissue_dic)

adata.obs["tissue"].value_counts()

cancer    804
normal    471
Name: tissue, dtype: int64

### Write sanitized dataset to disk

In [8]:
adata.write_h5ad(
    "./data/sanitized_h5ads/luad_gse97168.h5ad",
    compression=hdf5plugin.FILTERS["zstd"],
    compression_opts=hdf5plugin.Zstd(clevel=5).filter_options
                )