**Aim:** anlysis of CITE-seq data

**Tools**: 
1. Alignment and count: [`cellranger`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) or [`kallisto | bustools`](https://www.kallistobus.tools/)
2. Preprocessing: [`scanpy`](https://scanpy.readthedocs.io/en/stable/) or [`seurat`](https://satijalab.org/seurat/)

## 1. Alignment and count

In [6]:
# cat kb-count.sh

In [5]:
# cat kite.sh

Let's try `cellranger` instead! (see [this link](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis))
> The pipeline outputs a unified feature-barcode matrix that contains gene expression counts alongside Feature Barcode counts for each cell barcode. 

> To enable Feature Barcode analysis, `cellranger count` needs two new inputs:
> 1. Libraries CSV is passed to `cellranger count` with the `--libraries` flag, and declares the FASTQ files and library type for each input dataset. In a typical Feature Barcode analysis there will be two input libraries: one for the normal single-cell gene expression reads, and one for the Feature Barcode reads. This argument replaces the `--fastqs` argument.
> 2. Feature Reference CSV is passed to `cellranger count` with the `--feature-ref` flag and declares the set of Feature Barcode reagents in use in the experiment. For each unique Feature Barcode used, this file declares a feature name and identifier, the unique Feature Barcode sequence associated with this reagent, and a pattern indicating how to extract the Feature Barcode sequence from the read sequence. See [Feature Barcode Reference](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis#feature-ref) for details on how to construct the feature reference.

<!-- # Surface protein counts
Here we aim to run `kite` and load results into python using `scanpy` library. 

Main kite workflow didn't work and most of barcodes didn't match (see [here](https://github.com/pachterlab/kite)). So, I'm trying https://www.kallistobus.tools/tutorials/kb_kite/python/kb_kite.html -->

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis#feature-ref

> __Feature Barcode Extraction Pattern__

> The `pattern` field of the feature reference defines how to locate the Feature Barcode within a read. The Feature Barcode may appear at a known offset with respect to the start or end of the read or may appear at a fixed position relative to a known anchor sequence. The pattern column can be made up of a combination of these elements:

> - **5P**: denotes the beginning of the read sequence. May appear 0 or 1 times, and must be at the beginning of the pattern. Only 5P or 3P may appear, not both. (^ may be used instead of 5P.)
> - **3P**: denotes the end of the read sequence. May appear 0 or 1 times, and must be at the end of the pattern. (\$ may be used instead of 3P.)
> - **N**: denotes an arbitrary base.
> - **A, C, G, T**: denotes a fixed base that must match the read sequence exactly.
> - **(BC)**: denotes the Feature Barcode sequence as specified in the sequence column of the feature reference. Must appear exactly once in the pattern.

> Any constant sequences made up of A, C, G and T in the pattern must match exactly in the read sequence. Any N in the pattern is allowed to match a single arbitrary base. A modest number of fixed bases should be used to minimize the chance of a sequencing error disrupting the match. The fixed sequence should also be long enough to uniquely identify the position of the Feature Barcode. For feature types that require an non-N anchor, we recommend 12bp-20bp of constant sequence. The extracted Feature Barcode sequences are corrected up to a Hamming distance of 1 using the 10x barcode correction algorithm that is used to correct cell barcodes.



In [3]:
cat feature_ref.csv

id,name,read,pattern,sequence,feature_type
CD45,CD45_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,TGGCTATGGAGCAGA,Antibody Capture
CD3,CD3_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,GTATGTCCGCTCGAT,Antibody Capture
CD8a,CD8a_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,TACCCGTAATAGCGT,Antibody Capture
CD4,CD4_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,AACAAGACCCTTGAG,Antibody Capture
CD14,CD14_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,AACCAACAGTCACGT,Antibody Capture
CD19,CD19_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,ATCAGCCATGTCAGT,Antibody Capture
CD11c,CD11c_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,GTTATGGACGCTTGC,Antibody Capture
PDL1,PDL1_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,TCGATTCCACCAACT,Antibody Capture
TIM3,TIM3_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,ATTGGCACTCAGATG,Antibody Capture
CTLA4,CTLA4_TotalSeqB,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,AGTGTTTGTCCTGGT,Antibody Capture

creating the CSV files:

In [1]:
# import pandas as pd 
# import os 
# import itertools
# from glob import glob

# # PWD='/rumi/shams/abe/People/Bahar'
# PWD='/data_gilbert/home/aarab/People/Bahar'

# fastqs = sorted([(PWD+'/fastq/',f.split('_')[0]) for f in map(os.path.basename, glob("fastq/*R1*"))])

# library_types = [j for i in list(itertools.repeat(['Gene Expression','Antibody Capture'], 8)) for j in i]

# library = pd.DataFrame({
#     'fastqs': [path for path,_ in fastqs],
#     'sample': [sample for _,sample in fastqs],
#     'library_type': library_types})

# library.to_csv('library.csv',index=None)

In [13]:
!for f in counts/*csv; do echo $f; cat $f; echo "________________________________________________"; done 

counts/101.csv
fastqs,sample,library_type
/data_gilbert/home/aarab/People/Bahar/fastq/,101-GE,Gene Expression
/data_gilbert/home/aarab/People/Bahar/fastq/,101-SP,Antibody Capture
________________________________________________
counts/103.csv
fastqs,sample,library_type
/data_gilbert/home/aarab/People/Bahar/fastq/,103-GE,Gene Expression
/data_gilbert/home/aarab/People/Bahar/fastq/,103-SP,Antibody Capture
________________________________________________
counts/104.csv
fastqs,sample,library_type
/data_gilbert/home/aarab/People/Bahar/fastq/,104-GE,Gene Expression
/data_gilbert/home/aarab/People/Bahar/fastq/,104-SP,Antibody Capture
________________________________________________
counts/105.csv
fastqs,sample,library_type
/data_gilbert/home/aarab/People/Bahar/fastq/,105-GE,Gene Expression
/data_gilbert/home/aarab/People/Bahar/fastq/,105-SP,Antibody Capture
________________________________________________
counts/106.csv
fastqs,sample,library_type
/data_gilbert/home/aarab

## 2. Preprocessing
- https://scanpy-tutorials.readthedocs.io/en/latest/cite-seq/pbmc5k.html

Load cellranger results

In [2]:
import os 
import pickle
import numpy as np 
import pandas as pd
import scanpy as sc
import scanpy.external as sce
import anndata as ad
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from glob import glob

sc.settings.verbosity = 1             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')

sc.logging.print_header()

scanpy==1.7.2 anndata==0.7.6 umap==0.5.1 numpy==1.19.1 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.9.6 louvain==0.7.0


In [230]:
# https://stackoverflow.com/questions/21884271/warning-about-too-many-open-figures
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})

plt.close('all')

In [2]:
# https://www.techcoil.com/blog/how-to-save-and-load-objects-to-and-from-file-in-python-via-facilities-from-the-pickle-module/

def save_obj(obj, name ):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [3]:
def preprocessing(adata):
    
    adata.var_names_make_unique()
    # remove doublet
    sce.pp.scrublet(adata)
    adata  = adata[adata.obs.doublet_score < adata.uns['scrublet']['threshold']]

    # Basic filtering:
    adata.layers["counts"] = adata.X.copy()
    sc.pp.filter_genes(adata, min_counts=1)
    
    return adata

#     sc.pp.filter_genes(adata, min_cells=10)
#     sc.pp.filter_cells(adata, min_genes=150)
#     # Identify highly-variable genes.
#     sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#     adata.raw = adata
#     # Actually do the filtering
#     adata = adata[:, adata.var.highly_variable]

### Label samples

<!-- https://www.kallistobus.tools/tutorials/kb_aggregate/python/kb_aggregating_count_matrices.html -->

**Experimetal design**

> 2 biological replicates per condition
> - 101, 103 --> Nuj selected
> - 104,105 --> Balb selected
> - 108,106 --> Rag selected
> - 109,112 --> NSG selected
<!-- > each biological replicate was extracted from mice and ran on 10x on different days
so:
> - 101, 104,108,109 were extracted, processed loaded on 10x chromium on 8/16/21
> - 103, 105,106,112 were extracted, processed loaded on 10x chromium on 8/17/21
> The library preparation for all 8 samples was done simultaneously. Two libraries were generated per sample:
>- GE: gene expression library (RNA fraction)
>- SP: surface protein library (citeseq fraction)

> therefore I had total of 16 libraries submitted for sequencing.

> The ratio of SP/GE libraries were 1:5 in the final sample pool.
-->

In [4]:
%%time 
samples = [
    'Nuj-rep1','Nuj-rep2',
    'Balb-rep1','Balb-rep2',
    'Rag-rep1','Rag-rep2',
    'NSG-rep1','NSG-rep2'
]

data = {}
for sam, file in zip(samples, sorted(glob('counts/*/outs/filtered_feature_bc_matrix.h5'))):
    data[sam] = {}
    print ('_'*50)
    print (sam, file)
    
    adata = sc.read_10x_h5(file, gex_only=False)
    adata = preprocessing(adata)
    
    data[sam]['main'] = adata
    del adata

__________________________________________________
Nuj-rep1 counts/101/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.34
Detected doublet rate = 1.1%
Estimated detectable doublet fraction = 28.4%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.8%
__________________________________________________
Nuj-rep2 counts/103/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.36
Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 18.6%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.7%
__________________________________________________
Balb-rep1 counts/104/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.52
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 19.2%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 2.2%
__________________________________________________
Balb-rep2 counts/105/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.42
Detected doublet rate = 0.6%
Estimated detectable doublet fraction = 19.8%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.1%
__________________________________________________
Rag-rep1 counts/106/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.42
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 16.7%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 2.9%
__________________________________________________
Rag-rep2 counts/108/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.51
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 19.4%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 2.8%
__________________________________________________
NSG-rep1 counts/109/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.33
Detected doublet rate = 1.4%
Estimated detectable doublet fraction = 39.0%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.7%
__________________________________________________
NSG-rep2 counts/112/outs/filtered_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  view_to_actual(adata)


Automatically set threshold at doublet score = 0.26
Detected doublet rate = 2.0%
Estimated detectable doublet fraction = 38.9%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 5.1%
CPU times: user 5min 21s, sys: 12.4 s, total: 5min 34s
Wall time: 52.8 s


In [5]:
!mkdir -pv preprocessing

In [6]:
name = 'preprocessing/rawcounts.pkl'

save_obj(data, name)

In [7]:
ls -s --block-size=M preprocessing/rawcounts.pkl

1292M preprocessing/rawcounts.pkl


___


In [74]:
name = 'preprocessing/rawcounts.pkl'
data = load_obj(name)

## 3. Basic clustering 
Annotate data based on mRNA expression and surface protein abundance 

In [280]:
def clustering(adata,lognorm=True,n_neighbor=30,res=None):
    if lognorm: 
        sc.pp.log1p(adata)
    sc.pp.pca(adata)
    sc.pp.neighbors(adata, n_neighbors=n_neighbor)   # why can't we just work with the default neighbors?
    sc.tl.umap(adata)
    sc.tl.leiden(adata,resolution=res)


# def join_graphs_max(g1: "sparse.spmatrix", g2: "sparse.spmatrix"):
#     """Take the maximum edge value from each graph."""
#     out = g1.copy()
#     mask = g1 < g2
#     out[mask] = g2[mask]
#     return out
# 
#     adata.obsp["connectivities"] = join_graphs_max(adata.obsp["rna_connectivities"], adata.obsp["protein_connectivities"])
#     sc.pp.neighbors(adata, n_neighbors=30)   # why can't we just work with the default neighbors?
#     sc.tl.leiden(adata, key_added="joint_leiden")
#     sc.tl.umap(adata)

def annotate_rna_protein(adata):
    
    protein = adata[:, adata.var["feature_types"] == "Antibody Capture"].copy()
    
    clustering(protein)
    
    protein.var.set_index('gene_ids',inplace=True)    
    adata.obs = pd.concat([adata.obs,protein.to_df()],axis=1) 
    
    print ('protein clustering is done!')
    
    rna = adata[:, adata.var["feature_types"] == "Gene Expression"].copy()
    clustering(rna)

    print ('RNA clustering is done!')

    odata = rna.copy()
    
    odata.obsm["protein_umap"] = protein.obsm["X_umap"]
    odata.obs["protein_leiden"] = protein.obs["leiden"]
    odata.obsp["protein_connectivities"] = protein.obsp["connectivities"].copy()

    odata.obsm["rna_umap"] = rna.obsm["X_umap"]
    del odata.obs["leiden"]
    odata.obs["rna_leiden"] = rna.obs["leiden"]
    odata.obsp["rna_connectivities"] = rna.obsp["connectivities"].copy()
    
    return odata

In [None]:
%%time 
for sample in data:
    print (sample)
    data[sample]['main'] = annotate_rna_protein(data[sample]['main'])
    # Save `adata`s to file
    data[sample]['main'].write(f'preprocessing/{sample}.h5ad.gz',compression='gzip')

### Draw UMAP plots

In [87]:
%%time 
from matplotlib import pyplot as plot
from matplotlib.backends.backend_pdf import PdfPages

pdf_pages = PdfPages('figures/umap.pdf')

# The PDF document
for sample in data:
    adata = data[sample]['main'].copy()

    # Create a figure instance (ie. a new page)
    fig = plot.figure(dpi=300)

    # Plot whatever you wish to plot
    print ('\033[1m' + sample)
    fig = sc.pl.embedding(adata, basis='rna_umap',color=adata.obs.columns[2:], size=10,show=False,return_fig=True)
    fig.suptitle(sample +'\n'+str(adata.shape[0])+' cells, '+str(adata.shape[1])+' genes', fontsize=20)
    
    fig.set_size_inches(18.5, 10.5) # https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib
    
    # Done with the page
    pdf_pages.savefig(fig)
    plt.close(fig)    
#     del adata
    
# Write the PDF document to the disk
pdf_pages.close()

[1mNuj-rep1
[1mNuj-rep2
[1mBalb-rep1
[1mBalb-rep2
[1mRag-rep1
[1mRag-rep2
[1mNSG-rep1
[1mNSG-rep2
CPU times: user 11 s, sys: 460 ms, total: 11.4 s
Wall time: 11.4 s


## 4. Huge heatmap!

In [333]:
mergedadata_2 = ad.AnnData(
        X=mergedadata.X,
        obs=mergedadata.obs.loc[:,["CD45","CD3","CD8a","CD4","CD14","CD19","CD11c","PDL1","TIM3","CTLA4","dataset"]]
    )

mergedadata_2.var.index = mergedadata.var.index

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.


In [334]:
clustering(mergedadata_2,lognorm=False,n_neighbor=10,res=1)

In [335]:
sc.pl.violin(mergedadata_2, mergedadata_2.obs.columns[:-2], groupby='leiden',save='_mRNA.pdf')



In [336]:
sc.pl.heatmap(mergedadata_2, mergedadata_2.obs.columns[:-2], groupby='leiden', dendrogram=True,save='.pdf',figsize=(5,30))



In [337]:
sc.pl.umap(mergedadata_2,color='dataset',save='_dataset.pdf')



In [338]:
# # The PDF document
pdf_pages = PdfPages('figures/violin_cite.pdf')

for cite in mergedadata_2.obs.columns[:-2]:
    # Create a figure instance (ie. a new page)
    fig = plot.figure(figsize=(12, 6), dpi=100)

    # Plot whatever you wish to plot
    print ('\033[1m' + cite)
    ax = plt.subplot()
    sc.pl.violin(mergedadata_2, cite, groupby='leiden',stripplot=False,ax=ax,rotation=90)
    # Done with the page
    pdf_pages.savefig(fig)
    
# Write the PDF document to the disk
pdf_pages.close()

[1mCD45
[1mCD3
[1mCD8a
[1mCD4
[1mCD14
[1mCD19
[1mCD11c
[1mPDL1
[1mTIM3
[1mCTLA4


In [339]:
sc.pl.stacked_violin(
    mergedadata_2,
    mergedadata_2.obs.columns[:-2],
    groupby='leiden',
    title='mRNA Clusters',
    save='mRNA.pdf',
    dendrogram=True
)



In [340]:
# find gene markers in the 'dataset' group.
sc.tl.rank_genes_groups(mergedadata_2, 'leiden', method='wilcoxon', n_genes=25)

In [350]:
sc.pl.rank_genes_groups_heatmap(mergedadata_2, n_genes=3, use_raw=False, vmin=-3, vmax=3, show_gene_labels=True,cmap='bwr',save='_rank_genes_groups_leiden.pdf',figsize=(30,60))



In [144]:
# import altair as alt
# from functools import partial

# alt.renderers.enable("png")
# alt.data_transformers.disable_max_rows()

# def embedding_chart(df: pd.DataFrame, coord_pat: str, *, size=5) -> alt.Chart:
#     """Make schema for coordinates, like sc.pl.embedding."""
#     x, y = df.columns[df.columns.str.contains(coord_pat)]
#     return (
#         alt.Chart(plotdf, height=300, width=300)
#         .mark_circle(size=size)
#         .encode(
#             x=alt.X(x, axis=None),
#             y=alt.Y(y, axis=None),
#         )
#     )

# def umap_chart(df: pd.DataFrame, **kwargs) -> alt.Chart:
#     """Like sc.pl.umap, but just the coordinates."""
#     return embedding_chart(df, "umap", **kwargs)

# def encode_color(c: alt.Chart, col: str, *, qdomain=(0, 1), scheme: str = "lightgreyred") -> alt.Chart:
#     """Add colors to an embedding plot schema."""
#     base = c.properties(title=col)
#     if pd.api.types.is_categorical(c.data[col]):
#         return base.encode(color=col)
#     else:
#         return base.encode(
#             color=alt.Color(
#                 col,
#                 scale=alt.Scale(
#                     scheme=scheme,
#                     clamp=True,
#                     domain=list(c.data[col].quantile(qdomain)),
#                     nice=True,
#                 )
#             )
#         )
#     plotdf = sc.get.obs_df(
#         rna,
#         obsm_keys=[("X_umap", i) for i in range(2)] + [("protein", i) for i in rna.obsm["protein"].columns]
#     )

#     fig = (
#         alt.concat(
#             *map(partial(encode_color, umap_chart(plotdf), qdomain=(0, .95)), plotdf.columns[3:]),
#             columns=2
#         )
#         .resolve_scale(color='independent')
#         .configure_axis(grid=True)
#     )

In [3]:
# def preprocessing(adata):
#     # Basic filtering:
#     sc.pp.filter_genes(adata, min_cells=10)
#     sc.pp.filter_cells(adata, min_genes=150)
#     # annotate the group of mitochondrial genes as 'mt'
#     adata.var['mt'] = adata.var.index.str.startswith('MT-')  
#     # Remove cells that have too many mitochondrial genes expressed or too many total counts:
#     sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
#     adata = adata[adata.obs.n_genes_by_counts < 2500, :]
#     adata = adata[adata.obs.pct_counts_mt < 5, :]
#     # Total-count normalize (library-size correct) the data matrix 
#     # 𝐗 to 10,000 reads per cell, so that counts become comparable among cells.
#     sc.pp.normalize_per_cell(adata)
#     # Logarithmize the data:
#     sc.pp.log1p(adata)
#     # Identify highly-variable genes.
#     sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#     adata.raw = adata
#     # Actually do the filtering
#     adata = adata[:, adata.var.highly_variable]
#     # Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. 
#     # Scale the data to unit variance.
#     sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
#     # Scale each gene to unit variance. Clip values exceeding standard deviation 10.
#     sc.pp.scale(adata, max_value=10)

A zip file of the three MatrixMarket files (.mtx, .mtx_cols, .mtx_rows)

In [131]:
# from zipfile import ZipFile
# # https://thispointer.com/python-how-to-create-a-zip-archive-from-multiple-files-or-directory/
# def write_adata(adata,name):
#     adata.obs.to_csv(f'{name}.mtx_rows.gz',compression='gzip')
#     adata.var.to_csv(f'{name}.mtx_cols.gz',compression='gzip')
#     adata.to_df().to_csv(f'{name}.mtx.gz', compression='gzip')

#     # create a ZipFile object
#     zipObj = ZipFile(f'{name}.zip', 'w')
#     # Add multiple files to the zip
#     zipObj.write(f'{name}.mtx_rows.gz')
#     zipObj.write(f'{name}.mtx_cols.gz')
#     zipObj.write(f'{name}.mtx.gz')
#     # close the Zip File
#     zipObj.close()

In [132]:
# write_adata(mix, 'preprocessing/mix')