### Doublet identification & filtering

In [1]:
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import scvi
import hdf5plugin
import os

- Οι περισσότερες τεχνολογίες που χρησιμοποιούνται για scRNA sequencing είναι droplet-based.

- Πρόκειται για μικρο-σταγονίδια που περιλαμβάνουν μόρια ανιχνευτές (DNA barcodes). Σε αυτά απομονώνται μεμονωμένα κύτταρα το περιεχόμενο των οποίων στη συνέχεια θα υποστεί αλληλούχιση επόμενης γενιάς (Next Generation Sequencing).

- Σε κάποιες περιπτώσεις, από λάθος γίνεται απομόνωση δύο κυττάρων (2) σε ένα μικρο-σταγονίδιο. Έτσι δημιουργούνται τα doublets.

- Τα doublets μπορεί να είναι ομοτυπικά (δηλαδή: τα προφίλ έκφρασης των δύο κυττάρων είναι παρόμοια) ή ετεροτυπικά (δηλαδή: τα προφίλ έκφρασης των δύο κυττάρων είναι διαφορετικά).

- Επομένως έχουμε μεταγραφωματικά προφίλ τα οποία ενώ φαίνεται ότι έχουν προέλθει από ένα κύτταρο, στην πράξη έχουν προέλθει από τον συνδυασμό δύο κυττάρων (κάτι το οποίο μπορεί να οδηγήσει σε εσφαλμένα συμπεράσματα).

- Έχουν προταθεί μεθοδολογίες βασισμένες σε μηχανική μάθηση, μέσω των οποίων γίνεται ταυτοποίηση και αφαίρεση των doublets (doublet finder, solo, κλπ). Οι μεθοδολογίες αυτές αποτελούνται από τα ίδια βασικά βήματα:

     - 1) Προσομοίωση "τεχνητών" doublets από τα κύτταρα του dataset (τα οποία είναι ένα μείγμα από singlets και               doublets).
     - 2) Εκπαίδευση ενός ταξινομητή έτσι ώστε να εντοπίζει τα "τεχνητά" doublets.
     - 3) Εφαρμογή του ταξινομητή στο πραγματικό σύνολο δεδομένων για πρόβλεψη και αφαίρεση των doublets.

- Το SOLO δημιουργεί "τεχνητά" doublets όπως και το doublet finder. Γίνεται τυχαία δειγματοληψία ζευγών κυττάρων και αφού αθροιστούν τα προφίλ έκφρασης τους στη συνέχεια γίνεται διαίρεση με το 2. Η διαδικασία αυτή γίνεται επαναλήπτικά έτσι ώστε να δημιουργηθούν N "τεχνητά" doublets.

- Η εκπαίδευση του ταξιξομητή αφορά στη διάκριση μεταξύ των "τεχνητών" doublets και των πραγματικών δεδομένων γονιδιακής έκφρασης. 

- Λαμβάνει χώρα σε έναν χώρο χαρακτηριστικών μειωμένης διαστατικότητας ο οποίος είναι μη-γραμμικός (έχει προκύψει από ένα variational autoencoder της βιβλιοθήκης scVI).

- Ο ταξινομητή που χρησιμοποιείται για την εύρεση των doublets είναι ένα τεχνητό νευρωνικό δίκτυο της βιβλιοθήκης scVI.

In [2]:
def filter_doublets(directory="./data/3.H5AD_filtered"):
    # List all files in the specified directory
    for file in os.listdir(directory):
        if file.endswith(".h5ad"):
            # Construct the full file path
            file_path = os.path.join(directory, file)
            # Print the file name
            print(f"Processing file: {file_path}")

            adata = sc.read_h5ad(file_path)
            sc.pp.highly_variable_genes(adata, n_top_genes=3000, subset=True, flavor="seurat_v3", span=0.8)
            scvi.model.SCVI.setup_anndata(adata)
            vae = scvi.model.SCVI(adata)
            vae.train()
            solo = scvi.external.SOLO.from_scvi_model(vae)
            solo.train()
            df = solo.predict()
            df["prediction"] = solo.predict(soft=False)

            doublet_dic = dict(zip(df.index, df.prediction))

            def filter_doublet(x):
                try:
                    return doublet_dic[x]
                except:
                    return 'filtered'

            adata = sc.read_h5ad(file_path)
            adata.obs["doublet"] = adata.obs.index.map(filter_doublet)
            adata = adata[adata.obs.doublet == 'singlet']
            adata.write_h5ad(
                        f"./data/4.H5AD_filtered_without_doublets/fwd_{file}",
                        compression=hdf5plugin.FILTERS["zstd"]
                    )

In [3]:
filter_doublets()

Processing file: ./data/3.H5AD_filtered/filtered_GSM4432635_Control.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|██████████████████| 400/400 [01:20<00:00,  5.12it/s, v_num=1, train_loss_step=1.3e+3, train_loss_epoch=1.2e+3]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|██████████████████| 400/400 [01:20<00:00,  4.96it/s, v_num=1, train_loss_step=1.3e+3, train_loss_epoch=1.2e+3]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 254/400:  64%|████████████▋       | 254/400 [00:39<00:22,  6.51it/s, v_num=1, train_loss_step=0.338, train_loss_epoch=0.257]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.243. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432637_Control.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:37<00:00, 10.68it/s, v_num=1, train_loss_step=776, train_loss_epoch=775]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:37<00:00, 10.68it/s, v_num=1, train_loss_step=776, train_loss_epoch=775]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 241/400:  60%|████████████        | 241/400 [00:18<00:11, 13.38it/s, v_num=1, train_loss_step=0.202, train_loss_epoch=0.181]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.259. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432638_Alzheimer.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████| 400/400 [01:09<00:00,  5.78it/s, v_num=1, train_loss_step=1.01e+3, train_loss_epoch=1.02e+3]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████| 400/400 [01:09<00:00,  5.75it/s, v_num=1, train_loss_step=1.01e+3, train_loss_epoch=1.02e+3]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 293/400:  73%|██████████████▋     | 293/400 [00:40<00:14,  7.16it/s, v_num=1, train_loss_step=0.413, train_loss_epoch=0.232]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.237. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432636_Control.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (8) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:28<00:00, 14.33it/s, v_num=1, train_loss_step=968, train_loss_epoch=966]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:28<00:00, 14.21it/s, v_num=1, train_loss_step=968, train_loss_epoch=966]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 313/400:  78%|███████████████▋    | 313/400 [00:17<00:04, 18.18it/s, v_num=1, train_loss_step=0.391, train_loss_epoch=0.208]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.236. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432639_Alzheimer.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:47<00:00,  8.38it/s, v_num=1, train_loss_step=793, train_loss_epoch=811]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████████████| 400/400 [00:47<00:00,  8.44it/s, v_num=1, train_loss_step=793, train_loss_epoch=811]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 347/400:  87%|█████████████████▎  | 347/400 [00:32<00:04, 10.61it/s, v_num=1, train_loss_step=0.0979, train_loss_epoch=0.16]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.167. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432641_Alzheimer.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████| 400/400 [01:28<00:00,  4.53it/s, v_num=1, train_loss_step=1.34e+3, train_loss_epoch=1.33e+3]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████| 400/400 [01:28<00:00,  4.54it/s, v_num=1, train_loss_step=1.34e+3, train_loss_epoch=1.33e+3]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 281/400:  70%|██████████████      | 281/400 [00:48<00:20,  5.78it/s, v_num=1, train_loss_step=0.193, train_loss_epoch=0.215]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.215. Signaling Trainer to stop.


  df[key] = c


Processing file: ./data/3.H5AD_filtered/filtered_GSM4432640_Alzheimer.h5ad


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (4) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 400/400: 100%|████████████████| 400/400 [00:15<00:00, 26.07it/s, v_num=1, train_loss_step=1.49e+3, train_loss_epoch=1.41e+3]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████| 400/400 [00:15<00:00, 26.01it/s, v_num=1, train_loss_step=1.49e+3, train_loss_epoch=1.41e+3]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/kostas/miniconda3/envs/scrna/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 310/400:  78%|███████████████▌    | 310/400 [00:09<00:02, 31.88it/s, v_num=1, train_loss_step=0.238, train_loss_epoch=0.304]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.376. Signaling Trainer to stop.


  df[key] = c
