# Generate Feature Matrix and Adjacency Matrix

Generate the **feature matrix** and **adjacency matrix** using the raw exon read counts and raw junction read counts obtained from the count files. For detailed instructions on aligning the RNA-seq data, refer to the following tutorials:

- **Full-length RNA-seq**: [step1_1_preprocess_full_length.md](./step1_1_preprocess_full_length.md)
- **10X RNA-seq**: [step1_2_preprocess_10X.md](step1_2_preprocess_10X.md)

Before running this section, please ensure that your counts data and metadata file is ready, and also make sure you have the `dolphin_adj_index.csv` and `dolphin.exon.pkl` files prepared. These files can be generated in this [file](./step0_generate_exon_gtf_final.ipynb). 

> 💡 **Note:** These two files are paired — the `.pkl` file contains exon-level information, and the `dolphin_adj_index.csv` stores adjacency location data.  
> The exon order in the adjacency index file matches the exon order in the pickle file. **They must be used together.**

For human GRCh38, you can directly download the necessary files from [here](https://mcgill-my.sharepoint.com/:f:/g/personal/kailu_song_mail_mcgill_ca/EvZtHeW7qjJJs_RHc2-327ABeLXafa-ruvfk9Vs134crig?e=TsGsk8).

### Metadata File Format

The metadata file should be tab-separated, with the **"CB"** (cell barcode) column as the only mandatory column. You can add as many metadata columns as needed. These additional metadata columns will be retained in all output `adata` files as **obs** (observations).

| CB                      | cluster           | 
|--------------------------|-------------------|
| T8_AAAGCAAGTCGCGAAA      | Stellate cell     | 
| T8_AAATGCCAGCTGGAAC      | Macrophage cell   | 
| T8_AAATGCCGTAGCTGCC      | Macrophage cell   | 
| T8_AAATGCCTCCACTGGG      | Ductal cell type 2| 
| ...                      | ...               | 

**Notes:**
- The **"CB"** column is mandatory as it represents the cell barcode.
- Additional columns (e.g., **cluster**, **source**, etc.) can be added based on the specific metadata required for your analysis.
- These metadata columns will be included in the output `adata` files under **obs**.
- **"CB"** will be used to find the **count table file names** and will be stored as **obs index** in `adata`.

> ⚠️ **For demonstration purposes**, the following example is run directly in this Jupyter Notebook.  
> For better performance and scalability, especially on large datasets, we recommend running the full pipeline via the [step2_1_graph_generation.py](https://github.com/mcgilldinglab/DOLPHIN/blob/main/docs/source/tutorials/step2_1_graph_generation.py) script.

In [1]:
from DOLPHIN.graph_generation.preprocess_raw_reads import run_parallel_gene_processing

In [None]:
### This function processes exon count and junction raw count data for each cell and converts them into
### flattened feature and adjacency vectors.

run_parallel_gene_processing(
    metadata_path="/fsla_meta.csv",
    gtf_path="./dolphin_exon_gtf/dolphin.exon.pkl",
    adj_index_path="./dolphin_exon_gtf/dolphin_adj_index.csv",
    main_folder="./",
    n_processes=8
)

Starting Raw Reads Processing...


INFO:DOLPHIN.graph_generation.preprocess_raw_reads:Running gene processing using 8 processes...


Sample =  SRR18379095 , Gene id =  ENSG00000000457 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000000460 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000000938 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000001036 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000001084 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000001497 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000001630 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000001631 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000002549 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000002586 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000002834 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000003056 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000003400 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000003402 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000003756 is running.
Sample =  SRR18379095 , Gene id =  ENSG00000004059 is r

In [1]:
### This function combines feature vectors and constructs the final feature matrix.
from DOLPHIN.graph_generation.process_feature_matrix import run_feature_combination

In [None]:
run_feature_combination(
    metadata_path="./fsla_meta.csv",
    graph_directory="./06_graph_mtx",
    gene_annotation="./dolphin_exon_gtf/dolphin_gene_meta.csv",
    gtf_pkl_path="./dolphin_exon_gtf/dolphin.exon.pkl",
    out_directory="./",
    out_name= "fsla",
    clean_temp=False
)

Start Combining Feature Matrix...


Combining Features:   0%|          | 0/450 [00:00<?, ?it/s]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_gtf_an.gene_name.fillna(df_gtf_an.Geneid, inplace=True)
... storing 'celltype1' as categorical
... storing 'celltype2' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_gtf_an.gene_name

In [1]:
### This function combines adjacency vectors and constructs the Adjacency matrix.
from DOLPHIN.graph_generation.process_adjacency_matrix import run_adjacency_combination

In [None]:
run_adjacency_combination(
    metadata_path="./fsla_meta.csv",
    graph_directory="./06_graph_mtx",
    adj_meta_file="/dolphin_exon_gtf/dolphin_adj_metadata_table.csv",
    out_directory="./",
    out_name= "fsla",
    clean_temp=False,
    adj_run_num=50,
    parallel=True 
)

Start Combining Adjacency Matrix...
Running in parallel with batch size = 50 ...


Processing batch: 0 → 50
Processing batch: 50 → 100Processing batch: 100 → 102



... storing 'gene_id' as categorical
... storing 'gene_name' as categorical


[1/3] Finished batch


... storing 'celltype1' as categorical
... storing 'celltype2' as categorical
... storing 'gene_id' as categorical
... storing 'celltype1' as categorical
... storing 'celltype2' as categorical
... storing 'gene_name' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical


[2/3] Finished batch
[3/3] Finished batch
Merging .h5ad batches...


  combined_adata = combined_adata.concatenate(ad, index_unique=None, batch_key=None)
  combined_adata = combined_adata.concatenate(ad, index_unique=None, batch_key=None)
... storing 'celltype1' as categorical
... storing 'celltype2' as categorical


In [1]:
### This function convert adjacency matrix to compressed Adjacency matrix.
from DOLPHIN.graph_generation.process_adjacency_matrix_compress import run_adjacency_compression

In [None]:
run_adjacency_compression(
    metadata_path="./fsla_meta.csv",
    out_name= "fsla",
    out_directory="./",
    num_processes=1
)

Starting processing cell: SRR18388386


In [1]:
### This function combine comprssed adjacency matrix.
from DOLPHIN.graph_generation.process_adjacency_matrix_compress_combine import run_adjacency_compress_combination

In [None]:
run_adjacency_compress_combination(
    metadata_path= "/fsla_meta.csv",
    out_name= "fsla",
    out_directory="./",
    adj_run_num=50,
    clean_temp=False,
    parallel= True,
)

Start Combining Compressed Adjacency Matrix...
Running in parallel with batch size = 10 ...


100%|██████████| 3/3 [20:59<00:00, 419.91s/it]


Merging .h5ad batches...


In [None]:
### This function clean the final adjacency matrix.
from DOLPHIN.graph_generation.process_adjacency_matrix_final import run_adjacency_matrix_final

run_adjacency_matrix_final(
    out_name="fsla",
    out_directory="./"
)


Start Generating Final Adjacency Matrix...


Processing gene batches: 100%|██████████| 62/62 [03:12<00:00,  3.10s/it]
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical


In [2]:
## This step is primarily used to select highly variable genes (HVGs) based on the gene count table.
## The final feature matrix and adjacency matrix will retain only these HVGs as input for the DOLPHIN model.
## This step is optional—it provides one way to select HVGs, but you are free to use your own selection method.

from DOLPHIN.graph_generation.process_raw_gene import run_raw_gene

In [None]:
run_raw_gene(
    metadata_path= "./fsla_meta.csv",
    featurecount_path= "./04_exon_gene_cnt",
    gtf_path="./dolphin_exon_gtf/dolphin.exon.gtf",
    out_name="fsla",
    out_directory="./")

100%|██████████| 795/795 [09:52<00:00,  1.34it/s]
  utils.warn_names_duplicates("var")
... storing 'celltype1' as categorical
... storing 'celltype2' as categorical
... storing 'GeneName' as categorical


In [None]:
## This step is generate hvg selected feature matrix

from DOLPHIN.graph_generation.process_feature_hvg import run_feature_hvg

run_feature_hvg(
    out_name="fsla",
    out_directory="./")

  view_to_actual(adata)


Keep 2000 genes
The Final Feature Matrix Size is 795 Cells and 13979 exons


In [None]:
## This step generates the HVG-selected adjacency matrix.

from DOLPHIN.graph_generation.process_adjacency_hvg import run_adjacency_hvg

run_adjacency_hvg(
    out_name="fsla",
    out_directory="./")

Keep 2000 genes
The Final Adjacency Matrix Size is 795 Cells with dimension of 415174


100%|██████████| 795/795 [17:30<00:00,  1.32s/it]


In [1]:
## This final step generates the model input

from DOLPHIN.graph_generation.process_graph_final import run_model_input

run_model_input(
    metadata_path= "/fsla_meta.csv",
    out_name="fsla",
    out_directory="./",
    celltypename="celltype1")

Start Construct Data Input for model input


100%|██████████| 200/200 [15:09<00:00,  4.55s/it]
