# Generate Feature Matrix and Adjacency Matrix

Generate the **feature matrix** and **adjacency matrix** using the exon read counts and junction read counts obtained from the count files. For detailed instructions on aligning the RNA-seq data, refer to the following tutorials:

- **Full-length RNA-seq**: [step1_1_preprocess_full_length.md](./step1_1_preprocess_full_length.md)
- **10X RNA-seq**: [step1_2_preprocess_10X.md](step1_2_preprocess_10X.md)

Before running this section, please ensure that your counts data and metadata file is ready, and also make sure you have the `dolphin_adj_index.csv` and `dolphin.exon.pkl` files prepared. These files can be generated in this [file](./step0_generate_exon_gtf_final.ipynb). 

For human GRCh38, you can directly download the necessary files from [here](https://mcgill-my.sharepoint.com/:f:/g/personal/kailu_song_mail_mcgill_ca/EvZtHeW7qjJJs_RHc2-327ABeLXafa-ruvfk9Vs134crig?e=TsGsk8).

### Metadata File Format

The metadata file should be tab-separated, with the **"CB"** (cell barcode) column as the only mandatory column. You can add as many metadata columns as needed. These additional metadata columns will be retained in all output `adata` files as **obs** (observations).

| CB                      | cluster           | 
|--------------------------|-------------------|
| T8_AAAGCAAGTCGCGAAA      | Stellate cell     | 
| T8_AAATGCCAGCTGGAAC      | Macrophage cell   | 
| T8_AAATGCCGTAGCTGCC      | Macrophage cell   | 
| T8_AAATGCCTCCACTGGG      | Ductal cell type 2| 
| ...                      | ...               | 

**Notes:**
- The **"CB"** column is mandatory as it represents the cell barcode.
- Additional columns (e.g., **cluster**, **source**, etc.) can be added based on the specific metadata required for your analysis.
- These metadata columns will be included in the output `adata` files under **obs**.
- **"CB"** will be used to find the **count table file names** and will be stored as **obs index** in `adata`.

In [1]:
from DOLPHIN import gene, get_gtf
import os

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
## This folder contains all the counts table files, undert this main folder "05_exon_junct_cnt" contains exon counts and junction counts files
main_folder = "./00_data_generation/"
output_folder = "06_graph_mtx" ## output files are stored here
os.makedirs(os.path.join(main_folder, output_folder), exist_ok=True)

In [None]:
metadata = "your_metaData.csv"
pd_gt = pd.read_csv(metadata, sep="\t")
mr_cb_list = list(pd_gt[pd_gt.columns[0]]) 

In [4]:
path_gtf = "./gtf.pkl"
path_adj_index = "./adj_index.csv"

In [6]:
gtf, adj_ind = get_gtf(path_gtf, path_adj_index)

### Note on Processing

This will run one cell at a time. Later, implementing multiprocessing will help speed up the process by running multiple cells concurrently.

In [None]:
for i in range(0,len(mr_cb_list)):
    g = gene(main_folder, gtf, adj_ind, mr_cb_list[i])
    g.get_all()