Skip to content

Fast Higashi Usage

ruochiz edited this page Apr 22, 2022 · 8 revisions

Data processing

Higashi processing (input data to sparse contact maps)

Fast-Higashi shares the same set of input as Higashi, and thus uses the same code to parse them.

Run the following commands to process the input data (only needs to be run once).

from higashi.Higashi_wrapper import Higashi
higashi_model = Higashi(config_path)
higashi_model.process_data()

This function will finish the following tasks:

  • generate a dictionary that'll map genomic bin loci to the node id.
  • extract data from the data.txt and turn that into the format of hyperedges (triplets)
  • create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes

The above function is also equivalent to

higashi_model.generate_chrom_start_end()
higashi_model.extract_table()
higashi_model.create_matrix()

Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.

Fast-Higashi only processing (sparse contact maps to sparse tensors)

from fasthigashi.FastHigashi_Wrapper import *
# Initialize the model
model = FastHigashi(config_path=config_path,
	             path2input_cache="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
	             path2result_dir="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
	             off_diag=100,
	             filter=False,
	             do_conv=False,
	             do_rwr=False,
	             do_col=False,
	             no_col=False)
# Pack from sparse mtx to tensors
model.prep_dataset()
**required arguments:**
config_path           The path to the configuration JSON file that you created.
path2input_cache      The path to the directory where the cached tensor file will be stored
path2result_dir       The path to the directory where the cached tensor file will be stored
off_diag              Maximum No of diagonals to consider. When set as 100, the 0-100th diagonal would be considered
filter                Whether only use cells that pass the quality control standard to learn the meta-interactions, and then infers the embeddings for the result of the cells. (Works better for datasets where the contacts per cell metrics vary drastically)
do_conv               Whether use linear convolution or not
do_rwr                Whether use partial random walk with restart or not
do_col                Whether use sqrt_vc normalization or not (recommended to be False), the program would automatically uses it when needed
no_col                Whether force the program to not use sqrt_vc normalization (recommended to be False), the program would automatically uses it when needed

Run the model

model.run_model(dim1=.6,
                rank=256,
                n_iter_parafac=1,
                extra="")
**required arguments:**
dim1           The scaling factor to calculate the chromosome specific embeddings. For a chromosome with n bins, studied at a resolution of xbp, its chromosome specific embedding size will be int(dim1 * n * x / 1000000)
rank           Size of the meta-embedding size (embeddings shared across all chromosomes)
n_iter_parafac Number of iterations for parafac within each Fast-Higashi iteration
extra          Extra annotation strings when saving the results

Get the results

embedding = model.fetch_cell_embedding(final_dim=256)
Clone this wiki locally