### Down sampling

UMAP, clustering and marker identification together allow a good understanding of cellular diversity. However, one can still choose from a plethora of other analysis on the data. For example, identification of cell differentiation trajectories. One of the major challenges to run these analysis could be the size of the data. Scarf performs a topology conserving downsampling of the data based on the cell neighbourhood graph. This downsampling aims to maximize the heterogeneity while sampling the cells from the the data. Before the actual downsampling step, two key steps must be performed.

The first step is the micro-clustering of the cells. Micro-clustering is performed using the dendrogram generated by the Paris algorithm. Rather than using a fixed distance value to cut the dendrogram to obtain cluster, a balanced cut is performed such the size of obtained clusters are bounded within the given limits. Below we perform balanced micro clustering and visualize the results

In [None]:
ds.run_clustering(balanced_cut=True, min_size=20, max_size=100, label='b_cluster')
ds.plot_layout(layout_key='RNA_UMAP', color_by='RNA_b_cluster', legend_onside=False, legend_ondata=False)
ds.plot_cluster_tree(cluster_key='RNA_b_cluster', width=1, do_label=False)

It is good idea to make sure that small populations are divided into smaller clusters to facilitate comprehensive downsampling of even smaller clusters. The next is to calculate the neighbourhood density of nodes. A degree of a node (i.e. a cell in the graph) is the number of nodes it is connected to, the two step degree (aka 1 neighbourhood degree)of a cell is the sum of degrees of cells that a cell is connected to. We calculate the two neighbourhood degree of cells to obtain an estimate of how densely connected the cells are in each region of the graph. The more densely connected the cells are, the less the heterogeneity across them. These values are saved in the cell metadata table, here as 'RNA_node_density'. We can visualize these values using `plot_layout` method.

In [None]:
ds.calc_node_density(neighbourhood_degree=2)
ds.plot_layout(layout_key='RNA_UMAP', color_by='RNA_node_density', clip_fraction=0.1, colormap='coolwarm')

Now we are ready to perform down-sampling of cells. The extent of down sampling is primarily governed by the number of micro clusters, i.e. atleast 1 cell from each micro-cluster (*seed cells*) will be present in the down sampled data. However, this might not be sufficient to ensure that these will conserve the topology, i.e. are connected to each other. Hence, the `run_subsampling` method will run a prize-collecting Steiner graph search to ensure that *seed cells* are connected (to the extent that the full graph is connected). In order to do this we need to set a reward on each seed and non-seed cells. This is done using the parameter `rewards` which is provided a tuple with values for seed and non-seed cells. Low reward on seed cells might lead to them being excluded from the subsample (something that we should try to avoid). High reward on non-seed cells will lead to inflation of number of cells in the sample. We also set a value for parameter `seed_frac` which is the fraction of cells that should be randomly sampled from each micro-cluster. This value is dynamically increased to a maximum of double the `seed_frac` value based on the relative mean value of node density for that cluster. Hence, in effect we increase the sampling rate for micro clusters that have lower overall connectivity.

In [None]:
ds.run_subsampling(seed_frac=0.05, rewards=(2, 0), min_nodes=1, min_edge_weight=0.1,
                   cluster_key='RNA_b_cluster', density_key='RNA_node_density')

As a result of subsampling the sub sampled cells are marked True under the cell metadata column `RNA_sketched`. We can visualize these cells using `plot_layout`


In [None]:
ds.plot_layout(layout_key='RNA_UMAP', color_by='RNA_cluster', subselection_key='RNA_sketched')

As can be seen from the plot above the subsample containing 16.49% of the cells has cells from all regions of the UMAP landscape.