<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Input" data-toc-modified-id="Input-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Input</a></span></li><li><span><a href="#Output" data-toc-modified-id="Output-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Output</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Load data</a></span><ul class="toc-item"><li><span><a href="#Cell-meta" data-toc-modified-id="Cell-meta-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Cell meta</a></span></li><li><span><a href="#MCDS" data-toc-modified-id="MCDS-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>MCDS</a></span></li></ul></li><li><span><a href="#Remove-blacklist" data-toc-modified-id="Remove-blacklist-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Remove blacklist</a></span></li><li><span><a href="#Prepare-Gene" data-toc-modified-id="Prepare-Gene-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Prepare Gene</a></span></li></ul></div>

# Prepare AnnData and Highly Variable Feature Selection

## Purpose
- Calculate methylation rate
- Normalize per cell and clip extreme
- HVF selection
- Prepare HVF adata file
- Prepare gene adata file
- Basically, from N-D to 2-D

## Input
- Cell metadata table, MCDS list

## Output
- mCH HVF adata
- mCG HVF adata
- Gene rate MCDS

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy.api as sc
import seaborn as sns
import xarray as xr
from pybedtools import BedTool
from ALLCools.mcds.MCDS import MCDS
from cemba_data.plot import *

In [3]:
result_dir = pathlib.Path('Adata')
result_dir.mkdir(exist_ok=True)
fig_dir = pathlib.Path('fig/feature_selection')
fig_dir.mkdir(exist_ok=True, parents=True)

## Parameters

In [4]:
# parameters cell
in_memory = False
dask_distribute = True

# selected cell metadata path
cell_meta_path = ''

# blacklist
black_list_path = '/home/hanliu/project/mouse_rostral_brain/misc/mm10-blacklist.v2.bed.gz'

# mcds_path
mcds_path_list = list(
    pathlib.Path(
        '/home/hanliu/project/mouse_rostral_brain/dataset/gene_with_2kb_slop/'
    ).glob('*mcds'))

clustering_feature = 'gene'  # usually 100kb chromosome bins or genes

# remove bad features
black_list_region = None
exclude_chromosome = ['chrY', 'chrM']

# preprocess parameters
min_feature_cov, max_feature_cov = 500, 3000

ch_hvf_top = 3000
min_ch_hvf_mean = 0.5
max_ch_hvf_mean = 2.5

cg_hvf_top = 3000
min_cg_hvf_mean = 0.5
max_cg_hvf_mean = 1.2

In [5]:
if dask_distribute:
    from dask.distributed import Client
    client = Client(dashboard_address=':5555')

## Load data
### Cell meta

In [6]:
cell_meta = pd.read_msgpack('CellMetadata.AfterQC.msg')
cell_meta.shape[0]

109670

### MCDS

In [7]:
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    mcds = MCDS.open(mcds_path_list,
                     use_cells=cell_meta[cell_meta['PassFilter']].index,
                     chunks={'cell': 1000})
if in_memory:
    mcds.load()

In [8]:
mcds.rename(name_dict={
    'geneslop2k_da': 'gene_da',
    'geneslop2k': 'gene'
},
            inplace=True)
mcds

<xarray.MCDS>
Dimensions:           (cell: 104340, count_type: 2, gene: 55487, mc_type: 2)
Coordinates:
  * mc_type           (mc_type) object 'CGN' 'CHN'
    strand_type       <U4 'both'
  * gene              (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG00000064372.1'
  * count_type        (count_type) object 'mc' 'cov'
    geneslop2k_chrom  (gene) object dask.array<shape=(55487,), chunksize=(55487,)>
    geneslop2k_start  (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
    geneslop2k_end    (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
  * cell              (cell) object '1A_M_0' '1A_M_1' ... '10F_M_998'
Data variables:
    gene_da           (cell, gene, mc_type, count_type) uint32 dask.array<shape=(104340, 55487, 2, 2), chunksize=(1000, 55487, 2, 2)>

## Remove blacklist

In [11]:
feature_bed_df = pd.DataFrame([
    mcds.coords[f'geneslop2k_chrom'].to_pandas(),
    mcds.coords[f'geneslop2k_start'].to_pandas(),
    mcds.coords[f'geneslop2k_end'].to_pandas()
],
                              index=['chrom', 'start', 'end'],
                              columns=mcds.get_index('gene')).T
feature_bed = BedTool.from_dataframe(feature_bed_df)

In [14]:
black_list_bed = BedTool(black_list_path)
black_feature = feature_bed.intersect(black_list_bed, f=0.5, wa=True)
black_feature_index = black_feature.to_dataframe().set_index(
    ['chrom', 'start', 'end']).index
black_feature_id = pd.Index(feature_bed_df.reset_index()\
                       .set_index(['chrom', 'start', 'end'])\
                       .loc[black_feature_index][clustering_feature])

In [18]:
mcds = mcds.sel(gene=~mcds.get_index('gene').isin(black_feature_id))

## Prepare Gene

In [9]:
mcds.add_gene_rate(in_memory=False, method=,
                   output_prefix=str(result_dir / 'GeneWithSlop2kb'),
                   cell_chunks=10000)

