<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Notes:" data-toc-modified-id="Notes:-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Notes:</a></span></li><li><span><a href="#Input" data-toc-modified-id="Input-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Input</a></span></li><li><span><a href="#Output" data-toc-modified-id="Output-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Output</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Load data</a></span><ul class="toc-item"><li><span><a href="#Cell-meta" data-toc-modified-id="Cell-meta-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Cell meta</a></span></li><li><span><a href="#MCDS" data-toc-modified-id="MCDS-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>MCDS</a></span></li></ul></li><li><span><a href="#Add-mC-rate" data-toc-modified-id="Add-mC-rate-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Add mC rate</a></span></li><li><span><a href="#Filter-feature" data-toc-modified-id="Filter-feature-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Filter feature</a></span><ul class="toc-item"><li><span><a href="#Remove-chromosome" data-toc-modified-id="Remove-chromosome-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Remove chromosome</a></span></li><li><span><a href="#Remove-blacklist-region" data-toc-modified-id="Remove-blacklist-region-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Remove blacklist region</a></span></li><li><span><a href="#Remove-by-mean-cov" data-toc-modified-id="Remove-by-mean-cov-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Remove by mean cov</a></span></li></ul></li><li><span><a href="#Select-HVF" data-toc-modified-id="Select-HVF-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Select HVF</a></span><ul class="toc-item"><li><span><a href="#mCH-HVF" data-toc-modified-id="mCH-HVF-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>mCH HVF</a></span></li><li><span><a href="#mCG-HVF" data-toc-modified-id="mCG-HVF-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>mCG HVF</a></span></li></ul></li><li><span><a href="#Get-Anndata" data-toc-modified-id="Get-Anndata-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Get Anndata</a></span><ul class="toc-item"><li><span><a href="#Load-HVF" data-toc-modified-id="Load-HVF-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Load HVF</a></span></li><li><span><a href="#CH-adata" data-toc-modified-id="CH-adata-10.2"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>CH adata</a></span></li><li><span><a href="#CG-adata" data-toc-modified-id="CG-adata-10.3"><span class="toc-item-num">10.3&nbsp;&nbsp;</span>CG adata</a></span></li></ul></li><li><span><a href="#Prepare-Gene" data-toc-modified-id="Prepare-Gene-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Prepare Gene</a></span></li></ul></div>

# Prepare AnnData and Highly Variable Feature Selection

## Purpose
- Calculate methylation rate
- Normalize per cell and clip extreme
- Highly variable feature (usually 100kb bins) selection
- Prepare HVF adata file
- Prepare gene adata file
- Basically, from N-D MCDS to 2-D anndata.AnnData object

## Notes:
- Usually we use 100kb bins to do clustering and use gene body mCH (for neuron) or mCG (for non-neuron) to annotate clusters.
- Therefore, feature usually refer to chrom100k bins, but you can try to use other feature (e.g. gene) to do clustering.

## Input
- Cell metadata table, MCDS list

## Output
- mCH HVF adata
- mCG HVF adata
- Gene rate MCDS

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy.api as sc
import seaborn as sns
import xarray as xr
from pybedtools import BedTool
from ALLCools.mcds.MCDS import MCDS
from cemba_data.plot import *

In [3]:
result_dir = pathlib.Path('Adata')
result_dir.mkdir(exist_ok=True)
fig_dir = pathlib.Path('fig/feature_selection')
fig_dir.mkdir(exist_ok=True, parents=True)

## Parameters

In [4]:
# parameters cell
in_memory = True
dask_distribute = False

# selected cell metadata path
cell_meta_path = ''

# blacklist
black_list_path = '/home/hanliu/project/mouse_rostral_brain/misc/mm10-blacklist.v2.bed.gz'

# mcds_path
mcds_path_list = [
    'snm3C.for_clustering.mcds'
]

clustering_feature = 'chrom100k'  # usually 100kb chromosome bins or genes

# remove bad features
black_list_region = None
exclude_chromosome = ['chrY', 'chrM']

# preprocess parameters
min_feature_cov, max_feature_cov = 500, 3000

ch_hvf_top = 3000
min_ch_hvf_mean = 0.5
max_ch_hvf_mean = 2.5

cg_hvf_top = 3000
min_cg_hvf_mean = 0.5
max_cg_hvf_mean = 1.2

generate_gene_rate = True

In [5]:
if dask_distribute:
    from dask.distributed import Client
    client = Client(dashboard_address=':5555')

## Load data
### Cell meta

In [6]:
cell_meta = pd.read_csv('cell_meta.csv', index_col=0)
cell_meta.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,mCH_rate,mCG_rate
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
CEMBA191126-9J-1-CEMBA191126-9J-2-A10_ad001,1130605,772813,241251,72464,35029,0.69,0.21,0.1,0.009055,0.702498
CEMBA191126-9J-1-CEMBA191126-9J-2-A10_ad002,2248942,1543407,456458,164457,77386,0.65,0.24,0.11,0.009589,0.703959
CEMBA191126-9J-1-CEMBA191126-9J-2-A10_ad004,1467972,989251,292156,105706,48788,0.65,0.24,0.11,0.009886,0.710592
CEMBA191126-9J-1-CEMBA191126-9J-2-A10_ad006,3129037,2142708,659870,204630,107233,0.68,0.21,0.11,0.012262,0.723521
CEMBA191126-9J-1-CEMBA191126-9J-2-A10_ad007,1272931,864230,238409,107488,44730,0.61,0.28,0.11,0.019954,0.729968


### MCDS

In [9]:
mcds = xr.open_dataarray('Adata/chrom100k_da_rate.nc').load()

In [11]:
import anndata
snmc_ch_adata = anndata.read_h5ad('Adata/mch_adata.norm_per_cell.hvf.snmc.h5ad')
snmc_cg_adata = anndata.read_h5ad('Adata/mcg_adata.norm_per_cell.hvf.snmc.h5ad')

Int64Index([   30,    31,    32,    33,    34,    35,    36,    37,    38,
               39,
            ...
            26328, 26329, 26330, 26331, 26332, 26333, 26334, 26335, 26336,
            26337],
           dtype='int64', name='chrom100k', length=23952)

Int64Index([   32,    33,    34,    35,    50,    65,    67,    68,    77,
               98,
            ...
            26278, 26291, 26301, 26302, 26306, 26307, 26315, 26318, 26319,
            26321],
           dtype='int64', length=2989)

In [31]:
var_names = snmc_ch_adata.var_names.astype(int) & mcds.get_index('chrom100k')
X = mcds.sel(chrom100k=var_names, mc_type='CHN').values

ch_adata = anndata.AnnData(X=X,
                        obs=pd.DataFrame([], index=mcds.get_index('cell')),
                        var=pd.DataFrame([], index=var_names))

var_names = snmc_cg_adata.var_names.astype(int) & mcds.get_index('chrom100k')
X = mcds.sel(chrom100k=var_names, mc_type='CGN').values

cg_adata = anndata.AnnData(X=X,
                        obs=pd.DataFrame([], index=mcds.get_index('cell')),
                        var=pd.DataFrame([], index=var_names))

Transforming to str index.
Transforming to str index.


### CH adata

In [32]:
ch_adata.write_h5ad(result_dir / 'mch_adata.norm_per_cell.hvf.m3c.h5ad')

In [33]:
ch_adata

AnnData object with n_obs × n_vars = 5398 × 2989 

### CG adata

In [34]:
cg_adata.write_h5ad(result_dir / 'mcg_adata.norm_per_cell.hvf.m3c.h5ad')

In [35]:
cg_adata

AnnData object with n_obs × n_vars = 5398 × 2984 

## Prepare Gene