# Data Preview
We cannot see a preview of the .h5 format files on the competition page, so check the first few rows at first.

The following notebook was used for data loading.

コンペページでは、.h5ファイルのプレビューが見れないので、ひとまず先頭行を確認します。

データの読み込みについては、下記のnotebookを参考にしました。

[Getting Started - Data Loading created @ Peter Holderrieth](https://www.kaggle.com/code/peterholderrieth/getting-started-data-loading)

# Data files
9 files are provided.

9つのファイルが提供されています。
1. metadata.csv
>*     cell_id - A unique identifier for each observed cell. / 観測された各セルに一意な識別子。
>*     donor - An identifier for the four cell donors. / 4人の細胞提供者の識別子。
>*     day - The day of the experiment the observation was made. / 実験観察が行われた日付。
>*     technology - Either citeseq or multiome. / "citeseq"か"multiome"のいずれか。
>*     cell_type - One of the  following cell types or else hidden. / 下記のセルタイプのいずれか、またはそれ以外が"hidden"。
>>* MasP = Mast Cell Progenitor
>>* MkP = Megakaryocyte Progenitor
>>* NeuP = Neutrophil Progenitor
>>* MoP = Monocyte Progenitor
>>* EryP = Erythrocyte Progenitor
>>* HSC = Hematoploetic Stem Cell
>>* BP = B-Cell ProgenitorMasP = Mast Cell Progenitor
>>* MkP = Megakaryocyte Progenitor
>>* NeuP = Neutrophil Progenitor
>>* MoP = Monocyte Progenitor
>>* EryP = Erythrocyte Progenitor
>>* HSC = Hematoploetic Stem Cell
>>* BP = B-Cell Progenitor
    
2. train_multi_inputs.h5
3. test_multi_inputs.h5
>* train/test_multi_inputs.h5 - ATAC-seq peak counts transformed with TF-IDF using the default log(TF) * log(IDF) output (chromatin accessibility), with rows corresponding to cells and columns corresponding to the location of the genome whose level of accessibility is measured, here identified by the genomic coordinates on reference genome GRCh38 provided in the 10x References - 2020-A (July 7, 2020).
>* train/test_multi_inputs.h5 - ATAC-seqのピーク数をデフォルトのlog(TF) * log(IDF)出力でTF-IDF変換したもの（クロマチンアクセス性）。行は細胞、列はアクセス性のレベルが測定されたゲノムの位置に対応し、ここでは10x References - 2020-A (July 7, 2020) で提供された参照ゲノムGRCh38のゲノム座標で特定されています。

4. train_multi_targets.h5
>* train_multi_labels.h5 - RNA gene expression levels as library-size normalized and log1p transformed counts for the same cells.
>* train_multi_labels.h5 - RNA遺伝子の発現量は、同じ細胞のライブラリーサイズで正規化し、log1p変換したカウント値。

5. train_cite_inputs.h5
6. test_cite_inputs.h5
>* train/test_cite_inputs.h5 - RNA library-size normalized and log1p transformed counts (gene expression levels), with rows corresponding to cells and columns corresponding to genes given by {gene_name}_{gene_ensemble-ids}.
>* train/test_cite_inputs.h5 - RNAライブラリーサイズを正規化し、log1p変換したカウント（遺伝子発現量）。行は細胞、列は{gene_name}_{gene_ensemble-ids}で指定した遺伝子に対応する。
7. train_cite_targets.h5
>* train_cite_labels.h5 - Surface protein levels for the same cells that have been dsb normalized.
>* train_cite_labels.h5 - dsbで正規化された同じ細胞の表面タンパク質レベル。

8. evaluation_ids.csv
>* Identifies the labels from the test set to be evaluated. It provides a join key from the cell_id / gene_id identifiers of the label matrix to the row_id needed for the submission file.
>* 評価対象のテストセットからラベルを特定する。ラベルマトリックスのcell_id / gene_id 識別子から、提出ファイルに必要なrow_idへの結合キーを提供します。

9. sample_submission.csv
>* A sample submission file in the correct format. See the Evaluation page for more information.
>* 正しい形式の投稿ファイルのサンプルです。詳しくは「評価」のページをご覧ください。

www.DeepL.com/Translator（無料版）で翻訳しました。

# Data Preview

Only the first few lines are read, as memory will be insufficient.

メモリが不足するため、先頭の数行のみ読み込みます。

In [1]:
num_rows = 5

In [2]:
#If you see a urllib warning running this cell, go to "Settings" on the right hand side, 
#and turn on internet. Note, you need to be phone verified.
!pip install --quiet tables

[0m

In [3]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DATA_DIR = "/kaggle/input/open-problems-multimodal/"
FP_CELL_METADATA = os.path.join(DATA_DIR,"metadata.csv")  # 1

FP_CITE_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_cite_inputs.h5")  # 2
FP_CITE_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_cite_targets.h5")  # 4
FP_CITE_TEST_INPUTS = os.path.join(DATA_DIR,"test_cite_inputs.h5")  # 3

FP_MULTIOME_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_multi_inputs.h5")  # 5
FP_MULTIOME_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_multi_targets.h5")  # 7
FP_MULTIOME_TEST_INPUTS = os.path.join(DATA_DIR,"test_multi_inputs.h5")  # 6

FP_SUBMISSION = os.path.join(DATA_DIR,"sample_submission.csv")  # 8
FP_EVALUATION_IDS = os.path.join(DATA_DIR,"evaluation_ids.csv")  # 9

In [4]:
df_1_metadata = pd.read_csv(FP_CELL_METADATA, nrows=num_rows)
df_1_metadata.head(num_rows)

Unnamed: 0,cell_id,day,donor,cell_type,technology
0,c2150f55becb,2,27678,HSC,citeseq
1,65b7edf8a4da,2,27678,HSC,citeseq
2,c1b26cb1057b,2,27678,EryP,citeseq
3,917168fa6f83,2,27678,NeuP,citeseq
4,2b29feeca86d,2,27678,EryP,citeseq


In [5]:
df_2_train_multi_inputs = pd.read_hdf(FP_MULTIOME_TRAIN_INPUTS, stop=num_rows)
df_2_train_multi_inputs.head(num_rows)

gene_id,GL000194.1:114519-115365,GL000194.1:55758-56597,GL000194.1:58217-58957,GL000194.1:59535-60431,GL000195.1:119766-120427,GL000195.1:120736-121603,GL000195.1:137437-138345,GL000195.1:15901-16653,GL000195.1:22357-23209,GL000195.1:23751-24619,...,chrY:7722278-7723128,chrY:7723971-7724880,chrY:7729854-7730772,chrY:7731785-7732664,chrY:7810142-7811040,chrY:7814107-7815018,chrY:7818751-7819626,chrY:7836768-7837671,chrY:7869454-7870371,chrY:7873814-7874709
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56390cf1b95e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.428336,0.0,0.0,0.0,0.0
fc0c60183c33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9b4a87e22ad0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81cccad8cd81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15cb3d85c232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
df_3_test_multi_inputs = pd.read_hdf(FP_MULTIOME_TEST_INPUTS, stop=num_rows)
df_3_test_multi_inputs.head(num_rows)

gene_id,GL000194.1:114519-115365,GL000194.1:55758-56597,GL000194.1:58217-58957,GL000194.1:59535-60431,GL000195.1:119766-120427,GL000195.1:120736-121603,GL000195.1:137437-138345,GL000195.1:15901-16653,GL000195.1:22357-23209,GL000195.1:23751-24619,...,chrY:7722278-7723128,chrY:7723971-7724880,chrY:7729854-7730772,chrY:7731785-7732664,chrY:7810142-7811040,chrY:7814107-7815018,chrY:7818751-7819626,chrY:7836768-7837671,chrY:7869454-7870371,chrY:7873814-7874709
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
458c2ae2c9b1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01a0659b0710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
028a8bc3f2ba,0.0,0.0,0.0,0.0,0.0,0.0,2.951019,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7ec0ca8bb863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
caa0b0022cdc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df_4_train_multi_targets = pd.read_hdf(FP_MULTIOME_TRAIN_TARGETS, stop=num_rows)
df_4_train_multi_targets.head(num_rows)

gene_id,ENSG00000121410,ENSG00000268895,ENSG00000175899,ENSG00000245105,ENSG00000166535,ENSG00000256661,ENSG00000184389,ENSG00000128274,ENSG00000094914,ENSG00000081760,...,ENSG00000086827,ENSG00000174442,ENSG00000122952,ENSG00000198205,ENSG00000198455,ENSG00000070476,ENSG00000203995,ENSG00000162378,ENSG00000159840,ENSG00000074755
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56390cf1b95e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.893861,0.0,0.0,0.0,0.0,5.583255,0.0,4.893861
fc0c60183c33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9b4a87e22ad0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.107832,0.0,0.0,0.0,0.0,0.0,0.0,5.107832
81cccad8cd81,0.0,4.507936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.195558,4.507936,0.0,0.0,0.0,0.0,0.0,0.0,5.195558
15cb3d85c232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.531572,0.0,0.0,4.842377,0.0


In [8]:
df_5_train_cite_inputs = pd.read_hdf(FP_CITE_TRAIN_INPUTS, stop=num_rows)
df_5_train_cite_inputs.head(num_rows)

gene_id,ENSG00000121410_A1BG,ENSG00000268895_A1BG-AS1,ENSG00000175899_A2M,ENSG00000245105_A2M-AS1,ENSG00000166535_A2ML1,ENSG00000128274_A4GALT,ENSG00000094914_AAAS,ENSG00000081760_AACS,ENSG00000109576_AADAT,ENSG00000103591_AAGAB,...,ENSG00000153975_ZUP1,ENSG00000086827_ZW10,ENSG00000174442_ZWILCH,ENSG00000122952_ZWINT,ENSG00000198205_ZXDA,ENSG00000198455_ZXDB,ENSG00000070476_ZXDC,ENSG00000162378_ZYG11B,ENSG00000159840_ZYX,ENSG00000074755_ZZEF1
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
45006fe3e4c8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.090185,0.0
d02759a80ba2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.039545,0.0,0.0,0.0,0.0,0.0,0.0
c016c6b0efa5,0.0,0.0,0.0,0.0,0.0,3.847321,0.0,3.847321,3.847321,0.0,...,0.0,0.0,3.847321,4.529743,0.0,0.0,0.0,3.847321,3.847321,0.0
ba7f733a4f75,0.0,0.0,0.0,0.0,0.0,0.0,3.436846,3.436846,0.0,0.0,...,3.436846,0.0,4.11378,5.020215,0.0,0.0,0.0,3.436846,4.11378,0.0
fbcf2443ffb2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.196826,0.0,0.0,...,0.0,4.196826,4.196826,4.196826,0.0,0.0,3.51861,4.196826,3.51861,0.0


In [9]:
df_6_test_cite_inputs = pd.read_hdf(FP_CITE_TEST_INPUTS, stop=num_rows)
df_6_test_cite_inputs.head(num_rows)

gene_id,ENSG00000121410_A1BG,ENSG00000268895_A1BG-AS1,ENSG00000175899_A2M,ENSG00000245105_A2M-AS1,ENSG00000166535_A2ML1,ENSG00000128274_A4GALT,ENSG00000094914_AAAS,ENSG00000081760_AACS,ENSG00000109576_AADAT,ENSG00000103591_AAGAB,...,ENSG00000153975_ZUP1,ENSG00000086827_ZW10,ENSG00000174442_ZWILCH,ENSG00000122952_ZWINT,ENSG00000198205_ZXDA,ENSG00000198455_ZXDB,ENSG00000070476_ZXDC,ENSG00000162378_ZYG11B,ENSG00000159840_ZYX,ENSG00000074755_ZZEF1
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
c2150f55becb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.090185,0.0
65b7edf8a4da,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.039545,0.0,0.0,0.0,0.0,0.0,0.0
c1b26cb1057b,0.0,0.0,0.0,0.0,0.0,3.847321,0.0,3.847321,3.847321,0.0,...,0.0,0.0,3.847321,4.529743,0.0,0.0,0.0,3.847321,3.847321,0.0
917168fa6f83,0.0,0.0,0.0,0.0,0.0,0.0,3.436846,3.436846,0.0,0.0,...,3.436846,0.0,4.11378,5.020215,0.0,0.0,0.0,3.436846,4.11378,0.0
2b29feeca86d,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.196826,0.0,0.0,...,0.0,4.196826,4.196826,4.196826,0.0,0.0,3.51861,4.196826,3.51861,0.0


In [10]:
df_7_train_cite_targets = pd.read_hdf(FP_CITE_TRAIN_TARGETS, stop=num_rows)
df_7_train_cite_targets.head(num_rows)

gene_id,CD86,CD274,CD270,CD155,CD112,CD47,CD48,CD40,CD154,CD52,...,CD94,CD162,CD85j,CD23,CD328,HLA-E,CD82,CD101,CD88,CD224
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
45006fe3e4c8,1.167804,0.62253,0.106959,0.324989,3.331674,6.426002,1.480766,-0.728392,-0.468851,-0.073285,...,-0.44839,3.220174,-0.533004,0.674956,-0.006187,0.682148,1.398105,0.414292,1.780314,0.54807
d02759a80ba2,0.81897,0.506009,1.078682,6.848758,3.524885,5.279456,4.930438,2.069372,0.333652,-0.468088,...,0.323613,8.407108,0.131301,0.047607,-0.243628,0.547864,1.832587,0.982308,2.736507,2.184063
c016c6b0efa5,-0.356703,-0.422261,-0.824493,1.137495,0.518924,7.221962,-0.375034,1.738071,0.142919,-0.97146,...,1.348692,4.888579,-0.279483,-0.131097,-0.177604,-0.689188,9.013709,-1.182975,3.958148,2.8686
ba7f733a4f75,-1.201507,0.149115,2.022468,6.021595,7.25867,2.792436,21.708519,-0.137913,1.649969,-0.75468,...,1.504426,12.391979,0.511394,0.587863,-0.752638,1.714851,3.893782,1.799661,1.537249,4.407671
fbcf2443ffb2,-0.100404,0.697461,0.625836,-0.298404,1.369898,3.254521,-1.65938,0.643531,0.90271,1.291877,...,0.777023,6.496499,0.279898,-0.84195,-0.869419,0.675092,5.259685,-0.835379,9.631781,1.765445


In [11]:
df_8_sample_submission = pd.read_csv(FP_SUBMISSION, nrows=num_rows)
df_8_sample_submission.head(num_rows)

Unnamed: 0,row_id,target
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0


In [12]:
df_9_evaluation_ids = pd.read_csv(FP_EVALUATION_IDS, nrows=num_rows)
df_9_evaluation_ids.head(num_rows)

Unnamed: 0,row_id,cell_id,gene_id
0,0,c2150f55becb,CD86
1,1,c2150f55becb,CD274
2,2,c2150f55becb,CD270
3,3,c2150f55becb,CD155
4,4,c2150f55becb,CD112


# Column Name Patterns
### Check for patterns in column names by replacing numbers in column names with "@" symbols
### カラム名の数字を「@」記号に置き換えることにより、カラム名のパターンを確認

For Multiome, the number of columns is very large, but the pattern is limited.

Multiomeについて、カラム数は非常に多いが、パターンは限られている。

In [13]:
print(len(df_2_train_multi_inputs.columns))

228942


In [14]:
import re
cols_replace_digit = []
for col in df_2_train_multi_inputs.columns:
    cols_replace_digit.append(re.sub(r'\d', "@", col))
set(cols_replace_digit)

{'GL@@@@@@.@:@@@@@-@@@@@',
 'GL@@@@@@.@:@@@@@-@@@@@@',
 'GL@@@@@@.@:@@@@@@-@@@@@@',
 'KI@@@@@@.@:@@@@-@@@@',
 'KI@@@@@@.@:@@@@@-@@@@@',
 'KI@@@@@@.@:@@@@@@-@@@@@@',
 'KI@@@@@@.@:@@@@@@@-@@@@@@@',
 'chr@:@@@@-@@@@@',
 'chr@:@@@@@-@@@@@',
 'chr@:@@@@@@-@@@@@@',
 'chr@:@@@@@@-@@@@@@@',
 'chr@:@@@@@@@-@@@@@@@',
 'chr@:@@@@@@@@-@@@@@@@@',
 'chr@:@@@@@@@@@-@@@@@@@@@',
 'chr@@:@@@@-@@@@@',
 'chr@@:@@@@@-@@@@@',
 'chr@@:@@@@@@-@@@@@@',
 'chr@@:@@@@@@-@@@@@@@',
 'chr@@:@@@@@@@-@@@@@@@',
 'chr@@:@@@@@@@-@@@@@@@@',
 'chr@@:@@@@@@@@-@@@@@@@@',
 'chr@@:@@@@@@@@@-@@@@@@@@@',
 'chrX:@@@@@-@@@@@',
 'chrX:@@@@@@-@@@@@@',
 'chrX:@@@@@@@-@@@@@@@',
 'chrX:@@@@@@@@-@@@@@@@@',
 'chrX:@@@@@@@@@-@@@@@@@@@',
 'chrY:@@@@@@@-@@@@@@@',
 'chrY:@@@@@@@@-@@@@@@@@'}

In [15]:
print(len(df_3_test_multi_inputs.columns))

228942


In [16]:
import re
cols_replace_digit = []
for col in df_3_test_multi_inputs.columns:
    cols_replace_digit.append(re.sub(r'\d', "@", col))
set(cols_replace_digit)

{'GL@@@@@@.@:@@@@@-@@@@@',
 'GL@@@@@@.@:@@@@@-@@@@@@',
 'GL@@@@@@.@:@@@@@@-@@@@@@',
 'KI@@@@@@.@:@@@@-@@@@',
 'KI@@@@@@.@:@@@@@-@@@@@',
 'KI@@@@@@.@:@@@@@@-@@@@@@',
 'KI@@@@@@.@:@@@@@@@-@@@@@@@',
 'chr@:@@@@-@@@@@',
 'chr@:@@@@@-@@@@@',
 'chr@:@@@@@@-@@@@@@',
 'chr@:@@@@@@-@@@@@@@',
 'chr@:@@@@@@@-@@@@@@@',
 'chr@:@@@@@@@@-@@@@@@@@',
 'chr@:@@@@@@@@@-@@@@@@@@@',
 'chr@@:@@@@-@@@@@',
 'chr@@:@@@@@-@@@@@',
 'chr@@:@@@@@@-@@@@@@',
 'chr@@:@@@@@@-@@@@@@@',
 'chr@@:@@@@@@@-@@@@@@@',
 'chr@@:@@@@@@@-@@@@@@@@',
 'chr@@:@@@@@@@@-@@@@@@@@',
 'chr@@:@@@@@@@@@-@@@@@@@@@',
 'chrX:@@@@@-@@@@@',
 'chrX:@@@@@@-@@@@@@',
 'chrX:@@@@@@@-@@@@@@@',
 'chrX:@@@@@@@@-@@@@@@@@',
 'chrX:@@@@@@@@@-@@@@@@@@@',
 'chrY:@@@@@@@-@@@@@@@',
 'chrY:@@@@@@@@-@@@@@@@@'}

In [17]:
print(len(df_4_train_multi_targets.columns))

23418


In [18]:
cols_replace_digit = []
for col in df_4_train_multi_targets.columns:
    cols_replace_digit.append(re.sub(r'\d', "@", col))
set(cols_replace_digit)

{'ENSG@@@@@@@@@@@'}

CITEseq has many variations of column name patterns

CITEseqは、カラム名のパターンのバリエーションが多い

In [19]:
print(len(df_5_train_cite_inputs.columns))

22050


In [20]:
cols_replace_digit = []
for col in df_5_train_cite_inputs.columns:
    cols_replace_digit.append(re.sub(r'\d', "@", col))
set(cols_replace_digit)

{'ENSG@@@@@@@@@@@_CREG@',
 'ENSG@@@@@@@@@@@_KATNB@',
 'ENSG@@@@@@@@@@@_PMS@P@@',
 'ENSG@@@@@@@@@@@_OOEP',
 'ENSG@@@@@@@@@@@_TNPO@P@',
 'ENSG@@@@@@@@@@@_KLF@-AS@',
 'ENSG@@@@@@@@@@@_EMSY',
 'ENSG@@@@@@@@@@@_CIPC',
 'ENSG@@@@@@@@@@@_PTPRS',
 'ENSG@@@@@@@@@@@_TCAP',
 'ENSG@@@@@@@@@@@_SPINK@',
 'ENSG@@@@@@@@@@@_AQR',
 'ENSG@@@@@@@@@@@_EDC@',
 'ENSG@@@@@@@@@@@_ANTXR@',
 'ENSG@@@@@@@@@@@_LDHA',
 'ENSG@@@@@@@@@@@_ARC',
 'ENSG@@@@@@@@@@@_SCG@',
 'ENSG@@@@@@@@@@@_DNAJC@B',
 'ENSG@@@@@@@@@@@_OR@G@',
 'ENSG@@@@@@@@@@@_NRL',
 'ENSG@@@@@@@@@@@_AGL',
 'ENSG@@@@@@@@@@@_CCND@',
 'ENSG@@@@@@@@@@@_CHRAC@',
 'ENSG@@@@@@@@@@@_APLF',
 'ENSG@@@@@@@@@@@_PTN',
 'ENSG@@@@@@@@@@@_EXTL@',
 'ENSG@@@@@@@@@@@_DSTNP@',
 'ENSG@@@@@@@@@@@_NAPEPLD',
 'ENSG@@@@@@@@@@@_HGSNAT',
 'ENSG@@@@@@@@@@@_MYCBPAP',
 'ENSG@@@@@@@@@@@_CA@A',
 'ENSG@@@@@@@@@@@_CASD@',
 'ENSG@@@@@@@@@@@_CAPRIN@',
 'ENSG@@@@@@@@@@@_UBA@-AS@',
 'ENSG@@@@@@@@@@@_RFK',
 'ENSG@@@@@@@@@@@_PF@',
 'ENSG@@@@@@@@@@@_RTL@C',
 'ENSG@@@@@@@@@@@_YWHAEP@',
 'ENSG@@@

In [21]:
len(set(cols_replace_digit))

8734

In [22]:
print(len(df_7_train_cite_targets.columns))

140


In [23]:
cols_replace_digit = []
for col in df_7_train_cite_targets.columns:
    cols_replace_digit.append(re.sub(r'\d', "@", col))
set(cols_replace_digit)

{'CD@',
 'CD@@',
 'CD@@@',
 'CD@@@a',
 'CD@@@b',
 'CD@@@e@',
 'CD@@L',
 'CD@@P',
 'CD@@RA',
 'CD@@RO',
 'CD@@a',
 'CD@@b',
 'CD@@c',
 'CD@@d',
 'CD@@f',
 'CD@@j',
 'CD@c',
 'CD@d',
 'CX@CR@',
 'FceRIa',
 'HLA-A-B-C',
 'HLA-DR',
 'HLA-E',
 'IgD',
 'IgM',
 'KLRG@',
 'LOX-@',
 'Mouse-IgG@',
 'Mouse-IgG@a',
 'Mouse-IgG@b',
 'Podoplanin',
 'Rat-IgG@',
 'Rat-IgG@a',
 'Rat-IgG@b',
 'TCR',
 'TCRVa@.@',
 'TCRVd@',
 'TIGIT',
 'integrinB@'}

# 