Requirements:
- scanpy

### Завантаження датасету (матриці вираження генів):

Датасет - це об'єкт типу AnnData. По суті, це звичайна таблиця, схожа на pandas.DataFrame, але вона ще містить додаткову інформацію про колонки (features) та екземпляри (instances). 


In [20]:
import os
import scanpy

dataset_filename = os.path.join("data", "train_kang.h5ad")
dataset = scanpy.read_h5ad(dataset_filename)

### Матриця вираження генів

Сама таблиця вираження генів виглядає ось так:

In [23]:
dataset.to_df()

index,AL627309.1,RP11-206L10.9,LINC00115,NOC2L,KLHL17,HES4,ISG15,TNFRSF18,TNFRSF4,SDF4,...,C21orf67,FAM207A,ADARB1,POFUT2,COL18A1,SLC19A1,COL6A2,FTCD,DIP2A,S100B
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACATACCAAGCT-1-stimulated,0.0,0.0,0.0,0.0,0.0,0.0,3.206646,0.947689,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATACCCCTAC-1-stimulated,0.0,0.0,0.0,0.0,0.0,0.0,3.314060,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATACCCGTAA-1-stimulated,0.0,0.0,0.0,0.0,0.0,0.0,2.344877,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATACCCTCGT-1-stimulated,0.0,0.0,0.0,0.0,0.0,0.0,2.292093,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATACGAGGTG-1-stimulated,0.0,0.0,0.0,0.0,0.0,0.0,2.430965,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGACTGGCGGAA-1-control,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TTTGACTGTCGTAG-1-control,0.0,0.0,0.0,0.0,0.0,0.0,0.909953,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TTTGACTGTTACCT-1-control,0.0,0.0,0.0,0.0,0.0,0.0,1.666875,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TTTGCATGCTTCGC-1-control,0.0,0.0,0.0,0.0,0.0,0.0,0.948482,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Всього в словнику є 6998 генів, якими можна описати клітину. Один екземпляр (рядок) - це вектор довжиною 6998, що є представленням клітини. Кожен елемент вектора - це вираження певного гена в цій клітині. Вираження гена - це величина, пропорційна до кількості входжень цього гена в РНК клітини.

Це представлення клітин через гени дуже схоже на метод Bag-of-Words. 
- Окрема клітина - це текстовий документ.
- Окремий ген - це слово із загального словника.
- Вираження гену у клітиниі - кількість входжень слова у документі.

Матриця вираження генів пройшла кілька етапів обробки і трансформацій. Саме тому її елемении - не цілі числа (кількості входжент), а числа і плаваючою точкою.



Наприклад, поглянемо на вектор, яким представалена клітина із кодом `TTTGCATGCTTCGC-1-control`:

In [27]:
df_data.loc['TTTGCATGCTTCGC-1-control']

index
AL627309.1       0.0
RP11-206L10.9    0.0
LINC00115        0.0
NOC2L            0.0
KLHL17           0.0
                ... 
SLC19A1          0.0
COL6A2           0.0
FTCD             0.0
DIP2A            0.0
S100B            0.0
Name: TTTGCATGCTTCGC-1-control, Length: 6998, dtype: float32

Вектор представлення дуже розріджений, адже більшість генів не будуть мати вираження у клітині. Але деякі гени все-ж таки будуть мати вираження. Наприклад, гени із кодами `ISG15`, `PARK7`, `PAXBP1`.


In [30]:
df_data.loc['TTTGCATGCTTCGC-1-control', ['ISG15', 'PARK7', 'PAXBP1']]

index
ISG15     0.948482
PARK7     1.215708
PAXBP1    0.582715
Name: TTTGCATGCTTCGC-1-control, dtype: float32

### Можливості AnnData

Як писав вище, зазвичай дані про клітини зберігаються в об'єкті `AnnData`. Це схоже на звичайну таблицю, але є додаткові можливості, зручні для користування.

Наприклад, у нашому датасеті клітини поділені на дві групи - котрольну (`control`) та стимульовану (`stimulated`). 
Користуючись можливостями AnnData, можемо легко отримати ці групи:

In [36]:
control_group = dataset[dataset.obs['condition'] == 'control']
stimulated_group = dataset[dataset.obs['condition'] == 'stimulated']

print(f'Контрольна група містить {control_group.shape[0]} екземплярів.')
print(f'Стимульована група містить {stimulated_group.shape[0]} екземплярів.')

Контрольна група містить 8007 екземплярів.
Стимульована група містить 8886 екземплярів.


Також кожна клітина належить до одного із семи типів клітин. В таблиці бачимо, який розподіл клітин по групах та по типах. Наприклад, всього є 2437 клітин типу `CD4T` у групі `control`.

In [42]:
dataset.obs.groupby(['condition', 'cell_type']).size().rename('# of cells').to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,# of cells
condition,cell_type,Unnamed: 2_level_1
control,CD4T,2437
control,CD14+Mono,1946
control,B,818
control,CD8T,574
control,NK,517
control,FCGR3A+Mono,1100
control,Dendritic,615
stimulated,CD4T,3127
stimulated,CD14+Mono,615
stimulated,B,993
