---

# **2/3 малышей**

1) Носова Ирина Вадимовна (4 курс МФТИ, ФБМФ)

2) Терещук Вера Юрьевна (4 курс МФТИ, ФБМФ)

3) Болев Михаил Алексеевич (4 курс МФТИ, ФБМФ)

4) Макаров Владислав Денисович (4 курс МФТИ, ФБМФ)

---

# Описание задачи

**Цель:** Проверить пары иммунных репертуаров на наличие ответа на заданный патоген. Для этого необходимо сопоставить последовательности CDR3-региона Т-клеточного рецептора с известными последовательностями из базы данных. При этом следует учитывать, что само наличие последовательности в репертуаре не гарантирует наличие ответа.

## Входные данные
- **База данных:** vdjdb.
- **Таблицы:** Данные о клонотипах.

## Ожидаемые результаты
Участникам требуется:
1. Разработать подход, позволяющий:
   - Определить наличие ответа на патоген.
   - Оценить степень уверенности в том, что найденный ответ не случаен.
2. Продемонстрировать работу метода на предоставленных таблицах с неизвестным статусом.

**Критерии оценки:**
- Успешное определение статусов таблиц.
- Подробный отчет о методах.
- Качественная интерпретация результатов.

---

Импорт необходимых библиотек:

In [1]:
# Базовые библиотеки
import numpy as np
import pandas as pd
from tqdm import tqdm

# PyTorch и модель ESM
import torch
from transformers import EsmTokenizer, AutoModel

# Модели и метрики из scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# CatBoost
from catboost import CatBoostClassifier

### **Использование эмбеддингов и классификации для анализа CDR3-последовательностей**

### **Этап 1. Инициализация модели и токенизатора**

Взяли довольно популярную модель предобученную на большом количестве данных аминокислотных последовательностей. Предобучалась на задаче MLM

In [2]:
tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
model

EsmModel(
  (embeddings): EsmEmbeddings(
    (word_embeddings): Embedding(33, 320, padding_idx=1)
    (dropout): Dropout(p=0.0, inplace=False)
    (position_embeddings): Embedding(1026, 320, padding_idx=1)
  )
  (encoder): EsmEncoder(
    (layer): ModuleList(
      (0-5): 6 x EsmLayer(
        (attention): EsmAttention(
          (self): EsmSelfAttention(
            (query): Linear(in_features=320, out_features=320, bias=True)
            (key): Linear(in_features=320, out_features=320, bias=True)
            (value): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (rotary_embeddings): RotaryEmbedding()
          )
          (output): EsmSelfOutput(
            (dense): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (LayerNorm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (intermediate): EsmIntermediate(
    

---

**Комментарий:**

Мы используем предобученную модель ESM2 от Facebook, которая предназначена для анализа последовательностей белков. Токенизатор преобразует аминокислотные последовательности в числовой вид, а модель извлекает эмбеддинги.

---

### **Этап 2. Упрощение модели**

Уберем верхний слой, т.к. он отвечает за классификацию

In [4]:
model.contact_head = None
model

EsmModel(
  (embeddings): EsmEmbeddings(
    (word_embeddings): Embedding(33, 320, padding_idx=1)
    (dropout): Dropout(p=0.0, inplace=False)
    (position_embeddings): Embedding(1026, 320, padding_idx=1)
  )
  (encoder): EsmEncoder(
    (layer): ModuleList(
      (0-5): 6 x EsmLayer(
        (attention): EsmAttention(
          (self): EsmSelfAttention(
            (query): Linear(in_features=320, out_features=320, bias=True)
            (key): Linear(in_features=320, out_features=320, bias=True)
            (value): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (rotary_embeddings): RotaryEmbedding()
          )
          (output): EsmSelfOutput(
            (dense): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (LayerNorm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (intermediate): EsmIntermediate(
    

In [5]:
model.pooler = None
model

EsmModel(
  (embeddings): EsmEmbeddings(
    (word_embeddings): Embedding(33, 320, padding_idx=1)
    (dropout): Dropout(p=0.0, inplace=False)
    (position_embeddings): Embedding(1026, 320, padding_idx=1)
  )
  (encoder): EsmEncoder(
    (layer): ModuleList(
      (0-5): 6 x EsmLayer(
        (attention): EsmAttention(
          (self): EsmSelfAttention(
            (query): Linear(in_features=320, out_features=320, bias=True)
            (key): Linear(in_features=320, out_features=320, bias=True)
            (value): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (rotary_embeddings): RotaryEmbedding()
          )
          (output): EsmSelfOutput(
            (dense): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (LayerNorm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (intermediate): EsmIntermediate(
    

Перенос на GPU

In [6]:
model.cuda()

EsmModel(
  (embeddings): EsmEmbeddings(
    (word_embeddings): Embedding(33, 320, padding_idx=1)
    (dropout): Dropout(p=0.0, inplace=False)
    (position_embeddings): Embedding(1026, 320, padding_idx=1)
  )
  (encoder): EsmEncoder(
    (layer): ModuleList(
      (0-5): 6 x EsmLayer(
        (attention): EsmAttention(
          (self): EsmSelfAttention(
            (query): Linear(in_features=320, out_features=320, bias=True)
            (key): Linear(in_features=320, out_features=320, bias=True)
            (value): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (rotary_embeddings): RotaryEmbedding()
          )
          (output): EsmSelfOutput(
            (dense): Linear(in_features=320, out_features=320, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (LayerNorm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (intermediate): EsmIntermediate(
    

### **Этап 3. Функция для получения эмбеддингов**

In [7]:
def get_embeddings(sequence):
    # Токенизация и перенос на GPU
    inputs = tokenizer(sequence, return_tensors="pt", add_special_tokens=True, padding=True)
    with torch.no_grad():
        inputs.to('cuda') # Перенос входных данных на GPU
        outputs = model(**inputs) # Прогон через модель
        embeddings = outputs.last_hidden_state.mean(dim=1) # Среднее значение по всем токенам
    return embeddings.squeeze().cpu().numpy() # Возвращаем эмбеддинги в формате numpy

### **Этап 4. Пример извлечения эмбеддингов**

In [8]:
# Пример последовательностей
sequences = ["MKTFFVAGLFVMLAALSG", "VLGFLVLTLTGAAGQVLG"]  
embeddings = np.array([get_embeddings(seq) for seq in sequences])
print("Размер эмбеддингов:", embeddings.shape)

Размер эмбеддингов: (2, 320)


### **Этап 5. Загрузка данных из базы vdjdb**

In [9]:
df = pd.read_csv("vdjdb.slim.txt", sep='\t')
df_homo = df[df['species'] == 'HomoSapiens'][['cdr3', 'antigen.species']]
df_homo.head()

Unnamed: 0,cdr3,antigen.species
2,CASSQDRGPANEQFF,EBV
3,CAGSVGSSNTGKLIF,InfluenzaA
4,CASNTGTASKLTF,InfluenzaA
6,CSASILGLAGYNEQFF,CMV
7,CADSWGKLQF,InfluenzaA


### **Этап 6. Извлечение эмбеддингов для всех последовательностей**

In [10]:
labels = df_homo['antigen.species']
data = df_homo.drop('antigen.species', axis=1)

X_data = []
for i in tqdm(range(75)):
    X_data.append(get_embeddings(data['cdr3'].tolist()[i*1000:(i+1)*1000]))

X_data = np.concatenate(X_data, axis=0)
X_data.shape

100%|██████████| 75/75 [00:06<00:00, 12.02it/s]


(74743, 320)

### **Этап 7. Предобработка данных для обучения**

In [11]:
# Замена редких классов на категорию "others"
class_counts = labels.value_counts()
classes_to_replace = class_counts[class_counts < 200].index
labels_new = labels.where(labels.isin(classes_to_replace) == False, 'others')

# Разделение на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X_data, labels_new, test_size=0.1, random_state=42)

In [56]:
le = LabelEncoder()

y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

### **Этап 8. Обучение модели CatBoost**

In [12]:
catboost = CatBoostClassifier(task_type="GPU", iterations=2000, loss_function="MultiClass")
catboost.fit(X_train, y_train)

Learning rate set to 0.08598
0:	learn: 2.4757704	total: 21.8ms	remaining: 43.6s
1:	learn: 2.3369015	total: 34.5ms	remaining: 34.5s
2:	learn: 2.2370708	total: 46.8ms	remaining: 31.2s
3:	learn: 2.1605648	total: 59.6ms	remaining: 29.7s
4:	learn: 2.0984277	total: 72.8ms	remaining: 29s
5:	learn: 2.0472571	total: 85.5ms	remaining: 28.4s
6:	learn: 2.0046049	total: 98.4ms	remaining: 28s
7:	learn: 1.9685749	total: 110ms	remaining: 27.5s
8:	learn: 1.9376453	total: 123ms	remaining: 27.3s
9:	learn: 1.9106436	total: 134ms	remaining: 26.7s
10:	learn: 1.8872641	total: 145ms	remaining: 26.2s
11:	learn: 1.8668915	total: 155ms	remaining: 25.7s
12:	learn: 1.8488919	total: 166ms	remaining: 25.3s
13:	learn: 1.8333832	total: 176ms	remaining: 25s
14:	learn: 1.8193556	total: 187ms	remaining: 24.8s
15:	learn: 1.8068515	total: 198ms	remaining: 24.6s
16:	learn: 1.7956143	total: 210ms	remaining: 24.5s
17:	learn: 1.7856744	total: 221ms	remaining: 24.3s
18:	learn: 1.7766972	total: 232ms	remaining: 24.1s
19:	learn: 

<catboost.core.CatBoostClassifier at 0x7ff4c8a21410>

### **Этап 9. Оценка качества модели**

In [13]:
preds_proba= catboost.predict_proba(X_test)
preds = catboost.predict(X_test)

roc_auc = roc_auc_score(y_true=y_test, y_score=preds_proba, multi_class='ovr')
print(f"ROC-AUC: {roc_auc:.3f}")

ROC-AUC: 0.700


### **Этап 10. Альтернативные модели**

In [14]:
# Обучение Logistic Regression
logreg = LogisticRegression(multi_class="ovr", max_iter=2000, random_state=42)
logreg.fit(X_train, y_train)

# Обучение Random Forest
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)



In [15]:
# Предсказания вероятностей для Logistic Regression
preds_proba_logreg = logreg.predict_proba(X_test)
preds_logreg = logreg.predict(X_test)

# Расчет ROC-AUC для Logistic Regression
roc_auc_logreg = roc_auc_score(y_true=y_test, y_score=preds_proba_logreg, multi_class='ovr')
print(f"ROC-AUC (Logistic Regression): {roc_auc_logreg:.3f}")

# Предсказания вероятностей для Random Forest
preds_proba_rf = random_forest.predict_proba(X_test)
preds_rf = random_forest.predict(X_test)

# Расчет ROC-AUC для Random Forest
roc_auc_rf = roc_auc_score(y_true=y_test, y_score=preds_proba_rf, multi_class='ovr')
print(f"ROC-AUC (Random Forest): {roc_auc_rf:.3f}")

ROC-AUC (Logistic Regression): 0.660
ROC-AUC (Random Forest): 0.661


Метрики хуже

In [28]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

### **Этап 10. Тюнинг гиперпараметров**

Используем optuna для подбора гиперпараметров

In [31]:
import optuna

def objective(trial):
    params = {
        "iterations": 2000,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "depth": trial.suggest_int("depth", 1, 10),
        "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1, 500, log=True),
    }

    model = CatBoostClassifier(task_type="GPU", loss_function="MultiClass", silent=True, **params)
    model.fit(X_train, y_train)
    predictions = model.predict_proba(X_val)
    roc_auc = roc_auc_score(y_true=y_val, y_score=predictions, multi_class='ovr')
    return roc_auc

In [32]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

[I 2024-11-24 11:08:33,565] A new study created in memory with name: no-name-a4b1a149-9ee7-4b31-b84c-f9ef65a5eee1
[I 2024-11-24 11:09:07,189] Trial 0 finished with value: 0.614885765840585 and parameters: {'learning_rate': 0.0066872164979316605, 'depth': 7, 'l2_leaf_reg': 59.34368163714414}. Best is trial 0 with value: 0.614885765840585.
[I 2024-11-24 11:09:23,129] Trial 1 finished with value: 0.6392894773712243 and parameters: {'learning_rate': 0.0017210911808615708, 'depth': 5, 'l2_leaf_reg': 3.43342872998643}. Best is trial 1 with value: 0.6392894773712243.
[I 2024-11-24 11:09:29,887] Trial 2 finished with value: 0.6095959448344327 and parameters: {'learning_rate': 0.004629291044751642, 'depth': 1, 'l2_leaf_reg': 5.387428987401444}. Best is trial 1 with value: 0.6392894773712243.
[I 2024-11-24 11:10:05,224] Trial 3 finished with value: 0.6830560167192489 and parameters: {'learning_rate': 0.012875867053624517, 'depth': 7, 'l2_leaf_reg': 1.3178689083980955}. Best is trial 3 with value

In [33]:
study.best_params

{'learning_rate': 0.03899128323378634,
 'depth': 10,
 'l2_leaf_reg': 2.4095622667503007}

### **Этап 12. Обучаем модель с лучшими гиперпараметрами**

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X_data, labels_new, test_size=0.1, random_state=42)

In [48]:
best_params = {'learning_rate': 0.03899128323378634,
 'depth': 10,
 'l2_leaf_reg': 2.4095622667503007}
catboost = CatBoostClassifier(task_type="GPU", iterations=2000, loss_function="MultiClass", **best_params)
catboost.fit(X_train, y_train)



0:	learn: 2.6128183	total: 128ms	remaining: 4m 15s
1:	learn: 2.5329803	total: 240ms	remaining: 3m 59s
2:	learn: 2.4640440	total: 338ms	remaining: 3m 44s
3:	learn: 2.4061099	total: 437ms	remaining: 3m 37s
4:	learn: 2.3542911	total: 533ms	remaining: 3m 32s
5:	learn: 2.3067687	total: 627ms	remaining: 3m 28s
6:	learn: 2.2659219	total: 727ms	remaining: 3m 26s
7:	learn: 2.2272552	total: 820ms	remaining: 3m 24s
8:	learn: 2.1930420	total: 919ms	remaining: 3m 23s
9:	learn: 2.1613116	total: 1.01s	remaining: 3m 21s
10:	learn: 2.1320083	total: 1.11s	remaining: 3m 20s
11:	learn: 2.1046776	total: 1.2s	remaining: 3m 18s
12:	learn: 2.0792718	total: 1.29s	remaining: 3m 17s
13:	learn: 2.0565715	total: 1.39s	remaining: 3m 17s
14:	learn: 2.0347897	total: 1.49s	remaining: 3m 16s
15:	learn: 2.0141493	total: 1.58s	remaining: 3m 16s
16:	learn: 1.9955119	total: 1.68s	remaining: 3m 16s
17:	learn: 1.9776254	total: 1.78s	remaining: 3m 15s
18:	learn: 1.9608574	total: 1.87s	remaining: 3m 15s
19:	learn: 1.9449936	to

<catboost.core.CatBoostClassifier at 0x7eff6e8a6550>

Метрика

In [51]:
preds_proba= catboost.predict_proba(X_test)
preds = catboost.predict(X_test)
roc_auc_score(y_true=y_test, y_score=preds_proba, multi_class='ovr')

0.7144901011889384

### **Этап 13. Прогноз на образцах**

In [36]:
!unzip for_task.zip

Archive:  for_task.zip
   creating: for_task/
  inflating: for_task/Barracuda.tsv  
  inflating: for_task/Boss.tsv       
  inflating: for_task/Dance_Till_Dead.tsv  
  inflating: for_task/FEV_Reject.tsv  
  inflating: for_task/Johny.tsv      
  inflating: for_task/King_Charles.tsv  
  inflating: for_task/Lucky_Number_5.tsv  
  inflating: for_task/Lychelle.tsv   
  inflating: for_task/Matchstick_Man.tsv  
  inflating: for_task/Mengsk.tsv     
  inflating: for_task/Pavlina_Grey.tsv  
  inflating: for_task/Starry_Sky.tsv  
  inflating: for_task/The_Wide_Pirate.tsv  
  inflating: for_task/vdjdb.slim.txt  
  inflating: for_task/Wing_And_A_Prayer.tsv  


Прочитаем датафреймы

In [22]:
def read_tsv_files_from_folder(folder_path):
    
    dataframes = []
    file_names = []
    
    # Проходимся по всем файлам в папке
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.tsv'):  # Проверяем, что файл имеет расширение .tsv
            file_path = os.path.join(folder_path, file_name)
            try:
                # Читаем файл и добавляем в список
                df = pd.read_csv(file_path, sep='\t')
                dataframes.append(df)
            except Exception as e:
                print(f"Не удалось прочитать файл {file_name}: {e}")
        file_names.append(file_name)
    
    return dataframes, file_names

In [23]:
folder_path = "for_task"
dataframes, file_names = read_tsv_files_from_folder(folder_path)

print(f"Количество датафреймов: {len(dataframes)}")

Количество датафреймов: 14


Оставим столбец только с аминокислотными последовательностями

In [26]:
dataframes_new = []
for df in dataframes:
    list_columns = list(df.columns)
    X_columns=[]
    for column in list_columns:
        if 'cdr3aa' in column.lower() or 'cdr3.amino' in column.lower():
            X_columns.append(column)
    df_new = df[X_columns]
    dataframes_new.append(df_new)

Получим эмбеддинги для модели

In [32]:
X_all = []
for dataframe in tqdm(dataframes_new):
    X_data = []
    list_column = list(dataframe.columns)
    for i in range(0, len(dataframe), 20000):
        X_data.append(get_embeddings(dataframe[str(*list_column)].tolist()[i:i+20000]))
    X_data = np.concatenate(X_data, axis=0)
    X_all.append(X_data)

100%|██████████| 14/14 [12:34<00:00, 53.90s/it]


Получим прогнозы на всех образцах

In [53]:
preds_all = []
for X in tqdm(X_all):
    preds = catboost.predict(X)
    preds_all.append(preds)

100%|██████████| 14/14 [01:00<00:00,  4.35s/it]


In [102]:
answers = []
for i, pred in enumerate(preds_all):
    answers.append(pred.flatten())

Построим итоговую таблицу

In [92]:
d = {}

for i, answer in enumerate(answers):

    unique_answer, num_unique = np.unique(answer, return_counts=True)
    
    d[file_names[i]] = {"answers": unique_answer,
                       "unique": num_unique}
    
    

In [97]:
data_for_df = []


for sample_name, values in d.items():
    
    answers = values["answers"]
    unique_values = values["unique"]
    
    
    sorted_indices = unique_values.argsort()[::-1]  
    sorted_answers = answers[sorted_indices]     
    
    
    row = {"Sample": sample_name}
    row.update({f"Answer_{i+1}": answer for i, answer in enumerate(sorted_answers)})
    data_for_df.append(row)


df_final = pd.DataFrame(data_for_df)


df_final = df_final.fillna(value=pd.NA)


cols = ["Sample"] + [col for col in df_final.columns if col != "Sample"]
df_final = df_final[cols]

df_final

Unnamed: 0,Sample,Answer_1,Answer_2,Answer_3,Answer_4,Answer_5,Answer_6,Answer_7,Answer_8,Answer_9,Answer_10,Answer_11,Answer_12,Answer_13,Answer_14,Answer_15
0,Barracuda.tsv,CMV,InfluenzaA,SARS-CoV-2,HIV-1,EBV,HomoSapiens,YFV,DENV,HCV,TriticumAestivum,others,Mtb,Influenza B,,
1,Boss.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
2,Dance_Till_Dead.tsv,CMV,InfluenzaA,SARS-CoV-2,HIV-1,EBV,HomoSapiens,YFV,DENV,HCV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
3,FEV_Reject.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
4,Johny.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HomoSapiens,HIV-1,HCV,YFV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
5,King_Charles.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
6,Lucky_Number_5.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
7,Lychelle.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Influenza B,Mtb,HTLV-1,PlasmodiumFalciparum,others
8,Matchstick_Man.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
9,Mengsk.tsv,CMV,InfluenzaA,SARS-CoV-2,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,Mtb,HTLV-1,TriticumAestivum,others,Influenza B,


**Объяснение:** В таблице представлены все патогены найденные в образцах. Они отсортированы в порядке встречаемости в образце, то есть Answer_1 самый часто встречающийся патоген в образце. Как мы видим, почти у всех есть антитела к CMV, InfluenzaA, SARS-CoV-2, поэтому можно сделать фильтрацию(убрать первые три ответа)

In [98]:
df_final.to_csv('answers.csv')

Делаем фильтрацию

In [124]:
answers = []
for i, pred in enumerate(preds_all):
    answers.append(pred.flatten())

In [120]:
d = {}

for i, answer in enumerate(answers):

    unique_answer, num_unique = np.unique(answer, return_counts=True)
    indices_of_top = np.argsort(num_unique)[:-3]
    
    d[file_names[i]] = {"answers": unique_answer[indices_of_top],
                       "unique": num_unique[indices_of_top]}

In [122]:
data_for_df = []

for sample_name, values in d.items():

    answers = values["answers"]
    unique_values = values["unique"]
    
    sorted_indices = unique_values.argsort()[::-1]  
    sorted_answers = answers[sorted_indices]        
    
    
    row = {"Sample": sample_name}
    row.update({f"Answer_{i+1}": answer for i, answer in enumerate(sorted_answers)})
    data_for_df.append(row)


df_final = pd.DataFrame(data_for_df)


df_final = df_final.fillna(value=pd.NA)

cols = ["Sample"] + [col for col in df_final.columns if col != "Sample"]
df_final = df_final[cols]

df_final

Unnamed: 0,Sample,Answer_1,Answer_2,Answer_3,Answer_4,Answer_5,Answer_6,Answer_7,Answer_8,Answer_9,Answer_10,Answer_11,Answer_12
0,Barracuda.tsv,HIV-1,EBV,HomoSapiens,YFV,DENV,HCV,TriticumAestivum,others,Mtb,Influenza B,,
1,Boss.tsv,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
2,Dance_Till_Dead.tsv,HIV-1,EBV,HomoSapiens,YFV,DENV,HCV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
3,FEV_Reject.tsv,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
4,Johny.tsv,EBV,HomoSapiens,HIV-1,HCV,YFV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
5,King_Charles.tsv,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
6,Lucky_Number_5.tsv,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
7,Lychelle.tsv,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Influenza B,Mtb,HTLV-1,PlasmodiumFalciparum,others
8,Matchstick_Man.tsv,EBV,HIV-1,HomoSapiens,YFV,HCV,DENV,TriticumAestivum,Mtb,Influenza B,others,HTLV-1,PlasmodiumFalciparum
9,Mengsk.tsv,EBV,HomoSapiens,HIV-1,YFV,HCV,DENV,Mtb,HTLV-1,TriticumAestivum,others,Influenza B,


Здесь представлена отфильтрованная таблица

In [123]:
df_final.to_csv('filtered_answers.csv')