# About this Notebook
This notebook gives you a brief introduction to the dataset, and brief guides on converting the dataset to common formats (e.g. Numpy arrays, PyTorch tensors). We recommend you use `pandas` to load the dataset from its original format. 

Feel free to convert the dataset to any format you feel comfortable using. Code to convert automatically to CSV is below. 

## About the Dataset
The dataset `de_train_split.parquet` is the primary training dataset for this competition. We've included it in this example to show you what it looks like. Feel free to find the other datasets in this challenge at the NeurIPS 2023 Kaggle site: https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/

Our primary testing dataset is `de_test_split.parquet`. 

The train/test split is approximately 80/20, with the train dataset containing 501 samples, and the test dataset containing 113 samples. 

However, **we ask that you only use `de_train_split.parquet` provided in this repository for training**, since we use different train/test splits to the official competition. 

In [None]:
!pip install pandas

In [6]:
import pandas as pd
import os


# Read the file:
dataset_path = os.path.join("dataset", "de_train_split.parquet")
save_path = os.path.join("dataset", "de_train_split.csv")
dataset_df = pd.read_parquet(dataset_path)
dataset_df
# If you'd like to convert the data to a readable format (e.g. CSV):
# dataset_df.to_csv(save_path)


Unnamed: 0,cell_type,sm_name,sm_lincs_id,SMILES,control,A1BG,A1BG-AS1,A2M,A2M-AS1,A2MP1,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1
0,NK cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.104720,-0.077524,-1.625596,-0.144545,0.143555,...,-0.227781,-0.010752,-0.023881,0.674536,-0.453068,0.005164,-0.094959,0.034127,0.221377,0.368755
1,T cells CD4+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.915953,-0.884380,0.371834,-0.081677,-0.498266,...,-0.494985,-0.303419,0.304955,-0.333905,-0.315516,-0.369626,-0.095079,0.704780,1.096702,-0.869887
2,T cells CD8+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,-0.387721,-0.305378,0.567777,0.303895,-0.022653,...,-0.119422,-0.033608,-0.153123,0.183597,-0.555678,-1.494789,-0.213550,0.415768,0.078439,-0.259365
3,T regulatory cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.232893,0.129029,0.336897,0.486946,0.767661,...,0.451679,0.704643,0.015468,-0.103868,0.865027,0.189114,0.224700,-0.048233,0.216139,-0.085024
4,NK cells,Mometasone Furoate,LSM-3349,C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C...,False,4.290652,-0.063864,-0.017443,-0.541154,0.570982,...,0.758474,0.510762,0.607401,-0.123059,0.214366,0.487838,-0.819775,0.112365,-0.122193,0.676629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,T regulatory cells,GLPG0634,LSM-45924,O=C(Nc1nc2cccc(-c3ccc(CN4CCS(=O)(=O)CC4)cc3)n2...,False,1.036718,0.286910,1.406953,2.464325,2.292620,...,-0.122192,0.646534,0.085965,0.390109,1.143933,1.000768,1.531486,2.296799,0.593876,0.841613
497,NK cells,Sgc-cbp30,LSM-47437,COc1ccc(CCc2nc3cc(-c4c(C)noc4C)ccc3n2C[C@H](C)...,False,-0.040492,1.767328,0.398127,0.004789,-0.872308,...,-1.570270,-0.369802,-1.477064,-0.173299,0.122084,0.490689,-0.113555,0.142869,-0.232731,-0.193126
498,T cells CD4+,Sgc-cbp30,LSM-47437,COc1ccc(CCc2nc3cc(-c4c(C)noc4C)ccc3n2C[C@H](C)...,False,-1.163977,0.730631,-0.214973,-0.054305,-0.734482,...,-0.312169,0.800282,0.063365,0.593356,-1.265234,-0.171999,-0.145227,1.421112,-1.337878,1.349846
499,T cells CD8+,Sgc-cbp30,LSM-47437,COc1ccc(CCc2nc3cc(-c4c(C)noc4C)ccc3n2C[C@H](C)...,False,0.699837,-0.750765,0.014432,-1.243092,0.094122,...,-0.125901,-0.069657,0.098344,0.007872,0.234029,-0.077685,-0.056956,-0.068496,-0.283606,-0.045120


## Input(s) to the Model
As you can see from the dataset above, there are two main input values:
1. `cell_type`
2. `sm_name` (small molecule(s) name)

The `sm_lincs_id` column describes the small molecule converted in a standardized representation.

The `SMILES` (Simplified Molecular Input Line Entry System) column represents the structure of the small molecule using ASCII strings. 

**All four of the above descriptors count as part of the input.**

## Target(s) for the Model
All 18211 columns from `A1BG` to `ZZEF1` are target columns. Each contains a number, forming a sequence of 18211 floating point numbers, that your model should predict given the inputs above. 

# Getting Started
We've collated a document filled with information, links to the official NeurIPS 2023 winner presentations, and some initial ideas to get you started! You can find it here: https://docs.google.com/document/d/1i9fo4z8QdXA9L17yZ34uPA2vfjbW8P8T5cG5gRZQuAc/edit?usp=sharing. 

Feel free to introduce yourself on the Discord and find a group/team, hackathons are more fun with others. Otherwise, feel free to hack solo. 

We love creativity, so if you have a weird, strange, goofy idea you think might work, build it!

In [None]:
!pip install torch
!pip install numpy

## Converting to Numpy Arrays and PyTorch Tensors
Below is quick code to convert the target values (sequence of length 18211) to common machine learning formats. Note that this doesn't convert the input columns, since these are provided as strings. 

You'll need to figure out the best way to convert inputs for your model. Some popular options are to use one-hot encodings, or to use embeddings (look for existing chemistry/biology embedding models). 

In [8]:
import torch
import numpy as np


# Convert targets to Numpy:
np_data = dataset_df.iloc[:, 5:].to_numpy()
print(np_data.shape)
print(np_data[0])

# Convert targets to PyTorch:
torch_data = torch.from_numpy(np_data)
print(torch_data.shape)
print(torch_data[0])

(501, 18211)
[ 0.10472047 -0.07752421 -1.62559604 ...  0.03412678  0.22137655
  0.36875538]
torch.Size([501, 18211])
tensor([ 0.1047, -0.0775, -1.6256,  ...,  0.0341,  0.2214,  0.3688],
       dtype=torch.float64)
