This notebook prepares smaller datasets to make some experiments runable using desktops. Available, smaller datasets:

- [labels_ten_frequent.csv](https://ligands.blob.core.windows.net/ligands/labels_ten_frequent.csv)
- [labels_ten_percent.csv](https://ligands.blob.core.windows.net/ligands/labels_ten_percent.csv)
- [labels_hundred.csv](https://ligands.blob.core.windows.net/ligands/labels_hundred.csv)
- [labels_two.csv](https://ligands.blob.core.windows.net/ligands/labels_two.csv)
- [labels_three.csv](https://ligands.blob.core.windows.net/ligands/labels_three.csv)

To use one of the synthesized (using this notebook) datasets, pass a path to the created comma separated values file to the [`LigandDataset`](https://github.com/jkarolczak/ligands-classification/blob/b0d2daf2f4fef1b83233d130336ffea38cb6a74d/src/simple_reader.py#L9) constructor as a second argument.


In [1]:
import numpy as np
import pandas as pd

from sklearn.utils import shuffle

np.random.seed(42)


In [None]:
df_labels = shuffle(
    pd.read_csv("../data/stats_by_label.csv", usecols=["label", "count"])
)
df_blobs = shuffle(pd.read_csv("../data/cmb_blob_labels.csv"))


# Ten most frequent classes

This dataset is composed of all blobs belonging to ten most frequent classes

Synthesized dataset consist of:

- 10 classes
- 558043 blobs


In [None]:
labels = set(
    df_labels.sort_values(by="count", ascending=False).iloc[:10]["label"].to_list()
)
df_blobs[df_blobs["ligand"].isin(labels)].to_csv("labels_ten_frequent.csv", index=False)


# 10% of the original dataset (maintaned proportions between classes)

This dataset is composed of (arbitrary) 10% of blobs belonging to classes, which contain more than 1000 instances. The constraint is to avoid creating extremally small classes in the synthesized (smaller) dataset. The syntesized dataset maintaines proportions between classes.

Synthesized dataset consist of:

- 44 classes
- 64714 blobs


In [None]:
labels = dict()

for row in df_labels[df_labels["count"] > 1000].values:
    labels[row[0]] = int(row[1] * 0.1)

result = pd.DataFrame(columns=df_blobs.columns)
for ligand, number in labels.items():
    result = pd.concat([result, df_blobs[df_blobs["ligand"] == ligand].head(number)])

result.to_csv("labels_ten_percent.csv", index=False)


# 100 instances of each class

This dataset is composed of a hundred instances of each class. 100 is the number of instances of the least frequent class. The syntesized dataset is perfectly balanced.

Synthesized dataset consist of:

- 219 classes
- 21900 blobs


In [None]:
df_blobs.groupby("ligand").head(100).to_csv("labels_hundred.csv", index=False)


# two classes

This dataset is very small. It's composed of two classes, 100 instances for each.

Synthesized dataset consist of:

- 2 classes
- 200 blobs


In [None]:
df = pd.read_csv("../data/labels_hundred.csv")
df[df.ligand.isin({"CA-like", "N-like"})].to_csv("labels_two.csv", index=False)


# three classes

This dataset is very small. It's composed of three classes, 100 instances for each.

Synthesized dataset consist of:

- 3 classes
- 200 blobs


In [2]:
df = pd.read_csv("../data/labels_hundred.csv")


In [None]:
df[df.ligand.isin({"CA-like", "PEG-like", "SAH-like"})].to_csv(
    "labels_three.csv", index=False
)
