# `CellLine`: clo; 2022-03-21

The owl files are missing metadata including definition and synonyms for clo, so we manually parse them from the csv file.

Download `clo.csv.gz` from: https://data.bioontology.org/ontologies/CLO/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv
https://bioportal.bioontology.org/ontologies/CLO

In [1]:
import pandas as pd


def df_from_csv(csv_filepath, prefix):
    df = pd.read_csv(csv_filepath)
    # df = df[~df["Obsolete"]]
    df["ontology_id"] = (
        df["Class ID"]
        .str.replace("http://purl.obolibrary.org/obo/", "")
        .str.replace("_", ":")
    )
    df = df[df["ontology_id"].str.startswith("CLO")]
    df.drop(columns=["definition"], inplace=True)
    df.rename(
        columns={
            "Preferred Label": "name",
            "Synonyms": "synonyms",
            "Definitions": "definition",
            "Parents": "parents",
        },
        inplace=True,
    )
    parents = []
    for p in df["parents"]:
        try:
            plist = [
                i
                for i in p.replace("http://purl.obolibrary.org/obo/", "")
                .replace("_", ":")
                .split("|")
                if i.startswith(prefix)
            ]
            parents.append(plist)
        except AttributeError:
            parents.append([])
    df["parents"] = parents
    df = df[["ontology_id", "name", "definition", "synonyms", "parents"]]
    df = df.sort_values("ontology_id")

    # drop duplicated names, keep the last record
    df = df.drop_duplicates("name", keep="last")

    return df.set_index("ontology_id")

In [2]:
df = df_from_csv("clo.csv.gz", "CLO")

  df = pd.read_csv(csv_filepath)
  df["Class ID"]


In [3]:
df

Unnamed: 0_level_0,name,definition,synonyms,parents
ontology_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CLO:0000000,cell line cell culturing,a maintaining cell culture process that keeps ...,,[]
CLO:0000001,cell line cell,A cultured cell that is part of a cell line - ...,,[]
CLO:0000002,suspension cell line culturing,suspension cell line culturing is a cell line ...,,[CLO:0000000]
CLO:0000003,adherent cell line culturing,adherent cell line culturing is a cell line cu...,,[CLO:0000000]
CLO:0000004,cell line cell modification,a material processing that modifies an existin...,,[]
...,...,...,...,...
CLO:0051617,RCB0187 cell,A immortal medaka cell line cell that has the ...,RCB0187|OLHE-131,[CLO:0009822]
CLO:0051618,RCB2945 cell,A immortal medaka cell line cell that has the ...,RCB2945|DIT29,[CLO:0009822]
CLO:0051619,RCB0184 cell,A immortal medaka cell line cell that has the ...,OLF-136|RCB0184,[CLO:0009822]
CLO:0051620,RCB0188 cell,A immortal medaka cell line cell that has the ...,RCB0188|OLME-104,[CLO:0009822]


In [4]:
df.loc["CLO:0007050"]

name                                          K 562 cell
definition            disease: leukemia, chronic myeloid
synonyms      K-562|KO|GM05372E|K.562|K562|GM05372|K 562
parents                                    [CLO:0000511]
Name: CLO:0007050, dtype: object

In [5]:
# adding RPE1 and RPE to synonyms as it's used quite often

df.loc["CLO:0004290"]["synonyms"] += "|RPE1|RPE-1|RPE"

In [6]:
df.loc["CLO:0004290"]["synonyms"]

'hTERT RPE-1|RPE1|RPE-1|RPE'

In [7]:
df.to_parquet("df_all__clo__2022-03-21__CellLine.parquet")