**Relevant Information:**
- [kaggle dataset link](https://www.kaggle.com/datasets/samdemharter/brca-multiomics-tcga?resource=download)
- [data cleaning methods](https://rbabaei82.github.io/MultiOmics_TCGA-BRCA/Analysis)

There are 705 breast cancer samples. The dataset contains four different omics data types (1936 features in total).
- cn: copy number variations (n=860)
- mu: mutations (n=249)
- rs: gene expression (n=604)
- pp: protein levels (n=223)

Pertinent Questions:
- What type of multi-omics prediction models can we build?
- Where are the strengths/weaknesses of different methods?
- Can we show why it is meaningful to integrate different data types?

In [1]:
import numpy as np
import pandas as pd

In [2]:
brca_df = pd.read_csv("brca_data_w_subtypes.csv")
brca_df.head()

Unnamed: 0,rs_CLEC3A,rs_CPB1,rs_SCGB2A2,rs_SCGB1D2,rs_TFF1,rs_MUCL1,rs_GSTM1,rs_PIP,rs_ADIPOQ,rs_ADH1B,...,pp_p62.LCK.ligand,pp_p70S6K,pp_p70S6K.pT389,pp_p90RSK,pp_p90RSK.pT359.S363,vital.status,PR.Status,ER.Status,HER2.Final.Status,histological.type
0,0.892818,6.580103,14.123672,10.606501,13.189237,6.649466,10.520335,10.33849,10.248379,10.22997,...,-0.691766,-0.337863,-0.178503,0.011638,-0.207257,0,Positive,Positive,Negative,infiltrating ductal carcinoma
1,0.0,3.691311,17.11609,15.517231,9.867616,9.691667,8.179522,7.911723,1.289598,1.818891,...,0.279067,0.292925,-0.155242,-0.089365,0.26753,0,Positive,Negative,Negative,infiltrating ductal carcinoma
2,3.74815,4.375255,9.658123,5.326983,12.109539,11.644307,10.51733,5.114925,11.975349,11.911437,...,0.21991,0.30811,-0.190794,-0.22215,-0.198518,0,Positive,Positive,Negative,infiltrating ductal carcinoma
3,0.0,18.235519,18.53548,14.533584,14.078992,8.91376,10.557465,13.304434,8.205059,9.211476,...,-0.266554,-0.079871,-0.463237,0.522998,-0.046902,0,Positive,Positive,Negative,infiltrating ductal carcinoma
4,0.0,4.583724,15.711865,12.804521,8.881669,8.430028,12.964607,6.806517,4.294341,5.385714,...,-0.441542,-0.152317,0.511386,-0.096482,0.037473,0,Positive,Positive,Negative,infiltrating ductal carcinoma


In [28]:
brca_df.isna().sum()

rs_CLEC3A              0
rs_CPB1                0
rs_SCGB2A2             0
rs_SCGB1D2             0
rs_TFF1                0
                    ... 
vital.status           0
PR.Status            122
ER.Status            122
HER2.Final.Status    145
histological.type      0
Length: 1941, dtype: int64

In [7]:
brca_df.groupby("histological.type").count()["vital.status"]

histological.type
infiltrating ductal carcinoma     574
infiltrating lobular carcinoma    131
Name: vital.status, dtype: int64

In [29]:
brca_df.groupby("vital.status").count()["histological.type"]

vital.status
0    611
1     94
Name: histological.type, dtype: int64

In [31]:
brca_df.groupby("PR.Status").count()["histological.type"]

PR.Status
Indeterminate                    4
Negative                       193
Not Performed                   28
Performed but Not Available      5
Positive                       353
Name: histological.type, dtype: int64

In [32]:
brca_df.groupby("ER.Status").count()["histological.type"]

ER.Status
Indeterminate                    2
Negative                       135
Not Performed                   27
Performed but Not Available      5
Positive                       414
Name: histological.type, dtype: int64

In [33]:
brca_df.groupby("HER2.Final.Status").count()["histological.type"]

HER2.Final.Status
Equivocal          9
Negative         457
Not Available      8
Positive          86
Name: histological.type, dtype: int64

In [54]:
label = []
for i in np.arange(brca_df.shape[0]):
    temp = brca_df.iloc[i][["PR.Status", "ER.Status", "HER2.Final.Status"]].str.lower().tolist()
    if temp == ["positive", "positive", "negative"]:
        label.append("luminal_a")
    elif temp == ["positive", "positive", "positive"]:
        label.append("luminal_b")
    elif temp == ["negative", "negative", "positive"]:
        label.append("her2pos")
    elif temp == ["negative", "negative", "negative"]:
        label.append("basal_like") 
    else:
        label.append("unclassified")
brca_df["label"] = label

In [55]:
brca_df["label"].value_counts()

luminal_a       279
unclassified    268
basal_like       90
luminal_b        43
her2pos          25
Name: label, dtype: int64

In [18]:
cn_df = brca_df[[i for i in brca_df.columns if (i[0]=='c') and (i[1]=='n')]].copy()
mu_df = brca_df[[i for i in brca_df.columns if (i[0]=='m') and (i[1]=='u')]].copy()
rs_df = brca_df[[i for i in brca_df.columns if (i[0]=='r') and (i[1]=='s')]].copy()
pp_df = brca_df[[i for i in brca_df.columns if (i[0]=='p') and (i[1]=='p')]].copy()

In [19]:
for i in [cn_df, mu_df, rs_df , pp_df]:
    i["vital,status"] = brca_df["vital.status"]
    i["histological.type"] = brca_df["histological.type"]
    print(len(i.columns))

862
251
606
225
