# Cleaning & Manipulation of Chr7 Genes Score Data

In [1]:
#Author: Shirley Zhou

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
ts_master_list = pd.read_csv("../data/Chr7_master_list.csv")
ts_master_list.head(3)

Unnamed: 0.1,Unnamed: 0,Gene,ENSEMBL_ID,Brummelkamp.hap1.GTS.ratio,Brummelkamp.hap1.GTS.threshold,Brummelkamp.kbm7.GTS.ratio,Brummelkamp.kbm7.GTS.threshold,Campbell2018.LAML.pvalues,Campbell2018.PANCANCER.pvalues,Campbell2018.Combined.pvalues,...,Sabbatini2017.THP.1,Sabbatini2017.THP.1.threshold,Wallace2016.DEseq2.Log2.Fold.Change,Wallace2016.DEseq2.P.Value,Wallace2016.threshold,Weissman2016a.Rep1_2.Mann.Whitney.p.value,Weissman2016a.average.phenotype.of.strongest.3.2,Weissman2016a.threshold,Weissman2016i.Rep1_2.Mann.Whitney.p.value,Weissman2016i.average.phenotype.of.strongest.3.2
0,776,VKORC1L1,ENSG00000196715,0.46,0,0.531993,0,,,,...,-0.031,0,1.440088,0.000159,1,0.000412,-0.034013,1,0.054796,0.01427
1,390,KCTD7,ENSG00000243335,0.55,0,0.542857,0,,,,...,-0.424,0,1.087654,0.011067,1,0.003739,-0.027808,1,0.704017,0.013199
2,420,LRRC17,ENSG00000128606,,0,,0,,,,...,-0.13,0,,,0,0.013048,-0.030304,1,0.181169,-0.021738


In [4]:
ts_score = pd.read_csv("../data/Rank_variables_master_list.csv")
ts_score.head(3)

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brummelkamp.hap1.GTS.ratio,Brummelkamp.hap1.GTS.threshold,Brummelkamp.kbm7.GTS.ratio,Brummelkamp.kbm7.GTS.threshold,Campbell2018.LAML.pvalues,Campbell2018.PANCANCER.pvalues,Campbell2018.Combined.pvalues,...,Sabbatini2017.THP.1,Sabbatini2017.THP.1.threshold,Wallace2016.DEseq2.Log2.Fold.Change,Wallace2016.DEseq2.P.Value,Wallace2016.threshold,Weissman2016a.mean.Replicate1_n_2.Mann.Whitney.p.value,Weissman2016a.CS.average.phenotype.of.strongest.3.2,Weissman2016a.threshold,Weissman2016i.Rep1_2.Mann.Whitney.p.value,Weissman2016i.average.phenotype.of.strongest.3 guides.comb(avg CS for replicate 1 and 2)
0,POLD2,5425.0,ENSG00000106628,0.246339,0,0.207792,0,,,,...,-3.567,0,1.130542,0.024891,1,0.137059,-0.012419,0,7e-06,-0.20583
1,EZH2,2146.0,ENSG00000106462,0.414478,0,0.211823,0,0.0498,1.0,0.065506,...,-1.605,0,,,0,0.798109,-0.000228,0,0.021934,0.033731
2,PTCD1,26024.0,ENSG00000106246,0.486842,0,0.488024,0,,,,...,-1.544,0,,,0,0.186927,-0.016081,0,0.000299,-0.147673


## Deletion of column data 

The column `Weissman2014.CRISPRi.Growth.phenotype..mean.of.top.3.gammas.` will be deleted because most of the hits in this CRISPRi screen are underpowered.

In [5]:
ts_score = ts_score.drop('Weissman2014.CRISPRi.Growth.phenotype..mean.of.top.3.gammas.', axis=1)

## Cleaning

### Remove NaN Columns

Define a function `percent_nan` to calculate the **percentage of NaN values** in the column.

In [6]:
def percent_nan(list):
    return (sum(np.isnan(list))/len(list))

Apply `percent_nan` to all columns of the `ts_score` table and calculate percentage of NaN values for each column and rank them from highest to lowest.

In [7]:
nan_check  = ts_score.drop(['Gene', 'Entrez.GeneID', 'ENSEMBL_ID'], 
                           axis=1).apply(percent_nan)
nan_check.sort_values(ascending = False).head(30)

Campbell2018.LAML.pvalues                                       0.937500
Campbell2018.PANCANCER.pvalues                                  0.937500
Campbell2018.Combined.pvalues                                   0.937500
Wallace2016.DEseq2.P.Value                                      0.760417
Wallace2016.DEseq2.Log2.Fold.Change                             0.760417
Elledge2019.HMEC.FDR.drop                                       0.166667
Elledge2019.HPNE.Average.Log2.Drop                              0.166667
Elledge2019.HPNE.FDR.drop                                       0.166667
Elledge2019.HMEC.Average.Log2FC.Drop(desc)                      0.166667
Elledge2019.HMEC.Combined.pvalue.drop                           0.166667
Elledge2019.HPNE.Combined.pvalue.drop                           0.166667
Brummelkamp.hap1.GTS.ratio                                      0.093750
Brummelkamp.kbm7.GTS.ratio                                      0.093750
Sabbatini2015.Raji.CS                              

From the result above we can see the follow columns have 75% or above of data that are missing (labeled as NaN):
- `Campbell2018.LAML.pvalues`
- `Campbell2018.PANCANCER.pvalues`
- `Campbell2018.Combined.pvalues`
- `Wallace2016.DEseq2.P.Value`
- `Wallace2016.DEseq2.Log2.Fold.Change`

Drop the 5 columns that have 75% or more NaN values and store the resulting table as `ts_score_cleaned_nan`.

In [8]:
ts_score_cleaned_nan = ts_score.drop(nan_check[nan_check > 0.75].index.tolist(), axis=1)

### Remove Zero Variance Columns

Calculate the variance of each column.

In [9]:
col_var = ts_score_cleaned_nan.drop(['Gene', 'Entrez.GeneID', 'ENSEMBL_ID'], 
                           axis=1).apply(np.var)
col_var.head(30)

Brummelkamp.hap1.GTS.ratio                                      1.466810e-02
Brummelkamp.hap1.GTS.threshold                                  3.993056e-02
Brummelkamp.kbm7.GTS.ratio                                      1.336673e-02
Brummelkamp.kbm7.GTS.threshold                                  2.039931e-02
Chen2019.CS.pos.score                                           9.004844e-02
Chen2019.pos.p.value                                            7.820765e-02
Chen2019.pos.p.value.threshold                                  3.027344e-02
Chen2019.pos.fdr                                                7.668305e-04
Chen2019.pos.rank                                               2.781835e+07
Weissman2014.CRISPRa.Growth.phenotype..mean.of.top.3.gammas.    1.188182e-03
Weissman2014.CRISPRa.Growth.phenotype.threshold                 6.759983e-02
Doench2018.Average.LFC                                          1.392407e-01
Doench2018.Average.negative.log.p.values.                       3.748032e-01

Drop the colunns that have variance = 0 and store the resulting table as `ts_score_cleaned_var`.

In [10]:
ts_score_cleaned_var = ts_score_cleaned_nan.drop(col_var[col_var == 0].index.tolist(), axis=1)
ts_score_cleaned_var.head()

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brummelkamp.hap1.GTS.ratio,Brummelkamp.hap1.GTS.threshold,Brummelkamp.kbm7.GTS.ratio,Brummelkamp.kbm7.GTS.threshold,Chen2019.CS.pos.score,Chen2019.pos.p.value,Chen2019.pos.p.value.threshold,...,Sabbatini2017.SKM.1,Sabbatini2017.SKM.threshold,Sabbatini2017.TF.1,Sabbatini2017.THP.1,Wallace2016.threshold,Weissman2016a.mean.Replicate1_n_2.Mann.Whitney.p.value,Weissman2016a.CS.average.phenotype.of.strongest.3.2,Weissman2016a.threshold,Weissman2016i.Rep1_2.Mann.Whitney.p.value,Weissman2016i.average.phenotype.of.strongest.3 guides.comb(avg CS for replicate 1 and 2)
0,POLD2,5425.0,ENSG00000106628,0.246339,0,0.207792,0,0.52414,0.52419,0,...,-2.306,0,-3.45,-3.567,1,0.137059,-0.012419,0,7.45e-06,-0.20583
1,EZH2,2146.0,ENSG00000106462,0.414478,0,0.211823,0,0.61422,0.614,0,...,-1.706,0,-2.087,-1.605,0,0.798109,-0.000228,0,0.02193387,0.033731
2,PTCD1,26024.0,ENSG00000106246,0.486842,0,0.488024,0,0.98863,0.9887,0,...,-0.975,0,-1.82,-1.544,0,0.186927,-0.016081,0,0.000299183,-0.147673
3,MCM7,4176.0,ENSG00000166508,0.077803,0,0.087719,0,0.72494,0.72481,0,...,-1.879,0,-2.898,-3.325,0,0.529921,-0.01497,0,0.3089643,-0.069984
4,NUP205,,ENSG00000155561,0.069801,0,0.048443,0,0.98416,0.98426,0,...,-2.369,0,-2.675,-2.244,0,0.050028,0.00524,0,2.22e-07,-0.339325


### Remove Zero Columns

Define a function `percent_zero` to calculate the __percentage of zeros__ in the column given.

In [11]:
def percent_zero(list):
    return ((len(list) - np.count_nonzero(list))/len(list))

Apply `percent_zero` to all columns of the ts_score table and calculate percentage of zero values for each column and rank them from highest to lowest. 

In [12]:
zeros_check = ts_score_cleaned_var.drop(['Gene', 'Entrez.GeneID', 'ENSEMBL_ID'], axis=1).apply(percent_zero)
zeros_check.sort_values(ascending = False).head(20)

Sabbatini2015.Raji.threshold                          0.989583
Sabbatini2015.Jiyoye.threshold                        0.979167
Brummelkamp.kbm7.GTS.threshold                        0.979167
Sabbatini2017.MOLM.13.threshold                       0.968750
Sabbatini2017.SKM.threshold                           0.968750
Chen2019.pos.p.value.threshold                        0.968750
Brummelkamp.hap1.GTS.threshold                        0.958333
Sabbatini2015.K562.threshold                          0.947917
Weissman2014.CRISPRa.Growth.phenotype.threshold       0.927083
Elledge2019.HPNE.Average.threshold                    0.906250
Elledge2013.TUSON_p_value_TSG.threshold               0.854167
Doench2018.Average.negative.log.p.values.threshold    0.822917
Elledge2019.HMEC.Average.threshold                    0.812500
Wallace2016.threshold                                 0.781250
Weissman2016a.threshold                               0.604167
Elledge2019.HMEC.Average.Log2FC.Drop(desc)            0

From the result above we can see the following columns have 60% or above of data that are 0 (they are all threshold columns):

- `Sabbatini2015.Raji.threshold`
- `Sabbatini2015.Jiyoye.threshold`
- `Brummelkamp.kbm7.GTS.threshold`
- `Sabbatini2017.MOLM.13.threshold`
- `Sabbatini2017.SKM.threshold`
- `Chen2019.pos.p.value.threshold`
- `Brummelkamp.hap1.GTS.threshold`
- `Sabbatini2015.K562.threshold`
- `Weissman2014.CRISPRa.Growth.phenotype.threshold`
- `Elledge2019.HPNE.Average.threshold`
- `Elledge2013.TUSON_p_value_TSG.threshold`
- `Doench2018.Average.negative.log.p.values.threshold`
- `Elledge2019.HMEC.Average.threshold`
- `Wallace2016.threshold`
- `Weissman2016a.threshold`

Remove the columns above that have more than 60% zeros.

In [13]:
ts_score_cleaned_zero = ts_score_cleaned_var.drop(zeros_check[zeros_check > 0.6].index.tolist(), axis=1)
ts_score_cleaned_zero.columns.tolist()

['Gene',
 'Entrez.GeneID',
 'ENSEMBL_ID',
 'Brummelkamp.hap1.GTS.ratio',
 'Brummelkamp.kbm7.GTS.ratio',
 'Chen2019.CS.pos.score',
 'Chen2019.pos.p.value',
 'Chen2019.pos.fdr',
 'Chen2019.pos.rank',
 'Weissman2014.CRISPRa.Growth.phenotype..mean.of.top.3.gammas.',
 'Doench2018.Average.LFC',
 'Doench2018.Average.negative.log.p.values.',
 'Elledge2013.TUSON_p_value_TSG',
 'Elledge2013.TUSON_q_value_TSG',
 'Elledge2013.TSG_Probability_LASSO',
 'Elledge2019.HMEC.Average.Log2FC.Drop(desc)',
 'Elledge2019.HMEC.Combined.pvalue.drop',
 'Elledge2019.HMEC.FDR.drop',
 'Elledge2019.HPNE.Average.Log2.Drop',
 'Elledge2019.HPNE.Combined.pvalue.drop',
 'Elledge2019.HPNE.FDR.drop',
 'Sabbatini2015.KBM7.CS',
 'Sabbatini2015.KBM7.adjusted.p.value',
 'Sabbatini2015.K562.CS',
 'Sabbatini2015.K562.adjusted.p.value',
 'Sabbatini2015.Jiyoye.CS',
 'Sabbatini2015.Jiyoye.adjusted.p.value',
 'Sabbatini2015.Raji.CS',
 'Sabbatini2015.Raji.adjusted.p.value',
 'Sabbatini2017.EOL.1',
 'Sabbatini2017.HEL',
 'Sabbatini201

Note that there is one more threshold column (probably with all 1 values) that need to be manually removed.

In [14]:
ts_score_cleaned_thrsh = ts_score_cleaned_zero.drop('Sabbatini2017.MV411.threshold', axis=1)
ts_score_cleaned_thrsh.head()

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brummelkamp.hap1.GTS.ratio,Brummelkamp.kbm7.GTS.ratio,Chen2019.CS.pos.score,Chen2019.pos.p.value,Chen2019.pos.fdr,Chen2019.pos.rank,Weissman2014.CRISPRa.Growth.phenotype..mean.of.top.3.gammas.,...,Sabbatini2017.OCI.AML5,Sabbatini2017.P31.FUJ,Sabbatini2017.PL.21,Sabbatini2017.SKM.1,Sabbatini2017.TF.1,Sabbatini2017.THP.1,Weissman2016a.mean.Replicate1_n_2.Mann.Whitney.p.value,Weissman2016a.CS.average.phenotype.of.strongest.3.2,Weissman2016i.Rep1_2.Mann.Whitney.p.value,Weissman2016i.average.phenotype.of.strongest.3 guides.comb(avg CS for replicate 1 and 2)
0,POLD2,5425.0,ENSG00000106628,0.246339,0.207792,0.52414,0.52419,0.999997,9935,-0.001149,...,-3.504,-3.275,-2.019,-2.306,-3.45,-3.567,0.137059,-0.012419,7.45e-06,-0.20583
1,EZH2,2146.0,ENSG00000106462,0.414478,0.211823,0.61422,0.614,0.999997,11628,0.034245,...,-1.001,0.178,0.919,-1.706,-2.087,-1.605,0.798109,-0.000228,0.02193387,0.033731
2,PTCD1,26024.0,ENSG00000106246,0.486842,0.488024,0.98863,0.9887,0.999997,18786,-0.049917,...,-1.967,-2.01,-1.25,-0.975,-1.82,-1.544,0.186927,-0.016081,0.000299183,-0.147673
3,MCM7,4176.0,ENSG00000166508,0.077803,0.087719,0.72494,0.72481,0.999997,13679,-0.099647,...,-3.472,-3.049,-1.725,-1.879,-2.898,-3.325,0.529921,-0.01497,0.3089643,-0.069984
4,NUP205,,ENSG00000155561,0.069801,0.048443,0.98416,0.98426,0.999997,18676,-0.013492,...,-2.165,-2.378,-1.174,-2.369,-2.675,-2.244,0.050028,0.00524,2.22e-07,-0.339325


### Save Cleaned Dataframe

Save this cleaned dataframe to a new `.csv` file

In [15]:
ts_score_cleaned_thrsh.to_csv(r'../data/Rank_variables_master_list_cleaned.csv')

### Rename the Columns

The columns of the cleaned dataframe need to be renamed in order to achieve both conciseness and clarity. The new column names shall be constructed to keep the following features:
- four-letter abbreviation of the communication author's last name
- type of cell line used (if applicable)
- type of statistical manipulation applied, etc. avg, combined, logm (if applicable)
- type of scores/measures applied, etc. GTS (Gene Trap Score), cs(crispr score), pvalue, fdr (false discovery rate)

In [16]:
col_names = {'Brummelkamp.hap1.GTS.ratio': 'Brum.HAP1.GTS',
             'Brummelkamp.kbm7.GTS.ratio': 'Brum.KBM7.GTS',
             'Chen2019.CS.pos.score': 'Chen.cs',
             'Chen2019.pos.fdr': 'Chen.fdr', 
             'Chen2019.pos.p.value': 'Chen.pvalue',
             'Chen2019.pos.rank': 'Chen.ts.rank',
             'Doench2018.Average.LFC': 'Doen.avg.LFC',
             'Doench2018.Average.negative.log.p.values.': 'Doen.avg.neg.log.pvalue',
             'Elledge2013.TUSON_p_value_TSG': 'Elle13.pvalue',
             'Elledge2013.TUSON_q_value_TSG': 'Elle13.fdr',
             'Elledge2013.TSG_Probability_LASSO': 'Elle13.lasso.prob',
             'Elledge2019.HMEC.Average.Log2FC.Drop(desc)': 'Elle18.HMEC.avg.LFC',
             'Elledge2019.HMEC.Combined.pvalue.drop': 'Elle18.HMEC.combined.pvalue', 
             'Elledge2019.HMEC.FDR.drop': 'Elle19.HMEC.fdr',
             'Elledge2019.HPNE.Average.Log2.Drop': 'Elle18.HPNE.avg.LFC',
             'Elledge2019.HPNE.Combined.pvalue.drop': 'Elle18.HPNE.Combined.pvalue',
             'Elledge2019.HPNE.FDR.drop': 'Elle19.HPNE.fdr',
             'Sabbatini2015.KBM7.CS': 'Sabb15.KBM7.cs',
             'Sabbatini2015.KBM7.adjusted.p.value': 'Sabb15.KBM7.fdr',
             'Sabbatini2015.K562.CS': 'Sabb15.K562.cs',
             'Sabbatini2015.K562.adjusted.p.value': 'Sabb15.K562.fdr',
             'Sabbatini2015.Jiyoye.CS': 'Sabb15.Jiyoye.cs', 
             'Sabbatini2015.Jiyoye.adjusted.p.value': 'Sabb15.Jiyoye.fdr',
             'Sabbatini2015.Raji.CS': 'Sabb15.Raji.cs',
             'Sabbatini2015.Raji.adjusted.p.value': 'Sabb15.Raji.fdr',
             'Sabbatini2017.EOL.1': 'Sabb17.EOL1', 
             'Sabbatini2017.HEL': 'Sabb17.HEL',
             'Sabbatini2017.MOLM.13': 'Sabb17.MOLM13',
             'Sabbatini2017.MonoMac1': 'Sabb17.MonoMac1',
             'Sabbatini2017.NB4.rep1': 'Sabb17.NB4.rep1',
             'Sabbatini2017.NB4.rep2': 'Sabb17.NB4.rep2', 
             'Sabbatini2017.OCI.AML2': 'Sabb17.OCI.AML2',
             'Sabbatini2017.OCI.AML3': 'Sabb17.OCI.AML3', 
             'Sabbatini2017.OCI.AML5': 'Sabb17.OCI.AML5',
             'Sabbatini2017.P31.FUJ': 'Sabb17.P31.FUJ', 
             'Sabbatini2017.PL.21': 'Sabb17.PL21', 
             'Sabbatini2017.SKM.1': 'Sabb17.SKM1',
             'Sabbatini2017.TF.1': 'Sabb17.TF1', 
             'Sabbatini2017.THP.1': 'Sabb17.THP1',
             'Weissman2014.CRISPRa.Growth.phenotype..mean.of.top.3.gammas.': 'Weis14.csa.avg',
             'Weissman2016a.mean.Replicate1_n_2.Mann.Whitney.p.value': 'Weis16.csa.MW.pvalue',
             'Weissman2016a.CS.average.phenotype.of.strongest.3.2': 'Weis16.csa.avg',
             'Weissman2016i.Rep1_2.Mann.Whitney.p.value': 'Weis16.csi.MW.pvalue',
             'Weissman2016i.average.phenotype.of.strongest.3 guides.comb(avg CS for replicate 1 and 2)': 'Weis16.csi.avg'}

In [17]:
ts_score_cleaned = ts_score_cleaned_thrsh.rename(columns = col_names)

In [18]:
ts_score_cleaned.columns

Index(['Gene', 'Entrez.GeneID', 'ENSEMBL_ID', 'Brum.HAP1.GTS', 'Brum.KBM7.GTS',
       'Chen.cs', 'Chen.pvalue', 'Chen.fdr', 'Chen.ts.rank', 'Weis14.csa.avg',
       'Doen.avg.LFC', 'Doen.avg.neg.log.pvalue', 'Elle13.pvalue',
       'Elle13.fdr', 'Elle13.lasso.prob', 'Elle18.HMEC.avg.LFC',
       'Elle18.HMEC.combined.pvalue', 'Elle19.HMEC.fdr', 'Elle18.HPNE.avg.LFC',
       'Elle18.HPNE.Combined.pvalue', 'Elle19.HPNE.fdr', 'Sabb15.KBM7.cs',
       'Sabb15.KBM7.fdr', 'Sabb15.K562.cs', 'Sabb15.K562.fdr',
       'Sabb15.Jiyoye.cs', 'Sabb15.Jiyoye.fdr', 'Sabb15.Raji.cs',
       'Sabb15.Raji.fdr', 'Sabb17.EOL1', 'Sabb17.HEL', 'Sabb17.MOLM13',
       'Sabb17.MonoMac1', 'Sabb17.NB4.rep1', 'Sabb17.NB4.rep2',
       'Sabb17.OCI.AML2', 'Sabb17.OCI.AML3', 'Sabb17.OCI.AML5',
       'Sabb17.P31.FUJ', 'Sabb17.PL21', 'Sabb17.SKM1', 'Sabb17.TF1',
       'Sabb17.THP1', 'Weis16.csa.MW.pvalue', 'Weis16.csa.avg',
       'Weis16.csi.MW.pvalue', 'Weis16.csi.avg'],
      dtype='object')

In [19]:
ts_score_cleaned.to_csv(r'../data/Rank_variables_master_list_clean_renamed.csv')

## Imputation (Replace NaNs)

### Median

In [20]:
def nan_to_median(df):
    new = df.copy()
    cols = new.drop(['Gene','Entrez.GeneID', 'ENSEMBL_ID'], axis=1).columns.tolist()
    medians = new.drop(['Gene','Entrez.GeneID', 'ENSEMBL_ID'], axis=1).apply(np.nanmedian)
    for i in cols:
        new[i] = new[i].replace(np.nan, medians[i])
    return new

In [21]:
ts_score_median_imputation = nan_to_median(ts_score_cleaned)
ts_score_median_imputation.head()

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brum.HAP1.GTS,Brum.KBM7.GTS,Chen.cs,Chen.pvalue,Chen.fdr,Chen.ts.rank,Weis14.csa.avg,...,Sabb17.OCI.AML5,Sabb17.P31.FUJ,Sabb17.PL21,Sabb17.SKM1,Sabb17.TF1,Sabb17.THP1,Weis16.csa.MW.pvalue,Weis16.csa.avg,Weis16.csi.MW.pvalue,Weis16.csi.avg
0,POLD2,5425.0,ENSG00000106628,0.246339,0.207792,0.52414,0.52419,0.999997,9935,-0.001149,...,-3.504,-3.275,-2.019,-2.306,-3.45,-3.567,0.137059,-0.012419,7.45e-06,-0.20583
1,EZH2,2146.0,ENSG00000106462,0.414478,0.211823,0.61422,0.614,0.999997,11628,0.034245,...,-1.001,0.178,0.919,-1.706,-2.087,-1.605,0.798109,-0.000228,0.02193387,0.033731
2,PTCD1,26024.0,ENSG00000106246,0.486842,0.488024,0.98863,0.9887,0.999997,18786,-0.049917,...,-1.967,-2.01,-1.25,-0.975,-1.82,-1.544,0.186927,-0.016081,0.000299183,-0.147673
3,MCM7,4176.0,ENSG00000166508,0.077803,0.087719,0.72494,0.72481,0.999997,13679,-0.099647,...,-3.472,-3.049,-1.725,-1.879,-2.898,-3.325,0.529921,-0.01497,0.3089643,-0.069984
4,NUP205,,ENSG00000155561,0.069801,0.048443,0.98416,0.98426,0.999997,18676,-0.013492,...,-2.165,-2.378,-1.174,-2.369,-2.675,-2.244,0.050028,0.00524,2.22e-07,-0.339325


In [22]:
ts_score_median_imputation

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brum.HAP1.GTS,Brum.KBM7.GTS,Chen.cs,Chen.pvalue,Chen.fdr,Chen.ts.rank,Weis14.csa.avg,...,Sabb17.OCI.AML5,Sabb17.P31.FUJ,Sabb17.PL21,Sabb17.SKM1,Sabb17.TF1,Sabb17.THP1,Weis16.csa.MW.pvalue,Weis16.csa.avg,Weis16.csi.MW.pvalue,Weis16.csi.avg
0,POLD2,5425.0,ENSG00000106628,0.246339,0.207792,0.524140,0.524190,0.999997,9935,-0.001149,...,-3.504,-3.275,-2.019,-2.306,-3.450,-3.567,0.137059,-0.012419,7.450000e-06,-0.205830
1,EZH2,2146.0,ENSG00000106462,0.414478,0.211823,0.614220,0.614000,0.999997,11628,0.034245,...,-1.001,0.178,0.919,-1.706,-2.087,-1.605,0.798109,-0.000228,2.193387e-02,0.033731
2,PTCD1,26024.0,ENSG00000106246,0.486842,0.488024,0.988630,0.988700,0.999997,18786,-0.049917,...,-1.967,-2.010,-1.250,-0.975,-1.820,-1.544,0.186927,-0.016081,2.991830e-04,-0.147673
3,MCM7,4176.0,ENSG00000166508,0.077803,0.087719,0.724940,0.724810,0.999997,13679,-0.099647,...,-3.472,-3.049,-1.725,-1.879,-2.898,-3.325,0.529921,-0.014970,3.089643e-01,-0.069984
4,NUP205,,ENSG00000155561,0.069801,0.048443,0.984160,0.984260,0.999997,18676,-0.013492,...,-2.165,-2.378,-1.174,-2.369,-2.675,-2.244,0.050028,0.005240,2.220000e-07,-0.339325
5,VPS41,27072.0,ENSG00000006715,0.471486,0.465969,0.423200,0.433060,0.999997,8208,0.009153,...,-0.072,-0.091,0.061,-0.819,-0.503,0.086,0.757875,-0.002069,3.875170e-04,-0.087477
6,TMEM106B,54664.0,ENSG00000106460,0.520790,0.535714,0.003058,0.009031,0.898840,187,-0.017155,...,0.417,0.300,0.279,0.320,-0.013,0.198,0.259325,-0.007054,3.432340e-01,0.007584
7,ACTB,60.0,ENSG00000075624,0.278481,0.312500,0.940260,0.940080,0.999997,17806,-0.009182,...,-2.588,-3.174,-2.325,-0.924,-1.645,-3.157,0.006688,-0.032164,9.133826e-01,-0.001176
8,CASP2,835.0,ENSG00000106144,0.576959,0.526316,0.074559,0.130170,0.987051,2515,-0.026189,...,0.359,0.597,0.347,0.195,0.761,0.259,0.248227,-0.011515,7.665392e-01,0.012567
9,CEP41,95681.0,ENSG00000106477,0.419279,0.499456,0.744320,0.744070,0.999997,14037,-0.007685,...,0.166,-0.517,-0.377,-0.020,-0.461,-0.056,0.912805,0.004097,2.070000e-06,-0.046852


Now see if the imputation had worked by checking the first row `Brum.HAP1.GTS`.

In [23]:
ts_score_cleaned[np.isnan(ts_score_cleaned["Brum.HAP1.GTS"].tolist())]

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brum.HAP1.GTS,Brum.KBM7.GTS,Chen.cs,Chen.pvalue,Chen.fdr,Chen.ts.rank,Weis14.csa.avg,...,Sabb17.OCI.AML5,Sabb17.P31.FUJ,Sabb17.PL21,Sabb17.SKM1,Sabb17.TF1,Sabb17.THP1,Weis16.csa.MW.pvalue,Weis16.csa.avg,Weis16.csi.MW.pvalue,Weis16.csi.avg
32,RABGEF1,27342.0,,,,0.54397,0.54361,0.999997,10300,0.013104,...,-0.9,-0.438,-1.109,-1.409,-0.544,-1.933,0.003739,-0.027808,0.704017,0.013199
34,DNAJC30,84277.0,ENSG00000176410,,,0.89391,0.89346,0.999997,16891,0.001629,...,0.191,-0.124,-0.326,-0.644,0.19,-0.288,0.406469,0.008781,0.616128,-0.03102
75,ZSCAN25,221785.0,,,,0.48859,0.49046,0.999997,9276,-0.007927,...,-0.037,-0.376,0.098,-0.237,-0.143,-0.164,0.611917,0.008821,0.986175,-0.004549
77,TNRC18,,,,,0.78966,0.78968,0.999997,14875,-0.005101,...,-0.699,-0.233,-1.256,-1.005,-1.393,-0.16,0.010632,-0.033607,0.176836,-0.010173
79,CALN1,83698.0,,,,0.63701,0.63683,0.999997,12038,,...,0.251,-0.229,0.256,-0.25,-0.079,-0.327,0.019361,-0.022015,0.286409,-0.007905
82,C1GALT1,,,,,0.76473,0.76453,0.999997,14422,-0.017888,...,0.505,0.464,0.289,0.105,0.536,0.615,0.291431,0.006242,0.40498,-0.001067
91,LUC7L2,51631.0,,,,0.36035,0.38215,0.999997,7236,-0.013031,...,-1.269,-0.991,-0.914,-0.946,-2.606,-0.557,0.023538,-0.015354,0.546132,0.018418
94,TARP,,,,,0.66453,0.66419,0.999997,12533,,...,0.113,0.455,0.045,0.048,0.575,0.65,0.539946,-0.012642,0.443288,-0.009949
95,TMEM120A,83862.0,,,,0.82131,0.8212,0.999997,15499,-0.014349,...,0.184,-0.161,-0.322,-0.337,-0.14,-0.157,0.155952,-0.004993,0.736046,0.005924


In [24]:
ts_score_median_imputation.loc[ts_score_cleaned[np.isnan(ts_score_cleaned["Brum.HAP1.GTS"].tolist())].index]

Unnamed: 0,Gene,Entrez.GeneID,ENSEMBL_ID,Brum.HAP1.GTS,Brum.KBM7.GTS,Chen.cs,Chen.pvalue,Chen.fdr,Chen.ts.rank,Weis14.csa.avg,...,Sabb17.OCI.AML5,Sabb17.P31.FUJ,Sabb17.PL21,Sabb17.SKM1,Sabb17.TF1,Sabb17.THP1,Weis16.csa.MW.pvalue,Weis16.csa.avg,Weis16.csi.MW.pvalue,Weis16.csi.avg
32,RABGEF1,27342.0,,0.504202,0.494515,0.54397,0.54361,0.999997,10300,0.013104,...,-0.9,-0.438,-1.109,-1.409,-0.544,-1.933,0.003739,-0.027808,0.704017,0.013199
34,DNAJC30,84277.0,ENSG00000176410,0.504202,0.494515,0.89391,0.89346,0.999997,16891,0.001629,...,0.191,-0.124,-0.326,-0.644,0.19,-0.288,0.406469,0.008781,0.616128,-0.03102
75,ZSCAN25,221785.0,,0.504202,0.494515,0.48859,0.49046,0.999997,9276,-0.007927,...,-0.037,-0.376,0.098,-0.237,-0.143,-0.164,0.611917,0.008821,0.986175,-0.004549
77,TNRC18,,,0.504202,0.494515,0.78966,0.78968,0.999997,14875,-0.005101,...,-0.699,-0.233,-1.256,-1.005,-1.393,-0.16,0.010632,-0.033607,0.176836,-0.010173
79,CALN1,83698.0,,0.504202,0.494515,0.63701,0.63683,0.999997,12038,-0.007076,...,0.251,-0.229,0.256,-0.25,-0.079,-0.327,0.019361,-0.022015,0.286409,-0.007905
82,C1GALT1,,,0.504202,0.494515,0.76473,0.76453,0.999997,14422,-0.017888,...,0.505,0.464,0.289,0.105,0.536,0.615,0.291431,0.006242,0.40498,-0.001067
91,LUC7L2,51631.0,,0.504202,0.494515,0.36035,0.38215,0.999997,7236,-0.013031,...,-1.269,-0.991,-0.914,-0.946,-2.606,-0.557,0.023538,-0.015354,0.546132,0.018418
94,TARP,,,0.504202,0.494515,0.66453,0.66419,0.999997,12533,-0.007076,...,0.113,0.455,0.045,0.048,0.575,0.65,0.539946,-0.012642,0.443288,-0.009949
95,TMEM120A,83862.0,,0.504202,0.494515,0.82131,0.8212,0.999997,15499,-0.014349,...,0.184,-0.161,-0.322,-0.337,-0.14,-0.157,0.155952,-0.004993,0.736046,0.005924


In [25]:
np.nanmedian(ts_score_cleaned['Brum.HAP1.GTS'])

0.5042016810000001

In [26]:
ts_score_median_imputation.to_csv(r'../data/work_data.csv')

### K-NN

First I would like to see number of NaN for each gene and decide if the k-NN method works for each gene missing some values. 

In [27]:
def num_nan_row(df):
    copy = df.drop(['Entrez.GeneID', 'ENSEMBL_ID'], axis=1)
    copy.index = copy['Gene']
    copy = copy.drop('Gene', axis = 1)
    for i in copy.columns.tolist():
        copy[i] = np.array(np.isnan(df[i]))
    return copy.sum(axis = 1)

In [28]:
def num_nan_col(df):
    copy = df.drop(['Entrez.GeneID', 'ENSEMBL_ID'], axis=1)
    copy.index = copy['Gene']
    copy = copy.drop('Gene', axis = 1)
    for i in copy.columns.tolist():
        copy[i] = np.array(np.isnan(df[i]))
    return copy.sum(axis = 0)

In [29]:
nan_byrow = num_nan_row(ts_score_cleaned).sort_values(ascending = False)
nan_byrow.head(25)

Gene
TARP        17
ISPD        14
TMEM120A    10
TNRC18       8
C1GALT1      8
SND1         6
CCZ1B        6
ZNF800       6
CARD11       6
LMBR1        6
PHF14        6
AHR          6
ZNF12        6
ZNF862       6
KMT2C        6
NUP205       6
GTF2IRD1     6
CALN1        3
DNAJC30      2
ZSCAN25      2
LUC7L2       2
RABGEF1      2
ZC3HAV1      0
CCM2         0
KRIT1        0
dtype: int64

In [30]:
nan_byrow/len(nan_byrow)

Gene
TARP        0.177083
ISPD        0.145833
TMEM120A    0.104167
TNRC18      0.083333
C1GALT1     0.083333
SND1        0.062500
CCZ1B       0.062500
ZNF800      0.062500
CARD11      0.062500
LMBR1       0.062500
PHF14       0.062500
AHR         0.062500
ZNF12       0.062500
ZNF862      0.062500
KMT2C       0.062500
NUP205      0.062500
GTF2IRD1    0.062500
CALN1       0.031250
DNAJC30     0.020833
ZSCAN25     0.020833
LUC7L2      0.020833
RABGEF1     0.020833
ZC3HAV1     0.000000
CCM2        0.000000
KRIT1       0.000000
BCAP29      0.000000
TES         0.000000
TMEM248     0.000000
BPGM        0.000000
HIP1        0.000000
              ...   
ZNF789      0.000000
PTPN12      0.000000
ZDHHC4      0.000000
IRF5        0.000000
TRIP6       0.000000
SP4         0.000000
JAZF1       0.000000
AP5Z1       0.000000
PLOD3       0.000000
BRAF        0.000000
IGF2BP3     0.000000
PURB        0.000000
ATXN7L1     0.000000
PPIA        0.000000
SYPL1       0.000000
LIMK1       0.000000
SEMA3C  

In [31]:
nan_bycol = num_nan_col(ts_score_cleaned).sort_values(ascending = False)
nan_bycol.head(25)

Elle18.HMEC.avg.LFC            16
Elle18.HMEC.combined.pvalue    16
Elle19.HMEC.fdr                16
Elle18.HPNE.avg.LFC            16
Elle18.HPNE.Combined.pvalue    16
Elle19.HPNE.fdr                16
Brum.HAP1.GTS                   9
Brum.KBM7.GTS                   9
Sabb15.Raji.fdr                 3
Sabb15.KBM7.cs                  3
Sabb15.KBM7.fdr                 3
Sabb15.K562.cs                  3
Sabb15.Jiyoye.cs                3
Sabb15.Jiyoye.fdr               3
Sabb15.Raji.cs                  3
Sabb15.K562.fdr                 3
Weis14.csa.avg                  2
Sabb17.HEL                      0
Sabb17.EOL1                     0
Weis16.csa.avg                  0
Chen.cs                         0
Chen.pvalue                     0
Chen.fdr                        0
Chen.ts.rank                    0
Doen.avg.LFC                    0
dtype: int64

In [32]:
nan_bycol/96

Elle18.HMEC.avg.LFC            0.166667
Elle18.HMEC.combined.pvalue    0.166667
Elle19.HMEC.fdr                0.166667
Elle18.HPNE.avg.LFC            0.166667
Elle18.HPNE.Combined.pvalue    0.166667
Elle19.HPNE.fdr                0.166667
Brum.HAP1.GTS                  0.093750
Brum.KBM7.GTS                  0.093750
Sabb15.Raji.fdr                0.031250
Sabb15.KBM7.cs                 0.031250
Sabb15.KBM7.fdr                0.031250
Sabb15.K562.cs                 0.031250
Sabb15.Jiyoye.cs               0.031250
Sabb15.Jiyoye.fdr              0.031250
Sabb15.Raji.cs                 0.031250
Sabb15.K562.fdr                0.031250
Weis14.csa.avg                 0.020833
Sabb17.HEL                     0.000000
Sabb17.EOL1                    0.000000
Weis16.csa.avg                 0.000000
Chen.cs                        0.000000
Chen.pvalue                    0.000000
Chen.fdr                       0.000000
Chen.ts.rank                   0.000000
Doen.avg.LFC                   0.000000


For the genes above, the k-NN method might not be very accurate.