## DepMap:

Here we explore gene_effect_corrected.csv file from DepMap (https://ndownloader.figshare.com/files/14221385)
The gens that are essential for cell viability has effect size below the threshold -0.5. https://depmap.org/portal/faq/#dep_thresholds.

In [None]:
import csv
import wget
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import os



## Load the data
The “gene effect” file contains the corrected CERES scores, which measure the effect size of knocking out a gene, normalized against the distributions of non-essential and pan-essential genes. Columns are genes, rows are cell lines.
A more negative CERES score indicates depletion of the gene in that cell line, indicating that it is an essential gene for cell viability.

For simplicity, we call a gene "effective", if the effect size of that gene is below -0.5.

In [None]:
url = 'https://ndownloader.figshare.com/files/14221385'
filename = '/Users/ravanv/PycharmProjects/IDGKG-gen/data/gene_effect_corrected.csv' # path to the file

if not os.path.exists(filename):
    filename = wget.download(url)

In [None]:
df = pd.read_csv(filename)
df.head()

In [None]:
columns = df.columns
print("number of genes:{}".format(df.shape[1]))
print("number of cell lines:{}".format(df.shape[0]))


## Statistics
For each gene, we find the effect size of that gene in each cell line and count the number of 
effect sizes that are less than -0.5. We then divide this number by the total number of cell lines to
get the proportion of effect sizes below -0.5 for each gene.
We also find the number of genes that have at least one effect size below -0.5. 
We count the numbe of genes that their effect size is below threshold in all cell lines.

In [None]:
threshold = -0.5
effective_genes = [] #list of effective genes
proportion_effective_genes_in_cell_lines = [] 
number_effective_genes = 0 #number of genes with at least one effect size below threshold
number_effective_genes_in_all_cell_lines = 0 #number of genes that thier effect sizes are less than threshold in all cell lines.
for i in range(1,df.shape[1]):
    col = columns[i]
    df_col = df.loc[:,col]#dataframe corresponding to a gene (column)
    gene_effect_df = df_col.loc[df.loc[:,col] < threshold] #dataframe corresponding to a gene(column) with effect sizes below threshold
    proportion_effective_gene_in_cell_lines = gene_effect_df.shape[0] / df.shape[0]
    if proportion_effective_gene_in_cell_lines > 0:
        number_effective_genes += 1
        effective_genes.append(i)
        proportion_effective_genes_in_cell_lines.append(proportion_effective_gene_in_cell_lines)
    if proportion_effective_gene_in_cell_lines == 1.0:
        number_effective_genes_in_all_cell_lines += 1

print("total number of genes: {}".format(df.shape[1]))        
print("number of effective genes (genes that have an effect size below -0.5 in at least one cell line): {}"
      .format(number_effective_genes))
print("number of genes that are effective in ALL cell lines (have effect size below -0.5 in all cell lines): {}".format(number_effective_genes_in_all_cell_lines))

## Data Visualization
For each gene, we plot the proprtion of cell lines in which the gene is effective (has effect size below the threshold). Here, we only plot the proportion for the first 3000 of effective genes.

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[0:300],proportion_effective_genes_in_cell_lines[0:300],width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[301:600],proportion_effective_genes_in_cell_lines[301:600],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[601:900],proportion_effective_genes_in_cell_lines[601:900],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[901:1200],proportion_effective_genes_in_cell_lines[901:1200],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[1201:1500],proportion_effective_genes_in_cell_lines[1201:1500],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[1501:1800],proportion_effective_genes_in_cell_lines[1501:1800],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[1801:2100],proportion_effective_genes_in_cell_lines[1801:2100],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[2101:2400],proportion_effective_genes_in_cell_lines[2101:2400],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[2401:2700],proportion_effective_genes_in_cell_lines[2401:2700],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(effective_genes[2701:3000],proportion_effective_genes_in_cell_lines[2701:3000],  width=4)
plt.xlabel('gene index')
plt.ylabel('proportion')
plt.title('proportion of cell lines in which each gene has the effect size below th threshold (-0.5).')

## More on Statistics
For each cell line, we find the number of genes that have effect size below -0.5. 

First, we create a transpose of the data. Now, the genes are rows and cell lines are columns.

In [None]:
df_tr = df.T
new_header = df_tr.iloc[0] #grab the first row for the header
df_tr = df_tr[1:] #take the data less the header row
df_tr.columns = new_header 
df_tr.head()

In [None]:
threshold = -0.5
cell_lines = []# list of cell lines in which at least one gene has effect size below threshold
proportion = [] #ratio of number of cell lines in which a gene has effect size below threshold to the total number of
#genes
number_effective_genes_in_cell_line = []
number_cell_lines = 0 #number of cell lines in which at least one gene has effect size below threshold
number_cell_lines_all_genes_effective = 0#number of cell lines in which all genes are effective (have effec 
#score lss than threshold)

for i in range(df_tr.shape[1]):
    col = df_tr.columns[i]
    df_tr_col = df_tr.loc[:,col]
    cell_line_df = df_tr_col.loc[df_tr.loc[:,col] < threshold]
    number_effective_genes = cell_line_df.shape[0]
    proportion_effective_genes_in_a_cell_line = cell_line_df.shape[0] / df_tr.shape[0]
    if proportion_effective_genes_in_a_cell_line > 0:
        number_cell_lines += 1
        cell_lines.append(i)
        proportion.append(proportion_effective_genes_in_a_cell_line)
        number_effective_genes_in_cell_line.append(number_effective_genes)
    if proportion_effective_genes_in_a_cell_line == 1.0:
        number_cell_lines_all_genes_effective += 1

print("total number of cell lines: {}".format(df_tr.shape[1]))        
print("number of cell lines that there is at least one gene with the effect score below -0.5: {}"
      .format(number_cell_lines))
print("number of cell lines in which all genes are effective: {}".format(number_cell_lines_all_genes_effective))

## Data Visualization
We plot number of genes that are effective in each cell line.We call a gene effctive if its effect 
size is below -0.5

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(cell_lines[0:150],number_effective_genes_in_cell_line[0:150],  width=0.2)
plt.xlabel('cell line index')
plt.ylabel('count')
plt.title('number of genes that are effective in each cell line.')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(cell_lines[150:300],number_effective_genes_in_cell_line[150:300],  width=0.3)
plt.xlabel('cell line index')
plt.ylabel('count')
plt.title('number of genes that are effective in each cell line.')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(cell_lines[300:450],number_effective_genes_in_cell_line[300:450],  width=0.3)
plt.xlabel('cell line index')
plt.ylabel('count')
plt.title('numberof genes that are effective in each cell line.')

In [None]:
f, ax = plt.subplots(figsize=(18,5)) # set the size that you'd like (width, height)
plt.bar(cell_lines[450:559],number_effective_genes_in_cell_line[450:559],  width=0.3)
plt.xlabel('cell line index')
plt.ylabel('count')
plt.title('number of genes that are effective in each cell line.')