As you know, the goal of this project has basically been to take a large group of genes at the beginning and then to continually whittle them down via a set of criteria or screens. At one of our steps, we used a filter step at the patient level but we want to be more inclusive and include an 'OR' statement that also accounts for single-cell expression.

So here is how this would be done:
1. Take the genes that are present in 'COMPARTMENTS >=4  UPDATED GENE NAMES.csv' which can be found here: https://drive.google.com/drive/folders/1RxLl65T5mD7P9jL584zareH6gUBs3t_B . Then, using these genes, ask if they have non-zero expression in at least (>=) 10% of naive tumor cells OR naive atypical ductal cells OR treated tumor cells OR treated atypical ductal cells. 
2. Once you have the whittled down list from #1, compare that to the genes found in the attached file, (120520) Post >=4 Surface and TCGA >25 FPKM.csv.
3. Take the SUPERSET of the genes found in #1 and #2 (so any gene that is in either #1 or #2), and then remove the duplicates, and send this gene list over to me.

In [1]:
import pandas as pd
import numpy as np

### read in columns and rows for single cell

In [2]:
columns_naive = pd.read_csv('../data/columns for X.csv',header = None)
columns_naive = list(columns_naive[0])

rows_naive = pd.read_csv('../data/rows for X.csv',header = None)
cell_type_naive = list(rows_naive[0])
cell_type_naive = [val.lower() for val in cell_type_naive]

In [3]:
rows_treated = pd.read_csv('../data/treated snRNA cell types list.csv',header = None)
cell_type_treated = list(rows_treated[0])
cell_type_treated = [val.lower() for val in cell_type_treated]

columns_treated = pd.read_csv('../data/treated snRNA gene list.csv', header = None)
columns_treated = list(columns_treated[0])

## read in gene names

In [4]:
gene_list = pd.read_csv('../data/COMPARTMENTS _=4 UDPATED GENE NAMES.csv', header = None)
gene_list = list(gene_list[0])
gene_list[0:10]

['CFTR',
 'RALA',
 'CACNG3',
 'SKAP2',
 'CEACAM7',
 'ITGA3',
 'CD4',
 'TSPAN9',
 'GPRC5A',
 'PSD']

In [5]:
len(cell_type_naive)

88031

In [6]:
np.unique(cell_type_naive)

array(['2591_t', 'acinar', 'alpha', 'atypical_ductal', 'b', 'beta',
       'cd4pos_t', 'cd4pos_tregs', 'cd8pos_t', 'cdc1', 'cdc2',
       'dc_activated', 'delta', 'ductal', 'endothelial', 'fibroblast',
       'gamma', 'macrophages_monocytes', 'mast', 'nascentendothelial',
       'nk', 'pdc', 'plasma', 'schwann', 'smoothmuscle', 'tumor'],
      dtype='<U21')

In [8]:
naive_gene_col_list = [idx for idx, col in enumerate(columns_naive) if col in gene_list]

In [10]:
len(naive_gene_col_list)

4425

## Naive Analysis

In [None]:
%%time 

skiprows_naive_tumor = [idx for (idx, row) in enumerate(cell_type_naive) if row != 'tumor']
skiprows_naive_atypical_ductal = [idx for (idx, row) in enumerate(cell_type_naive) if row != 'atypical_ductal']

df_naive_tumor = pd.read_csv('../data/X.csv', 
                             names = columns_naive, 
                             usecols = naive_gene_col_list, 
                             skiprows = skiprows_naive_tumor, 
                             engine='c',
                             na_filter=False, 
                             dtype=np.float64,
                             low_memory=False)    


In [None]:
df_naive_atypical = pd.read_csv('../data/X.csv', 
                                names = columns_naive, 
                                usecols = lambda x: x in gene_list, 
                                skiprows = skiprows_naive_atypical_ductal, 
                                na_filter=False, 
                                dtype=np.float64)

In [None]:
df_naive_tumor.head()

In [None]:
2+2

In [None]:
2+2