## Compiling CSVs from CRISPR screen data 
### Jacklyn Luu

The goal of this notebook is to find all the CRISPR screen results csv files and add author + virus names as separate columns. The data used comes from a Google Drive folder that contains various experimental results including results from CRISPR screens. 

In [1]:
# Set up environment
import pandas as pd
import os
from pathlib import Path   # Let's us traverse dir and subdir

In [12]:
home_dir = '/Users/jacklyn.luu/Desktop/DS-Infected Cell'
os.chdir(home_dir)
os.getcwd()

'/Users/jacklyn.luu/Desktop/DS-Infected Cell'

In [15]:
# Find all CRISPR screen csv files in the folder downloaded from GDrive
papers_meta = pd.read_csv('CRISPR screen datasets meta.csv')
data_dir = os.path.join(home_dir, 'Data from papers')

for subdir, dirs, files in os.walk(data_dir):
    for filename in files:        
        filepath = subdir + os.sep + filename
        if filepath.endswith(".csv") and (('CRISPR' in filepath) or ('screen' in filepath)):            
            data = pd.read_csv(filepath)
            
            # Add the name of virus and author 
            virus_acr = Path(filepath).parts[6]  # Virus name embedded in filepath
            data['Virus'] = virus_acr
            data['Author'] = papers_meta.loc[papers_meta['Virus Acronym'] == virus_acr, 'Author'].item()
            
            #Over write file
            data.to_csv(filepath, mode ='w+')

In [5]:
# Test if it worked
test_dir = os.path.join(data_dir, 'HCoV-229E - Schneider')
os.chdir(test_dir)
test_df = pd.read_csv('HCoV-229E - CRISPR screen.csv')
print(test_df)

       Unnamed: 0     Gene       p_value   z_score           fdr     Sig  \
0               0    ANPEP  1.570000e-53  9.056129  3.030000e-49     Sig   
1               1  TMEM41B  8.720000e-21  5.611329  1.690000e-16     Sig   
2               2     RTCB  3.010000e-15  4.794417  5.820000e-11     Sig   
3               3    NRIP1  3.030000e-14  4.629608  5.870000e-10     Sig   
4               4   GNPTAB  1.420000e-13  4.481975  2.760000e-09     Sig   
...           ...      ...           ...       ...           ...     ...   
19359       19359   ZYG11A  2.370285e-01 -0.641426  1.000000e+00  NotSig   
19360       19360   ZYG11B  5.901085e-01 -0.726182  1.000000e+00  NotSig   
19361       19361      ZYX  8.904381e-01 -0.072607  1.000000e+00  NotSig   
19362       19362    ZZEF1  3.116640e-01  0.284200  1.000000e+00  NotSig   
19363       19363     ZZZ3  7.512255e-01  0.311387  1.000000e+00  NotSig   

       PlottingIndex  AlphaIndex                   Sample  \
0               4636      