# Systems Immunogenetics Project

## WNV Flow Cytometry Data Cleaning Workflow

### McWeeney Lab, Oregon Health & Science University

#### Author: Michael Mooney (mooneymi@ohsu.edu)

## Introduction

This document will walk through...

Required Files:

- This notebook (`SIG_WNV_Flow_Data_Cleaning.ipynb`): [Download by clicking the button at the top right of the screen]
- The R script `flow_data_cleaning_functions.r`: [[Download R script here]](https://raw.githubusercontent.com/mooneymi/jupyter_notebooks/master/r/SIG/flow_data_cleaning_functions.r)
- The text file containing all flow variables (`final_flow.txt`): [[Download here]]()

** Note: this notebook can also be downloaded as a typical R script (only the code blocks seen below will be included): [[Download R script here]]()

If you are not familiar with Jupyter Notebooks, I've created a short tutorial to get you up-and-running quickly. There is also plenty of documentation online:

1. [Jupyter for R Tutorial]()
2. [Jupyter Documentation](http://jupyter.org/)
3. [Conda and R](https://www.continuum.io/conda-for-r)

#### Up Next: Plotting Flow Cytometry Data

After finishing this workflow you will have a cleaned dataset ready for exploration and analysis. A notebook with code examples for plotting the data (including interactive plots created with the Shiny library) is available: [Plotting Flow Cytometry Data]() 

## Step 1. Prepare the Input Files

1. Remove special characters

## Step 2. Load the Necessary R Libraries and Functions

There are a number of functions in the accompanying R script (`flow_data_cleaning_functions.r`) necessary for parsing and then processing the flow cytometry data:

1. `read_flow_exp_file()`: Parses a flow spreadsheet and creates an R dataframe.
2. `fix_column_names()`: Standardizes the column names of the above dataframe.
3. Functions for calculating cell counts and ratios:
    - `calc_treg_counts()`
    - `calc_tcell_counts()`
    - `calc_ics_counts()`
    - `calc_ics_percent_ratios()`
    - `calc_ics_count_ratios()`
4. `clean_inf_nan()`: Sets any infinite or NaN values to NA.

More information on each of these functions is available by calling the `describe()` function. For example, the following command will print documentation for the `read_flow_exp_file()` function:

    describe(read_flow_exp_file)

In [2]:
## Load functions for parsing the flow cytometry spreadsheets
## The gdata library is necessary for reading Excel spreadsheets; it will be loaded as well.
source('flow_data_cleaning_functions.r')

In [3]:
## View help documentation
describe(read_flow_exp_file)


This function parses flow cytometry data from an Excel workbook.

Parameters
f: The Excel file name.
cn_expected: A character vector containing the expected column names.

Returns
A dataframe containing the processed flow cytometry data.



## Step 3. Read the Data into R

In [None]:
## First move to the directory holding the data
flow_dir = "~/Documents/MyDocuments/SystemsImmunogenetics/WNV/Lund_Flow_fixed_Nov_13"
setwd(flow_dir)

In [None]:
## Get a list of data files to read
flow_files = list.files('.', pattern="Expt.*\\.xls")

## Load all expected flow variables (expected column names)
flow_cn = read.delim('./data/final_flow.txt', sep='\t', as.is=T, header=F)
flow_cn = flow_cn[,1]

In [None]:
## Iterate through all the files, parse each, and merge all data into a single dataframe
i = 1
for (file in flow_files) {
    print(file)
    flow_dat = read_flow_exp_file(file, flow_cn)
    
    ## Check if there are any unexpected columns
    new_columns = setdiff(colnames(flow_dat), flow_cn)
    if (length(new_columns) > 0) {
        flow_cn = c(flow_cn, new_columns)
    }
    if (i > 1) {
        ## Fill extra columns with NAs
        for (col in new_columns) {
            flow_all[,col] = NA
        }
        ## Merge data
        flow_all = rbind(flow_all[,flow_cn], flow_dat[,flow_cn])
    } else {
        flow_all = flow_dat
    }
    i = i + 1
}

In [None]:
## Check that all expected columns are present
setdiff(flow_cn, colnames(flow_all))
setdiff(colnames(flow_all), flow_cn)

## Step 4. Clean and Reformat the Data

In [None]:
## Order columns, add Lab column and fix formatting
flow_all = flow_all[, flow_cn]
flow_all$Lab = "Lund"

flow_all$ID = gsub(" ", "", flow_all$ID)
flow_all$ID = gsub("X", "x", flow_all$ID)
flow_all$Mating = gsub(" ", "", flow_all$Mating)
flow_all$Mating = gsub("X", "x", flow_all$Mating) 
flow_all$UW_Line = as.numeric(flow_all$UW_Line)

In [None]:
## Check for duplicate IDs
new_flow_ids = paste(flow_all$ID, flow_all$Tissue, sep='_')
sum(duplicated(new_flow_ids))

## Step 5. Calculate Cell Counts and Ratios

In [None]:
## Change all data columns to numeric
for (i in 11:277) {
    flow_all[,i] = as.numeric(flow_all[,i])
}

In [None]:
## Calculate cell counts and ratios
flow_full = flow_all
flow_full = calc_treg_counts(flow_full)
flow_full = calc_tcell_counts(flow_full)
flow_full = calc_ics_counts(flow_full)
flow_full = calc_ics_percent_ratios(flow_full)
flow_full = calc_ics_count_ratios(flow_full)
flow_full = clean_inf_nan(flow_full)

## Step 6. Save the Data

In [None]:
## Save R data file
save(flow_full, file='lund_flow_full_11-jan-2016_final.rda')

### Up Next: Plotting Flow Cytometry Data

Code for plotting this data is available here: [[Flow Data Plotting Workflow]]()