# The RUV Package 

"ruv" on CRAN

# RUV Methods

<table style="width:60%">
  <tr style="background-color:#FFFFFF">
    <th style="text-align:left;font-size:18px;border:0 none">Regression Methods</td>
    <th style="text-align:left;font-size:18px;border:0 none">Global Adjustments</td>
  </tr>    
  <tr style="background-color:#FFFFFF">
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUV2</td>
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUVI</td>
  </tr>
  <tr style="background-color:#FFFFFF">
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUV4</td>
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUVIII</td>
  </tr>
  <tr style="background-color:#FFFFFF">
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUVinv</td>
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">  </td>
  </tr>
  <tr style="background-color:#FFFFFF">
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">RUVrinv</td>
    <td style="text-align:left;font-size:18px;font-family:monospace;border:0 none">  </td>
  </tr>
</table> 



# Regression Methods

Common syntax:

```R
RUV2    (Y, X, ctl, k, Z = 1, eta = NULL             ) 
RUV4    (Y, X, ctl, k, Z = 1, eta = NULL             ) 
RUVinv  (Y, X, ctl,    Z = 1, eta = NULL             ) 
RUVrinv (Y, X, ctl,    Z = 1, eta = NULL, lambda=NULL) 
```

# Regression Methods

Function Arguments:

| Argument | Meaning                 | Example    | Data Type                             | Notes                       |
| ----     | ----------------------- | ---------  | ------------------------------------  | --------------------------- |
| Y        | Expression data         |            | Matrix                                | row = sample, column = gene |
| X        | Factor of interest      | gender     | matrix, factor, vector, or data frame |                             |
| ctl      | Neg. Controls           | spike-ins  | index (logical or integer vector)     |                             |
| Z        | array-wise covariates   | batch      | matrix, factor, vector, or data frame | 1 for intercept             |
| eta      | gene-wise covariates    | GC content | matrix, factor, vector, or data frame | 1 for intercept             |
| k        | # of unwanted factors   |            | integer                               | 0 for no adjustment         |
| lambda   | ridge parameter         |            | numeric                               | NULL for sensible default   |



# Y

* Expression Data

* $m \times n$ matrix, where  
  * $m$ is the number arrays  
  * $n$ is the number of genes

* Should be log transformed

* Usually best **not** to preprocess (quantile normalize, etc.)

# X

* Factor of interest (gender, brain region, etc.)

* Should **not** include the intercept

* Rule of thumb: "The fewer factors, the better"
  * More factors in $X$ $\implies$ fewer factors estimated in $\hat{W}$
  * Better to repeat analysis for each factor of interest separately

# Z

* Additional covariates (batch, etc.)

* Should include the intercept (if desired)

* Rule of thumb: "The fewer factors, the better"
  * More factors in $Z$ $\implies$ fewer factors estimated in $\hat{W}$
  * $\hat{W}$ often captures unwanted variation better than $Z$  
  * Exception: $Z$ is a factor that affects only a small number of genes, and likely the same genes as $X$.  
    Example: $X$ is a disease that affects a small number of genes; $Z$ is a drug that affects those same genes

# eta ($\eta$)

* Gene-wise covariates ***associated with unwanted factors*** (GC content, etc.)

* Included for convenience; equivalent to preprocessing by
```R
Y = RUVI(Y, eta, ctl)
```

* eta = 1 (for intercept) typically recommended, but **not** default

# ctl

* Crucial to success

* Ideally:
  * Unaffected by factor of interest
  * Affected by unwanted factors
  * "representative" of other genes  
    (similar range of expressions, not affected by their own unwanted factors, etc.)

* **Cannot be automatically "discovered" from the data**  
  (at least not naively)

* **Need not be perfect**  
  RUV methods are robust (to varying degrees, and in different ways)

# k

* Number of unwanted factors.  For RUV2 and RUV4 only.

* Useful when negative controls may contain biology.  
  Keeping $k$ small reduces risk of overadjusting.

* Best chosen "by hand".  
  ("getK" function not ideal)

# Comparison of Regression Methods

| Method   | Strengths                | Weaknesses                     | Notes                      |
| ----     | ------------------------ | -----------                    | -----                      |
| RUV2     | Simple and interpretable | Sensitive to misspecified NCs  | Good for spike-in controls |
|          | Not too sensitive to     |                                |  Keep k small if NCs may be |
|          | "nonrepresentative" NCs  |                                |  misspecified |


# Comparison of Regression Methods

<table style="width:100%">
  <tr>
    <th style="text-align:left;font-size:18px;width:15%;border:0 none">Method</td>
    <th style="text-align:left;font-size:18px;border:0 none">Strengths</td>
    <th style="text-align:left;font-size:18px;border:0 none">Weaknesses</td>
    <th style="text-align:left;font-size:18px;border:0 none">Notes</td>
  </tr>    
  <tr>
    <td style="text-align:left;vertical-align:top;font-size:14px;width:15%;border:0 none">RUV2</td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Simple and Interpretable </li>
                                                              <li>Not too sensitive to "nonrepresentative" NCs</li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Sensitive to misspecified NCs </li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Good for spike-in controls </li>
                                                              <li>Keep k small if NCs may be misspecified</li>
                                                              </ul></td>
  </tr>
  <tr>
    <td style="text-align:left;vertical-align:top;font-size:14px;width:15%;border:0 none">RUV4</td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Robust to misspecified NCs</li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Sensitive to "nonrepresentative" NCs </li>
                                                              <li>Anti-conservaitve for large k </li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>RUV(r)inv usually a better option </li>
                                                              <li>Good when NCs highly misspecified; keep k small</li>
                                                              </ul></td>
  </tr>
  <tr>
    <td style="text-align:left;vertical-align:top;font-size:14px;width:15%;border:0 none">RUVinv</td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Robust to misspecified NCs</li>
                                                              <li>No tuning parameter </li>
                                                              <li>Well calibrated p-values</li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Requires large number of NCs </li>
                                                              <li>Somewhat sensitive to "nonrepresentative" NCs </li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> </td>
  </tr>
  <tr>
    <td style="text-align:left;vertical-align:top;font-size:14px;width:15%;border:0 none">RUVrinv</td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Robust to misspecified NCs</li>
                                                              <li>Reasonable default for lambda </li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Somewhat sensitive to "nonrepresentative" NCs </li>
                                                              </ul></td>
    <td style="text-align:left;vertical-align:top;font-size:14px;border:0 none"> <ul> 
                                                              <li>Good compromise of features </li>
                                                              </ul></td>
  </tr>
</table> 



# Technical Note

* RUV2 requires 
  $$\beta_c = 0$$
* RUV4, RUVinv, and RUVrinv require
  $$ \beta_c \alpha_c' (\alpha_c \alpha_c')^{-1} \approx 0$$



* 

* 

* 