# Introduction to R programming in tidyverse and CRISPR screening data

- See the CRISPR_biology_resources folder in our shared google drive for resources on how CRISPR technology works
- Our lab uses CRISPR technology to perform high-throughput screens, where we generate lots of data for each experiment
- Our job for this internship is to analyze some of this data and make sense of it!

## Jupyter notebooks, markdown, R programming

- In jupyter notebooks, you can have code chunks or text chunks.
- Typically, you use the text chunks to annotate the code chunks.
- The text chunks in jupyter notebooks uses a language known as [markdown](https://www.markdownguide.org/getting-started/), which is quite commonly used in computational biology spaces.

### Example markdown commands

- **bold**    
- _italicize_   
- ```code chunk```
- a ```#``` if used in a markdown chunk will create a header, while a ```#``` if used in a code chunk will create a comment (comments do not get run by the coding language, they are just another way to take notes on the code you're writing/running)

### Loading libraries

- ```tidyverse``` is a suite of packages that are commonly used in computational biology, it's strength is in wrangling large datasets (think Excel but in code)

In [25]:
library(tidyverse) # a commonly used R package that makes wrangling large dataframes much easier

## Load the data that we want to explore (our lab's CRISPR screening data)



In [26]:
crispr_data <- read_tsv("./data/mageck.gene_summary_gq.tsv") # dataset collected by our lab using CRISPR screening methods

[1mRows: [22m[34m2189[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m  (1): id
[32mdbl[39m (13): num, neg|score, neg|p-value, neg|fdr, neg|rank, neg|goodsgrna, neg...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [27]:
head(crispr_data) # head command displays the first 6 rows of the dataframe

id,num,neg|score,neg|p-value,neg|fdr,neg|rank,neg|goodsgrna,neg|lfc,pos|score,pos|p-value,pos|fdr,pos|rank,pos|goodsgrna,pos|lfc
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
AQR,4,1,1,0.999998,2189,0,3.9439,6.5633e-11,2.2615e-06,0.00033,1,4,3.9439
CWC22,4,1,1,0.999998,2188,0,2.9268,9.1696e-10,2.2615e-06,0.00033,2,4,2.9268
DHX38,4,1,1,0.999998,2187,0,2.1742,5.0082e-09,2.2615e-06,0.00033,3,4,2.1742
CRNKL1,4,1,1,0.999998,2186,0,2.1624,7.4738e-09,2.2615e-06,0.00033,4,4,2.1624
DGCR14,4,1,1,0.999998,2185,0,2.4018,1.7951e-08,2.2615e-06,0.00033,5,4,2.4018
BUD13,4,1,1,0.999998,2184,0,1.8262,3.0539e-08,2.2615e-06,0.00033,6,4,1.8262


Note that the dataframe has a certain structure: Every ROW is an observation, and every COLUMN is a variable. This is also known as "tidy data", and is the basis for using the widely-used ```tidyverse``` package: read more [here](https://r4ds.had.co.nz/tidy-data.html)

In [None]:
tail(crispr_data) # tail shows the last 6 rows of the dataframe

id,num,neg|score,neg|p-value,neg|fdr,neg|rank,neg|goodsgrna,neg|lfc,pos|score,pos|p-value,pos|fdr,pos|rank,pos|goodsgrna,pos|lfc
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
POLR2H,4,0.00021021,0.00084769,0.210171,8,4,-1.678,0.99979,0.9998,0.999989,2185,0,-1.678
HNRNPC,4,0.00020848,0.00084317,0.210171,7,4,-1.29,0.99979,0.99981,0.999989,2186,0,-1.29
METTL3,4,0.00014486,0.00056739,0.207096,6,4,-1.2784,0.99986,0.99986,0.999989,2187,0,-1.2784
SBDS,4,3.6961e-05,0.00013337,0.058416,5,4,-2.3797,0.99996,0.99996,0.999989,2188,0,-2.3797
PPP1R10,4,3.0989e-05,0.00011076,0.058416,4,4,-2.1098,0.99997,0.99997,0.999989,2189,0,-2.1098
KIAA1429,4,1.6681e-05,6.5555e-05,0.058416,2,4,-1.7302,0.99998,0.99999,0.999989,2190,0,-1.7302


In [21]:
crispr_data %>% # %>% is the pipe operator in R tidyverse, it takes the output of a command and pipes it to the next command
    arrange(`pos|lfc`) %>% # when column headers contain special characters (e.g. |), you access the column by using ``
    head()

id,num,neg|score,neg|p-value,neg|fdr,neg|rank,neg|goodsgrna,neg|lfc,pos|score,pos|p-value,pos|fdr,pos|rank,pos|goodsgrna,pos|lfc
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
PPP1R8,3,0.0083934,0.023056,0.64276,84,3,-0.88706,0.99161,0.99187,0.999975,2146,0,-0.88706
ABCF1,4,3.9486e-05,0.00013343,0.058416,5,4,-0.80311,0.99994,0.99993,0.999975,2187,0,-0.80311
TRMT112,2,0.0016541,0.003277,0.335059,30,2,-0.78351,0.99835,0.99845,0.999975,2170,0,-0.78351
CLP1,4,0.0081985,0.027824,0.700068,83,3,-0.76976,0.23353,0.37504,0.999975,748,1,-0.76976
CPSF3,4,4.9718e-07,2.2615e-06,0.00495,1,4,-0.74837,0.99899,0.99902,0.999975,2176,0,-0.74837
SYMPK,3,0.00021085,0.00072143,0.143564,11,3,-0.73487,0.99979,0.99982,0.999975,2185,0,-0.73487
