## Introduction to using UNIX and Bedtools

# create a linux terminal

In [None]:
!pip install colab-xterm
%load_ext colabxterm
%xterm

## *** Set working directory

By default, the working directory will be My Drive/PB_course

In [None]:
# set working pathway to your own google drive doc (~ 1 min)
from google.colab import drive
drive.mount('/content/gdrive')                         # if using for the first time, you will be requested to grant permission to link your Google Drive

import os
try:
  os.mkdir("/content/gdrive/My Drive/PB_course")         # change this path if necessary
except FileExistsError:
  print("directory already exist. OK to continue")
os.chdir("/content/gdrive/My Drive/PB_course")

## Basic UNIX commands
The following section highlights some essential UNIX commands.
Of note, the "!" sign you see at the beginning of some commands is to simulate we ran the code under a UNIX environment. Hence, **the real command line is the text after the "!" sign.**

### Getting help in UNIX
**man**: to view the user manual for a given command of interest.
For example, for the command "pwd":

In [None]:
!man pwd

**help**: to show a brief summary about the command of interest; and in the event that the command cannot be found, to print the list of help topics.
For example, for the command "pwd":

In [None]:
!help pwd

### Navigating Linux

**echo**: to look at environmental variables. Specific environmental variables can be assessed as well, for example, `$SHELL, $PATH, $PS1, $HOME, and $USER`.  

**env** can be used to check all variables.
Some examples:

In [None]:
!echo This is an evening course

In [None]:
!echo $SHELL

**pwd**: to print the current working directory

In [None]:
!pwd

**ls**: to print a simplified list of the contents of the current working directory

In [None]:
!ls

**ls -l**: to print a detailed list of the contents of the current working directory

In [None]:
!ls -l

**mkdir**: to make a new directory  

For example, to make the directory **files**:

In [None]:
!mkdir files

**cd**: change directory. without directory name it takes the user back to the home directory    
One can also change directory with a specific argument. For example, if one wanted to change their directory to **files**:

In [None]:
%cd files

Alternatively, one could also go back in the directory tree with cd ..:

In [None]:
%cd ..

## Inspecting and manipulating files

In [None]:
# double check that we are in right directory
import os
os.chdir("/content/gdrive/My Drive/PB_course/files")

# download necessary files for analysis from github
!wget -O Sox17.bed https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/Sox17.bed
!wget -O Sox17FNV.bed https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/Sox17FNV.bed
!wget -O mm10.txt https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/mm10.txt
!wget -O mm10Refgene.bed https://raw.githubusercontent.com/jasonwong-lab/HKU-Practical-Bioinformatics/main/files/mm10Refgene.bed

UNIX has several useful commands to look at files including **head**, **tail**, **less**, **more**, **cat**.

**head**: prompts output of the first part of files such as the first 10 lines.
To take a quick look at the contents of one's files:

In [None]:
!ls -l

In [None]:
!head Sox17.bed

**\***: is an example of a UNIX wildcard. In the example below, we were able to look at the first lines of all files in the working directory with the use of \*.

In [None]:
!head -3 *

**less**: inspect file page by page. quit with 'q' key.

In [None]:
!less mm10Refgene.bed

**wc**: acts as a word count. It can count words "-w", lines "-l", and characters of a file:



In [None]:
!wc Sox17.bed

In [None]:
!wc -l Sox17.bed

In [None]:
!wc -l *

One can redirect output into a new file using ">":

In [None]:
!ls -l > list.tmp

In [None]:
!ls -l

In [None]:
!head list.tmp

UNIX "|" pipelines can be used to chain processes together in a sequential manner.

If you want to know how many file names in the **files** directory start with S:

In [None]:
!ls -ls S* > S_count.txt

In [None]:
!wc -l S_count.txt

The above two-step process (with the creation of a redundant file) can be simplified with the UNIX "|" pipe as follows:

In [None]:
!ls -ls S* | wc -l

In [None]:
!ls -ls *bed | wc -l

Let's remove these files using the **rm** command:

In [None]:
!rm list.tmp S_count.txt

In [None]:
!ls -l

To obtain a subset of information with a common feature, one may use the command **grep**. For example:

In [None]:
!grep peak_1000 Sox17.bed

### Useful bedtool commands

#### *** Package installation and downloads for workshop (~ 5 minutes)

1.   conda (for simple installation of packages)
2.   bedtool (for bed file modification)

In [None]:
# install conda (~ 1 min). There will be a message saying that the session has crashed but don't worry about this. This is due to the session restarting following conda installation
!pip install -q condacolab
import condacolab#
condacolab.install()

In [None]:
# install bedtools (~ 2 min)
!conda install -c bioconda bedtools

In [None]:
# double check that we are in right directory
import os
os.chdir("/content/gdrive/My Drive/PB_course/files")

#### windowBed
windowBed can be used to search for overlapping features:

In [None]:
!windowBed

To determine how many peaks overlap between the files Sox17.bed and Sox17FNV.bed:

In [None]:
!windowBed -a Sox17.bed -b Sox17FNV.bed -w 200 | wc -l

The command **awk** is a powerful scripting language that can be used for data extraction, reordering and manipulation. Below are some usage examples.

Extract strong peaks (with signals larger than 100). Column 5 has information for peak scores. 'print $0' prints the whole line.

In [None]:
!awk '$5>100 {print $0}' Sox17.bed

Extract peaks on chromosome 19.

In [None]:
!awk '$1 == "chr19" {print $0}' Sox17.bed | head

How many bases are there in total in the mm10 genome?

In [None]:
!awk '{s+=$2} END {print s}' mm10.txt

Extract and reorder columns 4 and 1 and create a new column that combines a string ('Dummy') and a counter ('NR')

In [None]:
!awk '{print $4"\t"$1"\tDummy"NR}' mm10Refgene.bed |head

To edit the files on hand, one can extract, join and sort columns with "cut", "paste" and "sort" respectively:

In [None]:
!cut -f 4,1 Sox17.bed |head

In [None]:
!cut -f 1 Sox17.bed > 1.bed

In [None]:
!cut -f 4 Sox17.bed > 4.bed

In [None]:
!paste 4.bed 1.bed |head

In [None]:
!sort -k2 -n mm10.txt

#### closestBed  

closestBed can be used to search for peaks close to the features of interest. The first step is to sort bed files by chromosomes and position.
1. with "sort"

In [None]:
!sort -k1,1 -k2,2n Sox17.bed > Sox17_sorted.bed

In [None]:
!sort -k1,1 -k2,2n Sox17FNV.bed > Sox17FNV_sorted.bed

In [None]:
!sort -k1,1 -k2,2n mm10Refgene.bed > mm10Refgene_sorted.bed

2. with "sortBed"

In [None]:
!sortBed -i Sox17.bed > Sox17_sorted.bed

In [None]:
!sortBed -i Sox17FNV.bed > Sox17FNV_sorted.bed

In [None]:
!sortBed -i mm10Refgene.bed > mm10Refgene_sorted.bed

After sorting, to determine whether there are Sox17 peaks right at the TSS:

In [None]:
!closestBed -a mm10Refgene_sorted.bed -b Sox17_sorted.bed -d | awk '$10<10 {print $0}' | wc -l