# Introduction to Unix

Authors: Ken Youens-Clark and Bonnie Hurwitz

## Overview

Unix is great for bioinformatics because it handles large datasets efficiently and supports powerful command-line tools for data analysis. Its scripting capabilities allow for automation of repetitive tasks, and its stability ensures reliable processing of complex bioinformatics workflows. Plus, many bioinformatics tools and software are built for Unix systems, making it a go-to environment for researchers in the field.

In this homework, you are going to get an opportunity to try out several Unix commands and see how they can be used for bioinformatics research.

Here's an overview:
- Unix Commands Overview
- Working with Sequence Files Example

-----


### Getting Started

Before we get started you will need to set your netid and then go into the directory for this assignment under bh_class.

You will need to rerun this section each time you come back to this notebook to reset the variables.

Remember, our notebooks work in the current working directory -- and when you login to the HPC this automatically is your home directory. You will need to move to the project directory `/xdisk/bhurwitz/bh_class/your_netid/assignments/02_intro_unix`. The next two cells set your netid and the project directory (be sure to replace "MY_NETID" with your actual netid). Then we will change into that directory for our exercise.

In [None]:
# Make sure you have the most up to date code
%cd ~/be487-fall-2024
!git stash
!git pull

In [None]:
# Change "MY_NETID" to your netid below, and run this cell
netid = "MY_NETID"

In [None]:
# Set the working directory and change into this directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/02_intro_unix"
%cd $work_dir

### Section 1: Unix Commands Overview

#### The basics

Let's go over the basic Unix commands that are included in this notebook. If you need a more thorough explanation, you can follow this introduction from [software carpentry](https://swcarpentry.github.io/shell-novice/01-intro.html)

#### Getting Unstuck

Commands have a helpful flag (additional arguments to do extra actions) called "--help" that will give you information on what the command does and how to use it, and any other flags that you can attach to the command:

    ls --help

There's also a more in-depth manual accessed by typing the command "man" followed by the command:

    man ls

#### Navigation

__pwd__:  Print Working Directory. It tells you which directory you're currently working in. \
__ls__:   List. List out all directories/files in the current directory. \
__cd__:   Change Directory. Moves you to the specified directory. 

#### Working with Files & Directories

__mkdir__: Make Directory. This creates a new directory. \
__wget__:  A network download tool. This command supports downloads files from a server. \
__unzip__: Unzip. Unzips a compressed file. \
__rm__:    Remove. This deletes a file, if it's a directory you'll need the recursive -r flag, but be mindful this is permanent. \
__mv__:    Move. Moves the specified file. \
__cat__:   Concatenate. This displays the contents of a file. \
__cp__:    Copy. Copies the contents of a file/directory. \
__grep__:  Search Command. This filters through a file to search for a pattern of specified characters. \
__wc__:     Word Count. Calculates a file's word, line, character or byte count. \
__diff__:   Difference. Displays the differences in files. \
__cksum__:  Checksum. Generates output values of a file (CRC, Byte Size, and Name). \
__sed__:    Stream Editor. Can insert, delete, search and replace (substitute) text.

#### Piping & Filtering

Unix commands can chain content from one command into another. For example, we can use the concatenate "cat" command, "pipe" to to a line count using the word count command (wc -l) and save the into a new file:

    cat fileName.txt | wc -l > file_line_count.txt

This will be important in the upcoming example on working with sequence files.


### Section 2: Working with Sequences Files

In this exercise, we will use Unix to work with a sequence file with contigs in it. Contigs are parts of a genomic sequence that are from the same organisms and have been "assembled together" from shorter sequence reads that come directly off a sequencer.

In [5]:
'''
Write the command below, and run the cell
2.1: List the files and directories in our project directory.
'''


total 144
-rw-r--r--@ 1 bhurwitz  staff  27361 Aug 29 15:12 hw01_intro_unix.ipynb
-rw-r--r--@ 1 bhurwitz  staff  39841 Aug 29 14:45 intro_unix.ipynb
-rw-r--r--@ 1 bhurwitz  staff    785 Aug 29 14:52 start_here.md


Now we are going to make a directory to store our contigs, and go into that directory

In [None]:
'''
Write the commands below, and run the cell
2.2 Make a directory called 01_contigs
2.3 Change into the 01_contigs directory
'''


### Section 3: Downloading files

Next, we will go and get a contigs files from the iMicrobe FTP site. We can use the `wget` command to pull down the data from the FTP site to our current directory.

In [None]:
# Let's go get the contigs file, run this cell
!wget ftp://ftp.imicrobe.us/biosys-analytics/contigs/contigs.zip

In [None]:
'''
Write the commands below, and run the cell
3.1 Write a command to list the contigs.zip file
'''

#### Unpack the zipped file

Great news! You downloaded the file. It should have a file size of "1979343". Do you get that too?

Now back to the exercise. Let's unpack the contigs.zip file.

In [None]:
'''
Write the commands below, and run the cell
3.2 unzip the contigs.zip file
3.3 remove the contigs.zip file
3.4: List the files and directories in our project directory now.
'''

#### Fasta formmated files

These files are in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format), which basically looks like this:

```
>Contig_4027
AACCGGGCCAATCACCACGCGATGGACGGTACGCTCGATTTCAATGGCAACCTGTATTTCTCGGACGATCTGAACACCAACCCCTATCGGAGCATCGGGAAGATCGATGGACGGACCGGGGAGATCACCAACGTCCAGGTCGTTGATTTCTCCGAAGACAACATCGAATCCACCGTCGATGTAATGGGATTGGGTTGGATGGAAGTGGGAGTGTCTCTTTCCACTCACCTGGGGGATTTTGCTTGCGGTGTGAC
>Contig_33139
TGTGACGGACCGTGATCGTTCCCTGATCCAGGTCGACGTCACTCCACTGGAGAGCCAGCAGCTCGCCCAGGCGTAGTCCGCAGAAGATGGCAGTGAAGAATAGAGCCGCTTGTGGGTGCGAGTTGCCGGAGCTGTTCCAGTCCCTGAGACCATCGACCAACGTCCGCGCCTCGGTGGAGCTATAGGGATTGATTTGTTTTTTCTGAGACGACTGCGGTCCGAGGATCTTGCGAAGATCGATCCCGATCGCTGGG
```

Header lines start with ">", then the sequence follows. Sequences may be broken up over several lines of 50 or 80 characters, but it's just as common to see the sequences take only one (sometimes very long) line. Sequences may be nucleotides, proteins, very short DNA/RNA, longer contigs (shorter strands assembled into contiguous regions), or entire chromosomes or even genomes.


### Section 4: getting data out of files (using grep).

So, you might want to ask how many sequences are in the "group12_contigs.fasta" file? To answer, we just need to count how many times we see ">" in the file. We can do that with the `grep` command. But, there are a few things we should learn first...

In [None]:
# what happens if we use 'grep >' to search for the fasta line header starting with '>'? 
!grep > group12_contigs.fasta

In [None]:
'''
Write the command below, and run the cell.
4.1: List the files and directories in our project directory now (using a long listing).
'''

#### Yikes, it looks like you accidentally overwrote your file. 

You should see that group12_contigs.fasta now has a file size of zero. 

So, basically grep did not execute, and your file got overwritten. Why?

Something quite insidious happened with that first "grep" statement -- it overwrote our original "group12_contigs.fasta" with the result of "grep"ing for nothing, which is nothing. 

Remember that the ">" symbol tell Unixs to redirect the output of grep into a file. But, we need to tell Unix that we mean a literal greater-than sign by placing it in single or double quotes or putting a backslash in front of it.

In [None]:
# This is known as escaping characters -- a common occurance in programming.
!grep '>' group12_contigs.fasta
!grep \> group12_contigs.fasta

In [None]:
'''
Write the command below, and run the cell.
4.2: List the files and directories in our project directory now (using a long listing).
'''

#### Why do we still have a file with zero length?

So, we ran those commands correctly, but nothing was output. This is because the file doesn't have anything in it! (given that we erased it in the last step) Let's try those commands on one of the other contigs files.

Now you try, use the file 'group24_contigs.fasta' 

In [None]:
'''
Write the command below, and run the cell.
4.3: search for the ">" character in the group24_contigs.fasta file.
'''

#### Let's get the file back, and try again.

Ugh, OK, I have to go back and wget the "contigs.zip" file to restore it. That's OK. Things like this happen all the time. Let's do this again...

In [None]:
'''
Write the command below, and run the cell.
4.4: remove the *.fasta files
4.5: get the contigs.zip file using the wget command above.
4.6: unzip the contigs.zip file
4.7: List the files and directories in our project directory now (using a long listing).
'''

#### The file is back!

You should see something like this from the last command

```
-rw-rw----  1 bhurwitz  staff  3034371 Aug 10  2016 group12_contigs.fasta
-rw-rw----  1 bhurwitz  staff  1550608 Aug 10  2016 group20_contigs.fasta
-rw-rw----  1 bhurwitz  staff  1686023 Aug 10  2016 group24_contigs.fasta
```

#### Count the sequences in the contigs file

Now that I have restored my data, I want to count how many greater-than signs (or fasta headers) are in the file. These are the names of the sequences in the contigs file. You should get 132.

You can use the grep command like above, and "pipe" it to the word count command.


In [None]:
'''
Write the command below, and run the cell.
4.8: Use the grep command like above, and "pipe" it to the word count command (counting lines only)
'''

#### Searching for something...

Moving on, let's find how many contig IDs in "group12_contigs.fasta" contain the number "47":

In [None]:
'''
Write the command below, and run the cell.
4.8: Use the grep command to find all ids with the number 47, then save the results to a file
4.9: Use the cat command to show what is in your new file
'''

#### What did you get?

You should see something like this:

```
cat group12_ids_with_47
>Contig_247
>Contig_447
>Contig_476
>Contig_1947
>Contig_4764
>Contig_4767
>Contig_13471
```

Let's play around with the file, by putting it in some temp files. Here are two ways to make a copy of the file contents.

In [None]:
# Option 1: use cat, note you shouldn't see any output from this command it is going into the file
!cat group12_ids_with_47 > temp1_ids

In [None]:
# Option 2: use the cp (copy) command
!cp group12_ids_with_47 temp2_ids

### Section 5: Checking if files are the same

How can we be sure the two temp files we just created are the same? Let's use "diff":


In [None]:
# Check if temp1_ids is the same as temp2_ids
!diff temp1_ids temp2_ids

#### What did you get?

You should see nothing, which is a case of "no news is good news." They don't differ in any way. 

We can verify this with "cksum" below (see below). 

In [None]:
'''
Write the command below, and run the cell.
5.1: Use the checksum (chksum) command to look at the temp file sizes to see if the files are different
'''

#### What did you get?

You should see this:

```
2188208005 89 temp1_ids
2188208005 89 temp2_ids
```

They are the same file size. If there were even one character difference, they would generate different hashes.


### Section 6: Checking for duplicates

Before we can check for duplicates, we need to make some! Let's create a file with duplicate IDs using the cat command:

In [None]:
# This command concatenates the contents of both temp files into a new file
!cat temp1_ids temp2_ids > duplicate_ids

### Want to see a cool trick?

In the code below, I am doing the following:
1. copying the file duplicate_ids to multiple_ids
2. getting all of the contigs IDs from "group20_contigs.fasta" that contain the number "51" 
and...concatenating the new IDs to file called "multiple_ids".

Notice that the second command:

```
grep 51 group20_contigs.fasta >> !$
```

is the same as

```
grep 51 group20_contigs.fasta >> multiple_ids
```

Cool shortcut huh?

Also notice the ">>" arrows to indicate that we are appending to the existing "multiple_ids" file.

In [None]:
!cp duplicate_ids multiple_ids
!grep 51 group20_contigs.fasta >> !$

In [None]:
'''
Write the command below, and run the cell.
6.1: remove the existing "temp" files using a "*" wildcard
'''

### Section 7: Using sort and uniq

Now let's explore more of what "sort" and "uniq" can do for us. We want to find which IDs are unique and which are duplicated. If we read the manpage ("man uniq"), we see that there are "-d" and "-u" flags for doing just that. 

The "-d" flag will only print duplicate lines, one for each group. 

And the "-u" will only print unique lines. 

Don't forget that input to "uniq" needs to be sorted for this all to work because the duplicates need to be next to each other in the list.

In [None]:
#sort multiple_ids, pipe into a uniq -d (repeated flag), place into temp1 file
!sort multiple_ids | uniq -d > temp1_ids
#sort multiple_ids, pipe into a uniq -u (unique flag), place into temp2 file 
!sort multiple_ids | uniq -u > temp2_ids
#check the differences between the two files 
!diff temp*

#### What did you get?

You should see something like this:

```
1,7c1,11
< >Contig_13471
< >Contig_1947
< >Contig_247
< >Contig_447
< >Contig_476
< >Contig_4764
< >Contig_4767
---
> >Contig_10051
> >Contig_1651
> >Contig_4851
> >Contig_5141
> >Contig_5143
> >Contig_5164
> >Contig_5170
> >Contig_5188
> >Contig_6351
> >Contig_9651
> >Contig_9851
```

In [None]:
# Let's remove our temp files again
!rm temp*

In [None]:
'''
Write the commands below, and run the cell.
7.1: sort and uniq the multiple_ids file and put the results in a file called clean_ids
7.2: word count the multiple_ids clean_ids files 
'''

### What did you get?

You should see something like this:

 14 multiple_ids
 7 clean_ids
 21 total

### Section 8:  Using the sed command to alter the ids

We can use "sed" to alter the IDs. 

The "s//" command says to "substitute" the first thing with the second thing, e.g., to replace the first occurence of "foo" with "bar", use ["s/foo/bar/"](http://stackoverflow.com/questions/4868904/what-is-the-origin-of-foo-and-bar). 

If you want to replace all instances, of "foo" with "bar", use ["s/foo/bar/g"] to say you want to run the command "globally.

In [None]:
!sed 's/C/c/' clean_ids
!sed 's/_/./' clean_ids
!sed 's/>//' clean_ids > newclean_ids

After we run these sed commands, what do our ids look like? Can you write a few Unix commands below to see what is in the newclean_ids file?

What did you change with the first two commands? Did it "stick", aka was saved in the clean_ids file?

In [None]:
'''
Write the commands below, and run the cell.
8.1: display the contents of clean_ids using cat
8.2: display the contents of newclean_ids using cat
'''

#### So what happened?

As we see with the cat command above, only the last sed command stuck and saved into the newclean_ids file. We have a few options to get all of them to work. We could pipe it all together, or we can use this handy -e flag and just space our substitutions with "/" in between the quotes. Usually, a single command is specified as the first argument to sed.  BUt, you can add multiple commands by using the -e (piping from cat) or -f (in a file) options.  All commands are applied to the input in the order they are specified regardless of their origin.

In [None]:
# check this out!
!sed -e 's/C/c/;s/_/./;s/>//' clean_ids > final_clean_ids

In [None]:
# What do you get?
!cat final_clean_ids

#### What happened?

Now we see all of our wanted substitutions have been placed into this final_clean_ids file for futher use.

### The End!

Last step, copy your completed Jupyter notebook into your assignments directory. Be sure to save your notebook first.

In [None]:
!cp ~/be487-fall-2024/assignments/02_intro_unix/hw02_intro_unix.ipynb  /xdisk/bhurwitz/bh_class/$netid/assignments/02_intro_unix/hw02_intro_unix.ipynb