Introduction to Unix and the Command Line
===================================

## Hardware, software, and Operating Systems
- harware = laptop, desktop, iphone
- software = word processing, spreadsheet, graphics, games email
- OS (also software) = Windows, Mac, Android, Ubuntu, Debian, Unix, Linux 
## Computer Structure
![computer_diagram.png](attachment:8612988c-1372-4b09-a533-061d30c587f2.png)
## Interfaces
How we interact with the computer (input and output)

### Graphical User Interface (GUI)
- anything you see and and click (FileExplorer, Finder)
![image.png](attachment:1292487f-1abb-475c-846c-1c607c57cf1c.png)

### Command Line Interface (CLI)
![Screen Shot 2020-08-18 at 10.36.24 AM.png](attachment:c82b53fc-6036-4d1a-840e-3d4d7fb007e0.png)

![image.png](attachment:06c44e5d-24bf-4295-92e6-6d86892cd56e.png)

## File System


Now Let's Code Along
===================================

 In the remainder of this document we are demonstrating how to run a shell command within a python notebook by starting each cell with `%%bash`. This tells the python interpreter to utilize your computers Unix-like system (either the MacOSX operating system or the Linux distribution WSL2, installed by Windows users). If you do not wish to use a python notebook, you can open up a terminal directly in VSCode by going to `Terminal > New Terminal` and enter the commands directly into the command line *without* using `%%bash` preceding the command. See the command line interface screenshot above as an example. 

### Option 1: Open a Terminal in VSCode

![image.png](attachment:image.png)

### Option 2: Open a Python Notebook
Launch VSCode and click to create a new file

![image-2.png](attachment:image-2.png)

Select `Jupyter Notebook` from the dropdown menu. This will create a new python notebook file with the `.ipynb` extension.

### Navigating the File System
The first thing we need to figure out is 'where are we' in the file system.  To do this, we want to *P*rint the *W*orking *D*irectory using the command `pwd`

In [1]:
%%bash 
pwd

/Users/loyalgoff/Library/CloudStorage/GoogleDrive-loyalgoff@gmail.com/My Drive/Work/Goff Lab/Teaching/Quantitative Neurogenomics/2023/course_materials/quant_mol_neuro_2023/modules/module1/notebooks


We can list the contents of the directory using the built-in program `ls`

In [2]:
%%bash
ls

Day1-overview.md
Day1.1-Morning-CLI_intro_and_git_AG.ipynb
Day1.1-Review_of_prereq_assignments
Day1.2-Morning-Intro_to_git_and_GitHub.ipynb
Day1.3-Afternoon-Intro_to_python_I.ipynb
Day1.3-Python_flow_control_and_functional_programming.ipynb
Day1.4-Afternoon-Intro_to_Python_II.ipynb
Foxp1.gbk
test.fa


# Wildcards and tab complete
- The * matches one or more occurrence of any character
- The ? matches a single occurrence of any character
- Another shortcut is tab completion, type the beginning of a file or directory, then hit tab for it to automatically fill in the rest

In [None]:
%%bash
ls *.fa
ls te?t.fa

#### Options/Input arguments
Bash/shell commands can take input arguments or options. One convention is to use a dash (`-`) to specify arguments. For example, we can ask ls to show a more detailed list of information for each file/folder:

In [None]:
%%bash
ls -l

We can aggregate different options by directly appending options one after another. The following shows how to display file sizes in human readable formats (`-h`):

In [None]:
%%bash
ls -lh

Sometimes commands take in arguments for various purposes. Again, using ls as example, it can take path as an argument. Without the path, it will by default show the current listings, as shown above. Given a path, it will list items in that path:

In [None]:
%%bash
ls ../

### Manual Pages (man)
It is certainly not expected that you memorize all arguments for every command.  This is where the manual (`man`) comes in handy.  You can use `man command_name` to find information about how to use a specific command. For example:

In [None]:
%%bash
man ls

Here, man is a command that takes one input argument (which should be a Bash command) and outputs the corresponding manual.

## Creating and Navigating Folders
Now that we have a basic overview of how to interact with the computer in bash, it will be useful to understand how to create folders (directories) and navigate around our system. We've already used the `pwd` command to learn where we currently are. But what if we wanted to make a new directory to contain a project?

The `mkdir` command stands for “make directory”. It takes in a directory name as an _argument_, and then creates a new directory in the current working directory.

In [4]:
%%bash
mkdir myDirectory

Nothing seemed to happen?  Lets check and see if our new directory was made:

In [5]:
%%bash
ls -l

total 548
-rw-r--r-- 1 loyalgoff staff   1130 Aug 23 12:44 Day1-overview.md
-rw-r--r-- 1 loyalgoff staff  95002 Aug 23 12:20 Day1.1-Morning-CLI_intro_and_git_AG.ipynb
drwxr-xr-x 2 loyalgoff staff     68 Aug 23 12:29 Day1.1-Review_of_prereq_assignments
-rw-r--r-- 1 loyalgoff staff 385296 Aug 23 12:20 Day1.2-Morning-Intro_to_git_and_GitHub.ipynb
-rw-r--r-- 1 loyalgoff staff  48002 Aug 23 12:43 Day1.3-Afternoon-Intro_to_python_I.ipynb
drwxr-xr-x 2 loyalgoff staff     68 Aug 23 12:32 Day1.3-Python_flow_control_and_functional_programming.ipynb
-rw-r--r-- 1 loyalgoff staff    903 Aug 23 12:45 Day1.4-Afternoon-Intro_to_Python_II.ipynb
-rw-r--r-- 1 loyalgoff staff  11725 Aug 23 12:20 Foxp1.gbk
drwxr-xr-x 2 loyalgoff staff     68 Aug 23 12:46 myDirectory
-rw-r--r-- 1 loyalgoff staff    528 Aug 23 12:20 test.fa


There it is, lets try and move into the directory.

`cd` stands for “change directory”. Just as you would click on a folder in Windows Explorer or Finder on a Mac, `cd` switches you into the directory you specify. In other words, `cd` changes the working directory.

In [6]:
%%bash
cd myDirectory
pwd

/Users/loyalgoff/Google Drive/Work/Goff Lab/Teaching/BCMB/Bootcamp/Bootcamp2021/bcmb_bootcamp/day1/notebooks/myDirectory


We can move back into the previous directory by using the shortcut `..`

In [7]:
%%bash
cd ..
pwd

/Users/loyalgoff/Google Drive/Work/Goff Lab/Teaching/BCMB/Bootcamp/Bootcamp2021/bcmb_bootcamp/day1


And finally, we can remove an (empty) directory using `rmdir`.

In [2]:
%%bash
rmdir myDirectory
ls

Day1 - Afternoon - Unix_and_Bash.ipynb
Day1-overview.md
Day1.0-Morning-Review_of_prereq_assignments.ipynb
Day1.1-Morning-Unix_I.ipynb
Day1.2-Morning-Intro_to_git_and_GitHub.ipynb


rmdir: myDirectory: No such file or directory


### Review
* The command line is a text interface for the computer’s operating system. To access the command line, we use the terminal.
* A filesystem organizes a computer’s files and directories into a tree. It starts with the root directory. Each parent directory can contain more child directories and files.
* From the command line, you can navigate through files and folders on your computer
  + `pwd` outputs the name of the current working directory.
  + `ls` lists all files and directories in the working directory.
  + `cd` switches you into the directory you specify.
  + `mkdir` creates a new directory in the working directory.
  + `rmdir` removes an empty directory

## Viewing and changing files

### Creating files
There are several ways to create a file. One of the easiest is to just create an empty file by touching it (`touch`)


In [None]:
%%bash
touch myBrandNewFile.txt

In [None]:
%%bash
ls -l

Since this file is empty, we should add something to it at this point.  We can write directly to a file by _*redirecting*_ some content into the file.  This is achieved with the `>`. Imagine that this is an arrow pointing to where you want to put the output.  Here we will also introduce you to the `echo` command which simply repeats the first argument.  Here we're going to have the output of `echo` _redirected_ into our new file.

In [None]:
%%bash
echo 'Hello World' > myBrandNewFile.txt

We can now view the contents of a file by using the command `cat`:

In [None]:
%%bash
cat myBrandNewFile.txt

more myBrandNewFile.txt

### Moving and removing files

The `mv` command moves files or directories

In [None]:
%%bash
mv myBrandNewFile.txt myOlderFile.txt
ls -l

And finally, we can remove a file using the `rm` command. The `rm` command removes files or directories <font color='red'>(removed files will be gone forever, proceed with caution)</font>:

In [None]:
%%bash
rm myOlderFile.txt
ls -la

### File properties

In [9]:
%%bash
du test.fa

du -h test.fa #outputs in 'human-readable' format (byte, kb, mb, etc)

4	test.fa
4.0K	test.fa


Lets see what's actually inside this file

In [12]:
%%bash
more test.fa

>dna_sequence_1
TGGCGTTGTCTTTAATTCGATTAGCATTGCATATCGATTATCTAGCGATATGCTATGCTTAGC
>dna_sequence_2
AGCCATATGTATCGAGCGATATGCGATACGAGTATCGAGTATGCAGTATGCATTGCAGTATGC
>dna_sequence_3
TGGGTATGCGCGACGAGCATTATACGATGCATTATTACGGATCTACGGCGATATTACGTACGA
>dna_sequence_4
GGACGTATCGAGTCTAGCGAGCGACTTCGAGCGATATCGGACTCTCGTCCTCTTCAGTCAGCC
>dna_sequence_5
TTTACGGACTTACGGACTAGCTGAGCTAGCTACGATCGATCGATCGTAGCTACGATCGTAGCT
>dna_sequence_6
CATCGATCGTAGC
>dna_sequence_7
ACACGGACTAGCGGATCTATCTGTACTGAGCGTATCTGACGGTAGCTATCGGACGTATCGGACGGACACAGCGTATGCGAC


To count the number of words and lines in your file

In [10]:
%%bash
wc test.fa #words
wc -l test.fa #lines
wc -m test.fa #characters

 14  14 528 test.fa
14 test.fa
528 test.fa


### Downloading files from the internet

In [4]:
%%bash
wget http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa

--2021-08-25 20:16:17--  http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa
Resolving sgd-archive.yeastgenome.org... 52.218.236.250
Connecting to sgd-archive.yeastgenome.org|52.218.236.250|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 234177 (229K) [binary/octet-stream]
Saving to: 'chr01.fsa'

     0K .......... .......... .......... .......... .......... 21%  300K 1s
    50K .......... .......... .......... .......... .......... 43%  607K 0s
   100K .......... .......... .......... .......... .......... 65% 2.97M 0s
   150K .......... .......... .......... .......... .......... 87%  769K 0s
   200K .......... .......... ........                        100% 1.84M=0.3s

2021-08-25 20:16:17 (662 KB/s) - 'chr01.fsa' saved [234177/234177]



If we check the directory listing, we should now see a new file `chr01.fsa`

In [5]:
%%bash
ls -la

total 3728
drwxr-xr-x   9 loyalgoff  staff      288 Aug 25 20:16 .
drwxr-xr-x   5 loyalgoff  staff      160 Aug 23 12:21 ..
drwxr-xr-x  15 loyalgoff  staff      480 Aug 25 20:02 .ipynb_checkpoints
-rw-r--r--   1 loyalgoff  staff      550 Aug 25 19:58 Day1 - Afternoon - Unix_and_Bash.ipynb
-rw-------   1 loyalgoff  staff     1080 Aug 23 12:58 Day1-overview.md
-rw-r--r--   1 loyalgoff  staff      556 Aug 25 19:50 Day1.0-Morning-Review_of_prereq_assignments.ipynb
-rw-------   1 loyalgoff  staff  1266402 Aug 25 20:15 Day1.1-Morning-Unix_I.ipynb
-rw-------   1 loyalgoff  staff   385296 Aug 23 12:20 Day1.2-Morning-Intro_to_git_and_GitHub.ipynb
-rw-r--r--@  1 loyalgoff  staff   234177 Oct 25  2019 chr01.fsa


Let's take a quick peek inside.  This is a large file so maybe we only want to see the first few lines...

In [6]:
%%bash
head chr01.fsa

>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG


### File compression/uncompression
Many files and file types can be compressed to save space (e.g. 'zipping' a file).  Often times you may have to uncompress a file that you download from the internet before being able to read/use it. Similarly, you may often have to compress a file before uploading or to save disk space. Compression and uncompression are done using programs such as `tar`, and/or `gzip`/`gunzip`.

In [19]:
%%bash
gzip test.fa # to compress the file

gunzip test.fa.gz # to uncompress the file

Try it out for yourself!

In [7]:
%%bash
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/knownGene.txt.gz

--2021-08-25 20:16:37--  http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/knownGene.txt.gz
Resolving hgdownload.cse.ucsc.edu... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4270683 (4.1M) [application/x-gzip]
Saving to: 'knownGene.txt.gz'

     0K .......... .......... .......... .......... ..........  1%  266K 15s
    50K .......... .......... .......... .......... ..........  2%  567K 11s
   100K .......... .......... .......... .......... ..........  3% 6.11M 8s
   150K .......... .......... .......... .......... ..........  4%  514K 8s
   200K .......... .......... .......... .......... ..........  5%  605K 7s
   250K .......... .......... .......... .......... ..........  7% 1.72M 6s
   300K .......... .......... .......... .......... ..........  8%  570K 6s
   350K .......... .......... .......... .......... ..........  9% 7.84M 6s
   400K .......... .......... .......... ....

# File sizes

In [None]:
%%bash 
du chr01.fsa
du -h chr01.fsa
wc chr01.fsa
wc -l chr01.fsa
wc -m chr01.fsa

### Practice Exercise

Download this file containing the genome sequence of E. coli K12 using `wget` command:
https://github.com/doxeylab/learn-genomics-in-unix/raw/master/task1/e-coli-k12-genome.fasta.gz 

1. What is the size of this compressed file in megabytes?
2. Uncompress the file. What is the size now in megabytes?
3. How many lines are in the uncompressed file?


# Viewing files
- use up and down arrows to scroll
- spacebar scrolls down
- hit q to quit

In [None]:
less chr01.fsa

In [9]:
%%bash
head chr01.fsa

>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG


In [10]:
%%bash
# Print the first 10 lines of your file
head -n 20 chr01.fsa

>tpg|BK006935.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=I] [note=R64-1-1]
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG
CACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT
CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAAT
ACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
AATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAAC
AATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAA
AGTTCCTCAATATTGCAATTTGCT

In [None]:
%%bash
tail chr01.fsa

In [None]:
%%bash
# Print last 10 lines of your file
tail -n 10 file.txt 

# Searching in files 
## pattern finding with grep
A common task in programming is to search for a string within a file. `grep` is a powerful command for searching. `grep` takes a search string or regular expression as its first argument and a set of target files as its second. With only these arguments, `grep` will search the file and return lines containing the search string.

In [11]:
%%bash 
grep "TACCCTACC" chr01.fsa

CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
2


If you add the `-c` argument then `grep` returns the number of lines with the search string

In [None]:
%%bash
grep -c "TACCCTACC" chr01.fsa

# Piping/redirection

### Output Redirection

The '>' redirects the output of the commands on the left of it to a file specified on the right.

In [15]:
%%bash
tail -n 30 chr01.fsa > tail.txt

### Input Redirection

The '<' redirects input from a file to be used by the command on the left

In [16]:
%%bash
sort < tail.txt

AAATGTACTAATGGAATGATATATTAATATATAGTGTGTTTATACCTTATTATTGATGAT
AACTGTTATGGGGTGCTTCTATTGGGACTATGGACGCTGATAAAGAAAGACTAAGATTAT
AAGACTGGTACCAAAGGTAATGCATCTACCTCCCGTTACTTTTCCGAATCAGACAGTGTT
AAGGTACAGCCGTCTACAACGTGTGTGAATTTGCTAACCAATTCGGTGTTCCATGTATGG
AATTGAACATGATCAAATGGATTAAAGAAACTTTCCCAGATTTGGAAATCATTGCTGGTA
ACGTTGTCACCAAGGAACAAGCTGCCAATTTGATTGCTGCCGGTGCGGACGGTTTGAGAA
ATATTGGGCAGGGGATAGATGGTTGTTGGGGTGTGGTGATGGATAGTGAGTGGATAGTGA
ATGTGGTATGGTATCGAGTACCGATGGAGTGAGAGATGGCCTTGGTGTAGAGTATTATGG
CAACTAGAAGGTGGTGTTAATAACTTACATTCCTACGAAAAACGTTTACATAACTGAATG
CCGTACTTGTACAATGGATTACAACATTCTTGTCAAGACATCGGCTGTAGGTCGTTAACT
CGGGTAAGTTAGATGATGTATTGTTTACGTTATATTTGTTTAAATTGGATTTGTTTACAT
CTGATGGTGGTGTTCAAAAACATTGGTCATATTATTACCAAAGCTTTGGCTCTTGGTTCT
GAACTGATTTAATGAAAAATCAGAAGTACCCATTAGCGTCCAAATCTGCCAACACCAAGC
GGATTGTGATGATGGAGAGGGAGGGTAGTTGACATGGAGTTAGAATTGGGTCAGTGTTAG
GGTGTGGGTGTGGGTGTGGTGTGGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
GTATACATAAAATGGGTAGTGGATATTTGTATAGAAAGGGCATTACGCATGGAGTTAAGA
GTATTTACATGATAATTGGGGTTCCG

## Chaining commands
You can use the pipe operator `|` to chain together many different commands.  Essentially taking the output of one command and using that is the input for the next one in the chain.

In [17]:
%%bash
grep "TACCCTACC" chr01.fsa | wc -l  
# will count the number of lines containing the word "word"
# or alternatively
cat chr01.fsa | grep "TACCCTACC" | wc -l  # does the same thing as above

       2
       2


This can be used to string together commands to streamline workflows. For example:
`cut | sort | uniq`

 ## cut | sort | uniq

### `cut` - remove sections from each line of files
cut can grab columns of data from a delimited (in this case tab-delimited) 

```
$ man cut

$ cut -f1 data/hg38genes.txt

# cut -f2,3 data/hg38genes.txt
```

### `sort` - sort or merge records (lines) of text and binary files
We can combine the power of `cut` with another command-line tool, `sort` by using the pipe (`|`) operator.

```
# sort all chromosomes
$ cut -f1 data/hg38genes.txt | sort 

# sort all start positions
$ cut -f2 data/hg38genes.txt | sort
```
^ What happened here? let's `man sort` to see if we can figure out a solution.

### `uniq` - report or filter out repeated lines in a file
As it's name implies, `uniq` will take the input provided and collapse it to the unique set of rows.
When combined with `cut` and `sort` this can be a very handy set of tools for summarizing tabular data in files.

```
#Find unique chromosome names
$ cut -f1 data/hg38genes.txt | sort | uniq
```
What if we added the argument `-c` to `uniq`? (hint: `man uniq` for answer)

#### Exercises:
Using the combination of `cut | sort | uniq`, find:

1. How many genes are on each strand ('+', '-') in the file hg38genes.txt?

2. How many genes are on each strand, for each chromosome?


# Resources
* Cheat sheet containing many useful *nix commands:
    - https://files.fosswire.com/2007/08/fwunixref.pdf
