<a href="https://colab.research.google.com/github/hcpy/Local-GH-Data-for-Python-Project/blob/main/Copy_of_Hyun_unit_1_notebook_1_linux_%5BTeam_Python%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 1 - Instructional DEMO 1: Basic linux commands in a python environment
 - **Focus:** Basic linux commands, a bash script and some basic python syntax examples. 
 - **Author(s):** Sara B-C.
 - **Date Notebook Last Modified:** 08.09.2020
 - **Quick Description:** Use this notebook for your first adventure in programming. In the beginning, just hit play at each cell and watch things work. Once you are done, you can download the finished results. 

---
## Code outline
  0. Set up file stream.
  1. "Hello world, welcome to linux".
  2. Move, copy and delete files.   
  3. Check out the first few lines and the last few lines of a file. 
  4. "I'd like to see a specific column of the file".
  5. Can we smash two files together and count some lines?
  6. Filtering rows with `grep`.
  7. Sort it out.

---
## Additional notes
*   In notebooks, basic linux commands begin with a '! '.
*   The only thing that is really tricky so far will be starting google file stream to access other files on your drive.

## 0. Lets set up filestream access
Follow the directions on screen as you run the code cell below and then you can access the data stored on your 'My Drive'. For many of you, this is the first python code you will ever execute knowingly, as most google infrastructure is python based (another reason why the language is growing)!

In [5]:
import os
from google.colab import drive
drive.mount('/content/drive/')
os.chdir("/content/drive/MyDrive/Python course backup/example_data")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Above you should now see the output "Mounted at /content/drive/". This means your storage is now connected to your notebook and its runtime. We'll cover what the above code means once you have learned some more python.

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Exploring directories

Here we'll explore some directories you have just mounted.  

Remember, all linux shell code needs a `! ` prefix to run in notebooks.  Also remember that a `#` at the beginning of a line of code denotes a comment.

`pwd` tells you the directory you are currently working in.  

`ls` lets you know what files exist, **lets play around with that for a second**.

In [2]:
# What directory are you in?
! pwd

# What files are there?
! ls

# Check out some of the example data and see a list of how big it is, its permissions, who made it and when it was last modified. Also note the \ preceding the space.
! ls -lsth /content/MyDrive/FAES_BIOF309/example_data/discrete

/content
drive  sample_data
ls: cannot access '/content/MyDrive/FAES_BIOF309/example_data/discrete': No such file or directory


## 1. "Hello world!" the first time you and your computer chat.
We'll make a simple print to screen command using `echo`. Note, you'll see your first error below!

In [None]:
# If you just type `Hello world` the computer will think you are trying to run a command called `Hello` that has the `world!` option specificed.
Hello world!

SyntaxError: ignored

In [None]:
# Let's do this right and print it to the screen.
! echo "Hello world!"

Hello world!


## 2. Copy, move, rename and delete files.
Lets talk about some file management. We'll also play with directory structures.

Lets make directory to play with using `mkdir`.

In [3]:
! mkdir /content/drive/My\ Drive/FAES_BIOF309/temp/

mkdir: cannot create directory ‘/content/drive/My Drive/FAES_BIOF309/temp/’: No such file or directory


Lets go to the `temp` folder

In [None]:
! cd temp

Lets leave the directory

In [None]:
! cd ..

Make a copy in the new directory with `cp`.

In [None]:
! cp /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv /content/drive/My\ Drive/FAES_BIOF309/temp/

Rename the file using `mv`, also note that you can move files using mv... just leave off the file name for the last half of the options. The second line of code conforms the name is now 'GWAS.csv'.

In [None]:
! mv /content/drive/My\ Drive/FAES_BIOF309/temp/example_GWAS.csv /content/drive/My\ Drive/FAES_BIOF309/temp/GWAS.csv 
! ls /content/drive/My\ Drive/FAES_BIOF309/temp/

GWAS.csv  TrimmedDataSet.csv.gsheet


Wait, I don't like that file that we just renamed, lets delete it with `rm`! Then check if it is there again with `ls`.

In [None]:
! rm /content/drive/My\ Drive/FAES_BIOF309/temp/GWAS.csv 
! ls /content/drive/My\ Drive/FAES_BIOF309/temp/

TrimmedDataSet.csv.gsheet


Okay, lets get rid of the entire directory using `rm -r`. THIS IS THE MOST DANGEROUS COMMAND!

In [None]:
  ! rm -r /content/drive/My\ Drive/FAES_BIOF309/

Is the directory still there? If everything worked, the `ls` below should be empty.

In [None]:
! ls /content/drive/My\ Drive/FAES_BIOF309/

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
ls: cannot access '/content/drive/My Drive/FAES_BIOF309/': No such file or directory


What if we just want to list files with a certain extension like `*.bim`?

In [None]:
! ls /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/*.bim

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
ls: cannot access '/content/drive/My Drive/FAES_BIOF309/example_data/discrete/*.bim': No such file or directory


## 3. Check out the first few lines and the last few lines of a file.
This is probably one of the most common things you will do before more detailed data explorations. Easy way to see what is in a huge file without opening it... or to see what is at the end!

Lets look at the top of the GWAS file from earlier using `head`. This command does the top few lines. It's partner `tail` does the last few lines.

In [None]:
! head /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open '/content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv' for reading: No such file or directory


Cool, a tab delimited file with a header, the first column is called "SNP".

In [None]:
! tail /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
tail: cannot open '/content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv' for reading: No such file or directory


## 4. "I'd like to see a specific column of the file".
Linux shell has some great tools for working with delimited text files, particularly tabular data in columns.

Lets make a file that is only the first column of the GWAS file. `-f` means column 1 in a file delimited `-d` by a comma. Then we check the top few lines of the file to confirm it worked. This can also be done with the more complicated `awk` command, which is almost a language itself for text and tabular data manipulation.

In [None]:
! cut -f 1 -d ',' /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv > column1_cut.txt
! head column1_cut.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/bin/bash: column1_cut.txt: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open 'column1_cut.txt' for reading: No such file or directory


Now why not use the mighty stream editor `sed` to change all occurances of the prefix "snp" to "snipped". Then lets use `head` to see how it worked.

In [None]:
! sed 's/snp/snipped/g' column1_cut.txt > column1_sed.txt
! head column1_sed.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/bin/bash: column1_sed.txt: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open 'column1_sed.txt' for reading: No such file or directory


This can also be done with `awk`, a command which is almost a programming language in and of itself. Using the code below we can print the first and last columns of the GWAS file and convert the commas to spaces. 

In [None]:
! awk -F ',' '{print $1" "$7}' /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
awk: cannot open /content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv (No such file or directory)


##   5. Can we smash two files together and count some lines?
Remember we can smash files together horizontally and vertically.

We can `cat` to merge vertically.

In [None]:
! cat column1_cut.txt column1_sed.txt > column1_cat.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/bin/bash: column1_cat.txt: No such file or directory


Check the `head` and `tail` to confirm success.

In [None]:
! head column1_cat.txt
! tail column1_cat.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open 'column1_cat.txt' for reading: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
tail: cannot open 'column1_cat.txt' for reading: No such file or directory


To merge files easily along the vertical axis (y) we use `paste`.

In [None]:
! paste column1_cut.txt column1_sed.txt > column1_paste.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/bin/bash: column1_paste.txt: No such file or directory


Check the head and tail to confirm success.



In [None]:
! head column1_paste.txt
! tail column1_paste.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open 'column1_paste.txt' for reading: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
tail: cannot open 'column1_paste.txt' for reading: No such file or directory


How many lines are in each of these files using `wc -l`?

In [None]:
! wc -l column1_paste.txt column1_cat.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
wc: column1_paste.txt: No such file or directory
wc: column1_cat.txt: No such file or directory
0 total


## 6. Filtering lines with `grep` (inclusion and exclusion).

In [None]:
# Remove any variant in the GWAS file that isn't in scientific notation by filtering out lines that do not contain "E-".
! grep "E-" /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv 

grep: /content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv: No such file or directory


In [None]:
# Lets do the opposite and now exclude those lines by simply adding the `-v` option.
! grep -v "E-" /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv 

grep: /content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv: No such file or directory


## 7. Sort the GWAS file by p-value and show the top few lines.
Second-to-last column is p-value and we are interested in smallest to largest.
`| head` is a pipe, you can add `|` to chain commands together. 

In [None]:
! sort --field-separator=',' -k 7,7 /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv | head

sort: cannot read: '/content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv': No such file or directory


**Makes sense, now on to some exercises ...**

# Unit 1: Assignment # 1
***Come here to prove your knowledge.***

Text cells will indicate a task.  
Write your commands in the empty code cells below them.

## 1. Print a positive message to the screen, something nice.

In [None]:
! echo "Life is beautiful"

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
Life is beautiful


## 2. What files are in `/content/drive/My\ Drive/FAES_BIOF309/example_data/discrete` and how large are these files?

In [9]:
! ls -1sth  /content/drive/MyDrive/Python course backup  

ls: cannot access '/content/drive/MyDrive/Python': No such file or directory
ls: cannot access 'course': No such file or directory
ls: cannot access 'backup': No such file or directory


## 3. Show the top and bottom lines of any file CSV in the directory you just looked at above.

In [10]:
! head /content/drive/MyDrive/Python course backup/example_data/discrete/training_addit.csv

head: cannot open '/content/drive/MyDrive/Python' for reading: No such file or directory
head: cannot open 'course' for reading: No such file or directory
head: cannot open 'backup/example_data/discrete/training_addit.csv' for reading: No such file or directory


## 4. In the file `/content/drive/My\Drive/FAES_BIOF309/example_data/discrete/training_pheno.csv', show me the top few lines of the last column only. 

In [None]:
! cut -f 1 -d ',' /content/drive/MyDrive/FAES_BIOF309/example_data (1)/discrete/training_pheno.csv
! head column1_cut.txt

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/bin/bash: -c: line 0: syntax error near unexpected token `('
/bin/bash: -c: line 0: ` cut -f 1 -d ',' /content/drive/MyDrive/FAES_BIOF309/example_data (1)/discrete/training_pheno.csv'
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
head: cannot open 'column1_cut.txt' for reading: No such file or directory


## 5. There are 2 files called `*.bim` in the directory referenced in the cell above. Use `paste` to merge them and show the top and bottom of the files. 

## 6. Clean up any useless files we generated in the current directory during our exercises, these begin with "column1". Use `ls` to confirm.

## 7. Sort one of the `*.bim` files by the 4th column (numerically) and show the top few lines.

## 8. Filter some lines from a file using `grep`.

## 9. Pipe two commands together using `|`.

## 10. Write an equivalent `awk` and `cut` command. Two commands with near identical results.

In [4]:
! awk -F ',' '{print $1" "$7}' /content/drive/My\ Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv

awk: cannot open /content/drive/My Drive/FAES_BIOF309/example_data/discrete/example_GWAS.csv (No such file or directory)


# Thanks, see you in the next unit!