# FSH 510 Walkthrough: 
### Basic tools for improving reproducibility and efficiency of our bioinformatics

<br>
<br>

# 1. Data Integrity

Data we download is the starting point for all of our analyses and conclusions. We should be concerned with its integrity, for our own work, and for reproducibility.

## Check sums

Although it seems improbable, the risk of data corruption during transfers is real. Check sums can be used to verify successful transfers.

A **check sum** is related to a hash function - it is a function that takes a file as an input and outputs a unique character string. A small change in your file will result in a noticeable change in the check sum string.

#### Example

Here's a file with 1000 fastq reads. 

One section of the file looks like this:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/first_1.png?raw=true)

The check sum for this file is:

In [7]:
!md5sum first1000.txt

1b2f963dde9bebb703d0f24e0ffe8a84  first1000.txt


Then I went in and added an extra T. Now, the check sum for the changed file is:

![img](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/second_1.png?raw=true)

In [9]:
!md5sum first1000.txt

9191b9783d98fe7fb147f2efff1adbe4  first1000.txt


## Using ``diff`` to find differences between files

If your check sum suggests that the files are different, and you cannot retransfer the file, then you can use ``diff`` to identify regions that differ between files to, for example, fix them manually.

``diff`` spits out blocks of text called "hunks" with the differences. Here, this is the fastq read that had the difference.

In [13]:
!diff -u withT.txt withoutT.txt

--- withT.txt	2017-02-09 15:45:36.000000000 -0800
+++ withoutT.txt	2017-02-09 15:46:54.000000000 -0800
@@ -8,7 +8,7 @@
 
 @5_1101_31375_1103_1
 
-TGCAGGTCGGGGACTAGTAACAGAATAACTGATGTATCGTTCAGAATGTTGAAATAATCGTAGATGTGATTTAAATCATTTACTTTGCGGAACATATAAGGTCGTGATGTGTAGATTGTGTAAATCGCCACTTACGGATG
+TGCAGGTCGGGGACTAGTAACAGAATAACTGATGTATCGTTCAGAATGTTGAAAAATCGTAGATGTGATTTAAATCATTTACTTTGCGGAACATATAAGGTCGTGATGTGTAGATTGTGTAAATCGCCACTTACGGATG
 
 +
 


#### Questions for group

(1) How many of us verify the success of data downloads?

(2) How many of us document how/when we downloaded and verification of successful download?

<br>
<br>
<br>

# 2. Common Python modules 

In our lab, many of us switch between Python, R, Bash, and Excel for evaluating outputs of steps in our Stacks pipeline or formatting files for steps in the pipeline. It could be more efficient to automate multiple steps into one program.

At first, I did all dataframe work, plotting, and analyses in R because that's where I was comfortable and people told me Python wasn't great at these things. It turns out there are Python modules for the many reasons we turn to R, and they work just fine.

- ``numpy`` & ``scipy`` for the fundamentals of math and science computing
- ``matplotlib`` for basic plotting, based on MATLAB's plotting functions
- ``pandas`` for higher level computing with dataframes


And some other cool modules we could use to beef up our code:

- ``subprocess`` for running other programs from within a program
- ``argparse`` for managing user input
- ``time`` for time stamps and timing processes

#### ``numpy`` and ``scipy``

#### ``matplotlib``

Basic plotting in ``matplotlib`` isn't too different from basic plotting in R, just not built in and different syntax.
<br>
<br>

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/pyplot.png?raw=true)


<br>
<br>
Example of multiple boxplots:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/pyplot_fig.png?raw=true)

#### ``argparse``

``argparse`` allows you to define your arguments in place of sys (sys is under the hood), and in defining them, it automatically writes a help file. You can specify flags, allowing your arguments to come in any order. You can make some arguments required, causing the program to exit and printing to the screen the argument you did not name.

In the script, it looks like this:

<br>
![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/argparse.png?raw=true)
<br>

Notice that it doesn't actually take much more code than defining with ``sys.argv`` and commenting in your script.

And at the command line, the help file looks like this:

<br>
![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/helpfile.png?raw=true)
<br>


#### ``time``

Here's a simple timer function:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/timerfn.png?raw=true)

Then, feeding in start and end times by using ``time.time()`` before and after a given process, you can print it to the screen like this:

[NEED TO GET PIC]


#### ``subprocess``

I use ``subprocess`` to run scripts from within my Python script, but it there's a lot more it can do that goes over my head like managing standard output and errors.

<br>
![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/subprocess.png?raw=true)
<br>

#### Questions for group

(1) How many languages/platforms do folks code in when running their bioinformatics pipelines?
(2) What other modules do folks use in Python?
(3) What packages to folks use in other languages like R? (Eleni brought this up)

<br>
<br>
<br>


# 3. Automation

## Functions

Write functions for this you're going to do over and over again. For example, I often want to verify that my code is doing what I hope it's doing, so I'd like it to pause, print something to the screen, and ask me whether the program should continue.

#### Example: recursive function for yes-no-else user input

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/getinput.png?raw=true)

## Programs 

#### Example: ``easy_ustacks.py``

To run ``ustacks``, some of our lab members write bash scripts in bash, others write bash scripts in Python (as to not have to learn the syntax of bash). Then, we run the bash script from the terminal. Afterwards, many of us are interested in how many loci are retained per individual per population to decide whether to include all individuals for further analyses. Most of our lab mates then run an additional bash command to count loci, then format the output to fit into excel or R, to then plot. Some of us run ``ustacks`` overnight because we know it takes a long time, but knowing exactly how long it takes could help us make better use of time.

Here, I've written a simple program that (1) writes your bash script, (2) runs your bash script, (3) plots loci per individual per population, and (4) times how long it takes to run ``ustacks``.

Here's the [script](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Scripts/easy_ustacks.py).

Here's what it looks like at the command line:

[images of user input/output]

Automating and verification of steps to user may reduce opportunities for human error.

# 4. GitHub

GitHub is a distributed version control system written by the maker of Linux. It has a steep learning curve, but once over the hump, is great for sharing, collaborating, and version control.

You can use git at the command line, which looks like this:

<br>

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/gitbash.png?raw=true)

<br>


Or through the GitHub Desktop GUI if the command line freaks you out. It looks like this:


![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/gitdesktop.png?raw=true)

<br>


Some cool things about GitHub:

- makes it easy to collaborate with others, including infrastructure for controlling versions between users and communication channels like "issues"
- supports markdown for pretty and clear documentation 
- great documentation for support (when trying to get over the learning curve hump)
- because public, good for science outreach, communication, self-marketing

You can make releases to package your work at a given time. For example, you could make a release when you finish one project, so that you have a record of what scripts you used.

Here's what a release looks like:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/Seminar/images_for_notebook/release.png?raw=true)

I go to Mary's GitHub and steal her scripts all the time...

If we had a lab GitHub repository with shared scripts, there would be less reinventing the wheel and a more efficient entry into bioinformatics for new students.

#### Questions for the group

(1) GitHub versus Evernote?

- version control
- releases
- better communication (not just chat)

(2) What do other labs use?