# Lab 1 : Computational Frameworks for Evolutionary Genomics

## Learning objectives

* Become aware of the field of computational evolutionary genetics
* Learn to install and run Python programs on your computer

## Overview

In recent years, the field of evolutionary genetics has sifted towards requiring some knowledge of R, Python/Perl/C and the use of high performance computers (often requiring some fundamental Unix skill) available at national computing centers for working with large data sets typical in evolutionary genetics research.  While there are many great software packages available for particular computational problems in evolutionary biology, many software programs do not have a user interface (e.g. drop down menus and such) and are run in command line mode. The lab sessions in this course have been designed to give students an introduction to working with Python code in addition to learning some standalone software packages.  A complimentary course Advanced Genetics introduces the R programming language.  

### What is Bioinformatics?

Bioinformatics is the field of science in which biology, computer science, statistics and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics:

* The development of new algorithms and statistics with which to assess relationships among members of large data sets. 
* The development and implementation of tools that enable efficient access and management of different types of information. 
* The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures. - paraphrased from old NCBI web sit.

### Bioinformatics...
* is a term coined in response to the high demand of techniques and resources for handling the explosion of molecular data.
* is a buzzword to describe a growing field.
* benefits from the physicists, chemists and mathematicians crossing over into biology.
* is a collection of tools.
* is way of thinking about a problem.

### The Development and Implementation of Tools
In order to make new algorithms and data sources available to biologists someone needs to write applications that include these algorithms and create new databases. Often this is first done by academic research groups.
Later redone by private companies when market is large and profitable enough. There is a large gap between what is done by research groups and companies. Sometimes this is filled by large government funded projects, but not usually in time for most researchers. This is why bioinformatics and programming skills have become very valuable.

###  Data Science

* <a href="https://datascience.berkeley.edu/about/what-is-data-science/"target="_blank">What is Data Science? </a>  
* <a href="https://blog.udacity.com/2014/11/data-science-job-skills.html"target="_blank">What is Data Science? 8 Skills That Will Get You Hired</a>  
* <a href="https://www.nceas.ucsb.edu/news/open-science-kinder-science"target="_blank">Open Science is Kinder Science</a>  
* <a href="https://datacarpentry.org/"target="_blank">Data Carpentry</a>  
* <a href="https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005561"target="_blank">Building a local community of practice in scientific programming for life scientists</a>  


### Computer Operating Systems

Nearly everything we do in this course can be done in Windows, Apple's OS X, Linux and Unix and on most desk- or laptop computers sold in the past few years or in the clould or on a high performance computer. One of the goals of this course is for you to be able to set up an environment to program and run bioinformatics tools on your own computer or the computers in your research laboratory. The course projects will involve using high performance computing facilities.

### Open Source Software

The Open Source movement treats program source code in a similar manner to the way scientists publish their results: publicly and open to unfettered examination and discussion. Examples include:

* Linux operating system
* Apache web server
* Firefox web browser
* Python and IPython
* R
* BioPerl, BioJava, BioPython
* EMBOSS, Bioconductor, Cytoscape, and many of the programs we will use in bioinformatics.

We can also look at the code to see how they solved they problem, what algorithms they used and even use the code in our programs as long as we properly acknowledge the source.


## Python

Python is open source and multi-platform (e.g. Linux/GNU, Microsoft Windows, Mac OS X). Python is a popular programming language for the Bioinformatics and is also popular in other areas of biology and in engineering disciplines. Python is an interpreted language and comes with its own interpreter. Python can be used interactively inside the Python shell, and this is considered one of Python's strengths since it encourages "exploratory computing" that lets the programmer try out simple steps and algorithms before attempting to write functions and modules. Python has a handful of mature 3rd-party open source libraries, namely Numpy/Scipy for numerical operations, Cython for low-level optimization in C, IPython for interactive work, and MatPlotLib for plotting.  Here are a few tutorials and courses for learning Python.  Recently the </p>

* <a href="http://www.greenteapress.com/thinkpython/thinkpython.html" target="_blank">Think Python: How to Think Like a Computer Scientist</a>
* <a href="http://interactivepython.org/runestone/static/thinkcspy/index.html" target="_blank">Think Python: How to Think Like a Computer Scientist - Interactive Version</a>
* <a href="https://www.coursera.org/course/interactivepython" target="_blank">An Introduction to Interactive Programming in Python</a> A Coursera course
* <a href="http://www.learnpython.org/" target="_blank">LearnPython</a>
* <a href="https://software-carpentry.org/lessons/" target="_blank">Software Carpentry's Tutorials</a>
* <a href="http://intro-prog-bioinfo-2012.wikispaces.com/" target="_blank">QB3 Python Bioinformatics Course 2012</a>
* <a href="http://www.programmingforbiologists.org//" target="_blank">Ethan White's Programming for Biologists</a>

### SciPy

 <a href="http://www.scipy.org/" target="_blank">SciPy</a> (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:</p>

* NumPy - Base N-dimensional array package
* SciPy library Fundamental library for scientific computing
* Matplotlib - Comprehensive 2D Plotting
* IPython - Enhanced Interactive Console
* Sympy - Symbolic mathematics
* Pandas - Data structures &amp; analysis

### Jupyter

Problem sets in this class will be turned in as <a href="https://jupyter.org/" target="_blank">Jupyter</a> notebooks.  Jupyter provides a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results. In this way, notebook files can serve as a complete computational record of an analysis and/or workflow. Notebooks may be exported to a range of static formats, including HTML, PDF, and slide shows and shared using email, Dropbox and Moodle.  

## R

R is the largest and most comprehensive public domain statistical computing environment.  The core R package is enhanced by several hundred user-supplied add-on packages, including many for gene expression analysis, in the <a href="http://cran.r-project.org/"> Comprehensive R Archive Network (CRAN)</a>. Omegahat Project for Statistical Computing</a>. <a href="http://www.bioconductor.org/"> BioConductor</a> is an open source and open development software project for the analysis and comprehension of genomic data and is based primarily on the R programming language. R and Bioconductor are free, Open Source and available for Windows, MacOS and a wide variety of UNIX platforms. 
 
### R manuals, help and tutorials

Many introductory and advance tutorials have been developed for R.  Here are a few

* <a href="http://cran.r-project.org/manuals.html#R-admin" target="_blank">The offical R manuals</a>
* <a href="http://cran.r-project.org/doc/manuals/R-intro.html" target="_blank">CRAN's Introduction to R</a>
* <a href="https://r4ds.had.co.nz/" target="_blank">R for Data Science</a> by Garrett Grolemund and Hadley Wickham
* <a href="http://www.cookbook-r.com/Graphs/" target="_blank">R Graphics Cookbook</a> by Winston Chang
* <a href="https://github.com/datacarpentry/genomics-workshop/" target="_blank">Data Carpentries Genomic Workshop Sessions</a> 
* <a href="https://datacarpentry.org/R-ecology-lesson/index.html" target="_blank">Data Analysis and Visualization in R for Ecologists</a> 

There are also many workshops and online R courses that you could take to follow up what you learn in this class.


## GitHub

GitHub - https://github.com/ has become a popular way to manage, share and view code for open source projects. You can read more about version control - http://git-scm.com/book/en/Getting-Started-About-Version-Control. Once you sign up for an account you will be able to see the introductory guide which includes (1) Setting up Git, (2) Creating repositories, (3) Forking repositories and (4) Working together.  

The tutorials created for this course will posted on GitHub. The files can be downloaded to your computer and you can modify the files to create new examples, better exercises or simply correct my typos. Since these tutorials are written using Jupyter notebooks, the GitHub files can be viewed as web pages using the Jupyter notebook viewer http://nbviewer.ipython.org/ or on your computer using Jupyter Notebook.  


## Working with Python

There are many ways in which to work with Python and complete the labs. 

### For students in 597 - RStudio Cloud

If you are an undergraduate in 597 go to [RStudio Cloud](https://rstudio.cloud/) and create an account. I will share the link for our Workspace in an email. The steps are

* Click on Link 
* Join EvoGeno Workspace
* Under your spaces select EvoGeno Workspace
* Click on Project tab
* Start assignment

### For students in 697 - Unity and Massachusetts Green HPC clusters

There are several options available for using high performance computers at minimal or no cost to graduate students at UMass. MGHPCC is the older HPC with more programs available. Unity is a newer HPC still in development, but has a more modern JuptyerLab interface with the Git extension. 

#### Unity

If you are taking the graduate version (697) of this class I recommend creating an account on Unity on new research HPC to use the new JupyterLab interface - https://unity.rc.umass.edu/index.php.  

#### MGHPCC

You are also likely to need to use our main workhorse HPC. If you are taking the graduate version (697) of this class please register for a MGPHCC account -  https://www.umassrc.org/hpc/index.php The MGHPCC is an intercollegiate high-performance computing facility located in Holyoke, Massachusetts. MGHPCC is for research computing, only Principal Investigators (PIs) as defined by the local school and their staff or authorized collaborators may receive accounts on MGHPCC.  PI authorization is required for all new account requests.  For more details see - http://wiki.umassrc.org/wiki/index.php/Main_Page.  Do not list me as the PI.

Open OnDemand provides a web interface to a number of cluster resources including Rstudio and Jupyter Notebook, and Gnome X11 desktop.  Current users of the cluster can log in to this web interface at https://www.umassrc.org:444 from your local campus network or VPN using your cluster username and password. *Note you may need to set up a VPN for remote access. - https://www.umass.edu/it/support/vpn/howinstallanduseglobalprotectvpnclient 


### For everyone - Google Colab

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser.

* Go to Google Colaboratory website - https://colab.research.google.com
* Log into your UMass account
* Open a new notebook.
* The notebook will be saved in your Gdrive under Colab Notebooks


### For everyone -  Installing Python local on your own computer

These is no need to do this, but it is an option. Download and install the <a href="https://www.anaconda.com/products/individual" target="_blank">Anaconda Individual Edition</a> which includes Python, the Scipy packages, Spyder and Jupyter



<a href="http://www.windowsreference.com/windows-xp/dos-commands-and-equivalent-linux-commands/" target="_blank">