Skip to content

Introduction to datawrangling with Pandas and trees with ggtree - for MDU

Notifications You must be signed in to change notification settings

kristyhoran/BioinfClinic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics clinic 1st April 2019

Kristy Horan

I will be talking on two rather distinct tools that I use regularly to wrangle data and to visualise phylogenies, the python dataframe package pandas and the R tree visualisation library ggtree.

Both will be introductory, but assume that everyone has a basic knowledge of R and/or Python. If you would like to play along having R studio and Jupyter installed would be helpful.

R Studio

https://www.rstudio.com/products/rstudio/download/

Jupyter

If you would like to read the instructions.... https://jupyter.readthedocs.io/en/latest/install.html

OR in terminal window

pip3 install jupyter

Get this repo

I will update the repo after clinic if we make any changes.

git clone https://github.com/kristyhoran/BioinfClinic.git

Introduction to pandas.

What is a dataframe?

Simply put a tabular representation of data, in our work rows are often samples (isolates or experimental conditions) and columns are some characteristic of those samples. This format often makes it easy for us to understand how the data is organised and allows for fairly straighforward visualisations (topic for another day perhaps).

Why Pandas??

R also has dataframes... in fact I would almost go so far as to say that R depends on dataframes. So why choose Pandas?

Well Python is my go to language, whilst R has many 'shiny' bits and pieces, I find Python more flexible. So I will offer an alternative to the R dataframe. Apart from my fondness for Python, I find that one of the major advantages is that it is a single module/package and largely follows Python syntaxt - so it is pretty easy to learn. It also plays well with numpy, scipy, matplotlib and many other Python visualisation packages to provide a rich suite of tools for statistical and visualisation purposes (again another topic for another day).

Get Started

Installing pandas is very easy.

pip3 install pandas!!

I will be demonstrating using a jupyter notebook so if you want to play along then see above for install directions.

Introduction to ggtree

I deal with trees ALOT! Often these trees are to be sent to people who just want a PDF or PNG for a publication or report etc; They do not want to be playing about with analysing the tree, they simply want to look at it.

Suprisingly tools that are purpose built to visualise trees for the purpose of generating a sttic image are a bit thin on the ground.

Python has a couple of neat tools, ete3 and BioPhylo but ete3 is a bit buggy and BioPhylo is a bit ugly out of the box. They can be quite powerful and customisable but for day to day visualisations they are a bit verbose. So I use ggtree.

ggtree is an R library that is relatively straightforward and easy to use, a good workhorse for day to day use. It does have some limitations in so far as it is a little clunky if you want to add heaps of annotations and the documentation - although pretty - is light on detail and explanation. Another good thing about ggtree is that you can use many different tree formats (I will be using a simply newick today) as well as dendrograms generated by clustering.

About

Introduction to datawrangling with Pandas and trees with ggtree - for MDU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published