# Getting labelled data for nanopore reads
This is a guide predominatley on how to use the package `nanoraw`. Specifically, a forked version that I have made some modifications to.  
This program reads in a FAST5 file and maps it to a given reference genome, giving you a corrected version of your nanopore read.  


## Installation
We have installed the modified version of `nanoraw` on NeCTAR already, but there is command to set up a vitrual environment that you need to use before you use it.  

>```$ . /sw/python/venv/bin/activate```  

This command should be added as the first thing to be run in any script you set up that will use `nanoraw` on the NeCTAR cluster. **Notice the dot at the beginning of the command**


### Installation on your personal computer

If you would like to install it on your personal machine, run the following code to install my modified version:

In [2]:
!pip install git+https://github.com/mbhall88/nanoraw.git

Collecting git+https://github.com/mbhall88/nanoraw.git
  Cloning https://github.com/mbhall88/nanoraw.git to /var/folders/gd/t7fyl03d5cbc0ckrw08lzw440000gn/T/pip-_DgfYL-build
Installing collected packages: nanoraw
  Running setup.py install for nanoraw ... [?25l- \ done
[?25hSuccessfully installed nanoraw-0.1


If you want to run this on your local machine you'll also need to have `graphmap` installed which can be done from  https://github.com/isovic/graphmap  

## Usage
We are only really concerned with on main function within the package - `correct_raw_data` - in the module `correct_raw`.  
This function, given the correct flags, will return a numpy array with each row being an event with the mean, stdev, length, and nucleotide for that event.  
```def correct_raw_data(
        filename, genome_filename, graphmap_path, basecall_group,
        corrected_group, rmStayStates=True, outlier_threshold=5,
        timeout=None, min_event_obs=4, num_cpts_limit=None,
        overwrite=True, in_place=True):```  
The arguments this function takes will be explained by way of example.

In [None]:
import numpy as np
from nanoraw import correct_raw as cr
fName = "IMB14_011406_LT_20160928_FNFAB27163_MN17279_mux_scan_GN_003_R9_280916_65776_ch27_read13_strand.fast5"
ref = "pacbio_ref.fa"
gmap_dir = "/Users/mbhall88/Dropbox/Documents/Bioinformatics/graphmap/bin/Mac/graphmap"
b_group = "Basecall_1D_000"
c_group = "RawGenomeCorrected_000"
foo = cr.correct_raw_data(fName, ref, gmap_dir, b_group,
                          c_group, in_place = False)

# names of the columns in the returned numpy array
col_names = np.array(["norm_mean", "norm_stdev", "start",
                        "length", "base"])

print foo

In the above usage example the main thing to be aware of is the `False` flag for the `in_place` argument. By setting this to False, the function will output a numpy array. The default is actually `True` and if left as such will override the FAST5 file. As such **make a copy of some FAST5 files and play around with this function before running any scripts on the real files**. We do not want to have this set to `True` as writing a new copy of every FAST5 file will be very time consuming, and unnecessary for our purposes.

### Graphmap path on NeCTAR
`Graphmap` is installed on NeCTAR already and the path is  
>```/sw/graphmap/current/bin/graphmap```


**A couple of last things...**  
I'll give you some suggested code (which you may have a better method for) for running this function over a selection of FAST5 files.  
Make sure your python file is in the directory containing the FAST5 files you want to run the function over. Running the following python code will return a list of all of the FAST5 files in your current directory and then you can run the function over whatever subset you want.

In [None]:
import os
cwd = os.getcwd()
f5_files = [f for f in os.listdir(cwd) if f.endswith('.fast5')]

### Final remarks
This labelled data does not have any methylation information in it at this point, and there may be some issues with the way `nanoraw` deals with gaps in the alignment, but for now this should give you something to start running your NN on.  
This is a link to the original `nanoraw` github repo - https://github.com/marcus1487/nanoraw  
Let me know if you need anything else.