# Introduction
I wanted to make changes to ../../documentation/mvp_v1.md, whilst being able to see the changes rendered. https://github.com/jupyter/notebook/issues/2485 suggests copying the markdown to a new cell in a notebook, then copying back once complete, so this is what I do here.

# File tracking system minimal viable product scope
V1.1 by Magnus Manske and Richard Pearson, 2018-07-18

## Introduction
This document describes the scope of the minimal viable product (MVP) for a new file tracking system (preliminarily designated FITS). The problem that FITS is intended to solve is described in the next section. The scope of the MVP is described in the following section. Following this, some key use cases for FITS are described, including an indication of whether these use cases will be catered for by the MVP. This document concludes with pointers to other relevant documents, relationship to the separate sample information management system (SIMS) project, and future plans.

## Problem
For our work, we rely on data files generated by machines (e.g. DNA sequencing, genotyping), supplied by third parties, or generated from such files through computation. We need to be able to know where all these files are, what they contain, how they relate to samples, and to each other.

Previously, the information described above has either been sourced from the Solaris database, from Sanger sequencing systems (Multi-LIMS warehouse (mlwh), iRODs baton) or from a combination of Solaris and Sanger sequencing systems. This approach is considered non-viable going forwards because:
* The Solaris database is being decommissioned and is no longer been populated with new data
* There are known errors in file-sample mappings in the Multi-LIMS warehouse
* We will need to store data that are not in Sanger systems, e.g. files of samples sequenced at locations other than the Sanger which we want to include in builds and releases

## Scope
The MVP _will_
* Store the location of files in various storage systems
* Store metadata of these files (e.g. file type, file size, MD5 sum)
* Store metadata about the data contained these files (e.g. insert size of paired end sequencing data)
* Store the relations of files to each other (e.g. a BAM and a CRAM file containing the same read data, which files(s) another file is based on)
* Store sample identifiers and their relation to the files
* Store metadata about the sample IDs (e.g. Oxford/ROMA codes)
* Be updated via automated processes (e.g. a cronjob reading Sanger sequencing updates)
* Be updated manually (e.g. to correct false legacy data)
* Be queried to get information about samples
* Be queried to get information about files
* Store mappings from files to samples for all parasite and vector MalariaGEN samples that have been whole-genome sequenced using Illumina technology at the Sanger

The MVP will _not necessarily_ store mappings from files to samples for 
* amplicon sequencing data
* other genotyping technologies such as Sequenom/Agena
* human samples
* samples not sequenced at the Sanger

## Use cases
The section describes a set of use cases that must be satisfied for the MVP to be considered complete, together with a brief description of their status in the current version of the file tracking system at the time of writing. The appendix gives details of queries related to some of these use cases that can currently be performed in FITS at the time of writing.

In the following, a build manifest is considered to be a file with one line per file, and to contain at the minimum:
* iRODs path
* Sample ID (Oxford code or ROMA ID)
* Alfresco study code
* NCBI taxon ID
* Manual QC status
* Date manual QC complete
* ENA run accession
* ENA sample accession

Note that in the future, it is expected that information on Sample IDs, Alfresco study codes and taxon IDs will come from SIMS, but until SIMS is operational, this information will have to come from FITS, otherwise it will not be possible to use FITS for creating build manifests. The definitive source for file-to-sample, sample-to-study and sample-to-taxon mappings should be considered to be Solaris for samples contained in Solaris, or mlwh/iRODS for samples that are not. Note that some older files are included in Solaris but not in mlwh.

For an example of a recently created build manifest, see https://github.com/wtsi-team112/Pv4/blob/master/notebooks/rp7/20180525_Pv4_manifest.ipynb

### Create a build manifest containing all samples from a species
See (how to build a manifest)[https://github.com/wtsi-team112/fits/blob/master/documentation//How_to_build_a_manifest.md].

### Create a build manifest given a set of sequencescape IDs, oxford code/ROMA IDs or Alfresco study codes
See (how to build a manifest)[https://github.com/wtsi-team112/fits/blob/master/documentation//How_to_build_a_manifest.md].

### Determine which samples from a given study have been sequenced
This is currently considered satisfied for the case of sequencescape study IDs. For more recent samples there should be a 1-to-1 mapping from sequencescape to Alfresco study codes, and therefore this use case can be considered satisfied going forwards.

### Determine how many samples have been sequenced, broken down by species
This is currently considered unsatisfied because taxon mappings from Solaris are not included


### Populate FITS with mappings from Solaris
Done.

### Manually alter FITS mappings
The key mappings that need to be captured are:
- File-to-sample
- Sample-to-taxon
- Sample-to-alfresco_study_code
The changes will need to be done in such a way that data can be overwritten or removed by later updates from mlwh/iRODS or elsewhere.

### Update the file tracking database with the latest from Multi-LIMS warehouse/baton (iRODs)
Could the way this is currently done with ./fits update_sanger?overwrite data from Solaris or manual edits? 
: No.

## Other documents
* Database description. Note that at the time of writing this requires the changes suggested by Richard in an email sent 28/06/2018 18:13
* Command line utility description. These still needs to be written.
* Process description for automatic and manual updating. This still needs to be written.

## Relationship to SIMS
For the moment, FITS and SIMS are developed separately. FITS will eventually pull information from SIMS via API, but there is no perceived need for information to flow in the other direction. FITS can match samples and files to SIMS UUIDs via high-level (e.g. ROMA/Oxford) or low-level (e.g. Sequenscape) IDs.

## Future plans
The following is a list of things that are considered to be out of scope for the MVP.

### Sequencing not done at WSI
FITS can store any file, given a unique path and a storage “engine”. Currently, the only engine supported is Sanger Sequencing iRODs, but more can be added without changing the database schema. Processes to import and update such data will have to be developed.

The first use case here is likely to be storing ENA run accessions.

### Expanding the scope to include amplicon sequencing data
FITS is agnostic to file types. Storage engines such as “Team 112 iRODs” or “Sanger NFS” can be added. Processes to import and update such data will have to be developed.

### Expanding the scope to include human data
FITS is agnostic to species, however, storing of data about human samples might influence the storage location (US cloud?).

### Expanding the scope to include pipeline outputs, e.g. release files
FITS is agnostic to file types.  Processes to import and update such data will have to be developed.

### Command line tool
A command line tool exists. It can be used to update FITS from the  Multi-LIMS warehouse database. A query engine to ease generation of manifest files without compromising generic queries is under development.

### Web front end
A web frontend is not planned at this time, however, one could be developed rather easily.

### Moving the system to a public cloud
The FITS database currently resides at Sanger. For better interoperability with Oxford and/or cloud locations, a move to a cloud-based database system is planned once the MVP has stabilized. A database-as-a-service, rather than a generic server running a MySQL client, would be preferred, for ease of maintenance, backups, availability etc.
Storage of human sample data might complicate finding an adequate cloud location.

## Appendix - current status of use cases
### Create a build manifest given a set of sequencescape IDs
Possible with current version:
```
SELECT vw_sample_tag.value,full_path FROM vw_sample_tag,vw_sample_file WHERE tag_id=3585 AND vw_sample_tag.sample_id=vw_sample_file.sample_id AND `value` IN (list_of_sequenscape_IDs);
```
See (how to build a manifest)[https://github.com/wtsi-team112/fits/blob/master/documentation//How_to_build_a_manifest.md].

### Create a build manifest given a set of Oxford codes and/or ROMA IDs
Possible with current version, similar to above.

### Create a build manifest given a set of Alfresco study codes
The current version does not track Alfresco study codes. These can be imported, though a sample tracking system might be a more appropriate place for this information.

### Create a build manifest containing all samples from a species
Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”, and species defined as Sequenscape taxon ID for “Plasmodium falciparum”:
```
SELECT DISTINCT sample_id,(SELECT group_concat(DISTINCT `value`) FROM vw_sample_tag st2 WHERE tag_id IN (3589,3586) AND st1.sample_id=st2.sample_id) AS sample_name FROM vw_sample_tag st1 WHERE tag_id=3600 AND `value`='5833';
```

### Determine which samples from a given study have been sequenced
Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”:
```
SELECT `value` AS sequenscape_study_id,count(*) AS cnt FROM vw_sample_tag WHERE tag_id=3593 GROUP BY `value` ORDER BY cnt DESC;
```

### Determine how many samples have been sequenced, broken down by species
Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”:
```
SELECT `value`,count(*) AS cnt,(SELECT taxon_name FROM sequenscape_taxa WHERE taxon_id=`value`) AS name FROM vw_sample_tag WHERE tag_id=3600 GROUP BY `value` ORDER BY cnt DESC;
```

### Create a detailed BAM/CRAM manifest file with metadata, based on species common names
Possible with current version; example for Plasmodium vivax V4 (this included a lot of filtering of unwanted files/samples):
```
SELECT file.id,file.full_path,
(SELECT group_concat(value) FROM vw_sample_tag,vw_sample_file WHERE file.id=file_id AND vw_sample_file.sample_id=vw_sample_tag.sample_id AND tag_id=3561) AS ox_code,
(SELECT group_concat(value) FROM vw_file_tag WHERE file.id=file_id AND tag_id=3591) AS common_name,
(SELECT group_concat(value) FROM vw_sample_tag,vw_sample_file WHERE file.id=file_id AND vw_sample_file.sample_id=vw_sample_tag.sample_id AND tag_id=3600) AS taxon_code
FROM file
WHERE storage=1
AND file.id IN (SELECT file_id FROM vw_file_tag WHERE tag_id=3591 AND value in ('Plasmodium vivax','vivax','P.vivax','Plasmodium Vvax','P. Vivax')) /*SPECIES NAMES*/
AND full_path NOT LIKE "%_human%" AND full_path NOT LIKE "%_phix%" /*NOT HUMAN OR PHIX*/
AND file.id NOT IN (SELECT parent FROM file_relation WHERE relation=3595) /*NOT IDENTICAL TO OTHER FILE*/
AND EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3576 AND value IN ('bam','cram')) /*FILE TYPE*/
AND EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3581 AND value=1) /*MANUAL QC*/
AND NOT EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3582 AND value=1) /*NO R&D*/
```

### Update the file tracking database with the latest from Multi-LIMS warehouse/baton (iRODs)
A command-line interface software exists. The command to perform the above operation on Sanger systems is:
./fits update_sanger
