# Performing QC on your data
The results you can get from any analysis will only ever be as good as the data you put into it. To avoid spending countless hours performing analysis without receiving any satisfactory results, or worse yet erroneous or misleading results, it is important to QC your data before starting. There are a number of checks you can make to ensure you are dealing with good quality data, and we will walk you through some of them here. 

## Contamination
In order to get meaningful results from `Roary`, the samples should be closely related. If you have lots of contamination in your data, for instance if one of your samples is from a different species, you will get very few genes in your core genome, if any at all.  

It is always a good idea to check that your samples are the species you expect them to be. You can use tools such as `Bactinspector` or [`Kraken` (https://www.ebi.ac.uk/research/enright/software/kraken)](https://www.ebi.ac.uk/research/enright/software/kraken) for this. `Roary` has a qc option that will run `Kraken` for you and generate a report listing the top species of all the samples.

We will not run `Kraken` now but an example report may look something like this:

![QC report](img/qc_report.png)

As we expected, these three samples are all of the same species. Let's assume that we had an additional fourth sample and we run `Roary` with the qc option, we get the following output:

![QC report with contamination](img/qc_contamination.png)

This tells us that the most prevalent species in sample 4 is in fact *Escherichia coli* so we will exclude this sample from our analysis before we carry on.

## Coverage
To get reasonable quality assemblies, you need a genome coverage of at least 30x. Remember to get a quick estimate of your coverage, you can divide the number of bases in your sequence data with the number of bases in the reference genome of the species. For the samples used in this tutorial, the coverage is listed below. The genome of  _S. pneumoniae_ is approximately 2,200,000 bases.

|Sample |No. of Bases|Coverage|
|------ |------------|--------|
|sample1|262705400   |120x    |
|sample2|218026200   |99x     |
|sample3|173524000   |79x     |

## Assembly size
The size of the assemblies can also provide a useful hint. If one of the assemblies is much smaller or bigger than the others there is a chance that this is not of the same species as the rest.

## Fragmented assemblies
If the assemblies are very fragmented (thousands of contigs), the genes may be too fragmented for inferring the pangenome.

These are just some of the most basic things that you can do to make sure your data looks ok. There is much more that can be done but we won't go into any further detail in this tutorial.

To generate some basic metrics about genome assemblies you can use the `assembly-stats` tools.

In [None]:
assembly-stats assemblies/sample1.fasta

## Check your understanding
**Q5: Why is it important to QC your data?**  
  
**Q6: You're not getting any core genes when you run Roary. What could be the reason?**  

**Q7: What is the size of the assembly for sample1?**  

**Q8: How many contigs are in the assembly of sample1?**  

Now you should be ready to run Roary to construct a pangenome, so go to the next section, [Constructing a Pangenome using Roary](run_roary.ipynb).  