# Using TEA (Taxa Evaluation and Assessment) 


## Quick Start

    Create a Misc object, then use the function main(). 
    
    Arguments:
        - Input: the name of the ground truth profile file
        - Output: the excel file name, a .xlsx of six sheets: True Positives, False Negatives, False Positives, True 
            Negatives, Precision, Recall of each tool
        - (optional) input directory of all profiles, including the ground truth; <Default: Directory of Package Manual>
        - (optional) the output directory; <Default: Directory of Package Manual>
        - (optional) "yes" if you want individual .csv files of each tool's confusion matrix; <Default: "no">

In [1]:
from TEA.precall import Misc

In [5]:
Quick = Misc()

Quick.main("truth.profile", "TaxaPerformanceMetrics_byTool", "C:\\Users\\**\\**\\TEA\\Test Files", "")

[]


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\..\\..\\TEA\\Test Files\\truth.profile'

## Set up:

    To set up a python file for MoC, first import the modules in the package.
    Then create objects to use each module.

In [2]:
from TEA.profile_parser import Parser
from TEA.comparator import Comparator
from TEA.confusion_matrix import Confusion
from TEA.precall import Misc

myParser = Parser()
myComparer = Comparator()
myConfusion = Confusion("truth", "pred")
myMisc = Misc()

## How they work

    The main purpose of this module is to calculate each Tax ID's confusion matrix and save that and other information into a .csv file.

    Each module does a specific step in that process

### Parse

    Parse separates the relevant information for each Tax ID (ie. rank, abundance, etc.) into dictionaries. Both functions parse_data() and main() return a dictionary of dictionaries variable, but main() does a little more with it.

    The returned variable looks like this:

    {Sample Number : dict}
                   - {Rank : dict}
                           - {Tax ID : Abundance}

    There is an alternative parsing format that can be used to get other information about the Tax ID (name and rank), which returns a dictionary of dictionaries variable that looks like this:

    {Sample Number : dict}
                  - {Tax ID : list}
                            - [rank, name]

    The functions that contribute to main() and parse_data() can be used individually as shown below


#### main(self, f, t=0)

    This function calls parse_data(), divide_content(), and get_file() to create and return a dictionary where sample number is the key and a dictionary of {Rank : {Tax ID : Abundance}} is the value.
    
    The variable t is optional and the default is zero. Passing one for t instead of zero tells the program to use the alternative parsing format.

In [2]:
sample_1 = myParser.main("pred")
sample_2 = myParser.main("pred2")

sample_1_alt = myParser.main("pred", 1)

#### print_samples(self, samples, t=0)

    This function prints the contents of the dictionary samples in a viewable way.

    The variable t defaults to zero, printing in the format of the default sample format. If t=1, it prints in the alternative format.

In [3]:
print("~DEFAULT~")
myParser.print_samples(sample_1)
print("~ALTERNATIVE~")
myParser.print_samples(sample_1_alt, 1)

~DEFAULT~
Sample Number: 0
	Rank: rank1
		1 - 100.0
	Rank: rank2
		3 - 97.0
		4 - 2.0
		7 - 1.0
Sample Number: 1
	Rank: rank1
		1 - 100.0
	Rank: rank2
		3 - 15.0
		5 - 85.0
~ALTERNATIVE~
Sample Number: 0
	Tax ID: 1
		Rank - rank1, Name - bac
	Tax ID: 3
		Rank - rank2, Name - bac|bac1
	Tax ID: 4
		Rank - rank2, Name - bac|bac2
	Tax ID: 7
		Rank - rank2, Name - bac|bac4
Sample Number: 1
	Tax ID: 1
		Rank - rank1, Name - bac
	Tax ID: 3
		Rank - rank2, Name - bac|bac1
	Tax ID: 5
		Rank - rank2, Name - bac|bac3


### Comparator

    Comparator compares two sample dictionaries to find the common Tax IDs in each sample and to combine the Tax IDs in each sample into a new dictionary.

#### main(self, files, t=0)

    The main() function takes in a string (files) of two profile names (without the '.profile' part) separated by a space. It also accepts an optional argument t that is used when the function calls on Parser.main().
    
    This function returns two dictionaries where sample number is the key and a set is the value. The sets for the two dictionaries contain all the Tax IDs from both samples, excluding repeats, or the Tax IDs both samples had in common, respectively.

In [4]:
common_1_2, combined_1_2 = myComparer.main("truth pred")
print(common_1_2)
print(combined_1_2)

{0: {1, 3, 4}, 1: {1, 3, 5}}
{0: {1, 2, 3, 4, 6, 7}, 1: {1, 3, 5}}


#### save_tax_ID(self, samples)

    This function iterates over a samples dictionary and uses get_tax_ID() to create and return a dictionary where sample number is the key and a set of Tax IDs from that sample is the value.

In [12]:
sample_tax_ID_1 = myComparer.save_tax_ID(sample_1)
sample_tax_ID_2 = myComparer.save_tax_ID(sample_2)
print(sample_tax_ID_1)
print(sample_tax_ID_2)

{0: {1, 3, 4, 7}, 1: {1, 3, 5}}
{0: {1, 3, 4}, 1: {1, 2, 5}}


#### common_tax_ID(self, tax_id1, tax_id_2)

    This function creates and returns a dictionary where sample number is the key and a set containing the common Tax IDs between two sample files is the value. It iterates over a samples dictionary and calls _common_tax_ID() to save each set under a sample number.

In [13]:
common_dict_1_2 = myComparer.common_tax_ID(sample_tax_ID_2, sample_tax_ID_1)
print(common_dict_1_2)

{0: {1, 3, 4}, 1: {1, 5}}


#### combine_tax_ID(self, tax_id_1, tax_id_2)

    This function creates and returns a dictionary where sample number is the key and a set containing the Tax IDs from both sample dictionaries is the value.

In [14]:
combined_dict_1_2 = myComparer.combine_tax_ID(sample_tax_ID_2, sample_tax_ID_1)
print(combined_dict_1_2)

{0: {1, 3, 4, 7}, 1: {1, 2, 3, 5}}


### Confusion

    Confusion uses both Comparator and Parser to create a confusion matrix for every Tax ID in each sample.

#### \__init__(self, tru, fn)

    The constructor for Confusion objects is a little different since it takes two arguments, one for the name of the truth file and one for the name of the predicted file (excluding the '.profile' part).

In [15]:
Confu = Confusion("truth", "pred")

#### get_file_name(self) and get_truth(self)

    These functions return a string containing the name of the predicted and truth files respectively

In [16]:
print("Truth:", Confu.get_truth(), '\nPredicted:', Confu.get_file_name())

Truth: truth 
Predicted: pred


#### set_file_name(self, tru) and set_truth(self, fn)

    These two functions allow you to change the ground truth or predicted file whenever you need to. It also makes it so that you only need one Confusion object for multiple predicted files or even multiple truth files.

In [17]:
Confu.set_file_name("pred2")
Confu.set_truth("pred")
print("Truth:", Confu.get_truth(), '\nPredicted:', Confu.get_file_name())

Truth: pred 
Predicted: pred2


#### main(self, csv="yes", t=0)

    This function uses Comparer and Parser and internal functions to create a confusion matrix for the predicted file, then save that data as a .csv file. The automatic name for that .csv file is the truth and predicted file names joined by two hyphens (ie. Truth='truth', Predicted='pred', .csv name='truth--pred.csv').
    
    It also returns the confusion matrix it created

In [18]:
t_p_matrix = Confu.main()

#### check_matrix_error(self, matrix)

    This a supplementary function that checks to make sure the sum of the numbers in the confusion matrix equal the total number of samples in the ground truth file for each Tax ID. It then returns a list of the Tax IDs with confusion matrices that are over and under that number respectively.

In [19]:
Confu.check_matrix_error(t_p_matrix)

([], [])

### Misc

    Misc is mainly for finding the confusion matrices of multiple profiles at once and creating an .xlsx file to display the confusion matrix values for each predicted file and the ground truth on separate sheets. It also uses the Precall class to calculate and add a sheet for precision and recall.

#### main(self, names, file_path="", excel_name="Default_excel_name", csv="no")

    This function takes the name of the ground truth file, the name of the output .xlsx file, the directory of the profile files, the directory for the output file, and a string to  determine whether or not to also create .csv files for each tool's confusion matrix. 
        The name of the ground truth file is a string. 
        The name of the ourput .xlsx file can be anything you choose.
        The directory of the profile files needs to include all the profiles being evaluated plus the ground truth file. They should all be in the CAMI format and have the extension '.profile'. The default is the package manual directory.
        The directory for the .xlsx output file can be anything you choose. The default is the package manual directory.
        The .csv files contain the confusion matrix for each Tax ID. One file would be made for each tool. If you want the progam to output them, pass in "yes", otherwise leave blank since the default is "no". The directory for these files will be the same as the .xlsx output file's.

In [14]:
myMisc.main("truth.profile", "TaxaPerformanceMetrics_byTool2", "C:\\*\\*\\*\\MoC\\tests\\", "")


Added as matrix truth

Added as matrix pred

Added as matrix pred2

Added as matrix pred3

Saved as 'C:\Users\milkg\Documents\Trail_Run_2.xlsx'
