Running this notebook you will be able to reproduce the simulated data analysis results. Let me walk you through the parameters setup and execution.

First we want to inport tree_eval_methods which contains Pipeline class. Pipeline class, as it was mentioned in the dissertation report, is a wrapper class. It is responsible for genotypes generation using different clustering algorithms. The tree reconstruction based on different methods. Finally, it estimates the difference score between the true and predcited data trees.

In [12]:
import tree_eval_methods
import numpy as np

Furthermore, we need to define directories where simulated data is stored. This data can be simulated using ... however if you trust the results and want to proceed further, please use the provided files. These files will contain populated clusters, where members are generated by introducing random error to the genotype.

In [13]:
raw_data_dirs=["./populated_data/populated_true_genotypes_5_5_0.01_20.txt",
               "./populated_data/populated_true_genotypes_5_5_0.05_20.txt",
               "./populated_data/populated_true_genotypes_5_5_0.1_20.txt",
               "./populated_data/populated_true_genotypes_10_10_0.01_100.txt",
               "./populated_data/populated_true_genotypes_10_10_0.05_100.txt",
               "./populated_data/populated_true_genotypes_10_10_0.1_100.txt",
               "./populated_data/populated_true_genotypes_20_20_0.01_1000.txt",
               "./populated_data/populated_true_genotypes_20_20_0.05_1000.txt",
               "./populated_data/populated_true_genotypes_20_20_0.1_1000.txt"]

These are the directories to true genotypes, which we use to populate clusters above.

In [24]:
true_genotype_dirs=["./true_genotypes/true_genotypes_5_5_0.01_20.txt",
               "./true_genotypes/true_genotypes_5_5_0.05_20.txt",
               "./true_genotypes/true_genotypes_5_5_0.1_20.txt",
               "./true_genotypes/true_genotypes_10_10_0.01_100.txt",
               "./true_genotypes/true_genotypes_10_10_0.05_100.txt",
               "./true_genotypes/true_genotypes_10_10_0.1_100.txt",
               "./true_genotypes/true_genotypes_20_20_0.01_1000.txt",
               "./true_genotypes/true_genotypes_20_20_0.05_1000.txt",
               "./true_genotypes/true_genotypes_20_20_0.1_1000.txt"]

Here you have to specify the clustering techniques you want to test. You have the following options: "slc","k_means","bmm".

In [15]:
clustering_methods = ["slc","k_means","bmm"]


Here you have to specify which tree reconstruction methods you want to use the with the specifed clustering methods above.

In [16]:
tree_methods = ["pars","nj","upgma"]


For every dataset you use, you have to specify how many clusters there should be according to simulated data parameters.

In [17]:
numbers_of_clusters=[5,5,5,10,10,10,20,20,20]


These are vector sizes of genotypes for every raw file we use.

In [18]:
vector_sizes = [5,5,5,10,10,10,20,20,20]

We will be writing predicted genotypes (averaged vectors from clustering results) to the following directories.

In [19]:
predicted_genotype_dirs=["./predicted_genotypes/predicted_genotypes_5_5_0.01_20.txt",
               "./predicted_genotypes/predicted_genotypes_5_5_0.05_20.txt",
               "./predicted_genotypes/predicted_genotypes_5_5_0.1_20.txt",
               "./predicted_genotypes/predicted_genotypes_10_10_0.01_100.txt",
               "./predicted_genotypes/predicted_genotypes_10_10_0.05_100.txt",
               "./predicted_genotypes/predicted_genotypes_10_10_0.1_100.txt",
               "./predicted_genotypes/predicted_genotypes_20_20_0.01_1000.txt",
               "./predicted_genotypes/predicted_genotypes_20_20_0.05_1000.txt",
               "./predicted_genotypes/predicted_genotypes_20_20_0.1_1000.txt"]

We also have to format true genotypes to PHYLIP format, as they are just raw strings now, so the formated files will be written in the following directories.

In [20]:
adjusted_true_genotype_dirs=["/home/laurynas/workspace/individual_project/simulated_data/adjusted_true_genotypes_5_5_0.01_20.txt",
               "./adjusted/adjusted_true_genotypes_5_5_0.05_20.txt",
               "./adjusted/adjusted_true_genotypes_5_5_0.1_20.txt",
               "./adjusted/adjusted_true_genotypes_10_10_0.01_100.txt",
               "./adjusted/adjusted_true_genotypes_10_10_0.05_100.txt",
               "./adjusted/adjusted_true_genotypes_10_10_0.1_100.txt",
               "./adjusted/adjusted_true_genotypes_20_20_0.01_1000.txt",
               "./adjusted/adjusted_true_genotypes_20_20_0.05_1000.txt",
               "./adjusted/adjusted_true_genotypes_20_20_0.1_1000.txt"]

You can specify how many attempts you want to give for every Pipeline. The results will be averaged.

In [21]:
iterations = 1

This is the main loop, where we simply run through all possible combinations of clustering, tree reconstruction and available datasets with different size,error. 

In [25]:
for clustering_method in clustering_methods:
    print "----New Clustering method ", clustering_method, " ----------------"
    for tree_method in tree_methods:
        print "$$$$$$ Clustering ", clustering_method, "  Tree method ", tree_method," $$$$$$$$"
        for raw_data_dir, true_genotype_dir,write_adjusted_true_gen_dir,write_predicted_gen_dir, no_clusters, vector_size in zip(raw_data_dirs, true_genotype_dirs,adjusted_true_genotype_dirs,predicted_genotype_dirs, numbers_of_clusters, vector_sizes):
            print "==============================================="
            print "Raw_data_dir: ", str(raw_data_dir)
            averaged_results = []
            for i in xrange(iterations):

                pipe = tree_eval_methods.Pipeline(raw_data_dir, true_genotype_dir, clustering_method, tree_method, no_clusters,
                                vector_size,
                                write_adjusted_true_gen_dir, write_predicted_gen_dir, max_hamming_distance=2)

                result = pipe.run_pipe()
                averaged_results.append(result)

            print "Averaged Distance: \n", np.mean(averaged_results)
            print "STD : \n", np.std(averaged_results)

            print "==============================================="

----New Clustering method  bmm  ----------------
$$$$$$ Clustering  bmm   Tree method  pars  $$$$$$$$
Raw_data_dir:  ./populated_data/populated_true_genotypes_5_5_0.01_20.txt
2.0
Averaged Distance: 
2.0
STD : 
0.0
Raw_data_dir:  ./populated_data/populated_true_genotypes_5_5_0.05_20.txt
1.73205080757
Averaged Distance: 
1.73205080757
STD : 
0.0
Raw_data_dir:  ./populated_data/populated_true_genotypes_5_5_0.1_20.txt
4.0
Averaged Distance: 
4.0
STD : 
0.0
Raw_data_dir:  ./populated_data/populated_true_genotypes_10_10_0.01_100.txt
6.7082039325
Averaged Distance: 
6.7082039325
STD : 
0.0
Raw_data_dir:  ./populated_data/populated_true_genotypes_10_10_0.05_100.txt
7.74596669241
Averaged Distance: 
7.74596669241
STD : 
0.0
Raw_data_dir:  ./populated_data/populated_true_genotypes_10_10_0.1_100.txt


KeyboardInterrupt: 