# Week 3
## Monday June 3, 2019:
To start out today, I decided to make a smaller .csv file to test the python canopy clustering code. I used Multigenome_CSV_Generator.py to generate one large .csv file of counts of the top 10 kmers from the previously generated .csv files from 10 genomes. I then tried to run my python code for canopy clustering. While running, the .csv file stored in a data frame used about 16 GB of RAM, and the CPU load of my mac grew sharply when trying to calculate the distance matrix. I was able to get lots of small clusters since the cluster cut-off values were lower than the distance values from the Euclidean metric. I then switched the distance metric to cosine similarity (range -1 to 1) and tried to find a good cut-off value for canopy inclusion. Once I had good clustering, I tried to generate a list of groupings to import into FindMyFriends using the manualGrouping() method. For this, I generated a list where the list index was a protein, and the value of the list at that index was the grouping of the protein. Once I examined the grouping list, I then realized that the code had an error; points were being clustered multiple times. I fixed that issue and finalized the code:

In [None]:
import numpy as np
import pandas as pd
import csv
from sklearn.metrics.pairwise import pairwise_distances

# X shoudl be a numpy matrix, very likely sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
# T1 > T2 for overlapping clusters
# T1 = Distance to centroid point to not include in other clusters
# T2 = Distance to centroid point to include in cluster
# T1 > T2 for overlapping clusters
# T1 < T2 will have points which reside in no clusters
# T1 == T2 will cause all points to reside in mutually exclusive clusters
# Distance metric can be any from here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
# filemap may be a list of point names in their order in X. If included, row numbers from X will be replaced with names from filemap. 
 
def canopy(X1_dist, T1, T2, filemap=None):
    canopies = dict()
    canopy_points = set(range(X1_dist.shape[0]))
    grouping_list = [None] * X1_dist.shape[0]
    iteration = 0
    clustered_points = []
    while canopy_points:
        iteration = iteration + 1
        print("iteration: " + str(iteration))
        point = canopy_points.pop()
        i = len(canopies)
        points_in_T2 = np.where(X1_dist[point] < T2)[0]
        points_to_cluster = set(points_in_T2).difference(set(clustered_points))
        canopies[i] = {"c":point, "points": points_to_cluster}
        clustered_points.extend(points_to_cluster)
        canopy_points = canopy_points.difference(set(np.where(X1_dist[point] < T1)[0]))
        
        for entry in canopies[i]["points"]:
            grouping_list[entry] = point
        
    if filemap:
        for canopy_id in canopies.keys():
            canopy = canopies.pop(canopy_id)
            canopy2 = {"c":filemap[canopy['c']], "points":list()}
            for point in canopy['points']:
                canopy2["points"].append(filemap[point])
            canopies[canopy_id] = canopy2
    return grouping_list

def main():
    df = pd.read_csv('/Users/matthewthompson/Documents/UAMS_SURF/K-mer_testing/FAA_files/10_genome_k4_kmer_top_10_table.csv')
    #gets protein_ids
    kmers = list(df['Unnamed: 0'])
    df['kmers'] = kmers
    #move from data column to index column
    df.set_index('kmers',inplace = True)
    #delete protein header list
    del df['Unnamed: 0']
    df = df.T
    df.head(3)
    print("canopy clustering:")
    X1_dist = pairwise_distances(df.to_numpy(), metric='cosine')
    
    grouping_list = canopy(X1_dist, 0.99, 0.99)
    
    '''
    with open('cosine_canopy_clusters_0.99.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        for key, value in clusters.items():
            writer.writerow([key, value])
    '''
    with open('grouping_list.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(grouping_list) 
    
if __name__ == "__main__":
    main()