# Day 22 notebook
The objectives of this notebook are to practice

* Generating data from a Gaussian mixture model
* Top-down hierarchical clustering
* Clustering gene expression data

## Modules for this activity

In [1]:
# standard library modules
import random                         # for sample

# third-party modules
from matplotlib import pyplot as plt  # for plotting
import toytree                        # for working with trees
from toytree.TreeNode import TreeNode # make TreeNode directly available

# course modules
import kmeans
import clusterplot

## PROBLEM 1: Sampling from a multivariate Gaussian distribution (1 POINT)
Implement the `sample_profile` function below which samples from a multivariate Gaussian distribution, given the means and standard deviations for each dimension (we are using a diagonal covariance matrix).  You should use the [`random.gauss`](https://docs.python.org/3/library/random.html#random.gauss) function to sample from a (one-dimensional) Gaussian distribution for each dimension.  You should consider using the `isinstance` builtin Python function for handling the `sd` argument to this function.

In [2]:
def sample_profile(mean, sd=1):
    """Randomly samples a profile from a multivariate Gaussian distribution.
    
    Args:
        mean: a tuple giving the mean of each dimension
        sd: either a tuple giving the standard deviation of each dimension
            or a single number specifying the standard deviation for all dimensions
    Returns:
        The sampled profile as a tuple
    """
    ### BEGIN SOLUTION
    sds = sd if isinstance(sd, tuple) else [sd] * len(mean)
    return tuple(map(random.gauss, mean, sds))
    ### END SOLUTION

In [3]:
# tests for sample_profile
def round_profile(t, digits=2): return tuple(round(elt, digits) for elt in t)
random.seed(43)
assert round_profile(sample_profile((3, 2, 1))) == (4.5, 2.37, 1.69)
random.seed(43)
assert round_profile(sample_profile((3, 2, 1), sd=10)) == (17.99, 5.7, 7.89)
random.seed(43)
assert round_profile(sample_profile((3, 2, 1), sd=(1, 10, 5))) == (4.5, 5.7, 4.44)
random.seed(43)
assert round_profile(sample_profile((3,))) == (4.5,)
print("SUCCESS: sample_profile passed all tests!")

SUCCESS: sample_profile passed all tests!


## PROBLEM 2: Sampling from a Gaussian mixture model (1 POINT)
Implement the `sample_gmm` function below which samples $n$ profiles from a Gaussian mixture model, given the prior probabilities, means, and standard deviations for each cluster.  To pass the tests, you will need to simulate each profile one by one.  You should use the `sample_categorical` function below to sample the cluster assignment for each profile (even in the case of uniform cluster probabilities!).  We will have this function return both the profiles and the indices of the clusters from which the profiles were generated, so that we can see how the profiles originated.

In [4]:
def sample_gmm(n, means, sds=1, probs=None):
    """Randomly samples profiles from a Gaussian mixture model.
    
    Args:
        n: the number of profiles to sample
        means: a list of tuples giving the mean profile of each cluster
        sds: either a list of numbers (or tuples) giving the standard deviation of each cluster
            or a single number (or tuple) giving the standard deviation for all clusters
        probs: a list of the prior probabilities of a profile coming from each cluster
            If None, a uniform distribution will be used.
    Returns:
        A tuple of the form (profiles, cluster_assignments) where 
        profiles is a list of the sampled profiles (each profile represented as a tuple) and
        cluster_assignments is a list of indices of the clusters from which the profiles originated.
    """
    ### BEGIN SOLUTION
    cluster_sds = sds if isinstance(sds, list) else [sds] * len(means)
    cluster_probs = probs if probs is not None else [1 / len(means)] * len(means)
    profiles = []
    assignments = []
    for i in range(n):
        index = sample_categorical(cluster_probs)
        assignments.append(index)
        profiles.append(sample_profile(means[index], cluster_sds[index]))
    return profiles, assignments
    ### END SOLUTION
    
def sample_categorical(distribution):
    """Randomly sample from a categorical distribution (a discrete distribution over K categories).
    
    Args:
        distribution: a list of probabilities representing a discrete distribution over K categories.
    Returns:
        The index of the category sampled.
    """
    r = random.random()
    for i, prob in enumerate(distribution):
        if r < prob:
            return i
        else:
            r -= prob
    # in case we encounter floating point issues return the last index
    return len(distribution) - 1

In [5]:
# tests for sample_gmm
def round_profiles(profiles, digits=2): return [round_profile(p, digits) for p in profiles]

random.seed(42)
profiles, cluster_indices = sample_gmm(4, [(1, 2), (0, 0), (3, 1)])
assert round_profiles(profiles) == [(0.79, 0.13), (0.87, 0.5), (3.89, 1.54), (1.23, 3.16)]
assert cluster_indices == [1, 0, 2, 0]

random.seed(42)
profiles, cluster_indices = sample_gmm(4, [(3, 1), (0, 0)], sds=[0.1, 10])
assert round_profiles(profiles) == [(7.92, 1.26), (2.99, 0.85), (8.95, 5.44), (3.02, 1.12)]
assert cluster_indices == [1, 0, 1, 0]

random.seed(42)
profiles, cluster_indices = sample_gmm(4, [(1, 2), (0, 0), (3, 1)], probs=[0.3, 0.2, 0.5])
assert round_profiles(profiles) == [(3.79, 1.13), (0.87, 0.5), (3.89, 1.54), (1.23, 3.16)]
assert cluster_indices == [2, 0, 2, 0]
print("SUCCESS: sample_gmm passed all tests!")

SUCCESS: sample_gmm passed all tests!


## Plotting GMM simulated data

Let's now generate some data from the Gaussian mixture model and plot it.  Several plotting functions are made available in the `clusterplot` module included with this activity.  In particular, we will use the `plot_profiles_interact_hidden` function to show the profiles with and without the (hidden) cluster information.  Note the checkbox at the top which allows you to toggle between showing and hiding the hidden cluster information.

### Equal variances and uniform probabilities

In [6]:
n = 400
means = [(1, 3), (0, 0), (3, 1)]
profiles, cluster_indices = sample_gmm(n, means)
clusterplot.plot_profiles_interact_hidden(profiles, cluster_indices, means)

### Other simulations

Simulate some other data sets with different parameters and visualize them.  For example, simluate data sets with:
* Different variances for each cluster
* Non-uniform cluster probabilties

In [7]:
### BEGIN SOLUTION TEMPLATE=your simulations and plots here
### END SOLUTION

## Hierarchical clustering

### Constructing ETE trees

In the second half of this notebook we will build tree structures that represent hierarchical clusterings.  The data structure that we will use for this is implemented by the `TreeNode` class contained within the [toytree](https://toytree.readthedocs.io/) module.  A reference for all functionality of this class can be found in the [documentation for the ETE Toolkit Master Tree class](http://etetoolkit.org/docs/latest/reference/reference_tree.html).

Here is an example of constructing a tree using this class:

In [8]:
leaf_a = TreeNode(name="a")
leaf_b = TreeNode(name="b")
leaf_c = TreeNode(name="c")

ancestor1 = TreeNode()
ancestor1.add_child(leaf_a, dist=1.5)
ancestor1.add_child(leaf_b, dist=0.5)

ancestor2 = TreeNode()
ancestor2.add_child(ancestor1, dist=0.5)
ancestor2.add_child(leaf_c, dist=2.0)

print(ancestor2)


      /-a
   /-|
--|   \-b
  |
   \-c


To convert a `TreeNode` object into a toytree tree, which we may want to do for visualization, we can write the `TreeNode` object out as a Newick-formatted string (use `format=1`) and construct a toytree tree from that string:

In [9]:
ancestor2_toytree = toytree.tree(ancestor2.write(format=1))
ancestor2_toytree.draw()

(<toyplot.canvas.Canvas at 0x7f0164146978>,
 <toyplot.coordinates.Cartesian at 0x7f0164146940>)

## PROBLEM 3: Top-down hierarchical clustering (1 POINT)

Implement a function `cluster_top_down` that *recursively* computes a top-down hierarchical clustering of a set of profiles.  This function will call the `cluster_kmeans` function that we have developed in the previous activities in order to split a set of profiles into two subsets.  The distance from a parent node to each of its two children nodes will be defined as half of the Euclidean distance between the cluster centers for the two K-means clusters that correspond to the children.

To pass the tests, you will need to follow these conventions:
* you should not modify the call to the `cluster_kmeans` function provided in the template code below
* the first recursive call to `cluster_top_down` should be on the first cluster (cluster index = 0) from k-means
* when constructing lists of subsets of profiles (for recursive calls) you should keep the profiles in the same order as they were given in the input.

You will likely find the following functions from the `kmeans` module of use:
* `group_by_cluster_assignment`
* `euclidean_distance`

### IMPORTANT NOTEBOOK NOTE
To make the `kmeans` module functional for this problem, you will need to paste your solutions to the `closest_center` and `mean_profile` functions below.

In [10]:
# we will import the squared_euclidean_distance
# and euclidean_distance functions here in case your functions reference them
from kmeans import squared_euclidean_distance, euclidean_distance

### BEGIN SOLUTION TEMPLATE=your closest_center and mean_profile functions

def closest_center(profile, centers):
    """Returns the index of the cluster center that is closest to profile.
    
    If multiple centers are equally close, the smallest index is returned.
    Args:
        profile: a tuple representing the query profile
        centers: a list of tuples representing the centers of each cluster.
    Returns:
        The index of the center that is closest (in Euclidean distance) to the query profile."""
    distances = [squared_euclidean_distance(profile, center) for center in centers]
    return distances.index(min(distances))

def mean_profile(profiles):
    """Computes the center (mean) of a cluster of the given profiles.
    
    Args:
        profiles: a list of profiles (tuples)
    Returns:
        a tuple representing the mean of the profiles."""
    n = len(profiles)
    return tuple(s / n for s in map(sum, zip(*profiles)))
### END SOLUTION

# we will plug these functions into the kmeans module via the assignments below
kmeans.closest_center = closest_center
kmeans.mean_profile = mean_profile

In [11]:
def cluster_top_down(profiles, profile_names):
    """Performs a top-down hierarchical clustering of a list of profiles, returning
    a tree that has the given profile names labeling the leaves.
    
    Args:
        profiles: a list of profiles/points (each of which is represented as a tuple)
        profile_names: a list of the same length as profiles giving the names of the profiles
    Returns:
        A TreeNode instance representing the root of the hierarchical clustering tree.
    """
    if len(profiles) == 1:
        return TreeNode(name=profile_names[0])
    else:
        cluster_assignments, centers = kmeans.cluster_kmeans(profiles, k=2, num_runs=10)
        ### BEGIN SOLUTION
        dist = kmeans.euclidean_distance(*centers)
        clusters = kmeans.group_by_cluster_assignment(profiles, cluster_assignments, 2)
        cluster_names = kmeans.group_by_cluster_assignment(profile_names, cluster_assignments, 2)
        node = TreeNode()
        node.add_child(cluster_top_down(clusters[0], cluster_names[0]), dist=dist/2)
        node.add_child(cluster_top_down(clusters[1], cluster_names[1]), dist=dist/2)
        return node
        ### END SOLUTION

In [12]:
# tests for cluster_top_down
test1_profiles = [(0, 0), (4, 3)]
test1_names = ["A", "B"]
random.seed(1)
test1_tree = cluster_top_down(test1_profiles, test1_names)
assert test1_tree.write(format=1) == "(A:2.5,B:2.5);"

test2_profiles = [(0, 0), (2, 2), (4, 5)]
test2_names = ["A", "B", "C"]
random.seed(1)
test2_tree = cluster_top_down(test2_profiles, test2_names)
assert test2_tree.write(format=1) == "((A:1.41421,B:1.41421):2.5,C:2.5);"

test3_profiles = [(0, 0), (0, 1), (2, 2), (4, 5), (5, 5)]
test3_names = ["A", "B", "C", "D", "E"]
random.seed(1)
test3_tree = cluster_top_down(test3_profiles, test3_names)
assert test3_tree.write(format=1) == "((E:0.5,D:0.5):2.77013,(C:1.25,(B:0.5,A:0.5):1.25):2.77013);"

test4_profiles = [(5, 5), (0, 1), (0, 0), (4, 5), (2, 2)]
test4_names = ["A", "B", "C", "D", "E"]
random.seed(1)
test4_tree = cluster_top_down(test4_profiles, test4_names)
assert test4_tree.write(format=1) == "(((C:0.5,B:0.5):1.25,E:1.25):2.77013,(A:0.5,D:0.5):2.77013);"

test5_profiles = [(0, 0, 0, 0, 0, 0), (0, 1, 2, 3, 4, 5), (5, 4, 3, 2, 1, 0)]
test5_names = ["A", "B", "C"]
random.seed(1)
test5_tree = cluster_top_down(test5_profiles, test5_names)
assert test5_tree.write(format=1) == "((A:3.7081,B:3.7081):3.49106,C:3.49106);"
print("SUCCESS: cluster_top_down passed all tests")

SUCCESS: cluster_top_down passed all tests


## PROBLEM 4: Clustering of gene expression data from various human cell types (1 POINT)

In this problem, you are to use your `cluster_top_down` function from above to cluster a set of real gene expression data from human samples.  The provided data set is a set of expression measurements taken from 95 different human cell types using RNA-seq technology.  For space and time considerations, expression values for only the most variable 1000 genes are given.  The expression value for gene $i$ in sample $j$ is given as $log_{10}{cpm_{ij} + 1}$ where $cpm_{ij}$ is the RNA-seq measurement in units of "counts per million" (CPM).

In [13]:
def read_gene_expression_profiles(filename):
    rows = [line.rstrip().split("\t") for line in open(filename)]
    sample_names = rows[0]
    columns = zip(*rows[1:])
    profiles = [tuple(map(float, column)) for column in columns]
    return profiles, sample_names

expression_profiles, sample_names = read_gene_expression_profiles("cell_type_expression.txt")

Cluster these expression profiles with `cluster_top_down` and then visualize the resulting tree.  One of the samples is labeled "***UNKNOWN***".  Based on how that sample clusters with the others, what is the most likely cell type for this sample?  Submit your answer by assigning a string to the variable `unknown_cell_type_prediction`.  Your answer should be one of "B cell", "T cell", "epithelial cell", or "macrophage".

In [14]:
### BEGIN SOLUTION TEMPLATE=unknown_cell_type_prediction=?
gene_expression_tree = cluster_top_down(expression_profiles, sample_names)
toy_gene_expression_tree = toytree.tree(gene_expression_tree.write(format=1))
toy_gene_expression_tree.draw()
unknown_cell_type_prediction = "T cell"
# the cell type was actually "effector memory CD8-positive, alpha-beta T cell, terminally differentiated"
### END SOLUTION

In [15]:
# tests for unknown_cell_type_prediction
assert isinstance(unknown_cell_type_prediction, str)
assert unknown_cell_type_prediction in ("B cell", "T cell", "epithelial cell", "macrophage")
### BEGIN HIDDEN TESTS
assert unknown_cell_type_prediction == "T cell"
### END HIDDEN TESTS