1. How would you define clustering? Can you name a few clustering algorithms?

Clustering is a class of unsupervised machine learning algorithms where the goal is to find subsets of data that are similar to each other.

Some clustering algorithms: k-means, dbscan, gaussian mixture models, BIRCH, mean-shift, affinity propogation, spectral clustering.

2. What are some of the main applications of clustering algorithms?

It is useful for:
- segmentation 
 - images
 - customers
- data analysis
 - run data through a clustering algorithm and try to understand why the clusters were chosen
- anomaly detection
 - data points that are not near cluster centers are likely to be outliers
  - can use distance measures or density estimation
- dimensionality reduction
- preprocessing
 - can make data more linearly seperable
- semi-supervised learning
 - propogate labels from representative data points to points within the same cluster

3. Describe two techniques to select the right number of clusters when using K-Means .

### find the inflection point in a graph of inertia

inertia: mean squared distance of points to closest cluster centroid

This is simple but innaccurate: just find the point where the "elbow" occurs in graph of k vs inertia


### silhoutte digram
Use diagram to choose k where clusters are of similar cardinality, have individual silhouette coefficients greater than global silhouette score

silhouette coefficient:

$$
\frac{b-a}{max(a, b)}
$$

where

a: mean distance to other instances in the same cluster

b: mean distance to other instances of the closest cluster

silhoutte score: mean of silhoutte coefficient accross all instances

4. What is label propagation? Why would you implement it, and how?

Label propogation is useful when you have some labeled instances and some unlabelled instances. Since more data improves the performance and generalization of algorithms, and labelling data can be time consuming and costly, it is useful to find an automated process for labelling the data to improve supervised learning performance.

If we assume that we start with a completely unlabelled dataset, label propogation can be performed by:
- run the data through a clustering algorithm
- find "representative examples" which are the points nearest the centroid of each cluster (or some number of those nearest the centroid per cluster)
- label representative examples
- propogate label of representative examples to other instances in their cluster
 - a distance threshold can improve the quality of labels. I imagine that probability density could also be a useful measure here

5. Can you name two clustering algorithms that can scale to large datasets? And two that look for regions of high density?

Algorithms that can scale to large datasets:
- k-means
- agglomerative clustering
- BIRCH if n_features < 20

Algorithms that look for regions of high density:
- DBSCAN
- mean-shift
- gaussian mixture models

6. Can you think of a use case where active learning would be useful? How would you implement it?

Active learning could be useful in pretty much any situation where data is much more plentiful than labels, but in particular in situations where labelling is very costly.

To implement active learning
- start with some small proportion of labels
- train the model, which has some form of probability score or activation score per instance
- predict the score per instance on unlabelled instances
- ask a human for labels on the least certain instances
- train the model again until returns diminish

8. What is a Gaussian mixture? What tasks can you use it for?

A gaussian mixture model is an unsupervised learning algorithm that assumes that data is generated from a mixture of gaussian distributions. There are k distributions each with mean and covariance, and each assigned a weight $\phi_i$ that indicates the probability that the weight's corresponding distribution generated a point.

In GMMs, only the data X is observed, while z is an unobserved latent variable that selects one of the k distributions. The process of training the distribution involves inferring the latent variable from the data via an iterative process called Expectation Maximization. In EM, two phases are repeated iteratively. First, assigning responsibilities (expectation over latent "selector" variable) of each cluster to each point, and second maximizing the likelihood of the evidence for each parameter (weights, means, and covariances).

Bayesian GMMs assign priors to the parameters, but from a notational perspective we treat all of these random variables as part of the set of latent variables $Z$, and then use a method called Variational Inference to optimize the parameters. Variational Inference is very similar to EM, but can be used when the posterior distribution $p(z | X)$ is intractible. It makes the additional assumption that the posterior can be approximated by a variational distribution $q(Z)$ which factorizes according to a partitioning of the latent variables $Z$. Thus, $q(Z) = \prod_{i=1}^{M} q_i(Z_i)$. Each factor $q_i$ can be optimized separately with respect to the other $q_j\, \text{where}\, i \neq j$. This is the mean field approach to Variational Inference.

GMMs can be used for clustering, anomaly detection, and density estimation.

9. Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?

- Minimizing the model evidence k using the Bayesian Information Criterion or the Akaike Information Criterion
- Using Bayesian GMM which discovers the optimal k as part of the optimization process, assuming the initial `n_components` is larger than the optimal k.