## Unsupervised Learning & Clustering Using K-Means

- *unsupervised learning*: ML where models trained using unlabeled dataset; allowed to act on data without supervision
    - goal -> find underlying structure of dataset -> group the data according to similarities and rep dataset in compressed format
   - *clustering*: grouping objects into clusters, such that objects w/ most similarities remains in a group and has less/no similarities w/ objects of another group; finds commonalities btwn data objects and categorizes them
   - *association*: finding relationships btwn variables in large database; determines set of items that occurs together in dataset; ideal for marketing (ie, people who buy X item also tend to purchase Y item); ie, Market Basket Analysis

### Clustering
- discover underlying structure of data; *unsupervised task = not predicting anything specific*
![mlclustering.png](attachment:mlclustering.png)

**Centroid-based clustering**: organizes data into non-hierarchical clusters
- *k-means*: most common type
- sensitive to initial conditions and outliers
![centroidclustering.png](attachment:centroidclustering.png)

**Density-based clustering**: connects areas of high example density into clusters
- allows for arbitrary-shaped distributions as long as dense areas can be connected
- has difficulty w/ data of varying densities and high dimensions
- do not assign outliers to clusters
![DBSCAN-density-data.svg.png](attachment:DBSCAN-density-data.svg.png)

**Distribution-based clustering**: assumes data is composed of distributions (ie, Gaussian); as distance from center *increases*, probability that a point belongs to distribution *decreases*
- bands show decrease in probability
- if type of distribution is unknown, do not use this one
![distributionbased.jpg](attachment:distributionbased.jpg)

**Hierarchical clustering**: works best in data that is hierarchical; creates clusters in tree-like manner
![hierarchical-clustering-in-machine-learning12.png](attachment:hierarchical-clustering-in-machine-learning12.png)

### Clustering Using K-Means
**1.** select num of clusters you want to identify (k-value)
![image.png](attachment:image.png)

**2.** *randomly* select three distinct centroids
![image-2.png](attachment:image-2.png)

**3.** measure distance btwn first point to the three centroids
![image-3.png](attachment:image-3.png)

**4.** assign first point to nearest cluster
![image-4.png](attachment:image-4.png)

**5.** repeat step 4 and assign all points to nearest cluster
![image-5.png](attachment:image-5.png)

**6.** calculate mean of each cluster
![image-6.png](attachment:image-6.png)

**7.** repeat measure and clustering using mean values to verify if each point clustered correctly
![image-7.png](attachment:image-7.png)

**8.** assess quality of clustering by adding up variance within each cluster
![image-8.png](attachment:image-8.png)

**9.** repeat from step 2 to step 8 by randomly picking up 3 initial centroids; select the attempt of which the total variation is the smallest
![image-9.png](attachment:image-9.png)

**10.** how can we find best k-value? start w/ k=1 and increase k -> *plot reduction of variance as k increases using elbow plot*
- best k is located at the elbow of the plot -> reduction in variation
![image-10.png](attachment:image-10.png)