## Unsupervised learning

- Finds patterns in data
- Compressing the data using patterns (dimension reduction)

## Supervised learning vs unsupervised learning

- *Supervised* learning finds patterns for a prediction task
	- E.g.: classify tumors as benign or cancerous (labels)
- *Unsupervised* learning finds patterns in data <u>but without a specific prediction task in mind</u>

## K-means clustering

~~~
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)

model.fit(samples)

labels = model.predict(samples)
~~~

## Cluster labels for new samples

- New samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- Finds the nearest centroid to each new sample

~~~
new_labels = model.predict(new_samples)
~~~

## Cross tabulation with pandas

- Clusters vs species is a "cross-tabulation"
- Use the *pandas* library
- Given the `species` of each sample as a list `species`

~~~
import pandas as pd

df = pd.DataFrame({'labels': labels, 'species': species})

ct = pd.crosstab(df['labels'], df['species'])
~~~

## Measuring clustering quality

- Using only samples and their cluster labels
- A good clustering has tight clusters
- ... and samples in each cluster bunched together

## Inertia measures clustering quality

- Measures how spread out the clusters are
- Distance from each sample to centroid of its cluster
- After `.fit()`, available as attribute `inertia_`
- k-means attempts to minimize the inertia when choosing clusters

## The number of clusters

- More clusters means lower inertia
- What is the best number of cluster?

- A good clustering has tight clusters (low inertia)
- ... but **not too many clusters**!
- Choose an "elbow" in the inertia plot:
	- Where inertia begins to decrease more slowly

## `StandardScaler`

- In k-means: feature variance = feature influence
- `StandardScaler` transforms each feature to have mean $0$ and variance $1$
- Features are said to be "standardized"

~~~
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(samples)

samples_scaled = scaler.transform(samples)
~~~

## Pipelines combine multiple steps

~~~
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

from sklearn.pipeline impor make_pipeline

pipeline = make_pipeline(scaler, kmeans)

pipeline.fit(samples)
labels = pipeline.predict(samples)
~~~


## Hierarchical clustering

- Clusters are contained in one another

## Agglomerative hierarchical clustering

- Every element begins in a separate cluster
- At each step, the two closest clusters are merged
- Continue until all countries in a single cluster

## The dendrogram of a hierarchical clustering

- Read from the bottom up
- Vertical lines represent clusters

<img src='./IMAGES/dendrogram.PNG'>

## With SciPy

~~~
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

mergings = linkage(samples, method='complete')

dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)

plt.show()
~~~

## Intermediate clusterings & height on dendrogram

- An intermediate stage in the hierarchical clustering is specified by choosing a height on the dendrogram
- The y-axis of the dendrogram encodes the distance between merging clusters

## Distance between clusters

- Defined by a "linkage method"
- Specified via method parameters, e.g. `linkage(samples, method='complete')`
	- In 'complete' linkage: distance between clusters is maximum distance between their samples

## Extracting cluster labels

- Use the `fcluster` method
- Returns a NumPy array of cluster labels

~~~
from scipy.cluster.hierarchy import linkage

mergings = linkage(samples, method='complete')

from scipy.cluster.hierarchy import fcluster

labels = fcluster(mergings, 15, criterion='distance')

print(labels)
~~~

## Aligning cluster labels with country names

- Given a list of strings `country_names`:

~~~
pairs = pd.DataFrame({'labels': labels, 'countries': country_names})

print(pairs.sort_values('labels'))
~~~

## t-SNE for 2-dimensional maps

- t-SNE: "t-distributed stochastic neighbor embedding"
- Maps samples to 2D space (or 3D)
- Map approximately preserves nearness of samples

~~~
from sklearn.manifold import TSNE

model = TSNE(learning_rate=100)

transformed = model.fit_transform(samples)


xs = transformed[:,0]
ys = transformed[:,1]

plt.scatter(xs, ys, c=species)
plt.show()
~~~

- **t-SNE has only `.fit_tranform()`**
	- Can't extend the map to include new data samples
- Learning rate
	- Choose learning rate for the dataset
	- Wrong choice: points bunch together
	- Try values between 50 and 200
- The axes do not have any interpretable meaning
	- They are different every time


## Dimension reduction

- More efficient storage and computation
- Remove less-informative 'noise' features which cause problems for prediction tasks

## Principal Component Analysis (PCA)

- First step: decorrelation
- Second step: reduces dimension

## PCA aligns data with axes

- Rotates data samples to be aligned with axes
- Shifts data samples so they have mean 0
- No information is lost

## PCA in scikit-learn

- `.fit()` learns the transformation from given data
- `.transform()` applies the learned data
- `.transform()` can also be applied to new data

~~~
from sklearn.decomposition import PCA

model = PCA()

model.fit(samples)

transformed = model.transform(samples)
~~~

## PCA features

- Rows of `transformed` correspond to samples
- Columns of `transformed` are the 'PCA features'
- Row gives PCA feature values of corresponding sample

- *PCA features are no correlated*

## Pearson correlation

- Measures linear correlation of features
- Value between $-1$ and $1$
- Value of $0$ means no linear correlation

## Principal components

- Directions of variance
- PCA aligns principal components with the axes
- Available as `.components_` attribute of PCA object

## Instrinsic dimension of a flight

- 2 features: long & lat
- Dataset **appears** to be 2-dimensional
- But can approximate using one feature: displacement along flight path
- Is intrinsically 1-dimensional

<img src='./IMAGES/flight-path.PNG'>

## Intrinsic dimension

- number of features needed to approximate the dataset
- essential idea behind dimension reduction
	- What is the most compact representation of the samples?
- can be detected with PCA
	- **intrinsic dimension = number of PCA features with significant variance**

- Plotting the variances of the PCA features

~~~
pca = PCA()
pca.fit(samples)

features = range(pca.n_components_)

plt.bar(features, pca.explained_variance_)

plt.xticks(features)
plt.xlabel('PCA feature')
plt.ylabel('variance')

plt.show()
~~~
***
~~~
# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(grains)

# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0,:]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()
~~~

## Dimension reduction with PCA

- PCA features are in decreasing order of variance
- Assumes the low variance features are 'noise'
- ... and high variance features are informative

- Specify how many features to keep
	- E.g.: `PCA(n_components=2)`
	- Keeps the first 2 PCA features
- Intrinsic dimension is a good choice

## Word frequency arrays

- Rows represent documents, columns represent words
- Entries measure presence of each word in each document
- ... measure using 'tf-idf'
- Usually give rise to sparse matrices

## Sparse arrays and `csr_matrix`

- Array is 'sparse': most entries are zero
- Can use `scipy.sparse.csr_matrix` instead of NumPy array
- `csr_matrix` remembers only the non-zero entries (saves space!)

## `TruncatedSVD` and `csr_matrix`

- scikit-learn PCA doesn't support `csr_matrix`
- Use scikit-learn `TruncatedSVD` instead
- Performs same transformation

~~~
from sklearn.decomposition import TruncatedSVD

model = TruncatedSVD(n_components=3)

model.fit(documents)

transformed = model.transform(documents)
~~~


## Non-negative matrix factorization (NMF)

- Dimension reduction technique
- NMF models are *interpretable* (unlike PCA)
	- Easy to interprest means easy to explain!
- However, all sample features must be non-negative ($/geq 0$)

## Interpretable parts

- NMF expresses documents as combinations of topics (or 'themes') and images as combinations of patterns

## NMF in scikit-learn

- Follows `.fit()` / `.transform()` pattern
- <u>Must</u> specify the number of components
	- `NMP(n_components=2)`
- Works with NumPy arrays and with `csr_matrix`

- Example: word-frequency array
	- 4 words, many documents
	- Measure presence of words in each document using 'tf-idf'
		- 'tf': frequency of the word
		- 'idf': reduces influence of frequent words

~~~
from sklearn.decomposition import NMF

model = NMF(n_components=2)

model.fit(samples)

nmf_features = model.transform(samples)
~~~

## NMF components

- Dimension of components = dim. of samples
- Entries are non-negative

## NMF features

- NMF features values are non-negative
- Can be used to reconstruct the samples
- ... combine feature values with components

## Reconstruction of a sample

- Multiply each NMF feature by the corresponding NMF component and add up
- Can also be expressed as a product of matrices
	- This is the "Matrix Factorization" in NMF

## NMF learns interpretable parts

- Word-frequency array articles (tf-idf)
	- 20,000 scientific articles (rows)
	- 800 words (columns)

~~~
nmf= NMF(n_components=10)

nmf.fit(articles)
~~~

- NMF components are topics
- For documents:
	- NMF components (rows) represent topics
	- NMF features combine topics into documents
- For images, NMF components are parts of images

## Grayscale image

- no colors, only shades of gray
- measure pixel brightness
- represent with value between $0$ and $1$ ($0$ is black)
- convert to 2D array

- as a flat array: row-by-row, from left to right

## Encoding a collection of images

- Collection of images of the same size
- Encode as 2D array
- Each row corresponds to an image
- Each column corresponds to a pixel

## Visualizing samples

~~~
bitmap = sample.reshape((2,3))

plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()
~~~

## Finding similar articles

- Similar articles should have similar topics
- Strategy:
	- Apply NMF to the word-frequency array
	- NMF feature values describe the topics
	- ... so similar documents have similar NMF feature values

- `articles` is a word frequecy array

~~~
from sklearn.decomposition import NMF

nmf = NMF(n_components=6)

nmf_features = nmf.fit_transform(articles)
~~~

- Different versions of the same document have some topic *proportions*
	- ... exact feature values may be different!
- **But all versions lie on the same line through the origin**!

## Cosine similarity

- Uses the angle between the lines
- Max: $1$, when angle is $0^0$
- Higher values mean more similar

~~~
from sklearn.preprocessing import normalize

norm_features = normalize(nmf_features)

current_article = norm_features[23,:]

similarities = norm_features.dot(current_article)

print(similarities)
~~~

## DataFrames and labels

- Titles given as a list `titles`

~~~
norm_features = normalize(nmf_features)

df = pd.DataFrame(norm_features, index=titles)

current_article = df.loc['Dog bites man']

similarities = df.dot(current_article)

print(similarities.nlargest())
~~~


***

~~~
# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(n_components=20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)


# Import pandas
import pandas as pd

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())
~~~