<a href="https://colab.research.google.com/github/kaliappan01/Exploring_ML_models/blob/main/Dimension_reductionality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dimensionality Reduction** is simply reducing the number of features (columns) while retaining maximum information.  
Advantages 
- Reduction in computation cost & storage
- Removes redundunt features

In [1]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE,Isomap
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import accuracy_score,homogeneity_score, completeness_score
import seaborn as sns
import numpy as np

%matplotlib inline

In [None]:
people = fetch_lfw_people(min_faces_per_person = 20,resize = 0.7)
print(people.DESCR) 

Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009


In [None]:
img_shape = people.images[0].shape 

In [None]:
people.images.shape

In [None]:
people.target_names[people.target[0]]

In [None]:
fig, axes = plt.subplots(3,5,figsize = (15, 8),subplot_kw=dict(xticks=[],yticks=[]))
for target, image, ax in zip(people.target, people.images, axes.ravel()):
  ax.imshow(image, cmap="gray")
  ax.set_title(people.target_names[target])

In [None]:
counts = np.bincount(people.target)
for i, (count, name) in enumerate(zip(people.target_names,counts)):
  print("{0:25} {1:3}".format(name,count),end = ' ')
  if (i+1)%3 ==0:
    print()

In [None]:
pca = PCA(n_components = 150).fit(people.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("N components")
plt.ylabel("Cumulative variance")

Principal Component Analysis
- This technique uses eigen values & eigen vectors to filter out the components which account for the maximum variance in the data
- Alternatively uses SVD(single value decomposition)

In [None]:
pca_n100 = PCA(n_components = 100, whiten = True, random_state = True)
data_scaled = people.data/255
data_pca = pca_n100.fit_transform(data_scaled)

In [None]:
images_retransformed= pca_n100.inverse_transform(data_pca)
images_rescaled = images_retransformed*255
images_recovered = [image.reshape(img_shape) for image in images_rescaled]

The  PCA transformation is reversible  
We attempt to recreate the images from their reduced version

In [None]:
fig, axes = plt.subplots(3,5,figsize = (15, 8),subplot_kw=dict(xticks=[],yticks=[]))
for target, image, ax in zip(people.target, images_recovered, axes.ravel()):
  ax.imshow(image, cmap="gray")
  ax.set_title(people.target_names[target])

We observe that the recreated images have lower resolution
These are images are recreated from the 100 components which account to nearly 90% of the data variance 
The extra components are discarded by PCA so they are lost in this technique of dimension reductionality

In [None]:
tsne = TSNE(random_state = 42)
img_tsne = tsne.fit_transform(people.data)

**t - Distributed Stochaistic Neighbor Embedding**  
- In this method the dimensions are reduced to 2
- Attempts to find a 2-D representation of the data that preserves the distance between the data points as best as possible

In [None]:
img_tsne.shape

In [None]:
def display_2d_component_names(model, selected,dataobj):
  colors = ["#2F4F4F","#8B008B","#7FFF00","#00FFFF","#00FF7F","#FF00FF","#FF0000","#FF1493"
  ,"#8A2BE2","#7FFFD4","#D2691E	"]
  plt.figure(figsize=(14,14))
  plt.xlim(model[:,0].min(),model[:,0].max()+1)
  plt.ylim(model[:,1].min(),model[:,1].max()+1)
  for i in range(len(dataobj.data)):
    cindex = dataobj.target[i]%len(selected)
    if dataobj.target[i] not in selected:
      continue
    plt.text(model[i,0],model[i,1],str(dataobj.target_names[dataobj.target[i]]),color = colors[cindex],fontdict = {'weight':'bold','size':9})

In [None]:
display_2d_component_names(img_tsne,(4,1),people)

In [None]:
iso =Isomap(n_components = 2)
img_iso = iso.fit_transform(people.data)

**ISOMAP**  
- This is a graph based technique
- Connects each instance to its K-nearest neighbor
- Uses Djisktra algorithmn to calculate the shortest distance between datapoints to assign clusters

In [None]:
display_2d_component_names(img_iso,(4,1),people)

MNIST DIGITS DATASET

In [None]:
from sklearn.datasets import load_digits

digits_data = load_digits()
print(digits_data.DESCR)

In [None]:
fig,axes = plt.subplots(2,5,figsize = (10,10),subplot_kw=dict(xticks=[],yticks=[]))
for img,ax in zip(digits_data.images,axes.ravel()):
  ax.imshow(img,cmap="gray")

In [None]:
pca_mnist = PCA(n_components = 2)
pca_mnist.fit(digits_data.data)
digits_pca = pca_mnist.transform(digits_data.data)
display_2d_component_names(digits_pca, (0,1,2,3,4,5,6,7,8,9),digits_data)

In [None]:
from sklearn.cluster import KMeans
kmeans1 = KMeans(n_clusters = 10)
pca_pred = kmeans1.fit_predict(digits_pca)

In [None]:
from scipy.stats import mode
def cluster_accuracy(target, clusters, numClasses):
  labels = np.zeros_like(clusters)
  for i in range(numClasses):
    mask = (clusters == i)
    labels[mask] = mode(target[mask])[0]
  print("Accuracy Score : {} \nHomogeneity Score : {}\nCompleteness Score : {}".format(accuracy_score(target, labels),homogeneity_score(target, labels),completeness_score(target, labels)))
  # return accuracy_score(target, labels),homogeneity_score(target, labels),completeness_score(target, labels)

In [None]:
cluster_accuracy(digits_data.target,pca_pred,10)

In [None]:
tsne_mnist = TSNE(random_state=42)
digits_tsne = tsne_mnist.fit_transform(digits_data.data)
display_2d_component_names(digits_tsne,(0,1,2,3,4,5,6,7,8,9),digits_data)

In [None]:
kmeans2 = KMeans(n_clusters = 10)
tsne_pred = kmeans2.fit_predict(digits_tsne)

In [None]:
cluster_accuracy(digits_data.target,tsne_pred,10)

In [None]:
iso_mnist = Isomap(n_components = 2)
digits_iso = iso_mnist.fit_transform(digits_data.data)
display_2d_component_names(digits_iso,(0,1,2,3,4,5,6,7,8,9),digits_data)

In [None]:
kmeans3 = KMeans(n_clusters = 10)
iso_pred = kmeans3.fit_predict(digits_iso)

In [None]:
cluster_accuracy(digits_data.target,iso_pred,10)