<a href="https://colab.research.google.com/github/josemage16/JDisplay/blob/main/Copy_of_03_Analisis_PCA_digitos_MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analisis de Componentes Principales (PCA)

**Autor**: Roberto Muñoz Soria

**Github**: https://github.com/rpmunoz


La base de datos MNIST de dígitos escritos a mano tiene un conjunto de entrenamiento de 60.000 ejemplos y un conjunto de prueba de 10.000 ejemplos. Es un subconjunto de un conjunto más grande disponible en NIST. Los dígitos se normalizaron en tamaño y se centraron en una imagen de tamaño fijo.
<br>

Es una buena base de datos para las personas que desean probar técnicas de aprendizaje y métodos de reconocimiento de patrones en datos del mundo real mientras dedican un esfuerzo mínimo al preprocesamiento y formato.

Parameters | Number
--- | ---
Classes | 10
Samples per class | ~7000 samples per class
Samples total | 70000
Dimensionality | 784
Features | integers values from 0 to 255

In [None]:
import os
import requests
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## Descargar los datos

In [None]:
dataDir = 'data'
if not os.path.exists(dataDir):
  os.mkdir(dataDir)

In [None]:
imgSize = (28, 28)
nSample = 10000

trainURL = "https://rmunoz-public.s3.amazonaws.com/ml/Kaggle_mnist_train.csv"
testURL = "https://rmunoz-public.s3.amazonaws.com/ml/Kaggle_mnist_test.csv"

trainFile = os.path.join(dataDir, "Kaggle_mnist_train.csv")
testFile = os.path.join(dataDir, "Kaggle_mnist_test.csv")

In [None]:
r = requests.get(trainURL, stream=True)
with open(trainFile, "wb") as f:
    f.write(r.content)
    
r = requests.get(testURL, stream=True)
with open(testFile, "wb") as f:
    f.write(r.content)

In [None]:
trainDF = pd.read_csv(trainFile)
nTrain = len(trainDF)

trainDF.head()

In [None]:
trainYDF = trainDF['label']
trainXDF = trainDF.drop(['label'], axis=1)

In [None]:
print(trainXDF.shape)
print(trainYDF.shape)

In [None]:
idx = np.random.choice(range(nTrain))

plt.figure(figsize=(7,7))
dataRaw = trainXDF.iloc[idx].values.reshape(imgSize[0], imgSize[1])
plt.imshow(dataRaw, interpolation = "none", cmap = "gray")
plt.show()

## Preprocesamos los datos

In [None]:
trainX = trainXDF.sample(nSample, random_state=42).to_numpy()
trainY = trainYDF.sample(nSample, random_state=42).to_numpy(dtype=int)

print("Tamaño de la muestra: ", trainX.shape)
print("Valor mínimo: ", np.min(trainX))
print("Valor máximo: ", np.max(trainX))

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
trainProcX = scaler.fit_transform(trainX)

print("Tamaño de trainX normalizado: ", trainProcX.shape)
print("Valor mínimo: ", np.min(trainProcX))
print("Valor máximo: ", np.max(trainProcX))

# Método 1: Cálculo de PCA usando funciones básicas de Python

In [None]:
#find the co-variance matrix which is : A^T * A

covar_matrix = np.matmul(trainProcX.T , trainProcX)

print ( "The shape of variance matrix = ", covar_matrix.shape)

In [None]:
from scipy.linalg import eigh 

# the parameter 'eigvals' is defined (low value to high value) 
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783)(index) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782,783))

print("Shape of eigen vectors = ", vectors.shape)
# converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.T

print("Updated shape of eigen vectors = ", vectors.shape)
# here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
# here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

## Visualizamos el resultado del analisis PCA

In [None]:
# projecting the original data sample on the plane 
#formed by two principal eigen vectors by vector-vector multiplication.

trainPCAX = np.matmul(vectors, trainX.T)

print (" resultant new data points' shape ", vectors.shape, "X", trainX.T.shape," = ", trainPCAX.shape)

In [None]:
# appending label to the 2d projected data(vertical stack)
trainPCA = np.vstack((trainPCAX, trainY)).T

# creating a new data frame for ploting the labeled points.
trainPCADF = pd.DataFrame(data=trainPCA, columns=("1st_principal", "2nd_principal", "label"))
print(trainPCADF.head())

In [None]:
sns.FacetGrid(trainPCADF, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

# Método 2: Cálculo de PCA usando librería Scikit-Learn

In [None]:
from sklearn import decomposition

In [None]:
# configuring the parameteres
# the number of components = 2

pca = decomposition.PCA()
pca.n_components = 2

trainPCAX = pca.fit_transform(trainX)

# pca_reduced will contain the 2-d projects of simple data
print("shape of pca_reduced.shape = ", trainPCAX.shape)

In [None]:
# attaching the label for each 2-d data point 
trainPCA = np.vstack((trainPCAX.T, trainY)).T

# creating a new data fram which help us in ploting the result data
trainPCADF = pd.DataFrame(data=trainPCA, columns=("1st_principal", "2nd_principal", "label"))
sns.FacetGrid(trainPCADF, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
plt.show()

# Método 3: t-SNE usando Scikit-Learn

In [None]:
from sklearn.manifold import TSNE

In [None]:
nSample = 1000

trainX = trainXDF.sample(nSample, random_state=42).to_numpy()
trainY = trainYDF.sample(nSample, random_state=42).to_numpy(dtype=int)

scaler = StandardScaler()
trainProcX = scaler.fit_transform(trainX)

print("Tamaño de trainX normalizado: ", trainProcX.shape)
print("Valor mínimo: ", np.min(trainProcX))
print("Valor máximo: ", np.max(trainProcX))

Parámetros del método t-SNE

- n_components: Número de componentes
- perplexity: Nivel de perplejidad. Default es 30
- learning rate: Tasa de aprendizaje. Default es 200
- n_iter: Número máximo de iteraciones. Default es 1000

## Parámetros por defecto

In [None]:
model = TSNE(n_components=2, random_state=0)

trainTSNEX = model.fit_transform(trainX)

# creating a new data frame which help us in ploting the result data
trainTSNE = np.vstack((trainTSNEX.T, trainY)).T
trainTSNEDF = pd.DataFrame(data=trainTSNE, columns=("Dim_1", "Dim_2", "label"))

# Ploting the result of tsne
sns.FacetGrid(trainTSNEDF, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

## Uso de perplexity=50

In [None]:
model = TSNE(n_components=2, random_state=0, perplexity=50)

trainTSNEX = model.fit_transform(trainX)

# creating a new data frame which help us in ploting the result data
trainTSNE = np.vstack((trainTSNEX.T, trainY)).T
trainTSNEDF = pd.DataFrame(data=trainTSNE, columns=("Dim_1", "Dim_2", "label"))

# Ploting the result of tsne
sns.FacetGrid(trainTSNEDF, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

## Uso de perplexity=50 y n_iter=5000

In [None]:
model = TSNE(n_components=2, random_state=0, perplexity=50,  n_iter=5000)

trainTSNEX = model.fit_transform(trainX)

# creating a new data frame which help us in ploting the result data
trainTSNE = np.vstack((trainTSNEX.T, trainY)).T
trainTSNEDF = pd.DataFrame(data=trainTSNE, columns=("Dim_1", "Dim_2", "label"))

# Ploting the result of tsne
sns.FacetGrid(trainTSNEDF, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

## Uso de perplexity=2

In [None]:
model = TSNE(n_components=2, random_state=0, perplexity=2, n_iter=500)

trainTSNEX = model.fit_transform(trainX)

# creating a new data frame which help us in ploting the result data
trainTSNE = np.vstack((trainTSNEX.T, trainY)).T
trainTSNEDF = pd.DataFrame(data=trainTSNE, columns=("Dim_1", "Dim_2", "label"))

# Ploting the result of tsne
sns.FacetGrid(trainTSNEDF, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

# Método 4: t-SNE usando muestra de datos más grande y parámetros optimizados

In [None]:
nSample = 10000

trainX = trainXDF.sample(nSample, random_state=42).to_numpy()
trainY = trainYDF.sample(nSample, random_state=42).to_numpy(dtype=int)

scaler = StandardScaler()
trainProcX = scaler.fit_transform(trainX)

print("Tamaño de trainX normalizado: ", trainProcX.shape)
print("Valor mínimo: ", np.min(trainProcX))
print("Valor máximo: ", np.max(trainProcX))

Parámetros del método t-SNE

- n_components: Número de componentes
- perplexity: Nivel de perplejidad. Default es 30
- learning rate: Tasa de aprendizaje. Default es 200
- n_iter: Número máximo de iteraciones. Default es 1000

## Parámetros optimizados

In [None]:
model = TSNE(n_components=2, random_state=0, perplexity=30, n_iter=500)

trainTSNEX = model.fit_transform(trainX)

# creating a new data frame which help us in ploting the result data
trainTSNE = np.vstack((trainTSNEX.T, trainY)).T
trainTSNEDF = pd.DataFrame(data=trainTSNE, columns=("Dim_1", "Dim_2", "label"))

# Ploting the result of tsne
sns.FacetGrid(trainTSNEDF, hue="label", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()