These are unsupervised learning algorithms where the class labels are unknown. We draw inferences from datasets consisting of input data where the answer is unknown. 

Dimensionality reduction compressed the data by finding a smaller, different set of variables that capture what matters most in the orignial features, while minimizing the loss of information. It helps mitigate problems associated with high dimensionality and permits the visualization of salient aspects of higher-dimensional data that is otherwise difficult to explore. 

Three most frequently used techniques for dimensionality reduction:

    1. principal component analysis(PCA)
    2. kernel principal component analysis (KPCA)
    3. t-distributed stochastic neighbor embedding (t-SNE)
    
    
**PCA** aims to reduce the dimensionality of a dataset with a large number of variables while retaining as much variance in the data as possible. It finds a set of new variables that through a linear combination. The new variables are called *principal components (PCs)*. These principal components are orthogonal (or independent) and can represent the original data. The number of components is a hyperparameter of the PCA algorithm that sets the target dimensionality.

**How does PCA Algorithm work?** The PCA algorithm works by projecting the original data onto the principal component space. It then identifies a sequence of principal components, each of which aligns with the direction of maximum variance in the data (after accounting for variation captured by previously computed components). The sequential optimization also ensures that new components are not correlated with existing components. Thus the resulting set constitutes an orthogonal basis for a vector space.  

The decline in the amount of variance of the original data explained by each principal component reflects the extent of correlation among the original features. The number of components that capture, for example, 95% of the original variation relative to the total number of features provides an insight into the linearly independent information of the original data.



## Coding 

#### Import PCA Algorithm
from sklearn.decomposition import PCA
#### Initialize the algorithm and set the number of PC's
pca = PCA(n_components=2)
#### Fit the model to data
pca.fit(data)
#### Get list of PC's
pca.components_
#### Transform the model to data
pca.transform(data)
#### Get the eigenvalues
pca.explained_variance_ratio

In [9]:
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from numpy.linalg import inv, eig, svd
from sklearn.manifold import TSNE
from sklearn.decomposition import KernelPCA

#data processing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler

In [12]:
#loading the dataset
dataset = read_csv('Dow_adjcloses.csv',index_col=0)

## EDA

In [13]:
dataset.shape

(4804, 30)

In [14]:
dataset.head()

Unnamed: 0_level_0,MMM,AXP,AAPL,BA,CAT,CVX,CSCO,KO,DIS,DWDP,...,NKE,PFE,PG,TRV,UTX,UNH,VZ,V,WMT,WBA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-03,29.847043,35.476634,3.530576,26.650218,14.560887,21.582046,43.003876,16.983583,23.52222,,...,4.70118,16.746856,32.227726,20.158885,21.31903,5.841355,22.564221,,47.337599,21.713237
2000-01-04,28.661131,34.134275,3.232839,26.610431,14.372251,21.582046,40.5772,17.04095,24.89986,,...,4.445214,16.121738,31.596399,19.890099,20.445803,5.766368,21.833915,,45.566248,20.907354
2000-01-05,30.122175,33.95943,3.280149,28.473758,14.914205,22.049145,40.895453,17.228147,25.78155,,...,4.702157,16.415912,31.325831,20.085579,20.254784,5.753327,22.564221,,44.503437,21.097421
2000-01-06,31.877325,33.95943,2.99629,28.553331,15.459153,22.903343,39.781569,17.210031,24.89986,,...,4.677733,16.972739,32.438168,20.122232,20.998392,5.964159,22.449405,,45.126952,20.52722
2000-01-07,32.509812,34.433913,3.138219,29.382213,15.962182,23.305926,42.128682,18.34227,24.506249,,...,4.677733,18.123166,35.023602,20.922479,21.830687,6.662948,22.282692,,48.535033,21.051805


In [15]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4804 entries, 2000-01-03 to 2019-02-06
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   MMM     4804 non-null   float64
 1   AXP     4804 non-null   float64
 2   AAPL    4804 non-null   float64
 3   BA      4804 non-null   float64
 4   CAT     4804 non-null   float64
 5   CVX     4804 non-null   float64
 6   CSCO    4804 non-null   float64
 7   KO      4804 non-null   float64
 8   DIS     4804 non-null   float64
 9   DWDP    363 non-null    float64
 10  XOM     4804 non-null   float64
 11  GS      4804 non-null   float64
 12  HD      4804 non-null   float64
 13  IBM     4804 non-null   float64
 14  INTC    4804 non-null   float64
 15  JNJ     4804 non-null   float64
 16  JPM     4804 non-null   float64
 17  MCD     4804 non-null   float64
 18  MRK     4804 non-null   float64
 19  MSFT    4804 non-null   float64
 20  NKE     4804 non-null   float64
 21  PFE     4804 non-null   flo

In [18]:
#data visualization
import seaborn as sns

correlation = dataset.corr()
plt.figure(figsize=(15, 15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True, cmap='cubehelix')

ModuleNotFoundError: No module named 'seaborn'

In [19]:
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())

Null Values = True


In [20]:
missing_fractions = dataset.isnull().mean().sort_values(ascending=False)
missing_fractions.head(10)
drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))
dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape

(4804, 28)

In [None]:
# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')