# Dimensionality Reduction

> In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction. [Link to wiki](https://en.wikipedia.org/wiki/Dimensionality_reduction)


### Feature Selection:
    A subset of the variables from the original dataset.


### Feautore Extraction:
    Transforms the data from high dimension to a lower dimension.
    
    
### Reasons to use dimensionlity reduction:
    1) Low dimension -> simple data
    2) Less computation time
    3) Less space
    4) Low number of features -> easy to analyse the data
    5) No redundant/correlated data

### Dimensionality reduction algorithms:
    1) PCA
    2) LDA/QDA
    3) ICA
    4) Factor Analisys
    5) Autoencoders

## PCA

    Transforms the variables whcih are in most cases correlated to a set of values that are uncorrelated. These values are called pricipal components. There are as many principal components as the number of variables in the dataset. The first principal component has the most possible variance, and each succeeding component has lower variance than the previous one.
    
## Assumptions

    1) Variables are continues or ordinal
    2) Linear relationship between the variables
    3) No outliers
    4) Zero mean and unit variance
    

## Disadvantages
    
    1) Find linear relationship between the variables
    2) Does not produce satisfying results if the mean and covariance are not 

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import sklearn.datasets

from plotly.offline import iplot
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Wine

In [2]:
wine_data = sklearn.datasets.load_wine()

In [3]:
Y = wine_data["target"]

In [4]:
wine_data = pd.DataFrame(wine_data["data"], columns=wine_data["feature_names"])

In [5]:
wine_data["target"] = Y

In [6]:
wine_data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [7]:
wine_data.drop(["target"], axis = 1).corr()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
alcohol,1.0,0.094397,0.211545,-0.310235,0.270798,0.289101,0.236815,-0.155929,0.136698,0.546364,-0.071747,0.072343,0.64372
malic_acid,0.094397,1.0,0.164045,0.2885,-0.054575,-0.335167,-0.411007,0.292977,-0.220746,0.248985,-0.561296,-0.36871,-0.192011
ash,0.211545,0.164045,1.0,0.443367,0.286587,0.12898,0.115077,0.18623,0.009652,0.258887,-0.074667,0.003911,0.223626
alcalinity_of_ash,-0.310235,0.2885,0.443367,1.0,-0.083333,-0.321113,-0.35137,0.361922,-0.197327,0.018732,-0.273955,-0.276769,-0.440597
magnesium,0.270798,-0.054575,0.286587,-0.083333,1.0,0.214401,0.195784,-0.256294,0.236441,0.19995,0.055398,0.066004,0.393351
total_phenols,0.289101,-0.335167,0.12898,-0.321113,0.214401,1.0,0.864564,-0.449935,0.612413,-0.055136,0.433681,0.699949,0.498115
flavanoids,0.236815,-0.411007,0.115077,-0.35137,0.195784,0.864564,1.0,-0.5379,0.652692,-0.172379,0.543479,0.787194,0.494193
nonflavanoid_phenols,-0.155929,0.292977,0.18623,0.361922,-0.256294,-0.449935,-0.5379,1.0,-0.365845,0.139057,-0.26264,-0.50327,-0.311385
proanthocyanins,0.136698,-0.220746,0.009652,-0.197327,0.236441,0.612413,0.652692,-0.365845,1.0,-0.02525,0.295544,0.519067,0.330417
color_intensity,0.546364,0.248985,0.258887,0.018732,0.19995,-0.055136,-0.172379,0.139057,-0.02525,1.0,-0.521813,-0.428815,0.3161


In [8]:
data = []
for i in [0, 1, 2]:
    trace = go.Scatter3d(x = wine_data[wine_data["target"] == i]["hue"],
                         y = wine_data[wine_data["target"] == i]["alcohol"],
                         z = wine_data[wine_data["target"] == i]["magnesium"],
                         mode = 'markers',
                         name = "class_{}".format(i),
                         marker = dict(size = 3))
    data.append(trace)

figure = dict(data=data)
iplot(figure)

In [9]:
scaler = StandardScaler()

In [10]:
wine_scaled = scaler.fit_transform(wine_data.drop(["target"], axis = 1))

In [11]:
wine_scaled = pd.DataFrame(wine_scaled, columns=wine_data.drop(["target"], axis = 1).columns)

In [12]:
pca = PCA(random_state=42)
principalComponents = pca.fit_transform(wine_scaled)

In [13]:
principalComponents

array([[ 3.31675081e+00, -1.44346263e+00, -1.65739045e-01, ...,
        -4.51563395e-01,  5.40810414e-01, -6.62386309e-02],
       [ 2.20946492e+00,  3.33392887e-01, -2.02645737e+00, ...,
        -1.42657306e-01,  3.88237741e-01,  3.63650247e-03],
       [ 2.51674015e+00, -1.03115130e+00,  9.82818670e-01, ...,
        -2.86672847e-01,  5.83573183e-04,  2.17165104e-02],
       ...,
       [-2.67783946e+00, -2.76089913e+00, -9.40941877e-01, ...,
         5.12492025e-01,  6.98766451e-01,  7.20776948e-02],
       [-2.38701709e+00, -2.29734668e+00, -5.50696197e-01, ...,
         2.99821968e-01,  3.39820654e-01, -2.18657605e-02],
       [-3.20875816e+00, -2.76891957e+00,  1.01391366e+00, ...,
        -2.29964331e-01, -1.88787963e-01, -3.23964720e-01]])

In [14]:
exp_var = pca.explained_variance_ratio_

In [15]:
sum(exp_var)

1.0

In [16]:
trace = go.Line(y = exp_var,
               x = np.arange(1, len(exp_var) + 1))
data = [trace]
figure = dict(data = data)
iplot(figure)


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




In [17]:
trace = go.Line(y = exp_var.cumsum(),
               x = np.arange(1, len(exp_var) + 1))
data = [trace]
figure = dict(data = data)
iplot(figure)

In [18]:
comp_df = pd.DataFrame(principalComponents)
comp_df["target"] = wine_data["target"]

In [19]:
comp_df.drop(["target"], axis = 1).corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1.0,2.416363e-16,-1.970169e-16,1.175725e-16,1.257535e-16,-3.984214e-17,-5.967345e-18,7.938876000000001e-17,-1.7118370000000002e-17,4.1041640000000004e-17,3.9935950000000004e-17,5.2490960000000003e-17,2.146192e-17
1,2.416363e-16,1.0,-6.958636e-17,5.2703760000000006e-17,-5.12781e-17,8.869605e-18,-7.813200000000001e-17,-1.16341e-16,3.201924e-16,1.9188e-16,-6.479281e-17,-8.647246e-17,1.215359e-16
2,-1.970169e-16,-6.958636e-17,1.0,4.328463e-18,3.049036e-16,4.7591690000000004e-17,-3.6552220000000004e-17,-1.194908e-16,8.974691e-17,1.113144e-16,-6.412865e-17,-3.7560760000000005e-17,6.936650000000001e-17
3,1.175725e-16,5.2703760000000006e-17,4.328463e-18,1.0,-1.493282e-16,9.097146000000001e-17,3.77785e-16,-2.165716e-16,2.421084e-18,-9.157465e-17,-2.940497e-16,1.102696e-16,-2.195606e-16
4,1.257535e-16,-5.12781e-17,3.049036e-16,-1.493282e-16,1.0,-1.2222900000000001e-17,-3.276489e-16,-4.289327e-17,-8.919845000000001e-17,-9.419481e-17,-1.4565670000000002e-17,5.547323e-17,-8.925502e-17
5,-3.984214e-17,8.869605e-18,4.7591690000000004e-17,9.097146000000001e-17,-1.2222900000000001e-17,1.0,1.155599e-16,1.655323e-16,-1.282104e-16,1.673985e-16,-2.687395e-16,-1.945109e-16,-4.84345e-18
6,-5.967345e-18,-7.813200000000001e-17,-3.6552220000000004e-17,3.77785e-16,-3.276489e-16,1.155599e-16,1.0,-2.1550000000000003e-17,-2.761315e-16,-1.459467e-16,3.4505730000000004e-17,-2.290527e-16,9.825908000000001e-17
7,7.938876000000001e-17,-1.16341e-16,-1.194908e-16,-2.165716e-16,-4.289327e-17,1.655323e-16,-2.1550000000000003e-17,1.0,1.321978e-16,6.934318e-17,-1.553679e-16,-1.318064e-16,3.902206e-17
8,-1.7118370000000002e-17,3.201924e-16,8.974691e-17,2.421084e-18,-8.919845000000001e-17,-1.282104e-16,-2.761315e-16,1.321978e-16,1.0,1.181542e-16,-3.2969640000000006e-17,-1.3417670000000001e-17,-6.406428e-17
9,4.1041640000000004e-17,1.9188e-16,1.113144e-16,-9.157465e-17,-9.419481e-17,1.673985e-16,-1.459467e-16,6.934318e-17,1.181542e-16,1.0,9.237317000000001e-17,7.435488000000001e-17,-2.512472e-16


In [20]:
data = []
for i in [0, 1, 2]:
    trace = go.Scatter3d(x = comp_df[comp_df["target"] == i][0],
                         y = comp_df[comp_df["target"] == i][1],
                         z = comp_df[comp_df["target"] == i][2],
                         mode = 'markers',
                         name = "class_{}".format(i),
                         marker = dict(size = 3))
    data.append(trace)

figure = dict(data=data)
iplot(figure)

In [21]:
data = []
for i in [0, 1, 2]:
    trace = go.Scatter(x = comp_df[comp_df["target"] == i][0],
                         y = comp_df[comp_df["target"] == i][1],
                         #z = comp_df[comp_df["target"] == i][2],
                         mode = 'markers',
                         name = "class_{}".format(i),
                         marker = dict(size = 7))
    data.append(trace)

figure = dict(data=data)
iplot(figure)

# Cancer

In [22]:
cancer_data = sklearn.datasets.load_breast_cancer()

In [23]:
Y = cancer_data["target"]

In [24]:
cancer_data = pd.DataFrame(cancer_data["data"], columns=cancer_data["feature_names"])

In [25]:
scaler = StandardScaler()
cancer_scaled = scaler.fit_transform(cancer_data)

In [26]:
cancer_scaled = pd.DataFrame(cancer_scaled, columns=cancer_data.columns)

In [27]:
cancer_scaled.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [28]:
pca = PCA(random_state=42)
principalComponents = pca.fit_transform(cancer_scaled)

In [29]:
exp_var = pca.explained_variance_ratio_

In [30]:
trace = go.Line(y = exp_var,
               x = np.arange(1, len(exp_var) + 1))
data = [trace]
figure = dict(data = data)
iplot(figure)


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




In [31]:
trace = go.Line(y = exp_var.cumsum(),
               x = np.arange(1, len(exp_var) + 1))
data = [trace]
figure = dict(data = data)
iplot(figure)

In [32]:
comp_df = pd.DataFrame(principalComponents)
comp_df["target"] = Y

In [33]:
comp_df.drop(["target"], axis = 1).corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.0,1.640074e-16,8.011719000000001e-17,-2.002561e-16,-6.271128e-17,-5.847041e-17,-6.046415e-17,1.575838e-16,2.06305e-16,9.506444000000001e-17,...,6.556079e-17,-4.3956610000000004e-17,6.55448e-17,6.494735000000001e-17,-4.195406e-18,-3.907532e-17,2.2558050000000003e-17,4.6500140000000003e-17,-1.08588e-16,-6.904474e-17
1,1.640074e-16,1.0,6.431281e-17,1.4993640000000002e-17,-1.489621e-16,3.557953e-17,3.94151e-17,8.766718e-17,2.491617e-16,6.553337000000001e-17,...,4.9958130000000006e-17,7.739455e-17,-3.3714580000000004e-17,6.813447e-17,4.847846000000001e-17,9.089546000000001e-17,-3.0768120000000004e-17,-5.41577e-17,-3.3251130000000004e-17,-6.725117000000001e-17
2,8.011719000000001e-17,6.431281e-17,1.0,-1.321443e-16,-1.018378e-16,3.9351070000000005e-17,-1.199512e-16,-4.276419e-17,-2.496862e-16,-7.203335000000001e-17,...,3.021201e-18,7.613315e-17,2.3281640000000003e-17,1.057503e-16,-1.791572e-16,-5.784022e-17,-4.9673030000000004e-17,2.040898e-17,1.072529e-16,-1.76348e-17
3,-2.002561e-16,1.4993640000000002e-17,-1.321443e-16,1.0,5.065613e-16,3.018134e-16,-1.316034e-16,-4.8197170000000003e-17,-4.230076e-16,5.100805e-17,...,-9.849993000000001e-17,7.700087000000001e-17,1.110804e-17,2.269964e-17,9.004723000000001e-17,1.5331380000000002e-17,6.759445e-17,-2.9994810000000004e-17,2.7865960000000004e-17,1.322175e-16
4,-6.271128e-17,-1.489621e-16,-1.018378e-16,5.065613e-16,1.0,-3.191149e-16,1.949131e-16,2.1130490000000002e-17,-3.246621e-16,1.441299e-16,...,-1.437276e-16,3.698073e-17,4.6835320000000005e-17,-1.212007e-16,3.591841e-17,-2.3945500000000003e-17,3.0469210000000006e-17,-2.057705e-17,5.2147540000000006e-17,9.288838000000001e-17
5,-5.847041e-17,3.557953e-17,3.9351070000000005e-17,3.018134e-16,-3.191149e-16,1.0,-1.676947e-16,1.769635e-16,1.4851170000000002e-17,-7.057919e-17,...,9.385057000000001e-17,-9.326359e-17,1.536548e-16,-3.138665e-17,-1.6234100000000002e-17,2.101116e-16,1.063492e-16,-1.51722e-17,-5.840350000000001e-17,-9.621908e-18
6,-6.046415e-17,3.94151e-17,-1.199512e-16,-1.316034e-16,1.949131e-16,-1.676947e-16,1.0,-9.355335000000001e-17,-1.401158e-16,1.3833450000000001e-17,...,6.857739000000001e-17,-3.368649e-17,-5.992777e-17,-1.9880580000000002e-17,-4.3058530000000003e-17,9.452872000000001e-17,-7.146216e-19,-5.3605540000000005e-17,-4.1217830000000003e-17,-9.263792e-17
7,1.575838e-16,8.766718e-17,-4.276419e-17,-4.8197170000000003e-17,2.1130490000000002e-17,1.769635e-16,-9.355335000000001e-17,1.0,1.085554e-16,-9.604731e-18,...,-8.162417e-19,-4.947933e-17,-1.611129e-16,-1.85096e-16,3.180079e-17,-1.562677e-17,-1.9563290000000003e-17,3.533587e-17,-5.809675e-17,-3.828549e-18
8,2.06305e-16,2.491617e-16,-2.496862e-16,-4.230076e-16,-3.246621e-16,1.4851170000000002e-17,-1.401158e-16,1.085554e-16,1.0,1.098568e-16,...,-1.309127e-18,2.006735e-17,-2.3213060000000003e-17,3.113434e-17,2.4306400000000002e-17,-2.172121e-16,-1.78255e-16,2.052853e-16,1.363164e-16,-4.5029640000000005e-17
9,9.506444000000001e-17,6.553337000000001e-17,-7.203335000000001e-17,5.100805e-17,1.441299e-16,-7.057919e-17,1.3833450000000001e-17,-9.604731e-18,1.098568e-16,1.0,...,-1.567113e-17,-1.811905e-17,6.498113e-17,3.8773570000000004e-17,1.293007e-19,1.616808e-17,1.048923e-16,2.310277e-16,7.558337000000001e-17,-2.752735e-16


In [34]:
data = []
for i in [0, 1]:
    trace = go.Scatter3d(x = comp_df[comp_df["target"] == i][0],
                         y = comp_df[comp_df["target"] == i][1],
                         z = comp_df[comp_df["target"] == i][2],
                         mode = 'markers',
                         name = "class_{}".format(i),
                         marker = dict(size = 3))
    data.append(trace)

figure = dict(data=data)
iplot(figure)

In [35]:
data = []
for i in [0, 1]:
    trace = go.Scatter(x = comp_df[comp_df["target"] == i][0],
                         y = comp_df[comp_df["target"] == i][1],
                         mode = 'markers',
                         name = "class_{}".format(i),
                         marker = dict(size = 7))
    data.append(trace)

figure = dict(data=data)
iplot(figure)