# Dimensionality Reduction

The performance of machine learning algorithms can degrade with too many features. A model with too many features is likely to overfit the training dataset and therefore may not perform well on new data. Dimensionality reduction is a data preparation technique performed on data prior to modeling. There are many techniques to reduce the number of dimensions in the training dataset.

### Import Dependencies

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
import time
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import Isomap

In [2]:
df = pd.read_csv("train.csv")

In [3]:
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


It seems that this dataset is about cellphone features and prices, some of this features are related to each other, and some of them are not realy decisive features, so for sake of ease, we need to reduce data dimensions by elimination or combanation.

## PCA

PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible. PCA detects corelated features by computing covariance matrix:

* if covariance of two feature is positive, these two are highly correlated means that if one goes high, the other goes high
* if covariance of two feature is negative, these two are highly uncorrelated means that if one goes high, the other goes down

<p align="center">
<img src="PCA.png" width="800" height="600"/>
</p>

For addressing the data points in the figure above, we need to know 2 features. But as we see, there is a relation between these two features because most of the data times that one feature increases, the other feature increase too. So we can combine these two dimensions together and make one dimension like this line by projecting data points on it.
<p align="center">
<img src="PCA2.png" width="800" height="600" alt="PCA">
</p>

### Preprocessing

In [4]:
y = df.pop('price_range')
x_train, x_test, y_train, y_test = train_test_split(df,y, test_size = 0.2)

0       1
1       2
2       2
3       2
4       1
       ..
1995    0
1996    2
1997    3
1998    0
1999    3
Name: price_range, Length: 2000, dtype: int64

### normalization

In [6]:
train_scaler = StandardScaler().fit(x_train)
test_scaler = StandardScaler().fit(x_test)

In [7]:
x_train_scaled = train_scaler.transform(x_train)
x_test_scaled = test_scaler.transform(x_test)

### Logistic Regression without PCA

In [8]:
classifier1 = LogisticRegression().fit(x_train_scaled, y_train)
print("Training score is:" + str(classifier1.score(x_train_scaled, y_train)))
print("Test score is:" + str(classifier1.score(x_test_scaled, y_test)))

Training score is:0.9775
Test score is:0.9625


### Logistic Regression with PCA

In [9]:
table = np.zeros([20,4])
for i in range(1,21):

    pca = PCA(n_components= i)
    pca.fit(x_train_scaled)
    x_train_scaled_tr = pca.transform(x_train_scaled)
    x_test_scaled_tr = pca.transform(x_test_scaled)

    t0 = time.time()
    classifier2 = LogisticRegression().fit(x_train_scaled_tr, y_train)
    t1 = time.time()

    table[i-1, 0] = t1 - t0
    table[i-1, 1] = classifier2.score(x_train_scaled_tr, y_train)
    table[i-1, 2] = classifier2.score(x_test_scaled_tr, y_test)
    table[i-1, 3] = table[i-1, 1] - table[i-1, 2]


pd.DataFrame(table, columns=["Training Time", "Training score", "Test Score", "Degree of overfitting"])



Unnamed: 0,Training Time,Training score,Test Score,Degree of overfitting
0,0.011003,0.26875,0.2475,0.02125
1,0.012,0.284375,0.25,0.034375
2,0.013999,0.336875,0.2725,0.064375
3,0.011971,0.335625,0.29,0.045625
4,0.012,0.3975,0.3875,0.01
5,0.012004,0.385625,0.38,0.005625
6,0.019005,0.4275,0.41,0.0175
7,0.018001,0.4275,0.4325,-0.005
8,0.017999,0.43,0.4525,-0.0225
9,0.013995,0.4675,0.47,-0.0025


## LDA

LDA best separates or discriminates training instances by their classes. LDA is typically used for multi-class classification. It can also be used as a dimensionality reduction technique. The major difference between LDA and PCA is that LDA finds a linear combination of input features that optimizes class separability while PCA attempts to find a set of uncorrelated components of maximum variance in a dataset. Another key difference between the two is that PCA is an unsupervised algorithm whereas LDA is a supervised algorithm where it takes class labels into account.
There are some limitations of LDA. To apply LDA, the data should be normally distributed. The maximum number of components that LDA can find is the number of classes minus 1. Unlike the PCA, it is not needed to perform feature scaling to apply LDA.

<p align="center">
<img src="LDA1.png" width="400" height="300">
</p>

<p align="center">
<img src="LDA2.png" width="400" height="300">
<img src="LDA3.png" width="400" height="200">
</p>

<p align="center">
<img src="LDA4.png" width="400" height="300">
<img src="LDA5.png" width="400" height="300">
<img src="LDA.png" width="400" height="200">
</p>

### Logistic Regression with LDA

In [10]:
table2 = np.zeros([3,4])
for i in range(1,4):

    lda = LinearDiscriminantAnalysis(n_components=i)
    lda.fit(x_train, y_train)
    x_train_tr = lda.transform(x_train)
    x_test_tr = lda.transform(x_test)

    t0 = time.time()
    classifier3 = LogisticRegression().fit(x_train_tr, y_train)
    t1 = time.time()

    table2[i-1, 0] = t1 - t0
    table2[i-1, 1] = classifier3.score(x_train_tr, y_train)
    table2[i-1, 2] = classifier3.score(x_test_tr, y_test)
    table2[i-1, 3] = table2[i-1, 1] - table2[i-1, 2]


pd.DataFrame(table2, columns=["Training Time", "Training score", "Test Score", "Degree of overfitting"])


Unnamed: 0,Training Time,Training score,Test Score,Degree of overfitting
0,0.040998,0.960625,0.9525,0.008125
1,0.051999,0.968125,0.96,0.008125
2,0.047999,0.970625,0.96,0.010625


## Isomap

It is a manifold learning algorithm which tries to preserve the geodesic distance between samples while reducing the dimension. OK, it has so many new concepts that it is not clear, what is the manifold, elucidean and geodesic distance.
* the manifold: Suppose there is a small ant walking along a shape in three dimensions. This shape could be curvy, twisty, or even have holes in it. Now here’s the rule: From the point of view of the ant, if everywhere it walks look like a flat plane, it is manifold then.
<p align="center">
<img src='Isomap3.jpeg' height='300' width='500'>
</p>

* Euclidean distance represents the shortest distance between two points.
<p align="center">
<img src='Isomap4.png' height='300' width='500'>
</p>

* Geodesic distance is a simple measure of the distance between two vertices in a graph is the shortest path between the vertices.
<p align="center">
<img src='Isomap5.png' height='300' width='500'>
</p>


Let's suppose we have data like a manifold like this below image
<p align="center">
<img src="Isomap2.png" alt="Isomap" height="600" width="400">
</p>

For detecting 2D curve of this manifold we need to follow these steps:
* Determines which points are neighbors on Manifold based on distance (Euclidean distance) For each point, we connect all points within a fixed radius (where we have to choose radius) or like KNN (K nearest neighboring algorithm) we have to choose K number of neighbors.
<p align="center">
<img src="Isomap6.png" alt="Isomap" height="300" width="400">
</p>

* Find the shortest path for each node to it's nearest neighbor by Euclidean distance and then connect all of points on the curve to create a path.
<p align="center">
<img src="Isomap7.png" alt="Isomap" height="300" width="400">
</p>

* After detecting the surface of this manifold, the 3D surface convert to the 2D surface and that makes reduce the dimension.

## Logistic Regression with Isomap

In [16]:
table3 = np.zeros([20,4])
for i in range(11,31):

    ism = Isomap(n_neighbors=i-6)
    ism.fit(x_train, y_train)
    x_train_tr = ism.transform(x_train)
    x_test_tr = ism.transform(x_test)

    t0 = time.time()
    classifier4 = LogisticRegression().fit(x_train_tr, y_train)
    t1 = time.time()

    table3[i-11, 0] = t1 - t0
    table3[i-11, 1] = classifier4.score(x_train_tr, y_train)
    table3[i-11, 2] = classifier4.score(x_test_tr, y_test)
    table3[i-11, 3] = table3[i-11, 1] - table3[i-11, 2]


pd.DataFrame(table3, columns=["Training Time", "Training score", "Test Score", "Degree of overfitting"])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,Training Time,Training score,Test Score,Degree of overfitting
0,0.072001,0.773125,0.795,-0.021875
1,0.074784,0.798125,0.815,-0.016875
2,0.053997,0.79375,0.8025,-0.00875
3,0.070999,0.786875,0.81,-0.023125
4,0.072998,0.7825,0.8125,-0.03
5,0.050998,0.788125,0.81,-0.021875
6,0.072998,0.7875,0.8075,-0.02
7,0.056989,0.780625,0.8075,-0.026875
8,0.073998,0.78125,0.805,-0.02375
9,0.056001,0.780625,0.8075,-0.026875
