<a href="https://colab.research.google.com/github/kabirchhabra/Skillslash/blob/main/Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dimensionality Reduction

**Background:**
+ When the numbers of features are high, the number of observations must be disproportinately high.
+ This is because of the problem of "curse of dimensionality".
+ Specifically, it affects algorithms which rely on similarity measure - KNN, Linear Regression.

**Dimensionality Reduction:**
+ But sometimes we will practically never be able to collect enough observations.
+ And, our learning algorithms will not have enough data to operate correctly.
+ Not all features are created equal
+ The goal of feature extraction for dimensionality reduction is to transform the original set of features.
+ From p(original), that we end up with a new set, p(new) where p(original) > p(new).
+ We do this while still keeping much of the underlying information.
+ In other words, we reduce the number of features with only a small loss in our data's ability to generate high-quality predictions.

**Disadvantages:**
+ One downside of the feature extraction techniques is that the new features we generate will not be interpretable by humans.
+ They will contain as much or nearly as much ability to trtain our models, but will appear to the human eye as a collection of random numbers.
+ If you want to maintain the ability to interpret models, dimensionality reduction through feature selection is a better option.

# PCA

+ Principal Component Analysis is a popular dimensionality reduction technique.
+ PCA projects observations onto the principal components of the feature matrix that retain the most variance.
+ PCA is an unsupervised technique, meaning that it does not use the information from the target vector and instead only conside the feature matrix.

**PCA - sklearn:**
+ PCA is implemented in scikit-learn using the pca method. n_components has two operations, depending on the argument provided.
+ If the argument is greater than 1, n_components will return that many features. This leads to the question of how to select the number of features that is optimal.
+ If the argument to n_components is between 0 and 1, pca return the minimum amount of features that retain that much variance. It is common to use values of 0.95 and 0.99, meaning 95% and 99% of the variance of the original features had been retained, respectively.

+ *Whiten:*
> whiten=True transforms the values of each principal component so that they have zero mean and unit variance. 

In [1]:
import pandas as pd

In [3]:
bank_df = pd.read_csv('bank.csv')
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing-loan,personal-loan,current-campaign,previous-campaign,subscribed
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [4]:
from sklearn.utils import resample

In [5]:
bank_subscribed_no = bank_df[bank_df['subscribed']=='no']
bank_subscribed_yes = bank_df[bank_df['subscribed']=='yes']

In [6]:
df_minority_upsampled = resample(bank_subscribed_yes, replace=True, n_samples=2000, random_state=42)

In [7]:
new_bank_df = pd.concat([bank_subscribed_no, df_minority_upsampled])

In [8]:
new_bank_df['subscribed'].value_counts()

no     4000
yes    2000
Name: subscribed, dtype: int64

In [9]:
X_features = list(new_bank_df.columns)
X_features.remove('subscribed')
X_features

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing-loan',
 'personal-loan',
 'current-campaign',
 'previous-campaign']

In [10]:
encoded_bank_df = pd.get_dummies(new_bank_df[X_features], drop_first=True)
X=encoded_bank_df

In [11]:
y = new_bank_df.subscribed.map(lambda x:int(x=='yes'))

In [12]:
from sklearn.decomposition import PCA

In [21]:
pca = PCA(n_components=1, whiten=True)

In [22]:
features_pca = pca.fit_transform(X)

In [23]:
print('Original number of features: ', X.shape[1])
print('Reduced number of features: ', features_pca.shape[1])

Original number of features:  23
Reduced number of features:  1


In [24]:
from sklearn.model_selection import train_test_split

In [29]:
train_X, test_X, train_y, test_y = train_test_split(features_pca, y, test_size=0.3, random_state=42)

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
model = LogisticRegression(max_iter=100000)
model.fit(train_X, train_y)

LogisticRegression(max_iter=100000)

In [32]:
pred_y = model.predict(test_X)

In [33]:
from sklearn import metrics

In [34]:
print(metrics.confusion_matrix(test_y, pred_y))

[[1224    1]
 [ 575    0]]


In [36]:
print(metrics.classification_report(test_y, pred_y))

              precision    recall  f1-score   support

           0       0.68      1.00      0.81      1225
           1       0.00      0.00      0.00       575

    accuracy                           0.68      1800
   macro avg       0.34      0.50      0.40      1800
weighted avg       0.46      0.68      0.55      1800



In [37]:
metrics.roc_auc_score(test_y, model.predict_proba(test_X)[:,1])

0.5892912156166815

# Feature Selection

+ Reduce the dimesnionality of our feature matrix by creating new features similar ability to train quality models but with significantly fewer dimensions. This is called feature extraction.
+ Selecting high quality, informative features and dropping less useful features. This is called feature selection.

**Using correlation:**
+ You have a feature matrix and suspect some features are highly correlated.
+ Use a correaltion matrix to check for highly correlated features.
+ If highly correlated features exist, consider dropping one of the correlated features.

In [38]:
# Same bank model till y = lambda function

In [42]:
import numpy as np

In [39]:
corr_matrix = X.corr().abs()
corr_matrix

Unnamed: 0,age,balance,current-campaign,previous-campaign,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,job_unemployed,job_unknown,marital_married,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing-loan_yes,personal-loan_yes
age,1.0,0.116661,0.017923,0.001506,0.066583,0.00431,0.082258,0.06789,0.525063,0.006086,...,0.004726,0.078146,0.273028,0.423815,0.098799,0.113228,0.094899,0.008845,0.194625,0.02684
balance,0.116661,1.0,0.012778,0.030651,0.066231,0.000353,0.05834,0.040634,0.097046,0.005872,...,0.017251,0.001199,0.026961,0.008208,0.082181,0.073698,0.029869,0.072134,0.045138,0.072443
current-campaign,0.017923,0.012778,1.0,0.088047,0.018551,0.015274,0.012967,0.036582,0.046458,0.042065,...,0.015985,0.016831,0.017385,0.0104,0.022921,0.02374,0.011159,0.014455,0.012254,0.041865
previous-campaign,0.001506,0.030651,0.088047,1.0,0.030009,0.027656,0.031959,0.004984,0.011355,0.020565,...,0.019691,0.003907,0.0149,0.041845,0.03031,0.014123,0.018997,0.036346,0.002664,0.045438
job_blue-collar,0.066583,0.066231,0.018551,0.030009,1.0,0.092666,0.077608,0.257391,0.129608,0.09838,...,0.081728,0.050548,0.092239,0.073857,0.062857,0.306804,0.006547,0.001397,0.202778,0.032115
job_entrepreneur,0.00431,0.000353,0.015274,0.027656,0.092666,1.0,0.030826,0.102235,0.05148,0.039076,...,0.032462,0.020077,0.058772,0.057149,0.04503,0.033271,0.024552,0.049362,0.023786,0.043399
job_housemaid,0.082258,0.05834,0.012967,0.031959,0.077608,0.030826,1.0,0.085622,0.043114,0.032726,...,0.027187,0.016815,0.028146,0.050178,0.074172,0.050027,0.006021,0.023249,0.083101,0.010175
job_management,0.06789,0.040634,0.036582,0.004984,0.257391,0.102235,0.085622,1.0,0.142992,0.108539,...,0.090167,0.055768,0.024646,0.030925,0.409287,0.580526,0.038183,0.020744,0.035294,0.035393
job_retired,0.525063,0.097046,0.046458,0.011355,0.129608,0.05148,0.043114,0.142992,1.0,0.054654,...,0.045403,0.028082,0.056648,0.138788,0.028552,0.074505,0.012425,0.017228,0.179368,0.025532
job_self-employed,0.006086,0.005872,0.042065,0.020565,0.09838,0.039076,0.032726,0.108539,0.054654,1.0,...,0.034464,0.021316,0.011318,0.00053,0.071686,0.110664,0.001386,0.002326,0.041987,0.000165


In [45]:
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(np.bool))
upper

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


Unnamed: 0,age,balance,current-campaign,previous-campaign,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,job_unemployed,job_unknown,marital_married,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing-loan_yes,personal-loan_yes
age,,0.116661,0.017923,0.001506,0.066583,0.00431,0.082258,0.06789,0.525063,0.006086,...,0.004726,0.078146,0.273028,0.423815,0.098799,0.113228,0.094899,0.008845,0.194625,0.02684
balance,,,0.012778,0.030651,0.066231,0.000353,0.05834,0.040634,0.097046,0.005872,...,0.017251,0.001199,0.026961,0.008208,0.082181,0.073698,0.029869,0.072134,0.045138,0.072443
current-campaign,,,,0.088047,0.018551,0.015274,0.012967,0.036582,0.046458,0.042065,...,0.015985,0.016831,0.017385,0.0104,0.022921,0.02374,0.011159,0.014455,0.012254,0.041865
previous-campaign,,,,,0.030009,0.027656,0.031959,0.004984,0.011355,0.020565,...,0.019691,0.003907,0.0149,0.041845,0.03031,0.014123,0.018997,0.036346,0.002664,0.045438
job_blue-collar,,,,,,0.092666,0.077608,0.257391,0.129608,0.09838,...,0.081728,0.050548,0.092239,0.073857,0.062857,0.306804,0.006547,0.001397,0.202778,0.032115
job_entrepreneur,,,,,,,0.030826,0.102235,0.05148,0.039076,...,0.032462,0.020077,0.058772,0.057149,0.04503,0.033271,0.024552,0.049362,0.023786,0.043399
job_housemaid,,,,,,,,0.085622,0.043114,0.032726,...,0.027187,0.016815,0.028146,0.050178,0.074172,0.050027,0.006021,0.023249,0.083101,0.010175
job_management,,,,,,,,,0.142992,0.108539,...,0.090167,0.055768,0.024646,0.030925,0.409287,0.580526,0.038183,0.020744,0.035294,0.035393
job_retired,,,,,,,,,,0.054654,...,0.045403,0.028082,0.056648,0.138788,0.028552,0.074505,0.012425,0.017228,0.179368,0.025532
job_self-employed,,,,,,,,,,,...,0.034464,0.021316,0.011318,0.00053,0.071686,0.110664,0.001386,0.002326,0.041987,0.000165


In [48]:
to_drop = [column for column in upper.columns if any(upper[column]>0.5)]
to_drop

['job_retired', 'marital_single', 'education_tertiary']

In [49]:
X.drop(to_drop, axis=1, inplace=True)
X

Unnamed: 0,age,balance,current-campaign,previous-campaign,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_married,education_secondary,education_unknown,default_yes,housing-loan_yes,personal-loan_yes
0,30,1787,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
1,33,4789,1,4,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1
2,35,1350,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
3,30,1476,4,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1
4,59,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
619,35,7050,3,4,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0
1177,28,4579,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3498,58,462,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
4366,59,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
