# Dimensionality reduction

Dimensionality reduction involves decreasing the number of features within a dataset.
Dealing with an excessive number of variables, also known as features, is a common challenge in machine learning tasks such as regression or classification.
The greater the number of features, the more challenging it becomes to model them&mdash;this phenomenon is referred to as the curse of dimensionality.

Moreover, some features may be redundant, introducing unnecessary noise to the dataset.
Including these in the training data does not contribute meaningfully.
This is where the reduction of the feature space becomes crucial.

The dimensionality reduction shifts data from a high-dimensional feature space to a lower-dimensional one.
It is crucial to ensure that, concurrently, meaningful properties inherent in the data are retained throughout this transformation.

## Curse of Dimensionality

In the world of machine learning and deep learning, algorithms need a lot of data to learn patterns and representations effectively.
However, when this data has a ton of features, it can lead to something called the curse of dimensionality.

For example, we can look at a dataset of pKa values and numerous corresponding molecular descriptors.

In [2]:
import pandas as pd

CSV_PATH_ALL = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/pka/pka_with_desc.csv"

df = pd.read_csv(CSV_PATH_ALL)

Let's see how many columns we have.

In [3]:
print(df.shape)

(1706, 212)


This means we have 1706 rows of 212 columns.
If you take a look at the columns (shown below), you will see "SMILES" and "pka_value" that are not features.

In [4]:
print(df.columns)

Index(['SMILES', 'pka_value', 'MaxAbsEStateIndex', 'MaxEStateIndex',
       'MinAbsEStateIndex', 'MinEStateIndex', 'qed', 'SPS', 'MolWt',
       'HeavyAtomMolWt',
       ...
       'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene',
       'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene',
       'fr_unbrch_alkane', 'fr_urea'],
      dtype='object', length=212)


We have a total of 210 features&mdash;that's a lot of features.
(This is a [flex tape reference](https://www.youtube.com/watch?v=JZLAHGfznlY).)

## Sparsity

This curse suggests that as we try to estimate a function accurately, the number of features or dimensions needed for the estimation grows exponentially.
This becomes especially tricky with big data, which tends to be more sparse.

Now, sparsity in data means that many features have a value of zero (not that the value is missing).
Having lots of sparse features increases space and computational complexity.
When data is sparse, it's hard to cluster observations or samples in the training dataset.

Let's see which columns of ours are sparse by computing the sparsity.

In [10]:
# Calculate sparsity for each column
sparsity_per_column = (df == 0).mean()

The expression `(df == 0).mean()` in Python, when applied to a Pandas DataFrame, performs the following steps:

-   `df == 0`: This creates a boolean DataFrame where each element is `True` if the corresponding element in `df` is equal to `0`, and `False` otherwise.
-   `(df == 0).mean()`: After obtaining the boolean DataFrame, `.mean()` is applied.
    This calculates the mean (average) along each column.
    Since `True` is treated as `1` and `False` as `0` when calculating the mean, this effectively gives the percentage of elements in each column that are equal to `0`.

In simpler terms, `(df == 0).mean()` computes the percentage of zeros in each column of the DataFrame.
This can be useful for identifying columns where a large proportion of the values are zero, which is a characteristic of sparse columns.

In [13]:
# Print or filter columns with sparsity below a certain threshold
threshold = 0.90
sparse_columns = sparsity_per_column[sparsity_per_column > threshold].index

print("Sparse Columns:")
print(sparse_columns.shape)
print(sparse_columns)

Sparse Columns:
(82,)
Index(['NumRadicalElectrons', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA4',
       'PEOE_VSA5', 'SMR_VSA2', 'SMR_VSA8', 'SlogP_VSA7', 'SlogP_VSA9',
       'EState_VSA11', 'NumAliphaticCarbocycles', 'NumSaturatedCarbocycles',
       'NumSaturatedHeterocycles', 'NumSaturatedRings', 'fr_Al_COO',
       'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_Ar_COO', 'fr_Ar_NH', 'fr_Ar_OH',
       'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_N_O', 'fr_Ndealkylation1',
       'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde',
       'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide',
       'fr_amidine', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzodiazepine',
       'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_furan',
       'fr_guanido', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide',
       'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss',
       'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile',
       '

Eighty-two columns in our dataframe are mostly just zero and probably would contribute little to any model.
Note that this does not consider if columns are equal to 1; however, this would require more complicated analyses.

## Reducing the number of dimensions

High-dimensional data can overlook key relationships.
Meaningful and non-redundant data; however, allows similar data points to come together and cluster in statistically significant regions.

Problems with high-dimensional data include:

- risk of overfitting the machine learning model;
- difficulty in clustering similar features;
- increased space and computational time complexity.

On the flip side, non-sparse or dense data has non-zero features that contain meaningful and non-redundant information.

To combat the curse of dimensionality, techniques like dimensionality reduction come into play.
These methods are handy for transforming sparse features into dense ones, cleaning up the data, and extracting relevant features.

## Decomposition algorithms

Decomposing signals into components is a technique used to extract meaningful information from data.
It is a process of breaking down a signal into its constituent parts, which can be used to understand the underlying structure of the signal.

### Principal Component Analysis (PCA)



### Kernel PCA (KPCA)

### Singular Value Decomposition (SVD)

## Manifold learning algorithms

Manifold learning is a set of techniques used to simplify complex data by reducing its dimensionality.
The main goal is to make it easier to understand and visualize the information.
Imagine you have a lot of data with 4 or more dimensions, it's hard to picture or make sense of it all at once.
Manifold learning helps by projecting this data onto simpler, lower-dimensional structures called manifolds.

The idea behind manifold learning is that, in many cases, the data might seem more complex than it actually is.
High-dimensional data can be challenging to visualize and understand.
Manifold learning steps in to simplify this by creating a lower-dimensional representation of the data.
It's like trying to capture the essential features of the data in a more manageable form.

Think of manifold learning as a way to improve upon traditional methods like PCA (Principal Component Analysis), which work well for linear structures but might struggle with more complex, non-linear patterns in the data.
Manifold learning is like extending these methods to better handle the intricacies and relationships in the data, making it more accessible and interpretable.

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

### Uniform Manifold Approximation and Projection


## Discriminant Analysis


### Linear Discriminant Analysis (LDA)
