<a id="top"></a>
<center><h1>Data Science Tutorial:</h1></center>
====
<center><h1>Introduction on Selecting Dimensionality Reduction Methods for Exploratory Data nalysis</h1></center>
-----

\begin{align}
 Author:&\quad\text{ Zhihan Xu}\\
 Email: &\quad\text{ zx243@cam.ac.uk}\\
 Supervisor: &\quad\text{ Laura Acqualagna}\\
 Company: &\quad \text{GlaxoSmithKline}\\
 Time: &\quad\text{ July 2017 - September 2017}\\
 \end{align}
 * * *

# Preface 

Scientists working with large volumes of high-dimensional data, such as drug discovery data, global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. This tutorial will provide a general exposure on dimensionality reduction methods and give advice about the suitability of methods implementation onto drug discovery data.

In the area of drug discovery data analysis, it compromises with (i) omics, (ii) phsiochemical properties of compounds, (iii) in vitro/in vivo experimental results and (iv) paramters extracted from medical images.

a well-known dimensionality reduction method called principal component analysis (PCA) is vastly used, but data scientists usually ignore the limitation and assumption of PCA. Imagine the useful information hidden in a bunch of data is stored in a safe, then currently there are no such methods that will be the key to all the safe. Thus, before trying to use the key such as PCA onto the safe, more efficient and pleasing way is to check the type of your safe (dataset), otherwise, the result of methods will normally be disappointing and meaningless. Similarly, it is better to check the step of your analysis procedure and the type of covariates in your data set before taking a close look at this tutorial. 

 


[Decision Tree for Data Analysis: ](https://www.lucidchart.com/invitations/accept/5f9dd676-f032-4916-940a-e0aff10a8ab1) In this tutorial, we only focus on the step of application of dimensionality reduction methods labelled by the black hexagon.

  Data type |  Covariate examples  | Suitability of using dimensionality reduction methods directly
  ------------- | ------------- | ------------- | 
  **Continuous/Discrete and unit consistent**   |  coordinates/pixel grey scale| YES
  **Continuous/Discrete but unit inconsistent**   | weight,height,salary | YES
  **Discrete**| age/number of objects| NO
  **Categorical**|marital status/level of education/country | NO
  **Binary** | sex/smoke or not| NO
  

 <img src="decision.jpg" alt="Drawing" style="width: 600px;"/>

## Summary Table

  Definition  | Data Type*     | Chracteristic | Parameter Tuning| Linearity Related to Input Space| Data Retrievable
  ------------- | ------------- | ------------- | ------------- | ------------- | -------------
  [**Linear PCA**](definition/definition.ipynb#pca)   |  numerical | variance contribution   | N/A  |  [Linear]() | exact |
  [**Multidimensional Scaling**](definition.ipynb#kernel pca)|   numerical | parwise distance  | distance function $d$  |  [Linear]() | exact |  
  [**Local Linear Embedding**](definition.ipynb#kernel pca)|   numerical | neighborhood selection   | $K$ neighbors   |  [non-Linear]() | exact |  
  [**Isomap **](definition.ipynb#isomap) |   numerical | neighborhood selection   | $K$ neighbors or radius $\epsilon$   |  [non-Linear]() | approximate |
  [**Kernel PCA**](definition.ipynb#kernel pca)|   numerical | variance contribution  | kernel function |  [non-Linear]() | approximate |
  [**t-SNE**](definition.ipynb#tsne)|   numerical | pairwise distance   |perplexity $perp$ |  [non-Linear]() | approximate |
  
  
  *The type of input covariates

## Section 1: [Definition of Dimensionality Reduction Methods](definition/definition.ipynb)
The introduction below includes classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS); nonlinear dimensionality reduction methods such as kernel PCA, which are capable of discovering the nonlinear degrees of freedom that underlie complex natural observations; another algorithm called isomap developed by [Tenenbaum and Langford (2000)](http://science.sciencemag.org/content/290/5500/2319), it computes a globally optimal solution for an important class of data manifolds.


1. [Linear Principal Component Analysis](definition/definition.ipynb#pca)
- [Multidimensional Scaling](definition/definition.ipynb#mds)
- [Local Linear Embedding](definition/definition.ipynb#LLE)
- [Isomap](definition/definition.ipynb#isomap)
- [Kernel Principal Component Analysis](definition/definition.ipynb#kernel pca)
- [t-distributed Stochastic Neighbor Embedding](definition/definition.ipynb#tsne)

The methods covered above can be implemented by build-in toolbox developed by [Scikit-Learn](http://scikit-learn.org/stable/unsupervised_learning.html) in the language of Python, while other useful methods such as self-organizing map (SOM) are not introduced here. For readers looking for results of evaluation of methods, a review paper was cited below.

A comparison of nonlinear dimensionality reduction methods was made by [Venna and Kaski (2006)](), methods are evaluated by *trustworthiness* of visualisation and *discontinuity* of the mapping. They concluded that isomap and LLE are designed to extract manifolds, while stochastic neighbour embedding (SNE) and self-organizing map (SOM) are more generally targeted for dimensionality reduction. For the purpose of data visualization, SOM and canonical correlation analysis (CCA) were recommended; for purpose of original neighbourhoods preservation, linear PCA and SNE were recommended.

## Section 2: [Case Study for Various Dataset](casestudy/casestudy.ipynb) 
#### [function script](casestudy/functions.ipynb)
1. [Handwritten digits](casestudy/casestudy.ipynb#digits) 
- [Concentric circle](casestudy/casestudy.ipynb#concentric circle) 
- [Mice dataset](casestudy/casestudy.ipynb#mice) 
- [Eigenfaces](casestudy/casestudy.ipynb#eigenfaces) 
- [S-shaped Manifolds](casestudy/casestudy.ipynb#manifold)

## Section 3: [Demonstration of Methods](demonstration/demonstration.ipynb)
#### [function script](demonstration/functions.ipynb)
1. [ t-SNE versus Linear PCA on handwritten dataset](demonstration/demonstration.ipynb#tsne pca)
- [ Gaussian Kernel PCA on concentric circle and mice dataset](demonstration/demonstration.ipynb#kernel pca)
- [ Factor Analysis versus randomised SVD Linear PCA on eigenfaces dataset](demonstration/demonstration.ipynb#fa pca)
- [ Difference of geodesic and Euclidean distance chosen in the methods' algorithm](demonstration/demonstration.ipynb#distance)
- [Implementation of pairwise distance methods onto mice data](demonstration/demonstration.ipynb#mice)


### Bibliography

Click to expand

<!--bibtex

@article{venna2006local,
  title={Local multidimensional scaling},
  author={Venna, Jarkko and Kaski, Samuel},
  journal={Neural Networks},
  volume={19},
  number={6},
  pages={889--899},
  year={2006},
  publisher={Elsevier}
}


@article{roweis2000nonlinear,
  title={Nonlinear dimensionality reduction by locally linear embedding},
  author={Roweis, Sam T and Saul, Lawrence K},
  journal={science},
  volume={290},
  number={5500},
  pages={2323--2326},
  year={2000},
  publisher={American Association for the Advancement of Science}
}

@article{maaten2008visualizing,
  title={Visualizing data using t-SNE},
  author={Maaten, Laurens van der and Hinton, Geoffrey},
  journal={Journal of Machine Learning Research},
  volume={9},
  number={Nov},
  pages={2579--2605},
  year={2008}
}

@inproceedings{scholkopf1997kernel,
  title={Kernel principal component analysis},
  author={Sch{\"o}lkopf, Bernhard and Smola, Alexander and M{\"u}ller, Klaus-Robert},
  booktitle={International Conference on Artificial Neural Networks},
  pages={583--588},
  year={1997},
  organization={Springer}
}

@article{higuera2015self,
  title={Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome},
  author={Higuera, Clara and Gardiner, Katheleen J and Cios, Krzysztof J},
  journal={PloS one},
  volume={10},
  number={6},
  pages={e0129126},
  year={2015},
  publisher={Public Library of Science}
}

@article{tenenbaum2000global,
  title={A global geometric framework for nonlinear dimensionality reduction},
  author={Tenenbaum, Joshua B and De Silva, Vin and Langford, John C},
  journal={science},
  volume={290},
  number={5500},
  pages={2319--2323},
  year={2000},
  publisher={American Association for the Advancement of Science}
}


@article{scikit-learn,
 title={Scikit-learn: Machine Learning in {P}ython},
 author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
 journal={Journal of Machine Learning Research},
 volume={12},
 pages={2825--2830},
 year={2011}
}

-->