Skip to content

ENH: Gamma Scaling Experiments #71

@nray

Description

@nray

Context
In Kosta’s papers they represent a single and two layer NN as a random object and study its bias and variance. In particular, they introduce a scaled normalization of each layer’s output before it is fed to the next layer as in the reference Normalization effects on deep neural networks, Equation 1.

We are interested in studying if such scaling has any effect on RVFL type of networks w.r.t their performance . Do we get any guidance from numerical studies if gamma scaling has any effect on RVFL's generalization properties ?

In rest of the issue, I am using the term gamma to mean gamma scaled normalization.

General questions:

  1. Is there a universal gamma or a range that leads to consistent performance across multiple datasets ?
    1. This is unlikely and conflicts with “No-free lunch theorem”.
    2. Kosta’s response seemed to agree that this is not “free-lunch”
  2. Are there specific properties of the dataset that are correlated with the gamma, so that we can utilize a-priori if such scaling would be useful for improving RVFL accuracy ?
    1. Kosta’s provided an expression for variance $\sigma(NN)=CN^{2/\gamma-1}e^{-At}$ , where $N-> \infty$, $t$ is running time, $C(X, w, a)$ is constant and function of input $X$, initialization $w$ and activation $a$, $A$ is a positive definite matrix, and A’s eigenvalues depend on $X, w, a$. But, the functional form of $C$ and $A$ in terms of $X, w, a$ is not clear.
    2. Also, how variance $\sigma(NN)$ is related to accuracy of the function approximation is something I do not fully understand.

Experiment assumptions
Activation function: non-polynomials and slowly increasing i.e., $tanh, sigmoid$
Gamma $\gamma$ : $[1/2, 1]$
Initialization weight function : normal

Experiment 1: Study the effect of $\gamma$ on solution accuracy for RVFL with direct solve, is there one specific value that gives the best approximation ?

Requirement : We need to scale the output of each layer before feeding to the next layer as described in the
reference.

How to code this up in GFDL library code:

  1. Introduce an extra parameter to our GFDL base class as in here.
  2. Scale the design matrix for each layer after the activation function has been applied as in here.

Note: This design does not scale the input to first layer or the direct links.

Experiment 2 : Confirmation of experiment 1 with iterative solve.

Kosta’s response to why an iterative solver is needed:

Direct solution is great, but it is also hard to analyze analytically. If the solvers converge, it makes sense to analyze the limit for large enough N. If it does not converge, it could mean different things, the algorithm may hit a local minimum, may be artifact of numerical algorithm. Hope is it should converge.

Constraint: We need to specify the learning rate explicitly, so the learning rate has to be a hyper-parameter or at the very least exposed to the solver api even if it is constant.

Why Ridge solver in current library cannot be re-purposed ?

  1. Currently, "reg_alpha=None" always leads to the direct solver path.
  2. Even if we manage to call the Ridge solver when "reg_alpha=None", the learning rate for sgd solver used by Ridge is a heuristic and not exposed through their api.

How to code this up ?

  1. Implement standalone solvers (current approach, but I am considering the second option to avoid maintaining such solvers on our own)
  2. Use other SGD implementations like torch.optim or scikit learning SGDRegressor, etc.

Current design for exposing iterative solvers for the ordinary least-squares formulation is to add a solver type to "fit" method args as in here along with kwargs to process solver specific arguments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions