# Feature analysts

- Transforming raw data into informative signals.
- These informative signals have some predictive power over financial variables. 
- Feature analysts are experts in information theory, signal extraction and processing, visualization, labeling, weighting, classifiers, and feature importance techniques. 

# Feature Engineering

Alpha factors:
- Alpha factors are transformations of market, fundamental, and alternative data that contain
predictive signals. Some factors describe fundamental, economy-wide variables such as
growth, inflation, volatility, productivity, and demographic risk. Other factors represent
investment styles, such as value or growth, and momentum investing that can be traded
and are thus priced by the market. There are also factors that explain price movements
based on the economics or institutional setting of financial markets, or investor behavior,
including known biases of this behavior.
- The economic theory behind factors can be rational so that factors have high returns
over the long run to compensate for their low returns during bad times. It can also be
behavioral, where factor risk premiums result from the possibly biased, or not entirely
rational, behavior of agents that is not arbitraged away.
- To avoid false discoveries and ensure a factor delivers consistent results, it should have
a meaningful economic intuition based on the various established factor categories like
momentum, value, volatility, or quality and their rationales

## Market features

Momentum investing

- Momentum investing is among the most well-established factor strategies, underpinned by quantitative evidence since Jegadeesh and Titman (1993) for the US equity market. 
- Momentum factors are designed to go long on assets that have performed well, while going short on assets with poor performance over a certain period. 
- Such price momentum defies the hypothesis of efficient markets, which states that past price returns alone cannot predict future performance. 

Behavioral rationale
- The behavioral rationale reflects the biases of underreaction (Hong, Lim, and Stein, 2000) and over-reaction (Barberis, Shleifer, and Vishny, 1998) to market news as investors process new information at different speeds. 
- After an initial under-reaction to news, investors often extrapolate past behavior and create price momentum. 
- A fear and greed psychology also motivates investors to increase exposure to winning assets and continue selling losing assets (Jegadeesh and Titman, 2011).

Fundamental Drivers
- Momentum can also have fundamental drivers such as a positive feedback loop between risk assets and the economy. 
- Economic growth boosts equities, and the resulting wealth effect feeds back into the economy through higher spending, again fueling growth.

Market Microstructure effects
- Over shorter, intraday horizons, market microstructure effects can also create price
momentum as investors implement strategies that mimic their biases. 
- These strategies create momentum because they imply an advance commitment to sell when an asset underperforms and buy when it
outperforms.

## Dimension Reduction

### Linear Projection Models

PCA
- The objective of PCA is to find linear combinations of the original predictors such that the
combinations summarize the maximal amount of variation in the original predictor
space. From a statistical perspective, variation is synonymous with information. So,
by finding combinations of the original predictors that capture variation, we find the subspace of the data that contains the information relevant to the predictors.
- PCA is a particularly useful tool when the available data are composed of one or
more clusters of predictors that contain redundant information (e.g., predictors that
are highly correlated with one another).

Kernel PCA
- Principal component analysis is an effective dimension reduction technique when
predictors are linearly correlated and when the resulting scores are associated with the
response. However, the orthogonal partitioning of the predictor space may not provide
a good predictive relationship with the response, especially if the true underlying
relationship between the predictors and the response is *non-linear*.
- The kernel PCA approach combines
a specific mathematical view of PCA with kernel functions and the kernel ‘trick’
to enable PCA to expand the dimension of the predictor space in which dimension
reduction is performed

ICA
- Independent component analysis (ICA) is similar to PCA in a number of ways. It creates new components that are linear combinations of the original
variables but does so in a way that the components are as statistically independent
from one another as possible. This enables ICA to be able to model a broader
set of trends than PCA, which focuses on orthogonality and linear relationships.
There are a number of ways for ICA to meet the constraint of statistical independence
and often the goal is to maximize the “non-Gaussianity” of the resulting
components.

NMF
- Non-negative matrix factorization is another linear projection method that is specific to features that are greater than or equal to zero. In this case, the algorithm finds the coefficients of A such that their values are also non-negative (thus ensuring that the new features have the same property). The method for determining the coefficients is conceptually simple: find the best set
of coefficients that make the scores as “close” as possible to the original data with
the constraint of non-negativity.

### Autoencoders

- Autoencoders are computationally complex multivariate methods for finding representations
of the predictor data and are commonly used in deep learning models
- The idea is to create a nonlinear mapping between the
original predictor data and a set of artificial features (that is usually the same size).
These new features, which may not have any sensible interpretation, are then used as
the model predictors. While this does sound very similar to the previous projection
methods, autoencoders are very different in terms of how the new features are derived
and also in their potential benefit.

## Deep Learning

CNN
- CNNs filter the data and then exploit data locality, either spatial, temporal, or even spatio-temporal, to efficiently represent the input data. When applied to time series, CNNs are non-linear autoregressive models which can be designed to capture multiple scales in the data using dilated convolution. We might use a CNN autoencoder to compress spatial data.

PCA and Deep Autoencoder
- Principal component analysis is one of the most powerful techniques for dimension reduction and uses orthogonal projection to decorrelate the features. The first m singular values of the weight matrix in a linear autoencoder are the m loading vectors used as an orthogonal basis for projection.
- With non-linear activation, the autoencoder can no longer resolve the loading vectors. However, the addition of a more expressive, non-linear, model is used to reduce the reconstruction error for a given compression dimension m.


We can combine these different architectures together to build powerful regressions and compression methods. For example, we might use a GRU-autoencoder to compress non-stationary time series where as we might use a CNN autoencoder to compress spatial data.

# Feature Importance

"Backtesting is not a research tool. Feature importance is."

— Marcos Lopez de Prado

Advances in Financial Machine Learning (2018)

Once we have found what features are important, we can learn more by conducting
a number of experiments. 

- Are these features important all the time, or only in some specific environments? 
- What triggers a change in importance over time? 
- Can those regime switches be predicted? 
- Are those important features also relevant to other related financial instruments? 
- Are they relevant to other asset classes? What are the most relevant features across all financial instruments? 
- What is the subset of features with the highest rank correlation across the entire investment universe?

## Measure of feature importance

### With substitution effect

A substiution effect
- A substitution effect takes place when the estimated importance of one feature is reduced by the presence of other related features. 
- Substitution effects are the ML analogue of what the statistics and econometrics literature calls “multi-collinearity.” 
- One way to address linear substitution effects is to apply PCA on the raw features, and then perform the feature importance analysis on the orthogonal features.

#### Mean decrease impurity (MDI)
 - At each node of each decision tree, the selected feature splits the subset it received in such a way that impurity is decreased. Therefore, we can derive for each decision tree how much of the overall impurity decrease can be assigned to each feature. And given that we have a forest of trees, we can average those values across all estimators and rank the features accordingly.

 - Masking effects take place when some features are systematically ignored by tree-based classifiers in favor of others. 
     - In order to avoid them, set max_features=int(1)when using sklearn’s RF class. In this way, only one random feature is considered per level.
 - Every feature will have some importance, even if they have no predictive power whatsoever.
 - MDI cannot be generalized to other non-tree based classifiers.
 - The method does not address substitution effects in the presence of correlated features. 
     - MDI dilutes the importance of substitute features, because of their interchangeability: The importance of two identical features will be halved, as they are randomly chosen with equal probability.
 - Strobl et al. [2007] show experimentally that MDI is biased towards some predictor variables. White and Liu [1994] argue that, in case of single decision trees, this bias is due to an unfair advantage given by popular impurity functions toward predictors with a large number of categories.

#### Mean Decrease Accuracy

- First, it fits a classifier; second, it derives its performance OOS according to some performance score (accuracy, negative log-loss, etc.); third, it permutates each column of the features matrix (X), one column at a time, deriving the performance OOS after each column’s permutation.
 
- This method can be applied to any classifier, not only tree-based classifiers.

- MDA is not limited to accuracy as the sole performance score.
- Like MDI, the procedure is also susceptible to substitution effects in the presence of correlated features. 
- Unlike MDI, it is possible that MDA concludes that all features are unimportant. That is because MDA is based on OOS performance.
- The CV must be purged and embargoed, for the reasons explained in Chapter 7.

### WITHOUT SUBSTITUTION EFFECTS

Substitution effects can lead us to discard important features that happen to be redundant. This is not generally a problem in the context of prediction, but it could lead us to wrong conclusions when we are trying to understand, improve, or simplify a
model. For this reason, the following single feature importance method can be a good complement to MDI and MDA.

#### Single Feature Importance

Single feature importance (SFI) is a cross-section predictive-importance (out-ofsample)
method. It computes the OOS performance score of each feature in isolation.

- This method can be applied to any classifier, not only tree-based classifiers.
- SFI is not limited to accuracy as the sole performance score.
- Unlike MDI and MDA, no substitution effects take place, since only one feature is taken into consideration at a time.
- Like MDA, it can conclude that all features are unimportant, because performance is evaluated via OOS CV.

#### Orthogonal Features

A partial solution is to orthogonalize the features before applying MDI and
MDA. An orthogonalization procedure such as principal components analysis (PCA)
does not prevent all substitution effects, but at least it should alleviate the impact of
linear substitution effects.

Besides addressing substitution effects,working with orthogonal features provides
two additional benefits: 
- (1) orthogonalization can also be used to reduce the dimensionality of the features matrix X, by dropping features associated with small eigenvalues. This usually speeds up the convergence of ML algorithms; 
- (2) the analysis is conducted on features designed to explain the structure of the data.

### PARALLELIZED VS. STACKED FEATURE IMPORTANCE

There are at least two research approaches to feature importance. 

1) Parallelized: for each security i in an investment universe i = 1,…, I, we form a dataset (Xi, yi), and derive the feature importance in parallel. Features that are important across a wide variety of instruments are more likely to be associated with an underlying phenomenon, particularly when these feature importances exhibit high rank correlation across the criteria. The
main advantage of this approach is that it is computationally fast, as it can be parallelized.
A disadvantage is that, due to substitution effects, important features may
swap their ranks across instruments, increasing the variance

2) Stacked: It consists in stacking all datasets {( ̃ Xi, yi)}i=1,…,I into a single combined dataset (X, y), where ̃ Xi is a transformed instance of Xi (e.g., standardized on a rolling trailing window). The purpose of this transformation is to ensure some distributional homogeneity. 

- Features stacking presents some advantages: 
    - (1) The classifier will be fit
on a much larger dataset than the one used with the parallelized (first) approach; 
    - (2) the importance is derived directly, and no weighting scheme is required for combining
the results; 
    - (3) conclusions are more general and less biased by outliers or overfitting;
and 
    - (4) because importance scores are not averaged across instruments, substitution
effects do not cause the dampening of those scores.

# References

Advances in Financial Machine Learning (De Prado) 
- Chapter 8 Feature Importance


Machine Learning in Finance - From Theory to Practice (Dixon et al.)
- Chapter 5. Interpretability
- Chapter 8-6. Autoencoders

Machine Learning for Algorithm Trading (S. Jansen)
- Chapter 4: Financial Feature Engineering

