# Feature selection
Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.  
## ChiSqSelector
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
## Model Fitting
ChiSqSelector has the following parameters in the constructor:

* numTopFeatures number of top features that the selector will select (filter).
We provide a fit method in ChiSqSelector which can take an input of RDD[LabeledPoint] with categorical features, learn the summary statistics, and then return a ChiSqSelectorModel which can transform an input dataset into the reduced feature space.  

This model implements VectorTransformer which can apply the Chi-Squared feature selection on a Vector to produce a reduced Vector or on an RDD[Vector] to produce a reduced RDD[Vector].  

Note that the user can also construct a ChiSqSelectorModel by hand by providing an array of selected feature indices (which must be sorted in ascending order).
## Example
The following example shows the basic use of ChiSqSelector. The data set used has a feature matrix consisting of greyscale values that vary from 0 to 255 for each feature.

In [1]:
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.feature.ChiSqSelector

val PATH = "file:///Users/lzz/work/SparkML/"

// Load some data in libsvm format
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
// Even though features are doubles, the ChiSqSelector treats each unique value as a category
val discretizedData = data.map { lp =>
  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor } ) )
}
// Create ChiSqSelector that will select top 50 of 692 features
val selector = new ChiSqSelector(50)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(discretizedData)
// Filter the top 50 features from each feature vector
val filteredData = discretizedData.map { lp => 
  LabeledPoint(lp.label, transformer.transform(lp.features)) 
}