An example showing univariate feature selection.

Noisy (non informative) features are added to the iris data and univariate feature selection is applied. For each feature, we plot the p-values for the univariate feature selection and the corresponding weights of an SVM. We can see that univariate feature selection selects the informative features and that these have larger SVM weights.

In the total set of features, only the 4 first ones are significant. We can see that they have the highest score with univariate feature selection. The SVM assigns a large weight to one of these features, but also Selects many of the non-informative features. Applying univariate feature selection before the SVM increases the SVM weight attributed to the significant features, and will thus improve classification.

#### New to Plotly?
Plotly's Python library is free and open source! [Get started](https://plot.ly/python/getting-started/) by downloading the client and [reading the primer](https://plot.ly/python/getting-started/).
<br>You can set up Plotly to work in [online](https://plot.ly/python/getting-started/#initialization-for-online-plotting) or [offline](https://plot.ly/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plot.ly/python/getting-started/#start-plotting-online).
<br>We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!

### Version

In [1]:
import sklearn
sklearn.__version__

'0.18.1'

### Imports

This tutorial imports [SelectPercentile](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile) and [f_classif](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif).

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif

### Calculations

Import some data

In [3]:
# The iris dataset
iris = datasets.load_iris()

# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target

X_indices = np.arange(X.shape[-1])

### Plot Results

Univariate feature selection with F-test for feature scoring We use the default selection function: the 10% most significant features

In [4]:
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
trace = go.Bar(x=X_indices - .45, 
               y=scores, width=.2,
               name=r'Univariate score (<i>-Log(p_{value})</i>)', 
               marker=dict(color='darkorange', 
                           line=dict(color='black', width=1))
              )

py.iplot([trace])

Compare to the weights of an SVM

In [5]:
clf = svm.SVC(kernel='linear')
clf.fit(X, y)

svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()

trace1 = go.Bar(x=X_indices - .25, 
                y=svm_weights,
                name='SVM weight',
                marker=dict(color='navy', 
                           line=dict(color='black', width=1))
               )

clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector.transform(X), y)

svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()

trace2 = go.Bar(x=X_indices[selector.get_support()] - .05, 
                y=svm_weights_selected,
                name='SVM weights after selection', 
                marker=dict(color='cyan', 
                           line=dict(color='black', width=1)) 
               )

data = [trace1, trace2]

layout = go.Layout(title="Comparing feature selection",
                   xaxis=dict(title='Feature number'),
                   barmode='grouped'
                  )
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

In [7]:
from IPython.display import display, HTML

display(HTML('<link href="//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700" rel="stylesheet" type="text/css" />'))
display(HTML('<link rel="stylesheet" type="text/css" href="http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css">'))

! pip install git+https://github.com/plotly/publisher.git --upgrade
import publisher
publisher.publish(
    'Univariate Feature Selection.ipynb', 'scikit-learn/plot-feature-selection/', 'Univariate Feature Selection | plotly',
    ' ',
    title = 'Univariate Feature Selection | plotly',
    name = 'Univariate Feature Selection',
    has_thumbnail='true', thumbnail='thumbnail/ufs.jpg', 
    language='scikit-learn', page_type='example_index',
    display_as='feature_selection', order=6,
    ipynb= '~Diksha_Gabha/3093')

Collecting git+https://github.com/plotly/publisher.git
  Cloning https://github.com/plotly/publisher.git to /tmp/pip-1ct5Kt-build
Installing collected packages: publisher
  Found existing installation: publisher 0.10
    Uninstalling publisher-0.10:
      Successfully uninstalled publisher-0.10
  Running setup.py install for publisher ... [?25l- done
[?25hSuccessfully installed publisher-0.10
