# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

In [38]:
from sklearn import decomposition
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
 

df = pd.read_csv('./agaricus-lepiota.csv')
x, y = pd.get_dummies(df), pd.get_dummies(df['edibility'])

# SelectKBest with chi2 and k = 5
skb = SelectKBest(chi2, k=5)
# Fit the selector to the dataset
skb.fit(x, y)
# Transform the dataset to include only the selected top 5 features
x_new = skb.transform(x)

print("Original shape", x.shape)
print("Skb shape:", x_new.shape)

selected = [pd.get_dummies(df).columns[i] for i in skb.get_support(indices=True)]
print("Selected features:", ", ".join(selected))

Original shape (8124, 119)
Skb shape: (8124, 5)
Selected features: edibility_e, edibility_p, odor_f, odor_n, stalk-surface-above-ring_k


In [41]:
print("Original space:", x.shape)
# Perform PCA, reducing the dimensionality to 5 principal components
pca = decomposition.PCA(n_components=5)
# Fit the PCA model and apply the dimensionality reduction
x_pca = pca.fit_transform(x)

print("PCA space:", x_pca.shape)
# Find the indices of the features that contribute the most to each of the principal components
# pca.components_[i].argmax() gets the index of the feature with the highest contribution for each component
best_features = [pca.components_[i].argmax() for i in range(x_pca.shape[1])]
# Retrieve the feature names corresponding to the indices found in the previous step
feature_names = [x.columns[best_features[i]] for i in range(x_pca.shape[1])]
print("Features in which gives max variance:", ", ".join(feature_names))

Original space: (8124, 119)
PCA space: (8124, 5)
Features in which gives max variance: edibility_e, stalk-root_b, habitat_d, stalk-shape_e, odor_n


In [40]:
set(selected).intersection(set(feature_names))

{'edibility_e', 'odor_n'}