# Selecting features for modeling
> This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).
> 
- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 4 exercises "Preprocessing for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Feature selection

### When to use feature selection
<p>Let's say you had finished standardizing your data and creating new features. Which of the following scenarios is NOT a good candidate for feature selection?</p>

<pre>
Possible Answers

Several columns of running times that have been averaged into a new column.

<b>A text field that hasn't been turned into a tf/idf vector yet.</b>

A column of text that has already had a float extracted out of it.

A categorical field that has been one-hot encoded.

Your dataset contains columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant.
</pre>

**The text field needs to be vectorized before we can eliminate it, otherwise we might miss out on important data.**

### Identifying areas for feature selection


<p>Take an exploratory look at the post-feature engineering <code>hiking</code> dataset. Which of the following columns is a good candidate for feature selection?</p>

In [None]:
hiking = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/hiking_29x20.csv')

<pre>
Possible Answers

Length

Difficulty

Accessible

<b>All of the above</b>

None of the above

</pre>

In [None]:
hiking[['Length', 'Difficulty', 'Accessible']].head(15)

Unnamed: 0,Length,Difficulty,Accessible
0,0.8 miles,,Y
1,1.0 mile,Easy,N
2,0.75 miles,Easy,N
3,0.5 miles,Easy,N
4,0.5 miles,Easy,N
5,Various,Various,N
6,1.7 miles,,N
7,2.4 miles,,N
8,1.0 mile,,N
9,3.0 miles,,N


**All three of these columns are good candidates for feature selection.**

## Removing redundant features

### Selecting relevant features


<div class=""><p>Now let's identify the redundant columns in the <code>volunteer</code> dataset and  perform feature selection on the dataset to return a DataFrame of the relevant features.</p>
<p>For example, if you explore the <code>volunteer</code> dataset in the console, you'll see three features which are related to location: <code>locality</code>, <code>region</code>, and <code>postalcode</code>. They contain repeated information, so it would make sense to keep only one of the features. </p>
<p>There are also features that have gone through the feature engineering process: columns like <code>Education</code> and <code>Emergency Preparedness</code> are a product of encoding the categorical variable <code>category_desc</code>, so <code>category_desc</code> itself is redundant now.</p>
<p>Take a moment to examine the features of <code>volunteer</code> in the console, and try to identify the redundant features.</p></div>

In [None]:
volunteer = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/volunteer-617x16.csv')

Instructions
<ul>
<li>Create a list of redundant column names and store it in the <code>to_drop</code> variable: <ul>
<li>Out of all the location-related features, keep only <code>postcode</code>.</li>
<li>Features that have gone through the feature engineering process are redundant as well.</li></ul></li>
<li>Drop the columns from the dataset using <code>.drop()</code>. </li>
<li>Print out the <code>.head()</code> of the DataFrame to see the selected columns.</li>
</ul>

In [None]:
volunteer.head()

Unnamed: 0,vol_requests,title,hits,category_desc,locality,region,postalcode,created_date,vol_requests_lognorm,created_month,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,2,Web designer,22,Strengthening Communities,"5 22nd St\nNew York, NY 10010\n(40.74053152272...",NY,10010.0,2011-01-14,0.693147,1,0,0,0,0,0,1
1,20,Urban Adventures - Ice Skating at Lasker Rink,62,Strengthening Communities,,NY,10026.0,2011-01-19,2.995732,1,0,0,0,0,0,1
2,500,Fight global hunger and support women farmers ...,14,Strengthening Communities,,NY,2114.0,2011-01-21,6.214608,1,0,0,0,0,0,1
3,15,Stop 'N' Swap,31,Environment,,NY,10455.0,2011-01-28,2.70805,1,0,0,1,0,0,0
4,15,Queens Stop 'N' Swap,135,Environment,,NY,11372.0,2011-01-28,2.70805,1,0,0,1,0,0,0


In [None]:
# Create a list of redundant column names to drop
to_drop = ["locality", "region", "category_desc", "created_date", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, 1)

# Print out the head of the new dataset
volunteer_subset.head()

Unnamed: 0,title,hits,postalcode,vol_requests_lognorm,created_month,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,Web designer,22,10010.0,0.693147,1,0,0,0,0,0,1
1,Urban Adventures - Ice Skating at Lasker Rink,62,10026.0,2.995732,1,0,0,0,0,0,1
2,Fight global hunger and support women farmers ...,14,2114.0,6.214608,1,0,0,0,0,0,1
3,Stop 'N' Swap,31,10455.0,2.70805,1,0,0,1,0,0,0
4,Queens Stop 'N' Swap,135,11372.0,2.70805,1,0,0,1,0,0,0


**It's often easier to collect a list of columns to drop, rather than dropping them individually.**

### Checking for correlated features

<p>Let's take a look at the <code>wine</code> dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.</p>

In [None]:
wine = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/wine-178x5.csv')

Instructions
<ul>
<li>Print out the column correlations of the <code>wine</code> dataset using <code>corr()</code>.</li>
<li>Take a minute to look at the correlations. Identify a column where the correlation value is greater than 0.75 at least twice and store it in the <code>to_drop</code> variable.</li>
<li>Drop that column from the DataFrame using <code>drop()</code>.</li>
</ul>

In [None]:
# Print out the column correlations of the wine dataset
wine.corr()

Unnamed: 0,Flavanoids,Total phenols,Malic acid,OD280/OD315 of diluted wines,Hue
Flavanoids,1.0,0.864564,-0.411007,0.787194,0.543479
Total phenols,0.864564,1.0,-0.335167,0.699949,0.433681
Malic acid,-0.411007,-0.335167,1.0,-0.36871,-0.561296
OD280/OD315 of diluted wines,0.787194,0.699949,-0.36871,1.0,0.565468
Hue,0.543479,0.433681,-0.561296,0.565468,1.0


In [None]:
# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, 1)

**Dropping correlated features is often an iterative process, so you may need to try different combinations in your model.**

## Selecting features using text vectors


## Exploring text vectors, part 1


<p>Let's expand on the text vector exploration method we just learned about, using the <code>volunteer</code> dataset's <code>title</code> tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our <code>text_tfidf</code> vector.</p>

In [99]:
volunteer = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/volunteer-617x16.csv')
volunteer = volunteer[['category_desc', 'title']]
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

In [100]:
vocab = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/vocab.csv', index_col=0).to_dict()
vocab = vocab['0']
from sklearn.feature_extraction.text import TfidfVectorizer
title_text = volunteer['title']
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(title_text)

Instructions
<ul>
<li>Add parameters called <code>original_vocab</code>, for the <code>tfidf_vec.vocabulary_</code>, and <code>top_n</code>.</li>
<li>Call <code>pd.Series</code> on the zipped dictionary. This will make it easier to operate on.</li>
<li>Use the <code>sort_values</code> function to sort the series and slice the index up to <code>top_n</code> words.</li>
<li>Call the function, setting <code>original_vocab=tfidf_vec.vocabulary_</code>, setting <code>vector_index=8</code> to grab the 9th row, and setting <code>top_n=3</code>, to grab the top 3 weighted words.</li>
</ul>

In [101]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, vector_index=8, top_n=3))

[189, 942, 466]


**This is a little complicated, but you'll see how it comes together in the next exercise.**

### Exploring text vectors, part 2


<p>Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.</p>

Instructions
<ul>
<li>Call <code>return_weights</code> to return the top weighted words for that document.</li>
<li>Call <code>set</code> on the returned <code>filter_list</code> so we don't get duplicated numbers.</li>
<li>Call <code>words_to_filter</code>, passing in the following parameters: <code>vocab</code> for the <code>vocab</code> parameter, <code>tfidf_vec.vocabulary_</code> for the <code>original_vocab</code> parameter, <code>text_tfidf</code> for the <code>vector</code> parameter, and <code>3</code> to grab the <code>top_n</code> 3 weighted words from each document.</li>
<li>Finally, pass that <code>filtered_words</code> set into a list to use as a filter for the text vector.</li>
</ul>

In [102]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

**In the next section, you'll train a model using the filtered vector.**

### Training Naive Bayes with feature selection


<p>Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the <code>volunteer</code> dataset's <code>title</code> and <code>category_desc</code> columns.</p>

In [104]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
y = volunteer['category_desc']

Instructions
<ul>
<li>Use <code>train_test_split</code> on the <code>filtered_text</code> text vector, the <code>y</code> labels (which is the <code>category_desc</code> labels), and pass the <code>y</code> set to the <code>stratify</code> parameter, since we have an uneven class distribution.</li>
<li>Fit the <code>nb</code> Naive Bayes model to <code>train_X</code> and <code>train_y</code>.</li>
<li>Score the <code>nb</code> model on the <code>test_X</code> and <code>test_y</code> test sets.</li>
</ul>

In [106]:
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

0.567741935483871


**You can see that our accuracy score wasn't that different from the score at the end of chapter 3. That's okay; the title field is a very small text field, appropriate for demonstrating how filtering vectors works.**

## Dimensionality reduction


### Using PCA


<p>Let's apply PCA to the <code>wine</code> dataset, to see if we can get an increase in our model's accuracy.</p>

In [107]:
wine = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/wine.csv')

In [109]:
wine_X = wine.drop('Type', 1)

Instructions
<ul>
<li>Set up the <code>PCA</code> object. You'll use PCA on the wine dataset minus its label for <code>Type</code>, stored in the variable <code>wine_X</code>.</li>
<li>Apply PCA to <code>wine_X</code> using <code>pca</code>'s <code>fit_transform</code> method and store the transformed vector in <code>transformed_X</code>.</li>
<li>Print out the <code>explained_variance_ratio_</code> attribute of <code>pca</code> to check how much variance is explained by each component.</li>
</ul>

In [111]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


**In the next section you'll train a model using the PCA-transformed vector.**

### Training a model with PCA


<p>Now that we have run PCA on the <code>wine</code> dataset, let's try training a model with it.</p>

In [113]:
from sklearn.neighbors import KNeighborsClassifier
y = wine['Type']
knn = KNeighborsClassifier()

Instructions
<ul>
<li>Split the <code>transformed_X</code> vector and the <code>y</code> labels set into training and test sets using <code>train_test_split</code>.</li>
<li>Fit the <code>knn</code> model using the <code>fit()</code> function on the <code>X_wine_train</code> and <code>y_wine_train</code> sets.</li>
<li>Print out the score using <code>knn</code>'s <code>score()</code> function on <code>X_wine_test</code> and <code>y_wine_test</code>.</li>
</ul>

In [128]:
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

0.7555555555555555

**PCA is a decent choice for the wine dataset.**