# Classification

> In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.
> 
- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 3 exercises "Preprocessing for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Feature engineering


### Feature engineering knowledge test


<p>Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?</p>

<pre>
Possible Answers

A column of timestamps

A column of newspaper headlines

A column of weight measurements

<b>1 and 2</b>

None of the above

</pre>

**Timestamps can be broken into days or months, and headlines can be used for natural language processing.**

### Identifying areas for feature engineering


<p>Take an exploratory look at the <code>volunteer</code> dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?</p>

In [5]:
volunteer = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/volunteer.csv')

<pre>
Possible Answers

vol_requests

title

created_date

category_desc

<b>2, 3, and 4</b>
</pre>

In [6]:
volunteer[['vol_requests', 'title', 'created_date', 'category_desc']]

Unnamed: 0,vol_requests,title,created_date,category_desc
0,50,Volunteers Needed For Rise Up & Stay Put! Home...,January 13 2011,
1,2,Web designer,January 14 2011,Strengthening Communities
2,20,Urban Adventures - Ice Skating at Lasker Rink,January 19 2011,Strengthening Communities
3,500,Fight global hunger and support women farmers ...,January 21 2011,Strengthening Communities
4,15,Stop 'N' Swap,January 28 2011,Environment
...,...,...,...,...
660,3,Volunteer for NYLAG's Food Stamps Project,August 16 2011,Helping Neighbors in Need
661,10,Iridescent Science Studio Open House Volunteers,March 21 2011,Strengthening Communities
662,1,French Translator,July 20 2011,Helping Neighbors in Need
663,2,Marketing & Advertising Volunteer,June 01 2011,Strengthening Communities


**All three of these columns will require some feature engineering before modeling.**

## Encoding categorical variables


### Encoding categorical variables - binary


<p>Take a look at the <code>hiking</code> dataset. There are several columns here that need encoding, one of which is the <code>Accessible</code> column, which needs to be encoded in order to be modeled. <code>Accessible</code> is a binary feature, so it has two values - either <code>Y</code> or <code>N</code> - so it needs to be encoded into 1s and 0s. Use scikit-learn's <code>LabelEncoder</code> method to do that transformation.</p>

In [19]:
from sklearn.preprocessing import LabelEncoder
hiking = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/hiking.csv')

Instructions
<ul>
<li>Store <code>LabelEncoder()</code> in a variable named <code>enc</code></li>
<li>Using the encoder's <code>fit_transform()</code> function, encode the <code>hiking</code> dataset's <code>"Accessible"</code> column. Call the new column <code>Accessible_enc</code>.</li>
<li>Compare the two columns side-by-side to see the encoding.</li>
</ul>

In [15]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


**.fit_transform() is a good way to both fit an encoding and transform the data in a single step.**

## Encoding categorical variables - one-hot


<p>One of the columns in the <code>volunteer</code> dataset, <code>category_desc</code>, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' <code>get_dummies()</code> function to do so.</p>

Instructions
<ul>
<li>Call <code>get_dummies()</code> on the <code>volunteer["category_desc"]</code> column to create the encoded columns and assign it to <code>category_enc</code>.</li>
<li>Print out the <code>head()</code> of the <code>category_enc</code> variable to take a look at the encoded columns.</li>
</ul>

In [16]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
category_enc.head()

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


**get_dummies() is a simple and quick way to encode categorical variables.**

## Engineering numerical features

### Engineering numerical features - taking an average

<p>A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named <code>running_times_5k</code>. For each <code>name</code> in the dataset, take the mean of their 5 run times.</p>

In [2]:
running_times_5k = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/running_times_5k.csv')

Instructions
<ul>
<li>Create a list of the columns you want to take the average of and store it in a variable named <code>run_columns</code>.</li>
<li>Use <code>apply</code> to take the <code>mean()</code> of the list of columns and remember to set <code>axis=1</code>. Use <code>lambda row:</code> in the apply.</li>
<li>Print out the DataFrame to see the <code>mean</code> column.</li>
</ul>

In [4]:
# Create a list of the columns to average
run_columns = running_times_5k.columns[-5:]

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44


**Lambdas are especially helpful for operating across columns.**

### Engineering numerical features - datetime


<p>There are several columns in the <code>volunteer</code> dataset comprised of datetimes. Let's take a look at the <code>start_date_date</code> column and extract just the month to use as a feature for modeling.</p>

Instructions
<ul>
<li>Use Pandas <code>to_datetime()</code> function on the <code>volunteer["start_date_date"]</code> column and store it in a new column called <code>start_date_converted</code>.</li>
<li>To retrieve just the month, apply a lambda function to <code>volunteer["start_date_converted"]</code> that grabs the <code>.month</code> attribute from the <code>row</code>. Store this in a new column called <code>start_date_month</code>.</li>
<li>Print the <code>head()</code> of just the <code>start_date_converted</code> and <code>start_date_month</code> columns.</li>
</ul>

In [7]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the converted and new month columns
print(volunteer[['start_date_converted', 'start_date_month']].head())

  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


**You can also use attributes like .day to get the day and .year to get the year from datetime columns.**

## Text classification


### Engineering features from strings - extraction


<p>The <code>Length</code> column in the <code>hiking</code> dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.</p>

In [22]:
import re
hiking = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/hiking_29x11.csv')

Instructions
<ul>
<li>Create a pattern that will extract numbers and decimals from text, using <code>\d+</code> to get numbers and <code>\.</code> to get decimals, and pass it into <code>re</code>'s <code>compile</code> function.</li>
<li>Use <code>re</code>'s <code>match</code> function to search the text, passing in the <code>pattern</code> and the <code>length</code> text.</li>
<li>Use the matched <code>mile</code>'s <code>group()</code> attribute to extract the matched pattern, making sure to match group <code>0</code>, and pass it into <code>float</code>.</li>
<li>Apply the <code>return_mileage()</code> function to the <code>hiking["Length"]</code> column.</li>
</ul>

In [23]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


**Regular expressions are a useful way to perform text extraction.**

### Engineering features from strings - tf/idf


<p>Let's transform the <code>volunteer</code> dataset's <code>title</code> column into a text vector, to use in a prediction task in the next exercise.</p>

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

Instructions
<ul>
<li>Store the <code>volunteer["title"]</code> column in a variable named <code>title_text</code>.</li>
<li>Use the <code>tfidf_vec</code> vectorizer's <code>fit_transform()</code> function on <code>title_text</code> to transform the text into a tf-idf vector.</li>
</ul>

In [32]:
# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

**Scikit-learn provides several methods for text vectorization.**

## Text classification using tf/idf vectors


<p>Now that we've encoded the <code>volunteer</code> dataset's <code>title</code> column into tf/idf vectors, let's use those vectors to try to predict the <code>category_desc</code> column.</p>

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

Instructions
<ul>
<li>Using <code>train_test_split</code>, split the <code>text_tfidf</code> vector, along with your <code>y</code> variable, into training and test sets. Set the <code>stratify</code> parameter equal to <code>y</code>, since the class distribution is uneven. Notice that we have to run the <code>toarray()</code> method on the tf/idf vector, in order to get in it the proper format for scikit-learn.</li>
<li>Use Naive Bayes' <code>fit()</code> method on the <code>X_train</code> and <code>y_train</code> variables.</li>
<li>Print out the <code>score()</code> of the <code>X_test</code> and <code>y_test</code> variables.</li>
</ul>

In [36]:
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5548387096774193


**Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next chapter.**