# Chapter #3: Feature Engineering

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

## 1. Feature engineering

1. Feature engineering
> In this chapter, we're going to talk about a very important part of the preprocessing workflow: feature engineering.

2. What is feature engineering?
> Feature engineering is the creation of new features based on existing features, and it adds information to your dataset that is useful in some way: it adds features useful for your prediction or clustering task, or it sheds insight into relationships between features. Real world data is often not neat and tidy, and in addition to preprocessing steps like standardization, you'll likely have to extract and expand information that exists in the columns in your dataset. Feature engineering is a subject that could definitely be given its own entire course, so we're just going to go over some basics in this chapter. There are automated ways to create new features, but for now we're going to cover manual methods of feature engineering. These methods require you to have an in-depth knowledge of the dataset that you're working with. Feature engineering is also something that is very dependent on the particular dataset you're analyzing. The goal for this chapter is to demonstrate some scenarios in which feature engineering is useful, but it is by no means comprehensive of all feature engineering methods. It really depends on the dataset you're working with and the model you're building.

3. Feature engineering scenarios
> There are a variety of scenarios in which you might want to engineer features from existing data. An extremely common one is with text data. For example, if you're building some kind of natural language processing model, you'll have to create a vector of the words in your dataset. Another scenario might also be related to string data: maybe you have a column which records people's favorite colors. In order to feed this information into a model in scikit-learn, you'll have to encode this information numerically.

4. Feature engineering scenarios
> Another common example is with timestamps. You might see a full timestamp that includes the time down to the second or millisecond, which might be much too granular for a prediction task, so you'll want to create a new column with the day or the month. Perhaps a column contains a list of some kind: test scores, or running times, and maybe it's more useful to use an average. These are all examples of situations in which you want to generate new features from existing columns.

5. Let's practice!
Let's take a look at a dataset to determine where feature engineering might be useful.

### 1.1. Feature engineering knowledge test

Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?

Possible Answers:
- A column of timestamps.
- A column of newspaper headlines.
- A column of weight measurements.
- 1 and 2.
- None of the above.

> 1 and 2.

### 1.2. Identifying areas for feature engineering

Take an exploratory look at the `volunteer` dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?

- Getting everything ready.

In [2]:
# Reading the data & make sure that hits column is imported as a str:
volunteer = pd.read_csv("./data/volunteering.csv", dtype={'hits': str})

In [3]:
# Exploring the shape:
volunteer.shape

(665, 35)

In [4]:
# Exploring the first 5 rows:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


Possible Answers:
- vol_requests.
- title.
- created_date.
- category_desc.
- 2, 3, and 4.

> 2, 3, and 4.

## 2. Encoding categorical variables

1. Encoding categorical variables
> Because models in scikit-learn require numerical input, if your dataset contains categorical variables, you'll have to encode them. Let's take a look at how to do that.

2. Categorical variables
> Often, if you're collecting real world data, you'll have data that assigns a category to a particular row. For example, here's a set of some user data with categorical values. We have a subscribed column, with binary yes and no values, as well as a column with users' favorite colors, which has multiple categorical values.

3. Encoding binary variables - Pandas
> The first encoding we'll cover is encoding binary values. This is actually quite simple, and can be done in both Pandas and scikit-learn. In pandas, we can use the apply function to encode 1s and 0s in a dataframe column. Let's take the subscribed column here. using apply, we can write a simple conditional that returns a 1 if the value in subscribed is y, and a 0 if the value is n. Looking at a side by side comparison of the columns, you can see that the column is now numerically encoded. You might want to do this in pandas if you're not finished preprocessing, or if you're interested in further exploratory work once you've encoded.

4. Encoding binary variables - scikit-learn
> You can also do this in scikit-learn using LabelEncoder. It's useful to know both methods if, for example, you're implementing encoding as part of scikit-learn's pipeline functionality, which allows you to string different parts of the machine learning process together. Creating a LabelEncoder object also allows you to reuse this encoding on other data, such as on new data or a test set. To encode values in scikit-learn, you'll need the labelencoder transformer. You can use the fit-transform method to both fit the encoder to the data as well as transform the column. Printing out both the subscribed column and the new column, we can see that the ys and ns have been encoded to 1s and 0s.

5. One-hot encoding
> One-hot encoding also encodes categorical variables into 1s and 0s when you have more than two variables to encode. It works by looking at the entire list of unique values in a column, transforming each value into an array, and designating a 1 in the appropriate position to encode that a partcular value occurs. For example, in the fav_color column, we have three values: blue, green, and orange. If we were to encode these colors with 0s and 1s based on this list, we would get something like this: blue would have a 1 in the first position followed by two zeros, green would have a one in the second position, and orange would have a one in the last position. So an encoded column would look something like this.

> 6. One-hot encoding
You can use the get_dummies function in pandas to directly encode categorical values. Simply pass the column you want to encode to the get_dummies function and it returns the values encoded by position.

7. Let's practice!
> Now it's your turn to encode some values.

### 2.1. Encoding categorical variables - binary

Take a look at the `hiking` dataset. There are several columns here that need encoding, one of which is the `Accessible` column, which needs to be encoded in order to be modeled. `Accessible` is a binary feature, so it has two values - either `Y` or `N` - so it needs to be encoded into 1s and 0s. Use scikit-learn's `LabelEncoder` method to do that transformation.

- Getting everything ready.

In [5]:
# Reading the data:
hiking = pd.read_json("./data/hiking.txt")

In [6]:
# Exploring the shape:
hiking.shape

(33, 11)

In [7]:
# Exploring the first 5 rows:
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


- Store `LabelEncoder()` in a variable named `enc`.

In [8]:
# Initiating the encoder:
enc = LabelEncoder()

- Using the encoder's `.fit_transform()` function, encode the `hiking` dataset's `Accessible` column. Call the new column `Accessible_enc`.

In [9]:
# Encoding the Accessible column:
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

- Compare the two columns side-by-side to see the encoding.

In [10]:
# Comparing the 2 columns:
hiking[['Accessible', 'Accessible_enc']].head()

Unnamed: 0,Accessible,Accessible_enc
0,Y,1
1,N,0
2,N,0
3,N,0
4,N,0


## 2.2. Encoding categorical variables - one-hot

One of the columns in the `volunteer` dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' `get_dummies()` function to do so.

- Getting everything raedy.

In [11]:
# Reading the data:
volunteer = pd.read_csv("./data/volunteering.csv")

In [12]:
# Exploring the shape:
volunteer.shape

(665, 35)

In [13]:
# Exploring the first 5 rows:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


- Call `get_dummies()` on the `volunteer["category_desc"]` column to create the encoded columns and assign it to `category_enc`.

In [14]:
# Encoding the category_desc column:
category_enc = pd.get_dummies(data=volunteer['category_desc'])

- Print out the `head()` of the `category_enc` variable to take a look at the encoded columns.

In [15]:
# Exploring thefirst 5 rows from the newly encoded variable:
category_enc.head()

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


## 3. Engineering numerical features

1. Engineering numerical features
> Though you may have a dataset filled with numerical features, they may need a little bit of feature engineering to properly prepare for modeling. In this section, we'll talk about aggregate statistics as well as dates and how engineering numerical features can add value to your dataset.

2. Aggregate statistics
> If you had, say, a collection of features related to a single feature, like temperature or running time, you might want to take an average or median to use as a feature for modeling instead. A common method of feature engineering is to take an aggregate of a set of numbers to use in place of those features. This can be helpful in reducing the dimensionality of your feature space, or perhaps you simply don't need multiple similar values that are close in distance to each other. Here we have a dataset of temperature data over the course of three days in four different cities. Rather than use all three days, let's take an average. We can simply take the columns we want to run an aggregate statistic over - here I've thrown them into a list to make it easier to read - and apply the appropriate function, using a lambda. We set axis=1 in order to operate across a row. You can see that this returns a single mean value column.

3. Dates
> Dates and timestamps are another area where you might want to reduce granularity in your dataset. If you're doing time series analysis, that's likely a different story, but if you're running a prediction task, you may need higher-level information like the month or the year, or both. Here's a collection of dates. The full date is too granular for the prediction task we want to do, so let's extract the month from each date.

4. Dates
> The first thing to do is to convert this column to a datetime column in pandas. This makes the extraction task much easier. Once it's converted, we can once again use the apply method. Because this column is now a datetime, we can simply use the .month attribute to extract out the month. You can also use attributes like "day" to get the day, and "year" to get the year. And you can see that we now have a column of the month values.

5. Let's practice!
> Time to put engineering numerical features into practice!

### 3.1. Engineering numerical features - taking an average

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named `running_times_5k`. For each `name` in the dataset, take the mean of their 5 run times.

- Getting everything ready.

In [16]:
# Grapping the data from the worksapce:
running_times_5k = {"name":{"0":"Sue","1":"Mark","2":"Sean","3":"Erin","4":"Jenny","5":"Russell"},
                    "run1":{"0":20.1,"1":16.5,"2":23.5,"3":21.7,"4":25.8,"5":30.9},
                    "run2":{"0":18.5,"1":17.1,"2":25.1,"3":21.1,"4":27.1,"5":29.6},
                    "run3":{"0":19.6,"1":16.9,"2":25.2,"3":20.9,"4":26.1,"5":31.4},
                    "run4":{"0":20.3,"1":17.6,"2":24.6,"3":22.1,"4":26.7,"5":30.4},
                    "run5":{"0":18.3,"1":17.3,"2":23.9,"3":22.2,"4":26.9,"5":29.9}}

running_times_5k = pd.DataFrame(running_times_5k)

# Exploring the first 5 rows:
running_times_5k.head()

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Sue,20.1,18.5,19.6,20.3,18.3
1,Mark,16.5,17.1,16.9,17.6,17.3
2,Sean,23.5,25.1,25.2,24.6,23.9
3,Erin,21.7,21.1,20.9,22.1,22.2
4,Jenny,25.8,27.1,26.1,26.7,26.9


- Create a list of the columns you want to take the average of and store it in a variable named `run_columns`.

In [17]:
# Creating a list of the columns to take the average:
run_columns = running_times_5k.drop(columns=['name']).columns

- Use `apply` to take the `mean()` of the list of columns and remember to set `axis=1`. Use `lambda row:` in the `apply`.

In [18]:
# Aggregating the columns:
running_times_5k['mean'] = running_times_5k[run_columns].apply(lambda row : row.mean(), axis=1)

- Print out the DataFrame to see the `mean` column.

In [19]:
# Exploring the data:
print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44


### 3.2. Engineering numerical features - datetime

There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

- Getting everything ready.

In [20]:
# Reading the data:
volunteer = pd.read_csv("./data/volunteering.csv")

In [21]:
# Exploring the sahpe:
volunteer.shape

(665, 35)

In [22]:
# Exploring the first 5 rows:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


- Use Pandas `to_datetime()` function on the `volunteer["start_date_date"]` column and store it in a new column called `start_date_converted`.

In [23]:
# Exploring the start_date_date column:
volunteer['start_date_date'].head()

0        July 30 2011
1    February 01 2011
2     January 29 2011
3    February 14 2011
4    February 05 2011
Name: start_date_date, dtype: object

In [24]:
# Parsing the start_date_date column:
volunteer['start_date_converted'] = pd.to_datetime(volunteer['start_date_date'], format="%B %d %Y")

- To retrieve just the month, apply a `lambda` function to `volunteer["start_date_converted"]` that grabs the `.month` attribute from the row. Store this in a new column called `start_date_month`.

In [25]:
# Creating a new column for the month:
volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row : row.month)

# Exploring the start_date_month column:
volunteer['start_date_month'].head()

0    7
1    2
2    1
3    2
4    2
Name: start_date_month, dtype: int64

In [26]:
# Another trick to extract the month:
volunteer['start_date_month'] = volunteer['start_date_converted'].dt.strftime('%m')

# Exploring the start_date_month column:
volunteer['start_date_month'].head()

0    07
1    02
2    01
3    02
4    02
Name: start_date_month, dtype: object

- Print the `head()` of just the `start_date_converted` and `start_date_month` columns.

In [27]:
# Exploring the start_date_converted and start_date_month columns:
print(volunteer[['start_date_converted', 'start_date_month']].head())

  start_date_converted start_date_month
0           2011-07-30               07
1           2011-02-01               02
2           2011-01-29               01
3           2011-02-14               02
4           2011-02-05               02


## 4. Text classification

1. Engineering features from text
> Though text data is a little more complicated to work with, there's a lot of useful feature engineering you can do with it. One method is to extract the pieces of information that you need: maybe part of a string, or extracting a number, and transforming it into a feature. You can also transform the text itself into features, for use with natural language processing methods or prediction tasks. Let's learn how to extract data from text fields.

2. Extraction
> The way we're going to extract from strings is using regular expressions. Regular expressions are patterns that can be used to extract patterns from text data. You should already be familiar with regular expressions from the Cleaning Data in Python course, so this should be review. We're going to only focus on extracting numbers from strings. Here we have a string, and we want to extract the temperature digit from it. Notice that this number is a float. We'll need a pattern to extract this float, so let's break down the pattern in re.compile. "backslash d" means that we want to grab digits, and the "plus" means we want to grab as many as possible. So if there are two next to each other, we want both (like the 75). "backslash period" means we want to grab the decimal point, and then there's another "backslash d plus" at the end to grab the digits on the right-hand side of the decimal. We then search the string for a matching pattern using re.match, and we can extract it using group().

3. Vectorizing text
> If you're working with text, you might want to model it in some way. Maybe you want to use document text for classification. In order to do that, we'll need to vectorize the text and transform it into a numerical input that scikit-learn can use. We're going to create a tf/idf vector. tf/idf is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs. It stands for term frequency inverse document frequency and places the weight on words that are ultimately more significant in the entire corpus of words.

4. Vectorizing text
> Creating tf/idf vectors is relatively straightforward in scikit-learn, and we can use tf/idf vectorizer to do it. Here we have a collection of text. In order to vectorize it, we can simply pass the column of text we want to vectorize into tfidf vectorizer's fit transform method.

5. Text classification
> Now that we have a vectorized version of text, we can use it for classification. We'll use a Naive Bayes classifier, which is based on Bayes' theorem of conditional probability, which you can see here, and performs well on text classification tasks. Naive Bayes treats each feature as independent from the others, which can be a naive assumption, but this works out well on text data. Because each feature is treated independently, this classifier works well on high-dimensional data and is very efficient.

6. Let's practice!
> Now it's your turn to extract features from text.

### 4.1. Engineering features from strings - extraction

The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

- Getting everything ready.

In [28]:
# Reading the data:
hiking = pd.read_json("./data/hiking.txt", dtype={'Length': str})

In [29]:
# Exploring the shape
hiking.shape

(33, 11)

In [30]:
# Exploring the first 5 rows:
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


- Create a pattern that will extract numbers and decimals from text, using `\d+` to get numbers and `\.` to get decimals, and pass it into `re`'s `compile` function.

- Use the matched `mile`'s `group()` attribute to extract the matched pattern, making sure to match group `0`, and pass it into `float`.

In [31]:
# Creating a fuction to extract decimals:
def return_mileage(length):
    
    # Creating the pattern:
    pattern = re.compile(r"\d+\.\d+")
    
    # Matching the text:
    mile = re.match(pattern, length)
    
    # Returning the mileage:
    if mile:
        return float(mile.group(0))

- Apply the `return_mileage()` function to the `hiking["Length"]` column.

In [32]:
# Extracting the mileage:
hiking["Length_num"] = hiking['Length'].apply(lambda row : return_mileage(row))

In [33]:
# Exploring the data after extraction:
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


### 4.2. Engineering features from strings - tf/idf

Let's transform the `volunteer` dataset's `title` column into a text vector, to use in a prediction task in the next exercise.

- Getting everything ready.

In [34]:
# Reading the data:
volunteer = pd.read_csv("./data/volunteering.csv")

In [35]:
# Exploring the shape:
volunteer.shape

(665, 35)

In [36]:
# Exploring the first 5 rows:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


- Store the `volunteer["title"]` column in a variable named `title_text`.

In [37]:
# Creating a view from the title column:
title_text = volunteer["title"]

- Use the `tfidf_vec` vectorizer's `.fit_transform()` function on `title_text` to transform the text into a tf-idf vector.

In [38]:
# Initializing the vectorizer:
tfidf_vec = TfidfVectorizer()

In [39]:
# Vectorizing the title column:
text_tfidf = tfidf_vec.fit_transform(title_text)

### 4.3. Text classification using tf/idf vectors

Now that we've encoded the `volunteer` dataset's `title` column into tf/idf vectors, let's use those vectors to try to predict the `category_desc` column.

In [40]:
# Creating a mask to filter out missing values from the category_desc column:
not_null = volunteer['category_id'].notna()

In [41]:
# Initializing the model:
nb = GaussianNB()

- Using `train_test_split`, split the `text_tfidf` vector, along with your `y` variable, into training and test sets. Set the `stratify` parameter equal to `y`, since the class distribution is uneven. Notice that we have to run the `.toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.

In [42]:
# Creating the target column (y):
y = volunteer['category_desc'][not_null].copy()

In [43]:
# Creating the feature matrix (X):
X = text_tfidf.toarray()[not_null].copy()

In [44]:
# Splitting the data into training & hold=out sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

- Use Naive Bayes' `.fit()` method on the `X_train` and `y_train` variables.

In [45]:
# Fitting the model:
nb.fit(X_train, y_train)

GaussianNB()

- Print out the `.score()` of the `X_test` and `y_test` variables.

In [46]:
# Evaaluating the model performance:
print(nb.score(X_test, y_test))

0.5419354838709678
