## TOC:
* [1. Introduction to Data Preprocessing](#first-bullet)
* [What is data preprocessing? - Video](#second-bullet)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
os.listdir()

In [123]:
#Read data files
volunteer = pd.read_csv('datasets/volunteer_opportunities.csv')
hiking = pd.read_json("datasets/hiking.json")
wine = pd.read_csv("datasets/wine_types.csv")
ufo = pd.read_csv("datasets/ufo_sightings_large.csv")

### Course Description
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

## 1. Introduction to Data Preprocessing <a class="anchor" id="first-bullet"></a>

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

### What is data preprocessing? - Video  <a class="anchor" id="second-bullet"></a>

### Missing data - columns
We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?

 - The dataset volunteer has been provided.
 - Use the dropna() function to remove columns.
 - You'll have to set both the `axis=` and `thresh=` parameters.


Possible Answers
35, 24 (Correct)
35, 35
35, 19

In [None]:
volunteer.shape

In [None]:
volunteer.dropna(axis=1, thresh=3).shape

Correct! A lot of operations are done on a column basis, so it's useful to remember axis=1 when working with Pandas.

### Missing data - rows
Taking a look at the volunteer dataset again, we want to drop rows where the category_desc column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

Instructions
100 XP
Check how many values are missing in the category_desc column using isnull() and sum().
Subset the volunteer dataset by indexing by where category_desc is notnull(), and store in a new variable called volunteer_subset.
Take a look at the .shape attribute of the new dataset, to verify it worked correctly.

In [None]:
# Check how many values are missing in the category_desc column
print(volunteer["category_desc"].isna().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer["category_desc"].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

Nice work! Remember that you can use boolean indexing to effectively subset DataFrames.

## Working with data types - Video

### Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

Which data types are present in the volunteer dataset?

The dataset volunteer has been provided.
Use the .dtypes attribute to check the datatypes.

Possible Answers
 - Float and int only
 - Int only
 - Float, int, and object
 - Float only

In [None]:
volunteer.dtypes

Correct! All three of these types are present in the DataFrame.

### Converting a column type
If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

Instructions

 - Take a look at the .head() of the hits column.
 - Use the `.astype` function to convert the column to type int.
 - Take a look at the dtypes of the dataset again, and notice that the column type has changed.

In [None]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype("int")

# Look at the dtypes of the dataset
print(volunteer.dtypes)

Nice work! You can use astype to convert between a variety of types.

## Class distribution - Video

### Class imbalance
In the volunteer dataset, we're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

 - The dataset volunteer has been provided.
 - The colum you want to check is category_desc.
 - Use the value_counts() method to check variable counts.

Instructions

Possible Answers
 - Emergency Preparedness
 - Health
 - Environment
 - 1 and 3 (Correct)
 - All of the above

In [None]:
volunteer.category_desc.value_counts()

Correct! Both Emergency Prepardness and Environment occur less than 50 times.

### Stratified sampling
We know that the distribution of variables in the `category_desc` column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

Instructions
100 XP
 - Create a `volunteer_X` dataset with all of the columns except category_desc.
 - Create a `volunteer_y` training labels dataset.
 - Split up the `volunteer_X` dataset using scikit-learn's `train_test_split` function and passing `volunteer_y` into the `stratify=` parameter.
 - Take a look at the `category_desc` value counts on the training labels.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop("category_desc", axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[["category_desc"]]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train["category_desc"].value_counts())

Great job. You'll use train_test_split frequently while building models, so it's useful to be familiar with the function.

# Summary of Things Learned
 - volunteer.dropna(axis=1, thresh=3).shape
 - volunteer["category_desc"].isna().sum()
 - volunteer_subset = volunteer[volunteer["category_desc"].notnull()]
 - volunteer["hits"] = volunteer["hits"].astype("int")
 - volunteer.category_desc.value_counts()
 - X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

## 2. Standardizing Data

## Standardizing Data - Video

### When to standardize
Now that you've learned when it is appropriate to standardize your data, which of these scenarios would you NOT want to standardize?

Answer the question
50 XP
Possible Answers
 - A column you want to use for modeling has extremely high variance.
 - You have a dataset with several continuous columns on different scales and you'd like to use a linear model to train the data.
 - The models you're working with use some sort of distance metric in a linear space, like the Euclidean metric.
 - Your dataset is comprised of categorical data. (Correct)


Correct! Standardization is a preprocessing task performed on numerical, continuous data.

### Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

Instructions
100 XP
 - Split up the X and y sets into training and test sets using train_test_split().
 - Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
 - Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.

In [None]:
X = wine.drop("Type", axis=1)
y = wine.Type

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

Great work. You can see that the accuracy score is pretty low. Let's explore methods to improve this score.

## Log normalization - Video

### Checking the variance
Check the variance of the columns in the wine dataset. Out of the four columns listed in the multiple choice section, which column is a candidate for normalization?

Instructions
50 XP
 - Possible Answers
 - Alcohol
 - Proline (Correct)
 - Proanthocyanins
 - Ash

In [None]:
wine.var()

### Log normalization in Python
Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it.

`Numpy` has been imported as `np` in your workspace.

Instructions
100 XP
 - Print out the variance of the `Proline` column for reference.
 - Use the `np.log()` function on the `Proline` column to create a new, log-normalized column named `Proline_log`.
 - Print out the variance of the `Proline_log` column to see the difference.

In [None]:
# Print out the variance of the Proline column
print(wine.Proline.var())

# Apply the log normalization function to the Proline column
wine["Proline_log"] = np.log(wine.Proline)

# Check the variance of the Proline column again
print(wine["Proline_log"].var())

Nice work! The np.log() function is an easy way to log normalize a column.

In [None]:
wine["Proline"].hist()

In [None]:
wine["Proline_log"].hist()

## Scaling data for feature comparison - Video

### Scaling data - investigating columns
We want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using describe() to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

Instructions
50 XP
Possible Answers
 - The max of Ash is 3.23, the max of Alcalinity of ash is 30, and the max of Magnesium is 162.
 - The means of Ash and Alcalinity of ash are less than 20, while the mean of Magnesium is greater than 90.
 - The standard deviations of Ash and Alcalinity of ash are equal.
 - 1 and 2 are true. (Correct)

In [None]:
wine[["Ash", "Alcalinity of ash","Magnesium"]].describe()

Correct! Both of these statements are true according to the statistics returned by describe()

### Scaling data - standardizing columns
Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

Instructions
100 XP
 - Import StandardScaler from sklearn.preprocessing.
 - Create the StandardScaler() method and store in a variable named ss.
 - Create a subset of the wine DataFrame of the Ash, Alcalinity of ash, and Magnesium columns, store in a variable named wine_subset.
 - Apply the ss.fit_transform method to the wine_subset DataFrame.

In [None]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[["Ash", "Alcalinity of ash","Magnesium"]]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

Good job! In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.

## Standardized data and modeling - Video

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()

### KNN on non-scaled data
Let's first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data. The knn model as well as the X and y data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

Instructions
100 XP
 - Split the dataset into training and test sets using train_test_split().
 - Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
 - Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.

In [None]:
X = wine.drop(["Type",'Proline_log'],axis=1)
y = wine.Type

In [None]:
print(X.columns)
print(X.shape)

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

<script.py> output:
    0.7111111111111111

Well done. This scikit-learn workflow should be very familiar to you at this point.

### KNN on scaled data
The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the knn model as well as the X and y data and labels set have already been created for you.

Instructions
100 XP
 - Create the StandardScaler() method, stored in a variable named ss.
 - Apply the ss.fit_transform method to the X dataset.
 - Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
 - Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.


In [None]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

In [None]:
Excellent! The increase in accuracy is worth the extra step of scaling the dataset.

## 3. Feature Engineering

In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.

## Feature engineering  - Video

### Feature engineering knowledge test
Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?

Answer the question
50 XP
Possible Answers
1. A column of timestamps
2. A column of newspaper headlines
3. A column of weight measurements
4. 1 and 2 (Correct)
5. None of the above

Correct! Timestamps can be broken into days or months, and headlines can be used for natural language processing.

Identifying areas for feature engineering
Take an exploratory look at the volunteer dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?

Instructions
50 XP
Possible Answers
1. vol_requests
2. title
3. created_date
4. category_desc
5. 2, 3, and 4 (Correct)

Correct! All three of these columns will require some feature engineering before modeling.

## Encoding categorical variables - Video

### Encoding categorical variables - binary

Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

Instructions
100 XP
 - Store LabelEncoder() in a variable named enc
 - Using the encoder's fit_transform() function, encode the hiking dataset's "Accessible" column. Call the new column Accessible_enc.
 - Compare the two columns side-by-side to see the encoding.

In [None]:
hiking.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking.Accessible)

# Compare the two columns
print(hiking[["Accessible", "Accessible_enc"]].head())

Nice work! .fit_transform() is a good way to both fit an encoding and transform the data in a single step.

### Encoding categorical variables - one-hot
One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' get_dummies() function to do so.

Instructions

 - Call get_dummies() on the volunteer["category_desc"] column to create the encoded columns and assign it to category_enc.
 - Print out the head() of the category_enc variable to take a look at the encoded columns.

In [None]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
print(category_enc.head())

Good job! get_dummies() is a simple and quick way to encode categorical variables.

## Engineering numerical features - Video

## Engineering numerical features - taking an average
A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

Instructions

 - Create a list of the columns you want to take the average of and store it in a variable named run_columns.
 - Use apply to take the mean() of the list of columns and remember to set axis=1. Use lambda row: in the apply.
 - Print out the DataFrame to see the mean column.

In [None]:
# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k[run_columns].apply(lambda row: row.mean(), axis=1)

# Take a look at the results
print(running_times_5k)

Nice work! Lambdas are especially helpful for operating across columns.

### Engineering numerical features - datetime
There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

Instructions
 - Use Pandas to_datetime() function on the volunteer["start_date_date"] column and store it in a new column called start_date_converted.
 - To retrieve just the month, apply a lambda function to volunteer["start_date_converted"] that grabs the .month attribute from the row. Store this in a new column called start_date_month.
 - Print the head() of just the start_date_converted and start_date_month columns.


In [None]:
volunteer.head()

In [None]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the original and new columns
print(volunteer[["start_date_converted", "start_date_month"]].head())

In [None]:
#also .dt method was used
volunteer.start_date_converted.dt.month.head()

Awesome! You can also use attributes like .day to get the day and .year to get the year from datetime columns.

## Text classification - Video

### Engineering features from strings - extraction
The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

Instructions
100 XP
 - Create a pattern that will extract numbers and decimals from text, using \d+ to get numbers and \. to get decimals, and pass it into re's compile function.
 - Use re's match function to search the text, passing in the pattern and the length text.
 - Use the matched mile's group() attribute to extract the matched pattern, making sure to match group 0, and pass it into float.
 - Apply the return_mileage() function to the hiking["Length"] column.

In [None]:
# hiking = hiking.dropna(axis=0,subset="Length")
hiking = hiking[hiking.Length.notnull()]
# hiking.Length.notnull()

In [None]:
hiking.shape

In [None]:
import re

In [None]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

Great job! Regular expressions are a useful way to perform text extraction.

### Engineering features from strings - tf/idf
Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.

Instructions

 - Store the `volunteer["title"]` column in a variable named `title_text`.
 - Use the `tfidf_vec` vectorizer's `fit_transform()` function on `title_text` to transform the text into a tf-idf vector.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Take the title text
title_text = volunteer["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

Nice job. Scikit-learn provides several methods for text vectorization.

### Text classification using tf/idf vectors
Now that we've encoded the volunteer dataset's title column into tf/idf vectors, let's use those vectors to try to predict the category_desc column.

Instructions

 - Using `train_test_split`, split the `text_tfidf` vector, along with your y variable, into training and test sets. Set the `stratify` parameter equal to `y`, since the class distribution is uneven. Notice that we have to run the `toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
 - Use Naive Bayes' `fit()` method on the `train_X` and `train_y` variables.
 - Print out the `score()` of the `test_X` and `test_y` variables.

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

In [None]:
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
train_X, test_X, train_y, test_y = train_test_split(text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

Nice work! Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next chapter.

# Selecting features for modeling

This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).

In [108]:
from sklearn.model_selection import train_test_split

## Feature Selection - Video

When to use feature selection
Let's say you had finished standardizing your data and creating new features. Which of the following scenarios is NOT a good candidate for feature selection?

Answer the question
50 XP
Possible Answers
1. Several columns of running times that have been averaged into a new column.
2. A text field that hasn't been turned into a tf/idf vector yet. (Correct)
3. A column of text that has already had a float extracted out of it.
4. A categorial field that has been one-hot encoded.
5. Your dataset contains columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant.
press


Correct! The text field needs to be vectorized before we can eliminate it, otherwise we might miss out on important data.

Identifying areas for feature selection
Take an exploratory look at the post-feature engineering hiking dataset. Which of the following columns is a good candidate for feature selection?

Instructions
50 XP
Possible Answers
 - Length
 - Difficulty
 - Accessible
 - All of the above (Correct)
 - None of the above

## Removing redundant features - Video

### Selecting relevant features
Now that you've identified redundant columns in the volunteer dataset, let's perform feature selection on the dataset to return a DataFrame of the relevant features.

Instructions
100 XP
 - Create a list of redundant column names and store it in the to_drop variable, in alphabetical order. You'll see three related features: locality, region, and postalcode. For now, let's only keep postalcode.
 - Drop the columns from the dataset using drop().
 - Print out the head() of the DataFrame to see the selected columns.

In [None]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
print(volunteer_subset.head())

Nice job! It's often easier to collect a list of columns to drop, rather than dropping them individually.

### Checking for correlated features
Let's take a look at the wine dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

Instructions
100 XP
 - Print out the column correlations of the wine dataset using corr().
 - Take a minute to look at the correlations. Identify a column where the correlation value is greater than 0.75 at least twice and store it in the to_drop variable.
 - Drop that column from the DataFrame using drop().


In [None]:
wine.corr() > 0.75

In [None]:
# Print out the column correlations of the wine dataset
print(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

Good work. Dropping correlated features is often an iterative process, so you may need to try different combinations in your model.

## Selecting features using text vectors - Video

### Exploring text vectors, part 1
Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

Instructions
100 XP
 - Add parameters called `original_vocab`, for the `tfidf_vec.vocabulary_`, and `top_n`.
 - Call `pd.Series` on the zipped dictionary. This will make it easier to operate on.
 - Use the `sort_values` function to sort the series and slice the index up to top_n words.
 - Call the function, setting `original_vocab=tfidf_vec.vocabulary_`, setting `vector_index=8` to grab the 9th row, and setting `top_n=3`, to grab the top 3 weighted words.

In [97]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Take the title text
title_text = volunteer["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

In [None]:
#from slides
print(tfidf_vec.vocabulary_)

In [96]:
#from slides
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}

In [None]:
#from slides
text_tfidf.shape

In [None]:
#from slides
print(text_tfidf[3].data)

In [None]:
#from slides
len(text_tfidf[3].indices)

In [None]:
#from slides
def return_weights(vocab, vector, vector_index):
    zipped = dict(zip(vector[vector_index].indices,
        vector[vector_index].data))
    return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

In [None]:
#from slides
vector = text_tfidf[3].data
vector_index = text_tfidf[3].indices
top_n = 3

In [98]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

[23, 188, 562]


Nice job. This is a little complicated, but you'll see how it comes together in the next exercise.

In [102]:
text_tfidf[8].indices

array([562, 188,  23])

In [103]:
title_text[8]

'Join Cents Ability!'

### Exploring text vectors, part 2
Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

Instructions
100 XP
 - Call `return_weights` to return the top weighted words for that document.
 - Call `set` on the returned `filter_list` so we don't get duplicated numbers.
 - Call `words_to_filter`, passing in the following parameters: `vocab` for the `vocab` parameter, `tfidf_vec.vocabulary_` for the `original_vocab` parameter, `text_tfidf` for the `vector` parameter, and `3` to grab the `top_n` 3 weighted words from each document.
 - Finally, pass that `filtered_words` set into a list to use as a filter for the text vector.

In [104]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

Excellent! In the next section, you'll train a model using the filtered vector.

### Training Naive Bayes with feature selection
Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer dataset's title and category_desc columns.

Instructions
100 XP
 - Use `train_test_split` on the `filtered_text` text vector, the `y` labels (which is the `category_desc` labels), and pass the y set to the `stratify` parameter, since we have an uneven class distribution.
 - Fit the `nb` Naive Bayes model to `train_X` and `train_y`.
 - Score the `nb` model on the `test_X` and `text_y` test sets.

In [None]:
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), volunteer.category_desc, stratify=volunteer.category_desc)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

In [111]:
import sklearn as sk

In [112]:
sk.__version__

'0.19.2'

In [114]:
filtered_text.toarray().shape

(665, 1061)

Awesome! You can see that our accuracy score wasn't that different from the score at the end of chapter 3. That's okay; the title field is a very small text field, appropriate for demonstrating how filtering vectors works.

## Dimensionality reduction - Video

### Using PCA
Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

Instructions

 - Set up the `PCA` object. You'll use PCA on the wine dataset minus its label for `Type`, stored in the variable wine_X.
 - Apply PCA to `wine_X` using `pca`'s `fit_transform` method and store the transformed vector in transformed_X.
 - Print out the `explained_variance_ratio_` attribute of pca to check how much variance is explained by each component.

In [115]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98098798e-01 1.73593305e-03 9.43282757e-05 4.89438533e-05
 1.04695097e-05 5.60981698e-06 2.79968212e-06 1.44536313e-06
 9.75418873e-07 3.94184513e-07 2.13661389e-07 8.91974959e-08]


Excellent! In the next section you'll train a model using the PCA-transformed vector.

### Training a model with PCA
Now that we have run PCA on the wine dataset, let's try training a model with it.

Instructions

 - Split the `transformed_X` vector and the `y` labels set into training and test sets using `train_test_split`.
 - Fit the `knn` model using the `fit()` function on the `X_wine_train` and `y_wine_train` sets.
 - Print out the score using `knn`'s `score()` function on `X_wine_test` and `y_wine_test`.

In [120]:
y = wine["Type"]
y.shape

(178,)

In [121]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [122]:
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

0.7777777777777778

In [117]:
transformed_X.shape

(178, 12)

Excellent! In the next section you'll train a model using the PCA-transformed vector.

# 5. Putting it all together

### Checking column types
Take a look at the UFO dataset's column types using the dtypes attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

Instructions
100 XP
 - Print out the `dtypes` of the `ufo` dataset.
 - Change the type of the `seconds` column by passing the `float` type into the `astype()` method.
 - Change the type of the `date` column by passing `ufo["date"]` into the `pd.to_datetime()` function.
 - Print out the `dtypes` of the `seconds` and `date` columns, to make sure it worked.

In [124]:
ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


In [125]:
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
date              4935 non-null object
city              4926 non-null object
state             4516 non-null object
country           4255 non-null object
type              4776 non-null object
seconds           4935 non-null float64
length_of_time    4792 non-null object
desc              4932 non-null object
recorded          4935 non-null object
lat               4935 non-null object
long              4935 non-null float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB


In [127]:
# Check the column types
print(ufo.dtypes)

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo[["seconds", "date"]].dtypes)

date              datetime64[ns]
city                      object
state                     object
country                   object
type                      object
seconds                  float64
length_of_time            object
desc                      object
recorded                  object
lat                       object
long                     float64
dtype: object
seconds           float64
date       datetime64[ns]
dtype: object


Nice job on transforming the column types! This will make feature engineering and standardization easier.

### Dropping missing data
Let's remove some of the rows where certain columns have missing values. We're going to look at the length_of_time column, the state column, and the type column. If any of the values in these columns are missing, we're going to drop the rows.

Instructions
100 XP
 - Check how many values are missing in the length_of_time, state, and type columns, using isnull() to check for nulls and sum() to calculate how many exist.
 - Use boolean indexing to filter out the rows with those missing values, using notnull() to check the column. Here, we can chain together each column we want to check.
 - Print out the shape of the new ufo_no_missing dataset.

In [128]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[["length_of_time", "state", "type"]].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
          ufo["state"].notnull() & 
          ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


Awesome! We'll work with this set going forward.

## Categorical variables and standardization - Video

In [131]:
ufo.shape

(4935, 11)

### Extracting numbers from strings
The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

Instructions
100 XP
 - Pass \d+ into re.compile() in the pattern variable to designate that we want to grab as many digits as possible from the string.
 - Into re.match(), pass the pattern we just created, as well as the time_string we want to extract from.
 - Use lambda within the apply() method to perform the extraction.
 - Print out the head() of both the length_of_time and minutes columns to compare.

In [145]:
def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(lambda row: return_minutes(row))

# Take a look at the head of both of the columns
print(ufo[["length_of_time","minutes"]].head())

TypeError: expected string or bytes-like object

In [133]:
ufo.shape

(4935, 11)

Nice job. As you can see, we end up with some NaNs in the DataFrame. That's okay for now; we'll take care of those before modeling.

### Identifying features for standardization
In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the seconds column.

Instructions
100 XP
 - Use the var() method on the seconds and minutes columns to check the variance. Notice how high the variance is on the seconds column.
 - Using np.log() perform log normalization on the seconds column, transforming it into a new column named seconds_log.
 - Print out the variance of the seconds_log column.

In [144]:
# Check the variance of the seconds and minutes columns
print(ufo[["seconds", "minutes"]].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

KeyError: "['minutes'] not in index"

Good work. In the next section, we'll focus on engineering new features.

### Encoding categorical variables
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

Instructions
100 XP
 - Using apply(), write a lambda that returns a 1 if the value is us, else return 0. This is something we learned in Chapter 3 if you need a refresher.
 - Next, print out the number of unique() values of the type column.
 - Using pd.get_dummies(), create a one-hot encoded set of the type column.
 - Finally, use pd.concat() to concatenate the ufo dataset to the type_set encoded variables.

In [135]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda value: 1 if value=="us" else 0)

# Print the number of unique type values
print(len(ufo["type"].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

22


Awesome work! Let's continue on by extracting some date parts.

### Features from dates
Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

Instructions
100 XP
 - Print out the head() of the date column.
 - Using apply(), lambda, and the .month attribute, extract the month from the date column.
 - Using apply(), lambda, and the .year attribute, extract the year from the date column.
 - Take a look at the head() of the date, month, and year

In [136]:
# Look at the first 5 rows of the date column
print(ufo["date"].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda row: row.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda row: row.year)

# Take a look at the head of all three columns
print(ufo[["date", "month", "year"]].head())

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
2   2009-09-25 21:00:00
3   2002-11-21 05:45:00
4   2010-08-19 12:55:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2011-11-03 19:21:00     11  2011
1 2004-10-03 19:05:00     10  2004
2 2009-09-25 21:00:00      9  2009
3 2002-11-21 05:45:00     11  2002
4 2010-08-19 12:55:00      8  2010


Nice job on extracting dates! 'apply' and 'lambda' are extremely useful for extraction tasks.

### Text vectorization
Let's transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

Instructions
100 XP
 - Print out the head() of the ufo["desc"] column.
 - Set vec equal to the TfidfVectorizer() object.
 - Use vec's fit_transform() method on the ufo["desc"] column.
 - Print out the shape of the desc_tfidf vector, to take a look at the number of columns this created. The output is in the shape (rows, columns).

In [138]:
ufo = ufo[ufo.desc.notnull()]

In [139]:
# Take a look at the head of the desc field
print(ufo["desc"].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"])

# Look at the number of columns this creates
print(desc_tfidf.shape)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object
(4932, 6433)


Great! You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

### Feature selection and modeling - Video

### Selecting the ideal dataset
Let's get rid of some of the unnecessary features. Because we have an encoded country column, country_enc, keep it and drop other columns related to location: city, country, lat, long, state.

We have columns related to month and year, so we don't need the date or recorded columns.

We vectorized desc, so we don't need it anymore. For now we'll keep type.

We'll keep seconds_log and drop seconds and minutes.

Let's also get rid of the length_of_time column, which is unnecessary after extracting minutes.

Instructions

 - Use .corr() to run the correlation on seconds, seconds_log, and minutes in the ufo DataFrame.
 - Make a list of columns to drop, in alphabetical order.
 - Use to_drop() to drop the columns.
 - Use the words_to_filter() function we created previously. Pass in vocab, vec.vocabulary_, desc_tfidf, and let's keep the top 4 words as the last parameter.

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[["seconds", "seconds_log", "minutes"]].corr())

# Make a list of features to drop
to_drop = ["city", "country","date", "desc", "lat", "length_of_time","long","minutes","recorded", "seconds","state"]

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

Great job! You're almost done. In the next exercises, we'll try modeling the UFO data in a couple of different ways.

### Modeling the UFO dataset, part 1
In this exercise, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our X dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is us and 0 is ca.

Instructions
100 XP
 - Print out the .columns of the X set.
 - Split up the X and y sets using train_test_split(). Pass the y set to the stratify= parameter, since we have imbalanced classes here.
 - Use fit() to fit train_X and train_y.
 - Print out the .score() of the knn model on the test_X and test_y sets.


In [143]:
ufo.columns

Index(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long', 'country_enc', 'changing', 'chevron',
       'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk',
       'egg', 'fireball', 'flash', 'formation', 'light', 'other', 'oval',
       'rectangle', 'sphere', 'teardrop', 'triangle', 'unknown', 'month',
       'year'],
      dtype='object')

In [142]:
X_columns = ['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year']
y_columns = 'country_enc'

In [141]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

NameError: name 'X' is not defined

Awesome work! This model performs pretty well. It seems like we've made pretty good feature selection choices here.

### Modeling the UFO dataset, part 2
Finally, let's build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if we can predict the type of the sighting based on the text. We'll use a Naive Bayes model for this.

Instructions
100 XP

 - On the desc_tfidf vector, filter by passing a list of filtered_words into the index.
 - Split up the X and y sets using train_test_split(). Remember to convert filtered_text using toarray(). Pass the y set to the stratify= parameter, since we have imbalanced classes here.
 - Use the nb model's fit() to fit train_X and train_y.
 - Print out the .score() of the nb model on the test_X and test_y sets.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify = y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
nb.score(test_X, test_y)

Congrats, you've completed the course! As you can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type.