![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Feature Engineering

## Categorical features and dirty cats

In this project, you will be working with dirty cats again to practice all the techniques you learned on previous lessons.  

**Remember**: it's important to always learn from your training data and transform your test data.  Let's gain some practice applying feature engineering to train and test sets.

In [None]:
# Import necessary packages
import numpy as np
import pandas as pd

In [None]:
# Read in dirty_cats.csv


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## How do we get dummies from our train set and apply them to our test set?

In [None]:
# Let's work with our nom_0 column again
# Get dummies for nom_0


In [None]:
# Check out the dummy columns


---
### nom_0 has 3 categories: Green, Blue and Red
But, the dummies (because we dropped a column) only has 2 of the 3.

> How do we make sure our test dummies has the same 2 out of 3 columns?

In [None]:
# Split the data into train and test
# There's no target in this toy dataset.  We can just split the data randomly
# Select 80% of the data randomly for train
msk = np.random.rand(len(df_dirty)) < 0.8
# Pull the train data out
df_train = df_dirty[msk].copy()
# Pull the test data out
df_test = df_dirty[~msk].copy()

In [None]:
# Learn our dummy columns from our train data
# Use pd.get_dummies to get dummy columns for train data

# Save the columns from the dummy data frame as a list
train_dummies = #fill in your code here

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Checking dummy columns

There are 2 things you need to check for when applying what you learned about dummies in train to the test data
1. What dummy columns (categories) are missing from test that were in train?
    - We'll add these and fill the values with 0
2. What dummy columns (categories) are in test that were not in train?
    - We'll drop these

In [None]:
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set
train_set = set(train_dummies)
# Get the unique categories from test and create a set
test_set = set(df_test.nom_0.unique().tolist())
# cols to add exist in train, but not in test
cols_to_add = train_set.difference(test_set)
# cols to remove exist in test but not in train
cols_to_remove = test_set.difference(train_set)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Now that we know what columns to look for, let's apply what we know to our test set

In [None]:
# One-hot encode the test set (we want to start with all of the columns)
df_test_onehot = #fill in your code here
# Add any cols that are missing -> fill values with 0
for col in cols_to_add:
    df_test_onehot[col] = 0
# Remove any cols that weren't in train
df_test_dummies = df_test_onehot.drop(cols_to_remove,axis=1)

# Check that the width (number of columns) of train dummies and test dummies match


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## One final thing

When the numbers of columns match, you have one more thing you need to check --> that the columns are in the same order.

In [None]:
# We already has an ordered list of columns from train in train_dummies
# Let's apply that column order to test dummies
df_test_dummies = df_test_dummies[train_dummies]

# Check that the columns in train and test match
for train_col,test_col in zip(df_train_dummies.columns,df_test_dummies.columns):
    print(train_col,'<===>',test_col)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Incongruent data labeling

Looks like we have some incongruent data labeling in this data.

> Go back and fix the labeling then redo the dummy columns

In [None]:
# Check unique categories in nom_0 using value_counts()


In [None]:
# Looks like making case uniform will solve part of the problem
# Apply .lower() to train and test


In [None]:
# Have another look at the train categories


In [None]:
# Create a mapping using train data only to correct spelling
nom_0_map = {'bule':'blue','rde':'red','geren':'green','blue':'blue','red':'red','green':'green'}
# Use .map to apply the mapping to train data

# Use .map to apply the mapping to test data


In [None]:
# Check categories in train


In [None]:
# Now that we've cleaned up our labeling, let's make our dummy columns again
# Learn our dummy columns from our train data

# keep the dummy cols for use on test data


In [None]:
# Assuming we don't know that our test categories match our train categories 
# (this should always be your assumption)
# Get the cols to check
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set

# Get the unique categories from test and create a set

# cols to add exist in train, but not in test

# cols to remove exist in test but not in train


In [None]:
# One-hot encode the test set (we want to start with all of the columns)

# Add any cols that are missing


# Remove any cols that weren't in train


# Check that the width of train dummies and test dummies match


In [None]:
# Set make sure the order of the test columns matches the order of the train columns


In [None]:
# Have a look at the train data


In [None]:
# Have a look at the test data


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## They match!

You've successfully cleaned up labeling and created matching dummy columns for train and test!

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>