![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Feature Engineering

## Feature Engineering on Census Data

In this project, you will be working with Census Data from 1994 to put in practice all the techniques you learned on previous lessons.


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## What we know about the data

The 1994 Census Income dataset has **48,842 entries** (split into train and test). Each entry contains the following information about an individual:
- **age**: the age of an individual
- **workclass**: a general term to represent the employment status of an individual
- **fnlwgt**: final weight. In other words, this is the number of people the census believes the entry represents.
- **education**: the highest level of education achieved by an individual.
- **education_num**: the highest level of education achieved in numerical form.
- **marital_status**: marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces.
- **occupation**: the general type of occupation of an individual
- **relationship**: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute.
- **race**: Descriptions of an individual’s race
- **sex**: the biological sex of the individual
- **capital_gain**: capital gains for an individual
- **capital_loss**: capital loss for an individual
- **hours_per_week**: the hours an individual has reported to work per week
- **native_country**: country of origin for an individual
- **income_bracket**: whether or not an individual makes more than $50,000 annually.

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Initial EDA

In [None]:
# Import pandas and alias as pd

# Use this to view all of your data
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
# Pull in the train set datasets/wage_train.csv

# Look at the first few rows


In [None]:
# Pull in the test set, apply the same processes as train, then predict

# Look at the first few rows


---
### Remember that:

- We **develop from train**.  
- We **apply to test**.

In [None]:
# First step in exploration
# Use .describe to look at the data


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## What we get from the data description

- It seems like fnlwgt isn't really a characteristic of the individuals
- Also, since the Census Bureau assigns that value, we won't likely have it when making predictions on unseen data.

> Let's remove that feature

In [None]:
# Remove the fnlwgt feature from train and test



---
### List all binary columns and indicate if the classes are balanced

In [None]:
# your answer goes here


---
### List all nominal columns and indicate the majority class


In [None]:
# your answer goes here


---
### List all ordinal columns and indicate the majority class

In [None]:
# your answer goes here


> **Note**: occupation could have somewhat of an ordinal quality to it, but it's not clear how to order some of the occupations.  It also seems as if the ordinal quality may be more related to either work/life balance or income -> or some combination of the two.  In absence of a good way to rank occupations, we will treat them as nominal.

---
### List any cyclical or date columns


In [None]:
# your answer goes here


---
### List all continuous columns with basic stats (mean, std, min and max)


In [None]:
# your answer goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## What else can we get from the description?

- That 'education_num' is the numeric representation of 'education'
- We can drop 'education' since it is ordinal and already numerically represented by 'education_num'

In [None]:
# Drop education from train and test


---
### Final piece of information we get from the description

There are a few serious majority categories:

   - United States has very nearly ALL of the native_country examples.  This is a serious majority category. (native_country feature) 
   - White is the same. (race feature)
   - Private is the same. (workclass feature)
   - Married-civ-spouse isnt' as big, but we'll remove it too.  It can help with collinearity.
   
When we get dummies for this variable, instead of drop_first we'll want to drop United States and White specifically.


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Missing data

We don't see any nulls using .describe().  Are there any **hidden nulls** in this data?
What kind of features would we check for hidden nulls?

This will require a little manual work.
> Use .value_counts() to check for hidden nulls.  I've done the first one for you.

In [None]:
# Check for hidden nulls in workclass


In [None]:
# We have a hidden null in workclass.  
# Let's replace it with a true null and figure out what to do with it later.
df_train.loc[df_train.workclass=='?','workclass'] = None
# Check results using value_counts()


---
### Hmmmm.  That didn't work.

Why isn't that working?  

> Let's have a different look at the values with .unique()

In [None]:
# Look at the values in workclass with the unique() method


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## There are some extra spaces in those strings.

> Let's clean that up -> it will make life easier later.

In [None]:
# Clean all of the extra white space from object (string value) columns using the string method .strip()




# Check results


In [None]:
# Now let's do the same to test


# Check results

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Now that that's all clean:
> Let's try to fix the nulls again

In [None]:
# Let's replace the ? in workclass with a true null

# Now do the same to test

# Check results


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## What else do we get from our .describe() above that can help us here?

- We see that there are no negative values.  We don't need to check for -1 representation of nulls.

> Let's check the remaining nominal features for hidden nulls and fix them. (Create as many new cells as you need to do this.)

In [None]:
# Create a list of nominal features


In [None]:
# Check each of the nominal features for hidden nulls using .value_counts()
# If you find hidden nulls, convert them to true nulls


In [None]:
# your code goes here


#### Did you check every nominal feature?

#### Did you apply any necessary changes to both train and test data?


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## What do we do with the null values?

> Let's see how much data we lose if we remove all rows with null values

In [None]:
# Try removing all rows with nulls (you probably want to save the result to a new object)


# How much of the data was removed?


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## If removing all null rows removes less than 20% of the data:

Go ahead and remove all null rows for now. We can come back in a future iteration and spend more time here if model performance isn't as high as we would like for it to be.

In [None]:
# Remove all rows with null values in train and test


# Check the new length of your train and test sets


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Convert string values to numeric

**Remember**: all of our values have to be numeric for a machine learning algorithm to accept and understand them.
> Let's convert all of our string values to numeric now.

---
### Nominal to numeric

In [None]:
# Let's change all of our nominal features to numeric format
# Use the shortcut method get_dummies to get dummy columns for train data
# Remember we want to pick the category to drop with race and native_country




# Drop the White dummy column and United States dummy col

# We can get a full list of columns to match to test using .columns

In [None]:
# Now we apply this to our test set.  
# Do NOT drop_first here.  We'll use our train column list to match up the data



---
### Did you get an error?

I did. Looks like we have a feature in train that isn't in test.

We'll need to add it to test.

In [None]:
# the Holand-Netherlands feature exists in train but not in test.  We'll need to add that column to our test data
# It's a binary column.  We'll fill it with all 0s
# Make sure the name matches exactly what you see in the KeyError

# Now match up the train and test cols


In [None]:
# Compare the columns in train dummies to test dummies
for train_col,test_col in zip(df_train_dummies.columns,df_test_dummies.columns):
    print(train_col,'<---->',test_col)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## If your columns match-up, move on!  If not, you better find the bug now.

---
### Binary to numeric

In [None]:
# Create a list of binary features

# Create mappings for binary cols
# Keep information needed to apply this to test data







In [None]:
# Convert train binary to numeric
# Copy your train data into a new data frame 'df_train_binary' (just trust me on this one)



In [None]:
# Check results with .head()


In [None]:
# Now let's appply the mappings to our test set
# Don't worry if these results look funny.  Try to apply the mappings and then we'll figure out
# what is going on here


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Do your results on test look wrong?

That's ok, mine do too.

> Let's figure out why

In [None]:
# Why didn't our mapping work on income_bracket?
# Let's have a look at our test data before dealing with binary


In [None]:
# What did our train feature look like?


In [None]:
# The test values have '.' at the end.  
# Let's clean this up and try again



In [None]:
# Now apply the mappings to our test set





---
### Phew!  Glad we cleared that up!

You'll spend the majority of your time cleaning up your data like this.  It's not fun, but it must be done.  I've never worked with any company that has perfectly clean data.

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Pulling all of the data together

Now that we have our dummies, binary mappings and original continuous data, we need to put it together.

> Let's use pd.concat() to pull all of the pieces of data together

In [None]:
# If df_train isn't already in one dataframe, join the pieces now


In [None]:
# For test, you need to pull the dummies df and the binary df together


---
### The length of your test sets should all match each other as well.

#### If the length of your train data is not the same across all of your train sets, something is wrong.  Go back and look for a bug.

#### The length of your test sets should all match each other as well.

#### Test length should be a small percent of your train length.  These should not match.

In [None]:
# Combine all of the relevant columns
# List all columns we want to keep from the df




# for the train data . . .
# concat together the dummies, the binary columns and the relevant continuous columns



In [None]:
# Now do the same for test




In [None]:
# Do another check to make sure your train columns match your test columns
for train_col,test_col in zip(df_train_all.columns,df_test_all.columns):
    print(train_col,'<---->',test_col)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Collinearity

**Remember**: we want our features to be correlated with the target -> but not with each other

Now that we've taken a peak at correlation, skew and outliers:
> Let's check for collinearity

In [None]:
# Use statsmodels variance_inflation_factor to check for collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
# Check train features for collinearity
# Get a list of feature columns

# Separate features from the target into X_train_feat


In [None]:
# Use the VIF function to calculate VIF on X_train_feat


---
### Remember! VIF greater than 10 is an issue.

For now remove most columns with VIF greater than 10.  In a future iteration, you can derive or engineer some features with colinear columns and try to improve model performance.

You can leave a few and re-run VIF to see if it eleviates collinearity.
**Hint**: You could use a correlation score with the collinear features and the target to determine which are the most important.

In [None]:
# Make a list of collinear columns







# Check collinearity of these features with the target


---
### How to think about the removal

- age and education probably have some relation.  Let's drop age since it's a little less correlated with the target.
- many of the marital-status categories are some version of not married - let's combine them into one group called not_married

In [None]:
# Create a new feature called nom_dum_Not-married that combines:
# nom_dum_Unmarried, nom_dum_Divorced and nom_dum_Never-married

# Figure out what percent of the new feature is marked 1 (indicating not mraried)


In [None]:
# Make a list of columns to remove for collinearity based on the above

# Drop those columns

# Run VIF again

# This is an iterative process so you might have to do this several times to refine your cols_to_remove list.

---
### education_num and hours_per_week are still collinear.  
education_num has the highest correlation with the target.  We'll drop hours_per_week.

In [None]:
# Drom hours_per_week from your features

# Run VIF again


In [None]:
# Update your list of columns to remove to include hours_per_week


In [None]:
# Create the derived column (nom_dum_Not-married) in train and test


# When you have a final list, drop columns from your train and test data



![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Outliers, Skew, Transform, Scale

Once that lines up, let's check the correlation of the features to the target.

The target in this data is income_bracket.

> Let's check the correlation of the features with the target

In [None]:
# Check the correlation of all of the columns with the target
# Because there are a lot of columns, it might help to order them using .sort_values


In [None]:
# Check skew in feature columns


---
### There are too many columns to really consider all at once

Many of these skewed columns are imbalanced binary columns (from dummy encoding)

> Let's just check correlation and skew on our continuous columns (you can iterate on this later)


In [None]:
# Make a list of non_binary columns


# Check skew on the non_binary columns


In [None]:
# Check the correlation of the non-binary columns and the target


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Visualization

In [None]:
# Visualize each non_binary column using .hist() and sns.boxplot() 
# When you look at the results, consider whether the data is skewed or normally distributed.

# Visualize capital_gain with .hist()


In [None]:
import seaborn as sns
# Visualize the data in a boxplot to get a better understanding of outliers and skew


In [None]:
# visualize the rest of the non-binary columns with .hist() and sns.boxplot()
# Create as many new cells as you need.

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## How to deal with skew
Should you remove outliers in this data?


In [None]:
# your answer goes here


---
### If the data is skewed, transform it here

In [None]:
# Import power transformer from sklearn


# Instantiate a PT instances


# Split features from the target in both train and test







# Fit on train data and transform train data


# Transform test data


---
### We have a LOT of columns!  Let's look at the shape of our data


In [None]:
# Use the shape method to get the number of columns


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Let's reduce the dimensionality using PCA

In [None]:
# Remember to scale data prior to PCA
# Use StandardScaler to scale the features


# Instantiate a StandardScaler instance

# Fit on train and then transform train

# Transform Test


In [None]:
# Perform PCA on the features
# Import PCA from sklearn


# Instantiate a PCA instance
# Typically, we want the explained variance to be between 95–99%

# Put the results in a dataframe format that's easier to read




In [None]:
# How many columns did we have prior to PCA?


In [None]:
# How many columns do we have after PCA?


In [None]:
# Apply PCA to the test set







In [None]:
# Check that test has the same number of columns as train


In [None]:
# Our target variable is binary.  Let's try to run this through simple Logistic Regression model
# Remember, we only pulled in our training data.  When we are ready to test, we will need to apply
# the same changes on our test set that we applied to our train.

# Import LogisticRegression from sklearn

# Instantiate a logistic regression instance

# Fit the model on train data

# Make predictions from the test data


In [None]:
# Evaluate model performance
# Import evaluation tools from sklearn

# Evaluate performance 


In [None]:
# Model evaluation will be covered in another class
# But we'll get a quick baseline (the accuracy if we predicted everything as the majority class)
baseline = 1 - (y_test.sum()/len(y_test))
baseline

---
### [EXTRA] Can you improve on this score by iterating back through the feature engineering and continuing to the process?

Potential areas for improvement:

- Impute missing values
- Think more about how you remove data generally: e.g., when removing columns for collinearity during dummy encoding and collinearity checks

In [None]:
# your code goes here


<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>