<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Exercise:</span> Feature Engineering</h1>
<hr>
Welcome to the <span style="color:royalblue">Feature Engineering</span> Exercise! 

Remember, **better data beats better algorithms**.


<br><hr id="toc">

### In this module...

In this module, we'll cover the essential steps for building your analytical base table:
1. [Engineer features](#engineer-features)
2. [Save the ABT](#save-abt)

Finally, we'll save the ABT to a new file so we can use it in other modules.

<br><hr>

### First, let's import libraries and load the dataset.

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

We've provided comments for guidance.

In [None]:
import numpy as np
import pandas as pd
pd.set_option('max_columns', 100)

import matplotlib.pyplot as plt
import seaborn as sns

Next, let's import the dataset.
* The file path is <code style="color:crimson">'project_files/clean_employee_data.csv'</code>

In [None]:
# Load employee data from CSV
df = pd.read_csv('project_files/clean_employee_data.csv')

In [None]:
df.head()

<span id="engineer-features"></span>
# 1. Engineer features

For this project, we're going to have an abbreviated version of feature engineering, since we've already covered many tactics in the last lesson.

<br>
Do you remember the scatterplot of <code style="color:steelblue">'satisfaction'</code> and <code style="color:steelblue">'last_evaluation'</code> for employees who have <code style="color:crimson">'Left'</code>?

**Let's reproduce it here, just so we have it in front of us.**

In [None]:
# Create boolean series: `has_left` ... has this employee left?
# has_left =

# index df by `has_left` to create `left_df`
# left_df = 

In [None]:
# Scatterplot of satisfaction vs. last_evaluation, only those who have left
sns.scatterplot("satisfaction", "last_evaluation", data=left_df)
plt.show()

These roughly translate to 3 **indicator features** we can engineer:

* <code style="color:steelblue">'underperformer'</code> - last_evaluation < 0.6 and last_evaluation_missing == 0
* <code style="color:steelblue">'unhappy'</code> - satisfaction_level < 0.2
* <code style="color:steelblue">'overachiever'</code> - last_evaluation > 0.8 and satisfaction > 0.7

<br>

**Create those 3 indicator features.**
* Use boolean masks.
* **Important:** For <code style="color:steelblue">'underperformer'</code>, it's important to include <code style="color:steelblue">'last_evaluation_missing' == 0</code> to avoid those originally missing observations that we flagged and filled.

#### Create indicator features

Remember, we are making indicator variables by running boolean commands a single column or combhination of multiple columns.

1. **This check will result in a boolean series like: `[True, False, True, False, False, ...]`,**


2. **We then use `.astype(int)` on that boolean series to convert it to an indicator variable which looks like: `[1,0,1,0,0, ...]`**

In [None]:
# df['underperformer']

# df['unhappy']

# df['overachiever']

<br>

**Next, run code to check that you created the features correctly.**

In [None]:
# The proportion of observations belonging to each group


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="save-abt"></span>
# 2. Save the ABT

Finally, let's save the **analytical base table**. 

<br>

**Convert <code style="color:steelblue">'status'</code> into an indicator variable.**
* <code style="color:crimson">'Left'</code> should be <code style="color:crimson">1</code>
* <code style="color:crimson">'Employed'</code> should be <code style="color:crimson">0</code>
* There's also a really easy way you can use <code style="color:steelblue">pd.get_dummies()</code> here.

In [None]:
 # Convert status to an indicator variable


**To confirm we did that correctly, display the proportion of people in our dataset who left.**

In [None]:
# The proportion of observations who 'Left'
# if we have a series of 1s and 0s, what would the mean represent?


**Overwrite your dataframe with a version that has <span style="color:royalblue">dummy variables</span> for the categorical features.**
* Then, display the first 10 rows to confirm all of the changes we've made so far in this module.

#### Format for `pd.get_dummies()`
`pd.get_dummies(data = your_df, columns = ['all', 'categorical', 'columns', 'in', 'df'])`

In [None]:
# Create new dataframe with dummy features


In [None]:
# Display first 10 rows


**Save this dataframe as your <span style="color:royalblue">employee analytical base table</span> to use in later lessons.**
* Remember to set the argument <code style="color:steelblue">index=None</code> to save only the data.

In [None]:
# Save analytical base table
df.to_csv("project_files/anaytical_base_emp.csv")

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<br>

## Next Steps

Congratulations for making through The Feature Engineering Module!

As a reminder, here are a few things you did:
* You engineered features by leveraging your exploratory analysis.
* And you created dummy variables before saving the ABT.

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>