# Basic Feature Engineering

Part 1

### Adding simple derived columns

Our company, **Boss Widgetry, Inc.** has had a particularly good year of profits. We want to reward the leaders of our management team.

Here is our manager dataset.

In [4]:
data = {
    'Name': ['John Smith', 'Jane Doe', 'Emily Brown', 'Mike Johnson', 'David Lee'],
    'Age': [35, 28, 42, 31, 56],
    'Base Salary': [50000, 60000, 75000, 55000, 80000],
    'Years of Experience': [5, 3, 12, 4, 8],
    'Department': ['Sales', 'Marketing', 'Engineering', 'HR', 'Finance']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department
0,John Smith,35,50000,5,Sales
1,Jane Doe,28,60000,3,Marketing
2,Emily Brown,42,75000,12,Engineering
3,Mike Johnson,31,55000,4,HR
4,David Lee,56,80000,8,Finance


In [5]:
import pandas as pd

#### Adding a Simple Calculated Column

Let's add a column for the employee's salary after a 5% raise:

In [6]:
df['Salary After Raise'] = df['Base Salary'] * 1.05
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department,Salary After Raise
0,John Smith,35,50000,5,Sales,52500.0
1,Jane Doe,28,60000,3,Marketing,63000.0
2,Emily Brown,42,75000,12,Engineering,78750.0
3,Mike Johnson,31,55000,4,HR,57750.0
4,David Lee,56,80000,8,Finance,84000.0


#### Adding a Column Based on Multiple Existing Columns

We can create a column for a bonus based on years of experience and base salary.

In [7]:
df['Bonus'] = df['Base Salary'] * (df['Years of Experience']/100)
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department,Salary After Raise,Bonus
0,John Smith,35,50000,5,Sales,52500.0,2500.0
1,Jane Doe,28,60000,3,Marketing,63000.0,1800.0
2,Emily Brown,42,75000,12,Engineering,78750.0,9000.0
3,Mike Johnson,31,55000,4,HR,57750.0,2200.0
4,David Lee,56,80000,8,Finance,84000.0,6400.0


#### Add a column for raise + bonus = new salary.

In [8]:
df['New Salary'] = df['Salary After Raise'] + df['Bonus']
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department,Salary After Raise,Bonus,New Salary
0,John Smith,35,50000,5,Sales,52500.0,2500.0,55000.0
1,Jane Doe,28,60000,3,Marketing,63000.0,1800.0,64800.0
2,Emily Brown,42,75000,12,Engineering,78750.0,9000.0,87750.0
3,Mike Johnson,31,55000,4,HR,57750.0,2200.0,59950.0
4,David Lee,56,80000,8,Finance,84000.0,6400.0,90400.0


<hr style='height:2px'>

### Apply a function to a column

Let's add a column of age categories: 
* Junior: age < 30
* Mid-level: 30 <= age < 45
* Senior: age <= 45

First, define a function to classify ages:

In [9]:
def age_cat(age):
    if age < 30:
        return 'Junior'
    elif 30 <= age < 45:
        return 'Mid-level'
    else:
        return 'Senior'

#### use the `.apply()` method

This "applies" a *function* to the values in a column.

In [10]:
df['Age Category'] = df['Age'].apply(age_cat)
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department,Salary After Raise,Bonus,New Salary,Age Category
0,John Smith,35,50000,5,Sales,52500.0,2500.0,55000.0,Mid-level
1,Jane Doe,28,60000,3,Marketing,63000.0,1800.0,64800.0,Junior
2,Emily Brown,42,75000,12,Engineering,78750.0,9000.0,87750.0,Mid-level
3,Mike Johnson,31,55000,4,HR,57750.0,2200.0,59950.0,Mid-level
4,David Lee,56,80000,8,Finance,84000.0,6400.0,90400.0,Senior


Another (simpler? better?) way with `pd.cut()`

In [11]:
pd.cut(df['Age'], [0,30,45,100], labels=['Junior', 'Mid-level', 'Senior'])

0    Mid-level
1       Junior
2    Mid-level
3    Mid-level
4       Senior
Name: Age, dtype: category
Categories (3, object): ['Junior' < 'Mid-level' < 'Senior']

<hr style='height:2px'>

#### Conditional Logic wtih columns

The President of **Boss Widgetry, Inc.** wants to analyze who might be eligible to be offered a Vice President position.

To be a VP, candidates must be at least 35 years old AND have at least 5 years of experience with the company.

In [12]:
df['VP Eligible'] = (df['Age'] >= 35) & (df['Years of Experience'] >= 5)

In [13]:
df

Unnamed: 0,Name,Age,Base Salary,Years of Experience,Department,Salary After Raise,Bonus,New Salary,Age Category,VP Eligible
0,John Smith,35,50000,5,Sales,52500.0,2500.0,55000.0,Mid-level,True
1,Jane Doe,28,60000,3,Marketing,63000.0,1800.0,64800.0,Junior,False
2,Emily Brown,42,75000,12,Engineering,78750.0,9000.0,87750.0,Mid-level,True
3,Mike Johnson,31,55000,4,HR,57750.0,2200.0,59950.0,Mid-level,False
4,David Lee,56,80000,8,Finance,84000.0,6400.0,90400.0,Senior,True
