# Feature Engineering

Most often in data driven value creation, to answer specific questions we need to 'create' new data by using what we already have. This process can involve combining info from columns, transforming them or presenting them in a different way. 

The collective name of these processes is called feature engineering. In opposition to the previous modules, feature engineering is not focused that much on the technical aspects (i.e. programming), but rather on thinking about what _could be_ useful information from the data. 

In this notebook we will show a few techniques and commonly used quantities for feature engineering, but really, this part is also about your creativity and thinking a bit out of the box.

First we start with the `renewable_electricity` data set.

In [None]:
import pandas as pd

df = pd.read_csv('data/renewable_electricity.csv').drop(columns=['Unnamed: 0']) #drop again the unused column
df

For feature engineering it is necessary to have an understanding of the topic of the data, in this case a basic understanding on renewable electricity.

Let's say we want to add a new column `Percentage of total` that contains what percentage of the total production (Gross production normalized) the given electricty source produced. 

For this we first need to look at what was the total production of that year (with source = Total), and divide the production of each source with that value.
In technical terms, we need to:
- first store the total production per year in an easily accessible format. We will use a Pandas series with the year as index and Gross production as values.
- Make a function that takes in a row of our dataframe, and given the year it calculates this percentage.
- Use the `apply()` method on our dataframe to create the new column

In [None]:
df_total_productions = df[df['Source'] == 'Total']

production_per_year = pd.Series(df_total_productions['Gross production'].values, index=df_total_productions['Periods'].values)
#notice that we used .values to only get the values. If you know you only need the (list of) values from a series/dataframe, its a useful feature

production_per_year[1995] # we can access the gross total production for every year like this

In [None]:
#Next we define the funtion that we will then apply to every row
def percentage_of_total(row): #important: functions for .apply() can only take one argument, here the row
    prod_for_year = production_per_year[row['Periods']]
    percentage = row['Gross production'] / prod_for_year
    return percentage

df['Percentage of total'] = df.apply(percentage_of_total, axis=1)
df.tail(5) #and just like this, we have a new row with this information

Let's say we are also interested what percentage of the `Net production` is to the `Gross producion`. We will call this column `Net percentage`

Remember that we can perform operations on entire columns of a dataframe

In [None]:
df['Net percentage'] = df['Net production'] / df['Gross production'] 
# its as simple as this. Just divide the values in Net production with those in Gross production


Now we actually want to convert these proportions to percentages. In pandas when we multiply/divide/add a number to a column it will perform the same operation in all cells.

In [None]:
df['Net percentage'] = df['Net percentage'] * 100 # Multiply with 100 to get actual percentages for both calculated values
df['Percentage of total'] = df['Percentage of total'] * 100
df

As a final example for this data set, let's say we want to see for each year which source produced the highest amount of electricity, and add this as an extra column in our data called `Highest performing`

For this we need to do the followings:
- Create a pandas Series with the highest outputs for each year. We can do this by filtering the data for years and take the max value of the `Gross production` column. Note that we have to remove the sources `Total, wind-total, biomass-total` as they are summing up already.
- Create a function that accepts a row and based on a year, returns the highest output source for that year
- with `.apply()` apply it to the new column

In [None]:
sources_to_delete = ['Total', 'wind-total', 'biomass-total']

df_filtered = df[~df['Source'].isin(sources_to_delete)] # the ~ character prior to any conditional means that keep the opposite (negate)
years = df['Periods'].unique() #get all the years present in the dataframe

# we create an empty dictionary to store the values. A dictionary can be used to easily look up values based on a specified key /index
highest_per_year = {} #empty {} initialize an empty dictionary
#iterate over years and select the highest performing row, keep the source
for y in years:
    idx = df_filtered[df_filtered['Periods'] == y]['Gross production'].argmax() #argmax returns the INDEX of the highest value
    highest_row = df_filtered[df_filtered['Periods'] == y].iloc[idx] #with .iloc[] we can get a single row at the specified index
    highest_source = highest_row['Source']
    highest_per_year[y] = highest_source
# now we can use the dictionary to get the highest performing source for each year, simply by looking up:
highest_per_year[2014], highest_per_year[1997]


In [None]:
# we can just use a lambda function here, as all we need to do is look up the corresponding highest value for the year 
# in the dictionary created above

df['Highest performing'] = df.apply(lambda row: highest_per_year[row['Periods']], axis=1) 
df

This might have been a bit more challenging example, but this is the idea of feature engineering, that we need to think of more meaningful data that we can add to our dataframes, using the already available data sets.

# Exercises

To practice feature engineering, you are free to use whichever dataset you feel like you could use to expand with features. If in doubt about whether an extra column is useful, or how to make it, always feel free to reach out to a Data mentor.

We do want to give a few practice exercises, this time we'll use the `student_debt.csv` dataset. NOTE: This data set is used as example in the data_cleaning notebook. You will need to copy the code that converts the `Period` column to numbers with the Lambda funcionts:

```df["Period"] = df.apply(lambda row:int(row['Period'].replace('*', '')), axis=1)```

### Exercise 1
Create a new column `Percentage of Sum`  that contains, for each row, what percentage of student loan is held by them in that year. Essentially, calculate the percentage of the `Sum` in each column using the `Sum` in the `Total` row per year. After you calculated it, scale it such that it can be made into a percentage.

This exercise is highly similar to what we've done before in this notebook.


In [None]:
# ------------ Exercise code comes here ------------- #

### Exercise 2

Create a new column `highest in debt` which contains the highest _age group_ that has the highest _average_ student debt for each year. For this exercise you should:
- Filter the dataframe so it does not contain the `Total, Man, Woman` rows.
- Check for each year which _age group_ holds the highest Average (if you need check the example above how we did it)
- Store this data somehow (hint: you can use a dictionary again)
- Apply a (lambda) function to create the new column.

In [None]:
# ------------ Exercise code comes here ------------- #

### Further exercises

If you feel like, go to the data dasboard and download a new data set that you can use for other feature engineering tasks. 
This is really a field where practice makes perfect :) So don't be afraid to experiment and play around with the plethora of data sets we have to offer.

In [None]:
# ------------ Exercise code comes here ------------- #