# Feature Engineering
We need to clean up columns, add new ones, handle missings and so forth. Some of this can/should be done post-sampling to avoid leaking information about the test data into our models inadvertently.

In [1]:
install.packages(c("tidyverse","naniar","ggthemes"))
library(tidyverse)
library(naniar)
heroes = read_csv("data/heroes_information.csv")
powers = read_csv("data/super_hero_powers.csv")

Installing packages into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)
“installation of package ‘tidyverse’ had non-zero exit status”

ERROR: Error: package or namespace load failed for ‘tidyverse’ in library.dynam(lib, package, package.lib):
 shared object ‘readxl.so’ not found


## Cleaning data
We can use packages like `forcats`, `stringr`, and `lubridate` to fix values. Using `dplyr` we're able to programmatically fix issues across multiple columns.

In [None]:
replaceLT0 = function(x) ifelse(x < 0, NA, x) 
# mutate_if takes a dataset, applies a condition to each column, and for each column that returned TRUE applies a function to it
heroes = mutate_if(heroes, is.numeric, replaceLT0)
gg_miss_var(heroes) + ggthemes::theme_few() 

> Write a cleaning statement that converts missings encoded as hyphens to NA

For rebasing factors, handling typos etc. we can use the `forcats` package.

- `fct_explicit_na` converts missings into a distinct level. This can be very handy for modelling observations with missings, as there could be some systemic reason for the missings that has predictive value for your model
- `fct_infreq` reorders a factor so the most common level is first
- `fct_lump` consolidates low frequency levels into a single level
- `fct_relabel` applies a function to level labels to do things like remove special characters

![](http://perso.ens-lyon.fr/lise.vaudor/Rfigures/forcats/forcats.png)

In [4]:
heroes = mutate_if(heroes, is.character, fct_explicit_na)
gg_miss_var(heroes) + ggthemes::theme_few() 

ERROR: Error in gg_miss_var(heroes): could not find function "gg_miss_var"


In [5]:
fct_count(heroes$Alignment)
fct_count(fct_infreq(heroes$Alignment))
fct_count(fct_lump(heroes$Race, 3))

f,n
-,7
bad,207
good,496
neutral,24


f,n
good,496
bad,207
neutral,24
-,7


f,n
-,304
Human,208
Mutant,63
Other,159


## Lags
Lagged values are past values brought forward for use as predictive variables. These might be things like the previous payment amount, time spent playing previously etc. 

`lag()` (and `lead()` although it tends to be less relevant whilst building models) returns a previous row's value onto the current row.

```{r}
lag(1:5)
```

Use lags to provide prior values, like position 6 months ago:
```{r}
lag(1:50, n = 6, default = 0)
```

Use lags to flag if values changed
```{r}
x=rep(c(2,1,1,2),2)
x==lag(x)
```

Use lags to determine change
```{r}
x=1:15
(x/lag(x))-1
```
    

## Aggregates
Producing aggregate measures like lifetime value, typical profitability, max months in arrears in the past year, times seen etc are useful measures.

It is worth noting though that if you make aggregates of historic data and use these in a model, going forward you will need those levels of historic data to make predictions.

Find the rolling mean /  max / min / etc... over previous values:
```{r}
library(RcppRoll)
iris %>% 
  group_by(Species) %>% 
  mutate(rollMean=roll_meanr(lag(Sepal.Width),
                             n = 5))%>% 
  head()
```

Find the min / max / sum / etc... in all prior observations:
```{r}
iris %>% 
  group_by(Species) %>% 
  mutate(smallest=cummin(Sepal.Width))%>% 
  head()
```


> Use the `mutate()` function to work out the difference for each hero in weight against the mean for their race

## Extra data
We can also gain additional features by joining data. [Lise Vidaur](http://perso.ens-lyon.fr/lise.vaudor) has some nifty viz (as een above re: forcats!) for joins.


In [7]:
left_join(heroes, powers, by=c("name"="hero_names"))

“Column `name`/`hero_names` joining factor and character vector, coercing into character vector”

X1,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,...,Web Creation,Reality Warping,Odin Force,Symbiote Costume,Speed Force,Phoenix Force,Molecular Dissipation,Vision - Cryo,Omnipresent,Omniscient
0,A-Bomb,Male,yellow,Human,No Hair,203,Marvel Comics,-,good,...,False,False,False,False,False,False,False,False,False,False
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,blue,good,...,False,False,False,False,False,False,False,False,False,False
2,Abin Sur,Male,blue,Ungaran,No Hair,185,DC Comics,red,good,...,False,False,False,False,False,False,False,False,False,False
3,Abomination,Male,green,Human / Radiation,No Hair,203,Marvel Comics,-,bad,...,False,False,False,False,False,False,False,False,False,False
4,Abraxas,Male,blue,Cosmic Entity,Black,-99,Marvel Comics,-,bad,...,False,False,False,False,False,False,False,False,False,False
5,Absorbing Man,Male,blue,Human,No Hair,193,Marvel Comics,-,bad,...,False,False,False,False,False,False,False,False,False,False
6,Adam Monroe,Male,blue,-,Blond,-99,NBC - Heroes,-,good,...,False,False,False,False,False,False,False,False,False,False
7,Adam Strange,Male,blue,Human,Blond,185,DC Comics,-,good,...,False,False,False,False,False,False,False,False,False,False
8,Agent 13,Female,blue,-,Blond,173,Marvel Comics,-,good,...,,,,,,,,,,
9,Agent Bob,Male,brown,Human,Brown,178,Marvel Comics,-,good,...,False,False,False,False,False,False,False,False,False,False


## Reshaping data
We might also need to reshape data. This is usually what I think of as *unpivoting* my data or *pivoting* it due to my use of Excel.

In R, the function for unpivoting data is `gather()`, like gathering all your data up, and the function for pivoting your data is `spread()`.

`gather()` will need to know what the name should be for column containing our old headers, what the column name should be for the one holding our old cell values, and what columns we do/don't want to unpivot. `spread` will need to know which column is going to be used for headers, which one will become the cells.

In [8]:
powers_long = gather(powers, power, present, -hero_names)
str(powers_long)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	111389 obs. of  3 variables:
 $ hero_names: chr  "3-D Man" "A-Bomb" "Abe Sapien" "Abin Sur" ...
 $ power     : chr  "Agility" "Agility" "Agility" "Agility" ...
 $ present   : chr  "True" "False" "True" "False" ...


> Have a think about what might be needed to re-pivot the data (feel free to peek at the docs too!) and give it a go