# Welcome to Paranoia

Paranoia is a lovely planet located 1240.12 million light-years away from Earth.
This is how Paranoia looks from the space:

![Paranoia planet](assets/paranoia.jpg)

Beautiful, isn't it?

I've been there taking notes about the creatures that inhabit that country: **paranoids**.
In this notebook I will show how I've been able to prepare the paranoids data to be processed later by machine learning models that predict who will be the winner of a supposed battle. To perform that processing I will use, obviously, **Recipipe** :)

Through this tutorial, I will describe the most basic Recipipe concepts.

# Imports

In [1]:
import pandas as pd

import recipipe as r

# Read data

Paranoids are normally peaceful and very colorful creatures. I don't have any pictures of them because I forgot to bring a camera. Can you believe it?! Wormholes are fast but not pleasant enough to go back for a camera and come back...

I didn't forget paper and pen, so I was able to collect the data now contained in `paranoids.csv`.

In [2]:
df = pd.read_csv("data/paranoids.csv")
df

Unnamed: 0,name,type 1,type 2,type 3,width,height,length,weight,cute
0,Scucioid,Ground,Flying,,11.04,10.1147,14.2144,843.777,False
1,Trelvisoid,Ground,Flying,Water,13.5695,7.07998,13.2144,1175.05,False
2,Dovanoid,Flying,,,20.2368,13.9882,27.6025,6691.82,False
3,Pilinoid,Water,,,9.66392,5.85685,13.5306,775.836,True
4,Arkanoid,Fire,Flying,,6.35188,8.05839,7.03025,340.162,False
5,Pinkfloid,Water,,,7.55713,9.75967,10.9433,770.964,True
6,Redathoid,Fire,Ground,,5.68892,7.86451,7.22494,462.44,False
7,Nebidoid,Ground,,,10.5557,5.92844,15.1795,698.283,True
8,Golukoid,Flying,Fire,,9.2818,10.019,10.5873,1208.4,True
9,Oid,Water,,,13.4964,8.88992,7.03398,685.878,True


The column names are pretty descriptive.
All the lenght measures are in meters and the weight is in kilograms.
Perhaps the column that needs more explanation is `cute`.
It simply indicates whether or not I found the paranoid physically beautiful ^.^

As you can observe, there is **no column for the gender**. I'm not going to go into the disgusting details of paranoid reproduction, just to say that it's not an important feature.

# Introduction to transformers

The basic unit used in Recipipe is the transformer.

All Recipipe transformers inherit from `recipipe.core.RecipipeTransformer`.
At the same time, that class inherits from `sklearn.base.TransformerMixin` and `sklearn.base.BaseEstimator`, that is, a Recipipe transformer is an SKLearn transformer.
The biggest difference is that Recipipe transformers only work with `pandas.DataFrames`.
No worries, SKLearn transformers are fully supported using wrappers.
We will talk abou them later.

Let's start creating the simplest transformer, a transformer that selects columns from the input.

In [3]:
r.select()

SelectTransformer(col_format='{}', cols=None, cols_not_found_error=False,
                  dtype=None, keep_original=False, name=None)

To apply that transformation to our DataFrame:

In [4]:
r.select().fit_transform(df).head()

Unnamed: 0,name,type 1,type 2,type 3,width,height,length,weight,cute
0,Scucioid,Ground,Flying,,11.04,10.1147,14.2144,843.777,False
1,Trelvisoid,Ground,Flying,Water,13.5695,7.07998,13.2144,1175.05,False
2,Dovanoid,Flying,,,20.2368,13.9882,27.6025,6691.82,False
3,Pilinoid,Water,,,9.66392,5.85685,13.5306,775.836,True
4,Arkanoid,Fire,Flying,,6.35188,8.05839,7.03025,340.162,False


Almost all the Recipipe transformers are expecting a list of columns to which apply the transformation.
In our previous example, we didn't specify any columns, so the transformer is applied over all the columns.
Selecting all the columns give us the input DataFrame, so nothing exciting.

Let's make something more interesting. Let's select a couple of columns.

In [5]:
r.select("name", "cute").fit_transform(df).head()

Unnamed: 0,name,cute
0,Scucioid,False
1,Trelvisoid,False
2,Dovanoid,False
3,Pilinoid,True
4,Arkanoid,False


Note that after this operation `df` is still intact since we are not assigning the result, just printing it.

If you like to keep a list of features you can use that **list as a parameter** too.

In [6]:
features_str = ["name"]
features_float = ["width", "height"]
r.select(features_str, features_float).fit_transform(df).head()

Unnamed: 0,name,width,height
0,Scucioid,11.04,10.1147
1,Trelvisoid,13.5695,7.07998
2,Dovanoid,20.2368,13.9882
3,Pilinoid,9.66392,5.85685
4,Arkanoid,6.35188,8.05839


Let's do something more interesting, let's select the three type columns (type 1, type 2 and type 3) but without using a list, we'll use a wildcard.

In [7]:
r.select("type *").fit_transform(df).head()

Unnamed: 0,type 1,type 2,type 3
0,Ground,Flying,
1,Ground,Flying,Water
2,Flying,,
3,Water,,
4,Fire,Flying,


At fit time, the Recipipe transformer match the given list of columns to the columns present on the given DataFrame.
That feature is available in all the existing Recipipe transformers.

Also, all recipipe transformers have `dtype` as parameter.

In [8]:
r.select(dtype=[bool, float]).fit_transform(df).head()

Unnamed: 0,width,height,length,weight,cute
0,11.04,10.1147,14.2144,843.777,False
1,13.5695,7.07998,13.2144,1175.05,False
2,20.2368,13.9882,27.6025,6691.82,False
3,9.66392,5.85685,13.5306,775.836,True
4,6.35188,8.05839,7.03025,340.162,False


Recipipe uses a lot of alias to make the creation of transformers more convenient.
The `recipipe.select` method is defined as an alias for the `recipipe.transformers.SelectTransformer` class.
It is far easier to type `recipipe.select` than `recipipe.SelectTransformer`.
Using `r` as an alias of `recipipe`, as we are using in this notebook makes things even easier.
All the alias definitions are in `recipipe/__init__.py`.

# Create pipeline

If we want to chain one transformation after the other and apply them all at the same time, we need to create a `recipipe` object.
This is nothing but a super-vitaminized SKLearn pipeline.
For example, in contrast to SKLearn pipelines, `recipipe` objects can be created without any transformation at all.

Although we can also create a `recipipe` from a list of transformations, let's go step by step, adding one transformation at a time. So let's create an empty pipeline.

In [9]:
pipe = r.recipipe()

Recipipe pipelines work with `pandas.DataFrame`s as an input object.
All Recipipe transformers are expecting to receive a `pandas.DataFrame`, that's why we need to use Recipipe transformers on it.

---

Although reading the names I've given to the Paranoids is not a waste of time, unfortunately I doubt that our machine learning model will be able to enjoy my originality, so the sooner we delete unnecessary data the better performance we will have in our pipeline. Let's consider that whatever I find a paranoid `cute` or not is also very uninformative for our future model.

We will use the `recipipe.drop` transformer to drop columns.

In [10]:
pipe += r.drop("name", "cute")

Another valid method for adding steps to the pipeline is `pipe.add(your_transformer_here)`.

---

Each paranoid has one or more types (3 types as maximum).
Existing types are: Ground, Water, Flying, Electric and Fire.
Even if we know the types in advance, let's use a onehot encoder to find them all authomatically.

Under the hood, the onehot encoder that Recipipe uses is `sklearn.preprocessing.OneHotEncoder`.
Recipipe is only a wrapper over it.
Indeed the definition of `recipipe.onehot` is:

    onehot = from_sklearn(OneHotEncoder(sparse=False, handle_unknown="ignore"))

`recipipe.onehot` is a creator object, when you call it you get a Recipipe transformer that wraps the `sklearn.preprocessing.OneHotEncoder` passed as an argument to `from_sklearn`.

When you use `recipipe.onehot()` a *columns wrapper* over the onehot encoder is created (equivalent to `recipipe.onehot(wrapper="columns")`).
A columns wrapper fits the SKLearn transformer in all the selected columns and transform all the selected columns at the same time. That's not going to work in this example where we want to transform all the "type \*" columns using the same classes.

If we treat each column as a separate column we will get different encodings, specially because almost no paranoid has 3 types. Look to the unique values of each type column:

In [11]:
for i in ["type 1", "type 2", "type 3"]:
    print(i, ":", df[i].unique())

type 1 : ['Ground' 'Flying' 'Water' 'Fire' 'Electric']
type 2 : ['Flying' nan 'Ground' 'Fire' 'Electric']
type 3 : [nan 'Water']


The first thing we need to do is get rid of the `nan` values. For that task we can use `recipipe.impute` that is defined as:

    impute = from_sklearn(SimpleImputer(strategy="constant"))

Later we need to fit a onehot encoder in a long column created from the concatenation of all the "type \*" columns.
However, at the time of applying that transformation, it should be applied one "type \*" column at a time.
To do that we have the `fit_one_col` wrapper.

In [12]:
pipe += r.impute("type *", fill_value="Unknown")  # as strategy="constant" is used by default we only need
                                                  # to specify fill_value.
pipe += r.onehot("type *", wrapper="fit_one_col")

Notice that if you have another kind of column (non-related to types) that you want to encode using onehot, you cannot reuse this transformer or all the categories will be mixed. You need to add a new `r.onehot` to the pipeline.

---

Let's do more complicate operations.
Observe the results of the transformation pipeline we have so far.

In [13]:
pipe.fit_transform(df).head()

Unnamed: 0,type 1=Electric,type 1=Fire,type 1=Flying,type 1=Ground,type 1=Unknown,type 1=Water,type 2=Electric,type 2=Fire,type 2=Flying,type 2=Ground,...,type 3=Electric,type 3=Fire,type 3=Flying,type 3=Ground,type 3=Unknown,type 3=Water,width,height,length,weight
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,11.04,10.1147,14.2144,843.777
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,13.5695,7.07998,13.2144,1175.05
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,20.2368,13.9882,27.6025,6691.82
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,9.66392,5.85685,13.5306,775.836
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,6.35188,8.05839,7.03025,340.162


First I don't want to keep the "Unknown" colums.
Indeed they are not really unknown, it's a kind of indicator if a paranoid has a "type \*" or not.

In [14]:
pipe += r.drop("*=Unknown")

Later, I want to have only one "type=*\" column with a 1 if the paranoid has that type among the three type columns and a 0 otherwise.
For example, if one row has "type 1=Fire"=0, "type 2=Fire"=1 and "type 3=Fire"=0, then I want merge all those three columns into one "type=Fire"=1.
I want to repeat the same process with all the types.

There is no transformer that performs that operation, but we can easily create one using existing Recipipe transformers.

In [15]:
class add_column_group(r.ColumnGroupsTransformer):
    def _transform_group(self, df, group_cols):
        return df[group_cols].sum(axis=1)

The `recipipe.transformer.ColumnGroupsTransformer` is expecting to receive a list of lists of columns and iterate all the lists at the same time and reduce every iteration to one column (N to 1 relatioship between input and output colums).

Let's see
In this example, we want the `_transform_group` to be called with `group_cols = ["type 1=Electric", "type 2=Electric", "type 3=Electric"]`, later with `group_cols = ["type 1=Fire", "type 2=Fire", "type 3=Fire"]` and so on.
We want to store the results in `"type=Electric"`, `"type=Fire"`... respectively.
The `ColumnGroupsTransformer` authomatically removes any number from the input names and use that as an output name, so it's enough with:

In [16]:
pipe += add_column_group("type 1=*", "type 2=*", "type 3=*")  # Here is important not to use a list.
                                                              # If you use a list you will be summing all the cols.

Let's see how our pipeline goes so far.

In [17]:
pipe.fit_transform(df).head()

Unnamed: 0,type=Electric,type=Fire,type=Flying,type=Ground,type=Water,width,height,length,weight
0,0.0,0.0,1.0,1.0,0.0,11.04,10.1147,14.2144,843.777
1,0.0,0.0,1.0,1.0,1.0,13.5695,7.07998,13.2144,1175.05
2,0.0,0.0,1.0,0.0,0.0,20.2368,13.9882,27.6025,6691.82
3,0.0,0.0,0.0,0.0,1.0,9.66392,5.85685,13.5306,775.836
4,0.0,1.0,1.0,0.0,0.0,6.35188,8.05839,7.03025,340.162


I want to scale using mean and std the width, height, length and weight.
But for some reason that I don't want to disclose here I also want to scale the weight using a MinMax scaler.

No MinMax scaler transformer is included in recipipe, so let's create it using SKLearn.

In [18]:
from sklearn.preprocessing import MinMaxScaler

minmax = r.from_sklearn(MinMaxScaler())

Now we can use `minmax` as any other Recipipe transformer.
We are going to use new parameters of the Recipipe transformers: `keep_original` and `col_format`.

We want to keep the original column, the non-scaled column. That's why we want to use a standar scaler later over that column.
If we want to keep the original column we need to rename it.
Specify an output name with `col_format`.
Any '{}' in `col_format` will be replace by the original column name.

In [19]:
pipe += minmax("weight", keep_original=True, col_format="{}_minmax")

`recipipe.scaler` does the job for scaling using mean and stdev.
You can use `col_format` without `keep_original`.

In [20]:
pipe += r.scale("width", "height", "length", "weight", col_format="{}_norm")

In [21]:
pipe.fit_transform(df).head()

Unnamed: 0,type=Electric,type=Fire,type=Flying,type=Ground,type=Water,width_norm,height_norm,length_norm,weight_norm,weight_minmax
0,0.0,0.0,1.0,1.0,0.0,-0.200894,-0.156701,0.405524,-0.411887,0.081269
1,0.0,0.0,1.0,1.0,1.0,0.312981,-0.720646,0.231961,-0.276334,0.11454
2,0.0,0.0,1.0,0.0,0.0,1.667463,0.563115,2.729205,1.981071,0.668619
3,0.0,0.0,0.0,0.0,1.0,-0.480449,-0.947942,0.286842,-0.439688,0.074445
4,0.0,1.0,1.0,0.0,0.0,-1.1533,-0.538827,-0.841379,-0.617961,0.030688


# Apply transformations

Let's simulate that we have a train and a test DataFrame.

In [22]:
half = len(df) // 2
df_train = df[:half]
df_test = df[half:]

Fit and transform the pipeline again, but this time in our train DataFrame.

In [23]:
pipe.fit_transform(df_train)

Unnamed: 0,type=Fire,type=Flying,type=Ground,type=Water,width_norm,height_norm,length_norm,weight_norm,weight_minmax
0,0.0,1.0,1.0,0.0,0.144645,0.569747,0.162215,-0.318262,0.079289
1,0.0,1.0,1.0,1.0,0.752826,-0.688989,-0.011537,-0.141651,0.131444
2,0.0,1.0,0.0,0.0,2.355881,2.17639,2.488416,2.799491,1.0
3,0.0,0.0,0.0,1.0,-0.186214,-1.196317,0.043403,-0.354483,0.068592
4,1.0,1.0,0.0,0.0,-0.982546,-0.283166,-1.086042,-0.586753,0.0
5,0.0,0.0,0.0,1.0,-0.692761,0.422488,-0.406144,-0.35708,0.067825
6,1.0,0.0,1.0,0.0,-1.141945,-0.363583,-1.052214,-0.521563,0.019251
7,0.0,0.0,1.0,0.0,0.028202,-1.166623,0.329902,-0.395829,0.056382
8,1.0,1.0,0.0,0.0,-0.278089,0.530052,-0.467999,-0.123871,0.136695


Transform the test dataset using the same transformations.

In [24]:
pipe.transform(df_test)

Unnamed: 0,type=Fire,type=Flying,type=Ground,type=Water,width_norm,height_norm,length_norm,weight_norm,weight_minmax
9,0.0,0.0,0.0,1.0,0.73525,0.061735,-1.085394,-0.402442,0.054429
10,0.0,1.0,0.0,0.0,-0.117896,1.46687,-0.010616,-0.021361,0.166967
11,0.0,1.0,0.0,0.0,1.307511,5.183283,0.150365,0.791354,0.406972
12,0.0,1.0,1.0,1.0,1.190322,0.999623,-0.781754,-0.341115,0.07254
13,0.0,1.0,1.0,0.0,0.837195,4.545396,-0.596028,0.195259,0.230938
14,0.0,0.0,0.0,0.0,0.499696,-1.023707,-1.105094,-0.59773,-0.003242
15,0.0,1.0,0.0,0.0,3.029605,6.243415,1.371507,4.558518,1.519463
16,1.0,0.0,0.0,0.0,-0.352157,1.880944,-1.501602,-0.435456,0.04468
17,0.0,1.0,0.0,0.0,-1.727554,-2.630516,-1.775385,-0.749648,-0.048105
18,0.0,0.0,0.0,1.0,1.863758,0.74358,0.702755,1.15185,0.513431


# Resources

* Planet generator: https://topps.diku.dk/torbenm/maps.msp.
Yes, I've used a random planet generator... Paranoia is not real :(
* Paranoia planet and paranoids: my insane brain, copyright ©