### Custom Encoder: Rule-Based

Lightwood uses "Encoders" to convert preprocessed (cleaned) data into **features**. Encoders represent the **feature engineering** step of the data science pipeline; they can either have a set of instructions ("rule-based") or a learned representation (trained on data).

In the following notebook, we will experiment with creating a custom encoder that creates **Label Encoding**. 

For example, imagine we have the following set of categories:

```
MyColumnData = ["apple", "orange", "orange", "banana", "apple", "dragonfruit"]
```

There are 4 categories to consider: "apple", "banana", "orange", and "dragonfruit".

**Label encoding** allows you to refer to these categories as if they were numbers. For example, consider the mapping (arranged alphabetically):

1 - apple <br>
2 - banana <br>
3 - dragonfruit <br>
4 - orange <br>

Using this mapping, we can convert the above data as follows:

```
MyFeatureData = [1, 4, 4, 2, 1, 3]
```

In the following notebook, we will design a **LabelEncoder** for Lightwood for use on categorical data. We will be using the Kaggle "Used Car" [dataset](https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes). We've provided a link for you to automatically access this CSV. This dataset describes various details of cars on sale - with the goal of predicting how much this car may sell for.

Let's get started.

In [1]:
import pandas as pd

# Lightwood modules
import lightwood as lw
from lightwood import ProblemDefinition, \
                      JsonAI, \
                      json_ai_from_problem, \
                      code_from_json_ai, \
                      predictor_from_code

### 1) Load your data

Lightwood works with `pandas.DataFrame`s; load data via pandas as follows:

In [2]:
filename = 'https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/used_car_price/data.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,A1,2017,12500,Manual,15735,Petrol,150,55.4,1.4
1,A6,2016,16500,Automatic,36203,Diesel,20,64.2,2.0
2,A1,2016,11000,Manual,29946,Petrol,30,55.4,1.4
3,A4,2017,16800,Automatic,25952,Diesel,145,67.3,2.0
4,A3,2019,17300,Manual,1998,Petrol,145,49.6,1.0


We can see a handful of columns above, such as `model, year, price, transmission, mileage, fuelType, tax, mpg, engineSize`. Some columns are numerical whereas others are categorical.


### 2) Generate JSON-AI Syntax

We will make a `LabelEncoder` as follows:

(1) Find all unique examples within a column
(2) Order the examples in a consistent way
(3) Label (python-index of 0 as start) each category
(4) Assign the label according to each datapoint.

First, let's generate a JSON-AI syntax so we can automatically identify each column. 

In [3]:
# Create the Problem Definition
pdef = ProblemDefinition.from_dict({
    'target': 'price', # column you want to predict
})

# Generate a JSON-AI object
json_ai = json_ai_from_problem(df, problem_definition=pdef)

[32mINFO:lightwood-18190:Dropping features: [][0m
[32mINFO:lightwood-18190:Analyzing a sample of 6920[0m
[32mINFO:lightwood-18190:from a total population of 10668, this is equivalent to 64.9% of your data.[0m
[32mINFO:lightwood-18190:Using 15 processes to deduct types.[0m
[32mINFO:lightwood-18190:Infering type for: model[0m
[32mINFO:lightwood-18190:Infering type for: year[0m
[32mINFO:lightwood-18190:Infering type for: price[0m
[32mINFO:lightwood-18190:Infering type for: mileage[0m
[32mINFO:lightwood-18190:Infering type for: transmission[0m
[32mINFO:lightwood-18190:Infering type for: fuelType[0m
[32mINFO:lightwood-18190:Infering type for: tax[0m
[32mINFO:lightwood-18190:Infering type for: mpg[0m
[32mINFO:lightwood-18190:Infering type for: engineSize[0m
[32mINFO:lightwood-18190:Column year has data type integer[0m
[32mINFO:lightwood-18190:Column tax has data type integer[0m
[32mINFO:lightwood-18190:Column mileage has data type integer[0m
[32mINFO:lightwoo

Let's take a look at our JSON-AI and print to file.

In [4]:
print(json_ai.to_json())

with open("default.json", "w") as f:
    f.writelines(json_ai.to_json())

{
    "features": {
        "model": {
            "encoder": {
                "module": "Categorical.OneHotEncoder",
                "args": {}
            }
        },
        "year": {
            "encoder": {
                "module": "Integer.NumericEncoder",
                "args": {}
            }
        },
        "transmission": {
            "encoder": {
                "module": "Categorical.OneHotEncoder",
                "args": {}
            }
        },
        "mileage": {
            "encoder": {
                "module": "Integer.NumericEncoder",
                "args": {}
            }
        },
        "fuelType": {
            "encoder": {
                "module": "Categorical.OneHotEncoder",
                "args": {}
            }
        },
        "tax": {
            "encoder": {
                "module": "Integer.NumericEncoder",
                "args": {}
            }
        },
        "mpg": {
            "encoder": {
                "module": "Float

### 3) Create your custom encoder (`LabelEncoder`).

Once our JSON-AI is filled, let's make our LabelEncoder. All Lightwood encoders inherit from the `BaseEncoder` class, found [here](https://github.com/mindsdb/lightwood/blob/staging/lightwood/encoder/base.py). 

![BaseEncoder](baseencoder.png)


The `BaseEncoder` has 5 expected calls:

- `__init__`: instantiate the encoder
- `prepare`: Train or create the rules of the encoder
- `encode`: Given data, convert to the featurized representation
- `decode`: Given featurized representations, revert back to data
- `to`: Use CPU/GPU (mostly important for learned representations)



'/home/natasha/Documents/lightwood/docssrc/source/tutorials/custom_encoder_rulebased'