# MPG

We've looked at MPG before. It's a small dataset, but has some really nice categoricals.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


mpg = pd.read_csv('../files/mpg.csv', index_col=0)
mpg.head()

The key to finding categoricals is doing a nunique on the columns. Columns with low value counts may very well be a categorical.

In [None]:
mpg.nunique()

* Year only has 2 values, but i still a continuos field (although it does show that this isn't the greatest dataset around).
* Cyl has 4 values, the amount of cylinders. This is a good order categorical.
* Trans has 10 different values, but another intriguing thing going on: "manual(m5)" it combines 2 pieces of data: Manual/automatic and the number of gears.
* drv is an unordered categorical (unless you feel like a front-wheel drive is better than a 4x4?)
* fl is fuel type, unordered categorical
* class is also an unordered categorical

(When doing this kind of analysis, always have small code-block on hand where you can quickly check the different values in the column, like the one just below.)

In [None]:
mpg["fl"].unique()

## The metric system

Who knows what a mile/gallon is? A model won't care, it's just a scaled number, but we do.

Create two new columns, containing "clkm" (city litres per km) from "cty", city miles per gallon and "hwlkm" (high way...) from "hwy".

In [None]:
# Up to you!



## Ordered categoricals

Let's begin with the ordered categoricals. When doing graphs we'd do what we did in the previous Diamonds-exercise:

```Python
cut_type = CategoricalDtype(categories=[...
```

But this is a model-building course, not a graph-building course. So we want to translate these values into numerical values. And it's a pity, but for the number of cylinders, this is done. The values in the column are 4, 5, 6 and 8, which is an integer value that increases as the number of cylinders gets bigger. Any model will take that into account.

But lets, for arguments sake, say there is an order in fuel types. We'll stick to the following column:

| Code  | Fuel Type                    | Environmental Impact (COâ‚‚, NOx, etc.)                                      |
| ----- | ---------------------------- | -------------------------------------------------------------------------- |
| `'r'` | Regular unleaded petrol      | High emissions                                                             |
| `'p'` | Petrol                       | High emissions                                                             |
| `'d'` | Diesel                       | Lower COâ‚‚ than petrol, but higher NOx/particulates â€” still high overall    |
| `'c'` | CNG (compressed natural gas) | Cleaner than petrol/diesel, but still fossil fuel                          |
| `'e'` | Electric                     | Lowest emissions (assuming clean grid)                                     |

How would we encode this into a model? First you have to choose which gets a lower number: bad for the environment or low emissions? It matters for us as humans, but for a model it doesn't. If there were a column about the taxes for every vehicle and these were to go up for cars as they get worse for the environment, then 0 = bad and 4 = good will yield a negative correlation. 0 = good and 4 = bad would yield a positive correlation. In any case the model will have the information needed to predict one based on the other.

So encode this field so every fuel type gets a number in the order stated in the table above.

In [None]:
# Up to you!



## Unordered categoricals

Unordered categoricals (like 'red', 'blue', 'green' or 'brand A', 'brand B') should not be left as raw text unless you're using a model that explicitly supports it (like CatBoost).

The best way to encode them depends on the model you're using and the cardinality (number of unique categories).

Let's try one-hot encoding the 'drv'-column.


In [None]:
# Up to you!



Pros:
* Interpretable
* No ordinal assumptions

Cons:
* Sparse/high-dimensional if lots of categories
* Can hurt tree model performance if cardinality is high

Try a label-encoder for class next.

In [None]:
# Up to you!



Pros:
* Simple, compact
* Trees treat integers as categories, not as ordered numbers

Cons:
* Not safe for linear models â€” theyâ€™ll assume order where none exists

## Feature engineering

Finally, trans, the dual-valued column. First split it into two different columns.

In [None]:
# Up to you!



Which leaves us with two more categorical columns that we can apply all of the previous rules to.

And what is next?

- Encode trans_type and trans_detail with one-hot or a label-encoder
- Maybe encode manufacturer, but not with one-hot encoding (has 15 different values)
- Drop all non-digit-columns (booleans are fine too)
- Train a a model and start predicting!

You're model should improve, but inference will become more difficult. (Is audi 1 or 2? And which number does 'CNG' have?)