----------------------------
#### Handling categorical data - encoding - class labels
------------------------

- Many machine learning libraries require that `class labels` are encoded as `integer` values. 

- We need to remember that `class labels are not ordinal`, 
    - it doesn't matter which integer number we assign to a particular string-label. 
    - Thus, we can simply enumerate the class labels starting at 0:

#### Data Set Information:

- This data set consists of three types of entities: 
    - (a) the specification of an `auto` in terms of various characteristics, 
    - (b) its assigned insurance risk rating, 
    - (c) its normalized losses in use as compared to other cars. 
    
The second rating corresponds to the degree to which the auto is more 
risky than its price indicates. Cars are initially assigned a risk 
factor symbol associated with its price. 
Then, if it is more risky (or less), this symbol is adjusted by
moving it up (or down) the scale. Actuarians call this process 
"symboling". 
A value of +3 indicates that the auto is risky, 
-3 that it is probably pretty safe. 

The third factor is the relative average loss payment per insured 
ehicle year. This value is normalized for all autos within a particular 
size classification (two-door small, station wagons, sports/speciality, 
                     etc...), and represents the average loss per car 
per year. 

Note: Several of the attributes in the database could be used as a "class" 
    attribute.


Attribute Information:

Attribute: Attribute Range 

1. symboling: -3, -2, -1, 0, 1, 2, 3. 
2. normalized-losses: continuous from 65 to 256. 
3. `make`: 
alfa-romero, audi, bmw, chevrolet, dodge, honda, 
isuzu, jaguar, mazda, mercedes-benz, mercury, 
mitsubishi, nissan, peugot, plymouth, porsche, 
renault, saab, subaru, toyota, volkswagen, volvo 

4. `fuel-type`: diesel, gas. 
5. aspiration: std, turbo. 
6. `num-of-doors`: four, two. 
7. `body-style`: hardtop, wagon, sedan, hatchback, convertible. 
8. `drive-wheels`: 4wd, fwd, rwd. 
9. `engine-location`: front, rear. 
10. `wheel-base`: continuous from 86.6 120.9. 
11. length: continuous from 141.1 to 208.1. 
12. width: continuous from 60.3 to 72.3. 
13. height: continuous from 47.8 to 59.8. 
14. curb-weight: continuous from 1488 to 4066. 
15. `engine-type`: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 
16. `num-of-cylinders`: eight, five, four, six, three, twelve, two. 
17. engine-size: continuous from 61 to 326. 
18. `fuel-system`: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 
19. bore: continuous from 2.54 to 3.94. 
20. stroke: continuous from 2.07 to 4.17. 
21. compression-ratio: continuous from 7 to 23. 
22. horsepower: continuous from 48 to 288. 
23. peak-rpm: continuous from 4150 to 6600. 
24. `city-mpg`: continuous from 13 to 49. 
25. `highway-mpg`: continuous from 16 to 54. 
26. `price`: continuous from 5118 to 45400.

#### domain knowledge is key to categorical to numeric coversion

In [2]:
import pandas as pd
import numpy as np

- When we are talking about `categorical` data, 
    - we have to further distinguish between `nominal` and `ordinal` features. 

- `Ordinal` features can be understood as categorical values that can be sorted or ordered. 
    - For example, T-shirt size would be an ordinal feature, because we can define an order XL > L > M. 
    
- In contrast, `nominal` features don't imply any order and, to continue with the previous example, we could think of T-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.

In [4]:
location = "https://github.com/gridflowai/gridflowAI-datasets-icons/raw/master/AI-DATASETS/01-MISC/auto-specs.csv"

In [6]:
# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [8]:
# load the training data 
df_auto = pd.read_csv(location, header=None, names=headers, na_values="?" )
df_auto.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [12]:
df_auto.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

#### what do the value_counts() convey?

In [14]:
df_auto.make.value_counts()

make
toyota           32
nissan           18
mazda            17
mitsubishi       13
honda            13
volkswagen       12
subaru           12
peugot           11
volvo            11
dodge             9
mercedes-benz     8
bmw               8
audi              7
plymouth          7
saab              6
porsche           5
isuzu             4
jaguar            3
chevrolet         3
alfa-romero       3
renault           2
mercury           1
Name: count, dtype: int64

In [16]:
df_auto.fuel_type.value_counts()

fuel_type
gas       185
diesel     20
Name: count, dtype: int64

In [7]:
df_auto.engine_location.value_counts()

front    202
rear       3
Name: engine_location, dtype: int64

In [18]:
# only focus on encoding the categorical variables, 
# we are going to include only the object columns in our dataframe. 
# Pandas has a helpful select_dtypes function which we can use to build a 
# new dataframe containing only the object columns.
df_auto.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Pandas has a helpful __select_dtypes__ function which we can use to build a new dataframe containing only the object columns.

In [20]:
obj_df = df_auto.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


In [22]:
obj_df.isnull().sum()

make               0
fuel_type          0
aspiration         0
num_doors          2
body_style         0
drive_wheels       0
engine_location    0
engine_type        0
num_cylinders      0
fuel_system        0
dtype: int64

In [24]:
# there are a couple of null values in the data that we need to clean up.
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,,sedan,fwd,front,ohc,four,idi


In [26]:
obj_df["num_doors"].value_counts()

num_doors
four    114
two      89
Name: count, dtype: int64

In [13]:
# just fill in the value with the number 4 
# (since that is the most common value)

In [28]:
obj_df.describe(include='all')

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
count,205,205,205,203,205,205,205,205,205,205
unique,22,2,2,2,5,3,2,7,7,8
top,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi
freq,32,185,168,114,96,120,202,148,159,94


In [34]:
obj_df = obj_df.fillna({"num_doors": "four"})

In [36]:
obj_df.isnull().sum()

make               0
fuel_type          0
aspiration         0
num_doors          0
body_style         0
drive_wheels       0
engine_location    0
engine_type        0
num_cylinders      0
fuel_system        0
dtype: int64

In [38]:
obj_df.num_cylinders.value_counts()

num_cylinders
four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: count, dtype: int64

#### Approach 1 - Find and Replace 
- encoding
     - num_doors
     - num_cylinders



In [16]:
obj_df["num_cylinders"].value_counts()

four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num_cylinders, dtype: int64

In [42]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8, "two": 2, "twelve": 12, "three":3 }}

In [44]:
obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()

  obj_df.replace(cleanup_nums, inplace=True)


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


#### Approach 2 - Label Encoding 

In [21]:
# body_style column contains 5 different values. 

# convertible -> 0
# hardtop -> 1
# hatchback -> 2
# sedan -> 3
# wagon -> 4

In [48]:
obj_df["body_style"].value_counts()

body_style
sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: count, dtype: int64

In [50]:
# check the data type for body_style (object)
obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

- One trick you can use in pandas is to convert a column to a __category__, then use those category values for your label encoding:

In [52]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

assign the encoded variable to a new column using the __cat.codes__ accessor

In [54]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi,2
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi,3
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi,3


In [56]:
obj_df['body_style_cat'].unique()

array([0, 2, 3, 4, 1], dtype=int8)

#### Approach 3 - One Hot Encoding 

In [27]:
# Label encoding has the advantage that it is straightforward but it has the disadvantage that 
# the numeric values can be “misinterpreted” by the algorithms. 

# Pandas supports this feature using get_dummies. 
# This function is named this way because it creates dummy/indicator variables (aka 1 or 0).

In [60]:
s = pd.Series(list('APNPNNAAPP'))
s

0    A
1    P
2    N
3    P
4    N
5    N
6    A
7    A
8    P
9    P
dtype: object

In [62]:
pd.get_dummies(s)

Unnamed: 0,A,N,P
0,True,False,False
1,False,False,True
2,False,True,False
3,False,False,True
4,False,True,False
5,False,True,False
6,True,False,False
7,True,False,False
8,False,False,True
9,False,False,True


In [64]:
pd.get_dummies(pd.Series(list('abcde')))

Unnamed: 0,a,b,c,d,e
0,True,False,False,False,False
1,False,True,False,False,False
2,False,False,True,False,False
3,False,False,False,True,False
4,False,False,False,False,True


In [31]:
# look at the column drive_wheels where we have values of 4wd , fwd or rwd . 
# By using get_dummies we can convert this to 3 columns with a 1 or 0 
# corresponding to the correct value

In [66]:
obj_df.drive_wheels.value_counts()

drive_wheels
fwd    120
rwd     76
4wd      9
Name: count, dtype: int64

In [68]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,False,False,True
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,False,False,True
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,False,False,True
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,False,True,False
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,True,False,False


In [26]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,body_convertible,body_hardtop,body_hatchback,body_sedan,body_wagon,drive_4wd,drive_fwd,drive_rwd
0,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1,0,0,0,0,0,0,1
1,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1,0,0,0,0,0,0,1
2,alfa-romero,gas,std,2,front,ohcv,6,mpfi,2,0,0,1,0,0,0,0,1
3,audi,gas,std,4,front,ohc,4,mpfi,3,0,0,0,1,0,0,1,0
4,audi,gas,std,4,front,ohc,5,mpfi,3,0,0,0,1,0,1,0,0
