
# Exercise 5 Categorical variables

The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.

Preliminary:

- Load the breast-cancer.csv file
- Drop `Class` column
- Drop NaN values
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.

1. Count the number of unique values per feature in the train set
2. Identify the variables ordinal variables, nominal variables and the target. Create one One Hot Encoder for all categorical features (no ordinal). Here are the assumptions made on the variables:

```console
age: Ordinal
['ge40'> 'premeno' >'lt40']

menopause: Ordinal
['50-54' > '45-49' > '40-44' >  '35-39' > '30-34' > '25-29'> '20-24' > '15-19' > '10-14' > '5-9' > '0-4']

tumor-size: Ordinal
['15-17' >  '12-14' > '9-11' > '6-8' > '3-5' > '0-2']

inv-nodes: One Hot 
['yes' 'no']

node-caps: Ordinal
[3 > 2 > 1]

deg-malig: One Hot 
['left' 'right']

breast: One Hot 
['right_low' 'left_low' 'left_up' 'central' 'right_up']

breast-quad: One Hot 
['yes' 'no']

irradiat: One Hot 
['recurrence-events' 'no-recurrence-events']
```

- Fit on the train set

- Transform the test set

Example of expected output:

```console
# One Hot encoder on: ['inv-nodes', 'deg-malig', 'breast', 'breast-quad', 'irradiat']

input: ohe.transform(df[ohe_cols]) 
output:
array([[0., 1., 0., ..., 0., 0., 1.],
    [1., 0., 0., ..., 0., 1., 0.],
    [1., 0., 1., ..., 0., 0., 1.],
    ...,
    [0., 1., 0., ..., 0., 1., 0.],
    [1., 0., 0., ..., 0., 1., 0.],
    [1., 0., 1., ..., 0., 1., 0.]])

input: ohe.get_feature_names(ohe_cols)
output: 
array(['inv-nodes_no', 'inv-nodes_yes', 'deg-malig_left',
       'deg-malig_right', 'breast_central', 'breast_left_low',
       'breast_left_up', 'breast_right_low', 'breast_right_up',
       'breast-quad_no', 'breast-quad_yes',
       'irradiat_no-recurrence-events', 'irradiat_recurrence-events'],
      dtype=object)

```

3. Create one Ordinal encoder for all Ordinal features. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:

If the ordinal data set is (subset of two columns but I keep all rows for this example):

    |    | age     |   node-caps |
    |---:|:--------|------------:|
    |  0 | premeno |           3 |
    |  1 | ge40    |           1 |
    |  2 | ge40    |           2 |
    |  3 | premeno |           3 |
    |  4 | premeno |           2 |

The first step is to create a dictionnary:

```console
dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}
```

Then to instantiate an `OrdinalEncoder`:

```console
oe = OrdinalEncoder(dict_)
```

Now that you have enough information:

- Fit on the train set
- Transform the test set

4. Use a `make_column_transformer` to combine the two Encoders.

- Fit on the train set
- Transform the test set

*Hint: Check the first ressource*

**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:

```console
AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.
```

**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**

Ressources:

- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79

- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/


In [24]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split

file = np.genfromtxt("breast-cancer.csv", delimiter=',')

# 7. Attribute Information:
#    1. Class: no-recurrence-events, recurrence-events
#    2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
#    3. menopause: lt40, ge40, premeno.
#    4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
#                   45-49, 50-54, 55-59.
#    5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
#                  27-29, 30-32, 33-35, 36-39.
#    6. node-caps: yes, no.
#    7. deg-malig: 1, 2, 3.
#    8. breast: left, right.
#    9. breast-quad: left-up, left-low, right-up,	right-low, central.
#   10. irradiat:	yes, no.
# 
# 8. Missing Attribute Values: (denoted by "?")
#    Attribute #:  Number of instances with missing values:
#    6.             8
#    9.             1.
df = DataFrame(file, columns=['Class', 'age', 'menopuase', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat'])

print(df)

# df = df.drop(['Class'], axis=1)

# print(df['node-caps'].loc[lambda x: x=='nan'].index)
# print(df['irradiat'].loc[lambda x: x=='nan'].index)

X_train, X_test, y_train, y_test = train_test_split(df, df, test_size=0.20, random_state=43)

# print(df.nunique())
# print(df)

     Class  age  menopuase  tumor-size  inv-nodes  node-caps  deg-malig  \
0      NaN  NaN        NaN         NaN        NaN        NaN        NaN   
1      NaN  NaN        NaN         NaN        NaN        NaN        NaN   
2      NaN  NaN        NaN         NaN        NaN        NaN        NaN   
3      NaN  NaN        NaN         NaN        NaN        NaN        NaN   
4      NaN  NaN        NaN         NaN        NaN        NaN        NaN   
..     ...  ...        ...         ...        ...        ...        ...   
281    NaN  NaN        NaN         NaN        NaN        NaN        NaN   
282    NaN  NaN        NaN         NaN        NaN        NaN        NaN   
283    NaN  NaN        NaN         NaN        NaN        NaN        NaN   
284    NaN  NaN        NaN         NaN        NaN        NaN        NaN   
285    NaN  NaN        NaN         NaN        NaN        NaN        NaN   

     breast  breast-quad  irradiat  
0       NaN          NaN       NaN  
1       NaN          NaN 