# Convert Categorical Variables

In [19]:
'''
HOW TO Encode categorical features as a one-hot numeric array. 
This creates a binary column for each (or k-1) category and returns a sparse matrix.
'''
import pandas as pd
df = pd.read_csv("data_example.csv", sep = ",")
df

Unnamed: 0,type_of_food,some_other_attribute
0,fruit,0.3
1,vegetable,0.4
2,fruit,0.2
3,meat,0.3
4,fruit,0.6
5,vegetable,0.7
6,fruit,0.3
7,meat,0.4
8,vegetable,0.2
9,vegetable,0.3


In [20]:
# Convert categorical variable into dummy/indicator variables
dummy = pd.get_dummies(df['type_of_food'],drop_first=True)#,drop_first=True)   
#drop_first=True --> Whether to get k-1 dummies out of k categorical levels by removing the first level.

In [21]:
dummy

Unnamed: 0,meat,vegetable
0,0,0
1,0,1
2,0,0
3,1,0
4,0,0
5,0,1
6,0,0
7,1,0
8,0,1
9,0,1


- `pd.get_dummies(df['type_of_food'])`:

  - This function takes the type_of_food column and converts it into dummy (or one-hot encoded) variables.
  - Each unique category in the type_of_food column will be represented by a new binary column where:
       - 1 indicates the presence of that category.
       - 0 indicates the absence of that category.

- `drop_first=True`:

   - This argument is used to avoid the dummy variable trap by dropping the first category.
   - In one-hot encoding, if you have k categories, all k columns are often generated. This can cause multicollinearity in some models because the presence of one category can be inferred from the others. For example, if you know whether something is "fruit" and "vegetable," you automatically know if it is "meat."
    - By setting `drop_first=True`, you reduce the number of dummy variables from k to k-1, thus eliminating multicollinearity.

In [22]:
new_df = pd.concat([df,dummy], axis = 1) # axis = 1 concatanation on columns

In [23]:
new_df

Unnamed: 0,type_of_food,some_other_attribute,meat,vegetable
0,fruit,0.3,0,0
1,vegetable,0.4,0,1
2,fruit,0.2,0,0
3,meat,0.3,1,0
4,fruit,0.6,0,0
5,vegetable,0.7,0,1
6,fruit,0.3,0,0
7,meat,0.4,1,0
8,vegetable,0.2,0,1
9,vegetable,0.3,0,1


- `pd.concat()`: This function is used to concatenate two or more DataFrames along a particular axis (rows or columns). In this case, you're concatenating along the columns `(axis=1)`, meaning you're adding the dummy variables to the original DataFrame.

- `[df, dummy]`:

  - `df` is your original DataFrame, which contains the categorical `type_of_food` column and possibly other columns like `some_other_attribute`.
  - `dummy` is the DataFrame created by `pd.get_dummies()`, which contains the new binary columns representing the categories from `type_of_food`.

- axis=1:

- This argument tells `pd.concat()` to concatenate along the columns.
- So, the new DataFrame `new_df` will have all the original columns from `df` plus the one-hot encoded dummy columns from `dummy`.

In [24]:
new_df.drop("type_of_food", axis=1)

Unnamed: 0,some_other_attribute,meat,vegetable
0,0.3,0,0
1,0.4,0,1
2,0.2,0,0
3,0.3,1,0
4,0.6,0,0
5,0.7,0,1
6,0.3,0,0
7,0.4,1,0
8,0.2,0,1
9,0.3,0,1


## Sklearn OneHot encoder 

The `OneHotEncoder` from `sklearn.preprocessing` is another useful tool for converting categorical variables into a one-hot numeric array. Unlike `pd.get_dummies()`, it is more suitable for use in machine learning pipelines as it integrates directly with Scikit-learn estimators and can handle both training and test datasets consistently.

In [25]:
from sklearn.preprocessing import OneHotEncoder

In [26]:
encoder = OneHotEncoder(handle_unknown='ignore')


`handle_unknown='ignore'`:
- This parameter instructs the encoder to ignore any unknown categories that it encounters during transformation.

In [27]:
df[["type_of_food"]]

Unnamed: 0,type_of_food
0,fruit
1,vegetable
2,fruit
3,meat
4,fruit
5,vegetable
6,fruit
7,meat
8,vegetable
9,vegetable


In [28]:
encoder.fit(df[["type_of_food"]]) # Fit the encoder on the 'type_of_food' column


OneHotEncoder(handle_unknown='ignore')

In [29]:
encoder.categories_

[array(['fruit', 'meat', 'vegetable'], dtype=object)]

After fitting the `OneHotEncoder`, the `encoder.categories_` attribute contains the list of categories for each feature that was one-hot encoded. This is useful to see which unique categories the encoder has identified and encoded.

In [30]:
dummy = encoder.transform(df[["type_of_food"]]).toarray() # Fit the encoder on the 'type_of_food' column

In [31]:
dummy

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [32]:
df_categorical=pd.DataFrame(dummy,columns=encoder.categories_)

In [33]:
df_categorical

Unnamed: 0,fruit,meat,vegetable
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0
6,1.0,0.0,0.0
7,0.0,1.0,0.0
8,0.0,0.0,1.0
9,0.0,0.0,1.0


In [34]:
encoded=[[0, 1, 0], [1, 0, 0]]

In [35]:
print(encoder.inverse_transform(encoded))


[['meat']
 ['fruit']]


This is a two-dimensional list (or a matrix) representing one-hot encoded data for a categorical feature with three possible categories. Each inner list corresponds to one observation (row), and each element in the inner list corresponds to a category.
Structure of the `encoded` List:
- First Observation: `[0, 1, 0]`
     - This indicates that the first observation belongs to the second category (e.g., `vegetable` if we consider the order to be `[fruit, vegetable, meat]`).
- Second Observation: `[1, 0, 0]`
     - This indicates that the second observation belongs to the first category (e.g., `fruit`).

In [36]:
import pickle

pickle.dump(encoder, open("ohe.pkl", 'wb'))

In [37]:
encoder = pickle.load(open("ohe.pkl", 'rb'))

In [38]:
print(encoder.inverse_transform(encoded))


[['meat']
 ['fruit']]
