# Encoding Categorical Variables as Quantitative

We already learned how to scale quntiative variables but what if we have categorical variables.

If we want to calculate distances, we need to conver categorical variables into quantitative variables first!

To standardize categorical variables as quantitatives there is a standart way.
- Dummy Encoding
  
  OR
  
- One-hot Encoding

In [2]:
import pandas as pd

df = pd.read_table("https://datasci112.stanford.edu/data/housing.tsv")

features = ["Gr Liv Area", "House Style", "Bedroom AbvGr",
            "Full Bath", "Half Bath", "Neighborhood"]
df[features]

Unnamed: 0,Gr Liv Area,House Style,Bedroom AbvGr,Full Bath,Half Bath,Neighborhood
0,1656,1Story,3,1,0,NAmes
1,896,1Story,2,1,0,NAmes
2,1329,1Story,3,1,1,NAmes
3,2110,1Story,3,2,1,NAmes
4,1629,2Story,3,2,1,Gilbert
...,...,...,...,...,...,...
2925,1003,SLvl,3,1,0,Mitchel
2926,902,1Story,2,1,0,Mitchel
2927,970,SFoyer,3,1,0,Mitchel
2928,1389,1Story,2,1,0,Mitchel


## Dummy Encoding

In dummy encoding:
- Each class gets its own column.
- Each column consists of 0s and 1s. A 1 indicates that the observation was in that class.

### Dummy encoding with pandas

In Pandas we have a automatic function for dummy encoding. `pd.get_dummies()`

`get_dummies()` function returns booleans by default. We can conver these booleans to integers by passing the parameter `dtype=int`

In [6]:
pd.get_dummies(df[["House Style"]])

Unnamed: 0,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl
0,False,False,True,False,False,False,False,False
1,False,False,True,False,False,False,False,False
2,False,False,True,False,False,False,False,False
3,False,False,True,False,False,False,False,False
4,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...
2925,False,False,False,False,False,False,False,True
2926,False,False,True,False,False,False,False,False
2927,False,False,False,False,False,False,True,False
2928,False,False,True,False,False,False,False,False


`get_dummies()` can be used with multiple categorical variables at once.

In [10]:
pd.get_dummies(df[["House Style", "Neighborhood"]], dtype=int)

Unnamed: 0,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,House Style_SFoyer,House Style_SLvl,Neighborhood_Blmngtn,Neighborhood_Blueste,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2926,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2927,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2928,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If we pass quantitative variables to `pd.get_dummies()` it will leave them alone.

In [11]:
pd.get_dummies(df[features])

Unnamed: 0,Gr Liv Area,Bedroom AbvGr,Full Bath,Half Bath,House Style_1.5Fin,House Style_1.5Unf,House Style_1Story,House Style_2.5Fin,House Style_2.5Unf,House Style_2Story,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,1656,3,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,896,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1329,3,1,1,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2110,3,2,1,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1629,3,2,1,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,1003,3,1,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2926,902,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2927,970,3,1,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2928,1389,2,1,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Dummy encoding with Scikit-Learn

In Scikit-Learn we can do dummy encoding with using `OneHotEncoder`

In [12]:
from sklearn.preprocessing import OneHotEncoder

# declare the encoder
enc = OneHotEncoder()

# fit the encoder to data
enc.fit(df[["House Style"]])

# transform the data
enc.transform(df[["House Style"]])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2930 stored elements and shape (2930, 8)>

**OneHotEncoder()** -> A class from scikit-learn for converting categorical variables into numerical(0-1) columns so that ML models can understand them.

**fit()** -> Learns categories.

**transform()** -> Converts those categories into seperate columns and produce a 0-1 matrix (One column for each category)

When encoding categorical data (like with OneHotEncoder), the result often contains many zeros. To handle this efficiently, Scikit-Learn can store the output in two formats: sparse (memory-efficient) or dense (regular full array).

**Sparse Matrix:** Stores only non-zero values and their positions to save memory; efficient for data with many zeros.

**Dense Matrix:** Stores all values (including zeros) in full; easier to read but uses more memory.

We can convert a sparse matrix into a regular (dense) matrix using `.todense()`. We can also pass a paramter at the beginning for this `sparse_output=False`

In [13]:
enc = OneHotEncoder(sparse_output=False)

enc.fit(df[["House Style"]])

enc.transform(df[["House Style"]])

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]], shape=(2930, 8))

### Mixed Variables in Scikit-Learn

If we have a mix of quantitative and categorical variables, and we only want to dummy encode the categorical ones we use `ColumnTransformer`

In [16]:
from sklearn.compose import make_column_transformer

enc = make_column_transformer(
    (OneHotEncoder(), ["House Style", "Neighborhood"]),
    remainder="passthrough"
)

enc.fit(df[features])
enc.transform(df[features]).todense()

matrix([[0., 0., 1., ..., 3., 1., 0.],
        [0., 0., 1., ..., 2., 1., 0.],
        [0., 0., 1., ..., 3., 1., 1.],
        ...,
        [0., 0., 0., ..., 3., 1., 0.],
        [0., 0., 1., ..., 2., 1., 0.],
        [0., 0., 0., ..., 3., 2., 1.]], shape=(2930, 40))

Scikit-Learn provides a nice visualization of a `ColumnTransformer`

In [17]:
enc

0,1,2
,transformers,"[('onehotencoder', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


Also Scikit-Learn allows us to mis scalers and encoders with `ColumnTransformer`

In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ["House Style", "Neighborhood"]),
    (StandardScaler(), ["Gr Liv Area"]),
    remainder="passthrough"
)

transformer.fit(df[features])
transformer.transform(df[features])

array([[0., 0., 1., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 1., ..., 3., 1., 1.],
       ...,
       [0., 0., 0., ..., 3., 1., 0.],
       [0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 0., ..., 3., 2., 1.]], shape=(2930, 40))

In [19]:
transformer

0,1,2
,transformers,"[('onehotencoder', ...), ('standardscaler', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True
