# Encoding Categorical Infrastructure Data

## Context
Machine Learning models generally require numerical input. However, SRE and observability datasets are full of categorical text data. For example:
- `Server_Role`: 'Web', 'Database', 'Cache'
- `Log_Level`: 'INFO', 'WARN', 'ERROR', 'FATAL'
- `Region`: 'us-east-1', 'eu-west-1', 'ap-south-1'

We must transform these strings into numbers so algorithms can process them. This transformation is called **Encoding**.

## Objectives
- Load a synthetic dataset containing categorical fields (Server Role, Log Level).
- Implement **Ordinal / Label Encoding** for data that has a strict order (e.g., Log Levels).
- Implement **One-Hot Encoding** for nominal data without intrinsic order (e.g., Server Roles).

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder


### 1. Example Categorical Dataset
We will synthesize a small table representing log events from different servers.

In [None]:
data = {
    'Server_ID': ['srv-01', 'srv-02', 'srv-03', 'srv-04', 'srv-05', 'srv-06'],
    'Server_Role': ['Web', 'Database', 'Cache', 'Web', 'Database', 'Cache'],
    'Log_Level': ['INFO', 'WARN', 'ERROR', 'FATAL', 'INFO', 'WARN']
}

df = pd.DataFrame(data)
df

### 2. Ordinal Encoding / Label Encoding

**Ordinal Encoding** assigns a unique integer to each category. This is strongly recommended for data that has a **meaningful order or hierarchy**.

For example, in `Log_Level`:
`INFO < WARN < ERROR < FATAL`
We can encode these as `0, 1, 2, 3` so the model understands that `FATAL` is mathematically "higher" (worse) than `INFO`.

In [None]:
# Define the strict order we want to enforce
log_level_order = ['INFO', 'WARN', 'ERROR', 'FATAL']

# Initialize OrdinalEncoder with our specific order
ordinal_encoder = OrdinalEncoder(categories=[log_level_order])

# Fit and Transform
df['Log_Level_Encoded'] = ordinal_encoder.fit_transform(df[['Log_Level']])
df

# Note: Scikit-learn also has `LabelEncoder`, but it is strictly meant for encoding the 
# Target Variable (y), not the feature matrix (X), and it assigns integers alphabetically.

### 3. One-Hot Encoding

If we label-encoded `Server_Role` into `Web=0, Database=1, Cache=2`, the ML model might think `Cache` is numerically "greater than" `Web`. This is mathematically false; they are simply distinct, unranked categories.

For unranked (nominal) categorical variables, we use **One-Hot Encoding**. This creates a new binary column (`1` or `0`) for *every single category*.

In [None]:
# Initialize OneHotEncoder
# sparse_output=False returns a direct numpy array instead of a sparse SciPy matrix
onehot_encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the Server_Role column
role_encoded = onehot_encoder.fit_transform(df[['Server_Role']])

# See the categories it found
print("Categories found by OneHotEncoder:", onehot_encoder.categories_[0])

# Create a DataFrame from the encoded matrix with proper column names
role_encoded_df = pd.DataFrame(
    role_encoded, 
    columns=[f"Role_{cat}" for cat in onehot_encoder.categories_[0]]
)

# Concatenate the new One-Hot columns back to our original DataFrame
df_final = pd.concat([df, role_encoded_df], axis=1)
df_final.drop('Server_Role', axis=1, inplace=True) # Drop original string column

df_final

### Pandas `get_dummies()` vs Scikit-Learn `OneHotEncoder`
Pandas has a built-in function `pd.get_dummies()` that does exactly what OneHotEncoder does but with less code. 

**Why use Scikit-Learn?**
If you are training models for production, `OneHotEncoder` can be saved (pickled) and re-applied to future test data ensuring identical column generation. `pd.get_dummies()` might generate fewer columns if the test set is missing a category that the training set had, breaking your model's input shape.

In [None]:
# Example of Pandas get_dummies (Simpler but less robust for prod pipelines)
pd.get_dummies(df, columns=['Server_Role'], dtype=int)

### Summary

- **Ordinal Encoding:** Use when categories possess a ranked order (`INFO < WARN < ERROR`).
- **One-Hot Encoding:** Use when categories are nominally distinct without order (`Web, DB, Cache`).