# 03 – Kategorinių stulpelių kodavimas

## 0 žinksnis: **Bibliotekų importavimas**

In [17]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

## 1 žinksnis: **Duomenų įkėlimas**

In [18]:
df = pd.read_csv("mini_shop_sales.csv")

## 2 žinksnis: **Duomenų apžvalga**

In [19]:
print(df.shape)

(1002, 10)


In [20]:
df.head()

Unnamed: 0,order_id,customer_segment,city,membership_level,satisfaction,items_count,revenue,delivery_time_days,temperature_c,returned
0,1,B,Vilnius,Gold,High,4,11.711634,2.064192,23.850472,0
1,2,A,Kaunas,Bronze,Medium,2,15.937593,0.442284,13.346728,0
2,3,C,Vilnius,Gold,Low,2,8.898782,5.059285,28.390551,1
3,4,B,Vilnius,Silver,High,2,18.671088,2.212487,12.715129,0
4,5,A,Kaunas,Bronze,Low,3,38.979435,2.208552,28.285627,0


## 3 Žinksnis: **Duomenų tvarkymas**


## 1. One-Hot Encoding (OHE)

**Kada tinka?**  
- Kelios–kelios dešimtys kategorijų (kai nėra daug).  
- Neturime natūralios reikšmių tvarkos (nominalūs kintamieji).

paprasta, lengvai interpretuojama.  

In [21]:
# df_band = pd.get_dummies(df, dtype = "int8")
# df_band.head()

stulpelis = "city"
dummies = pd.get_dummies(df[stulpelis], prefix=stulpelis, dtype="int8")
df = df.drop(columns=[stulpelis]).join(dummies)


In [22]:
df.head()

Unnamed: 0,order_id,customer_segment,membership_level,satisfaction,items_count,revenue,delivery_time_days,temperature_c,returned,city_Kaunas,city_Klaipėda,city_Panevėžys,city_Vilnius,city_Šiauliai
0,1,B,Gold,High,4,11.711634,2.064192,23.850472,0,0,0,0,1,0
1,2,A,Bronze,Medium,2,15.937593,0.442284,13.346728,0,1,0,0,0,0
2,3,C,Gold,Low,2,8.898782,5.059285,28.390551,1,0,0,0,1,0
3,4,B,Silver,High,2,18.671088,2.212487,12.715129,0,0,0,0,1,0
4,5,A,Bronze,Low,3,38.979435,2.208552,28.285627,0,1,0,0,0,0


In [23]:
stulpelis = "customer_segment"
dummies = pd.get_dummies(df[stulpelis], prefix=stulpelis, dtype="int8")
df = df.drop(columns=[stulpelis]).join(dummies)

In [24]:
df.head()

Unnamed: 0,order_id,membership_level,satisfaction,items_count,revenue,delivery_time_days,temperature_c,returned,city_Kaunas,city_Klaipėda,city_Panevėžys,city_Vilnius,city_Šiauliai,customer_segment_A,customer_segment_B,customer_segment_C
0,1,Gold,High,4,11.711634,2.064192,23.850472,0,0,0,0,1,0,0,1,0
1,2,Bronze,Medium,2,15.937593,0.442284,13.346728,0,1,0,0,0,0,1,0,0
2,3,Gold,Low,2,8.898782,5.059285,28.390551,1,0,0,0,1,0,0,0,1
3,4,Silver,High,2,18.671088,2.212487,12.715129,0,0,0,0,1,0,0,1,0
4,5,Bronze,Low,3,38.979435,2.208552,28.285627,0,1,0,0,0,0,1,0,0



## 2. Ordinal Encoding (OE)

**Kada tinka?**  
- Kai kategorijos turi **natūralią tvarką** (pvz., `Bronze < Silver < Gold < Platinum` arba `Very Low < Low < Medium < High < Very High`).

**Pliusai:** išlaiko tvarką.  
**Minusai:** netinka nominaliems kintamiesiems – modelis manytų, kad skirtumai tarp kodų yra „linijiniai“.


In [25]:
tvarka = [["Very low", "Low", "Medium", "High", "Very High"]]
 
kodavimas = OrdinalEncoder(
    categories= tvarka,
    handle_unknown="use_encoded_value",
    unknown_value=-1
)
satisfaction_1 = df["satisfaction"].fillna("Nežinoma")
df["satisfaction_order"]=kodavimas.fit_transform(satisfaction_1.to_frame())
 
df[["satisfaction", "satisfaction_order"]].head()

Unnamed: 0,satisfaction,satisfaction_order
0,High,3.0
1,Medium,2.0
2,Low,1.0
3,High,3.0
4,Low,1.0


## Kaip rinktis metodą?

- **Nominalūs, mažas/vidutinis (unikalių reikšmių skaičius)** → *One-Hot Encoding*.
- **Tvarkingi (ordinal) kintamieji** → *Ordinal Encoder* (aiškiai nurodykite tvarką).