# 1. What is Data Transformation

Definition:
Data transformation is the process of changing the format, structure, or values of data to make it suitable for analysis or modeling.

Goal:

Make data consistent and usable.

Prepare data for machine learning algorithms.

Improve accuracy and efficiency of analysis.

Think of it as reshaping or standardizing data so it’s easier for a model or analysis to understand.

# Key Types of Data Transformation

A. Scaling / Normalization

B. Encoding Categorical Data

C. Log Transformation


In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.read_csv('Toyota.csv')

# 1. Encoding Categorical Data

Purpose:

Most machine learning models need numerical inputs.

Convert categorical features like FuelType or Automatic into numbers.

Techniques:

Label Encoding – Assigns a unique number to each category

One-Hot Encoding – Creates binary columns for each category

In [20]:
# 1. Label Encoding

df_lencode = df.copy()

print(df_lencode.dtypes)

cat_cols = df_lencode.select_dtypes(include=['object']).columns
print(cat_cols)

for col in cat_cols:
  Le = LabelEncoder()
  df_lencode[col]=Le.fit_transform(df_lencode[col])

print(df_lencode.dtypes)
print(df_lencode.head())

# 2. One-Hot Encoding
df_onehot = df.copy()
print(df_onehot.dtypes)

cat_cols = df_onehot.select_dtypes(include=['object']).columns
print(cat_cols)

df_onehot = pd.get_dummies(df_onehot, columns=cat_cols, drop_first=True)

print(df_onehot.dtypes)
print(df_onehot.head())


Unnamed: 0      int64
Price           int64
Age           float64
KM             object
FuelType       object
HP             object
MetColor      float64
Automatic       int64
CC              int64
Doors          object
Weight          int64
dtype: object
Index(['KM', 'FuelType', 'HP', 'Doors'], dtype='object')
Unnamed: 0      int64
Price           int64
Age           float64
KM              int64
FuelType        int64
HP              int64
MetColor      float64
Automatic       int64
CC              int64
Doors           int64
Weight          int64
dtype: object
   Unnamed: 0  Price   Age   KM  FuelType  HP  MetColor  Automatic    CC  \
0           0  13500  23.0  560         1   9       1.0          0  2000   
1           1  13750  23.0  958         1   9       1.0          0  2000   
2           2  13950  24.0  495         1   9       NaN          0  2000   
3           3  14950  26.0  579         1   9       0.0          0  2000   
4           4  13750  30.0  453         1   9      

# 2. Scaling / Normalization

Purpose:

Bring all numerical features to a similar scale.

Common methods:

Min-Max Scaling (0 to 1)

Standardization (Z-score)

In [21]:
# 1, Min_max Normalization
print(df_lencode.head())
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_lencode)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
print(df_scaled.head())

# 2. Z-Score normalization
print(df_lencode.head())
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_lencode)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
print(df_scaled.head())


   Unnamed: 0  Price   Age   KM  FuelType  HP  MetColor  Automatic    CC  \
0           0  13500  23.0  560         1   9       1.0          0  2000   
1           1  13750  23.0  958         1   9       1.0          0  2000   
2           2  13950  24.0  495         1   9       NaN          0  2000   
3           3  14950  26.0  579         1   9       0.0          0  2000   
4           4  13750  30.0  453         1   9       0.0          0  2000   

   Doors  Weight  
0      6    1165  
1      1    1165  
2      1    1165  
3      1    1165  
4      1    1170  
   Unnamed: 0     Price       Age        KM  FuelType    HP  MetColor  \
0    0.000000  0.325044  0.278481  0.446215  0.333333  0.75       1.0   
1    0.000697  0.333925  0.278481  0.763347  0.333333  0.75       1.0   
2    0.001394  0.341030  0.291139  0.394422  0.333333  0.75       NaN   
3    0.002091  0.376554  0.316456  0.461355  0.333333  0.75       0.0   
4    0.002787  0.333925  0.367089  0.360956  0.333333  0.75     

# 3. Log transformation
Purpose:

Reduce skewness in data with large values.

Useful for features like Price or KM which may have long-tailed distributions.

In [24]:
import numpy as np

df = pd.DataFrame({'Price':[15000, 20000, 16000, 22000, 21000]})
df['Price_log'] = np.log(df['Price'])
print(df)


   Price  Price_log
0  15000   9.615805
1  20000   9.903488
2  16000   9.680344
3  22000   9.998798
4  21000   9.952278
