<a href="https://colab.research.google.com/github/jazu1412/LOW_CODE_AUTOML_AUTOGLUON/blob/master/Tabular%20classification%20and%20Regression/autogluon_feature_engineering_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoGluon Tabular - Feature Engineering Tutorial

## Introduction

Feature engineering is like preparing ingredients for a gourmet meal. Just as a chef carefully selects, cuts, and seasons ingredients to enhance the final dish, we process raw data to make it more palatable for machine learning models. This tutorial will guide you through the process using AutoGluon, a powerful automated machine learning library.

In [1]:
!pip install autogluon.tabular[all]

Collecting autogluon.tabular[all]
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.tabular[all])
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting autogluon.core==1.1.1 (from autogluon.tabular[all])
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon.tabular[all])
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting xgboost<2.1,>=1.6 (from autogluon.tabular[all])
  Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting torch<2.4,>=2.2 (from autogluon.tabular[all])
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting lightgbm<4.4,>=3.3 (from autogluon.tabular[all])
  Down

In [2]:
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
import numpy as np
import random
from sklearn.datasets import make_regression
from datetime import datetime

# Set a random seed for reproducibility
np.random.seed(42)
random.seed(42)

## Creating a Sample Dataset

Imagine we're building a dataset about a magical forest. Each tree has different attributes:

- A: The amount of sunlight it receives (floating-point number)
- B: Its age in years (integer)
- C: The date it was last watered (datetime)
- D: The type of soil it grows in (categorical)
- E: A description of its leaves (text)

In [3]:
# Create base features
x, y = make_regression(n_samples=100, n_features=5, n_targets=1, random_state=42)
dfx = pd.DataFrame(x, columns=['A', 'B', 'C', 'D', 'E'])
dfy = pd.DataFrame(y, columns=['magic_power'])  # Our target variable

# Customize features
dfx['A'] = (dfx['A'] + 10) * 100  # Sunlight (lux)
dfx['B'] = (dfx['B'] + 5).astype(int)  # Age (years)
dfx['C'] = pd.to_datetime('2023-01-01') + pd.to_timedelta(dfx['C'].astype(int), unit='D')  # Last watered
dfx['D'] = pd.cut(dfx['D'], bins=4, labels=['sandy', 'clay', 'loamy', 'peaty'])  # Soil type
dfx['E'] = pd.Series([' '.join(random.choices(['green', 'yellow', 'red', 'broad', 'narrow', 'long', 'short'], k=3)) for _ in range(100)])  # Leaf description

# Combine features and target
df = pd.concat([dfx, dfy], axis=1)

print(df.head())
print("\nData types:")
print(df.dtypes)

             A  B          C      D                    E  magic_power
0   906.217496  5 2023-01-01  peaty  narrow green yellow   271.316121
1  1108.895060  4 2023-01-01  loamy   yellow long narrow     6.230541
2   939.829339  3 2023-01-02  loamy      short green red    11.861024
3  1082.190250  5 2023-01-01  sandy   green yellow broad   -63.940576
4  1154.993441  5 2023-01-01  sandy  green yellow narrow    49.630085

Data types:
A                     float64
B                       int64
C              datetime64[ns]
D                    category
E                      object
magic_power           float64
dtype: object


## Basic Feature Engineering with AutoGluon

Now that we have our magical forest dataset, let's use AutoGluon to process it. This is like having a sous chef who knows exactly how to prepare each ingredient:

In [4]:
from autogluon.features.generators import AutoMLPipelineFeatureGenerator

auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
processed_features = auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

print("Processed features:")
print(processed_features.head())
print("\nProcessed feature types:")
print(processed_features.dtypes)

Processed features:
             A  B  D    E                    C  C.year  C.month  C.day  \
0   906.217496  5  3  NaN  1672531200000000000    2023        1      1   
1  1108.895060  4  2  NaN  1672531200000000000    2023        1      1   
2   939.829339  3  2  NaN  1672617600000000000    2023        1      2   
3  1082.190250  5  0  NaN  1672531200000000000    2023        1      1   
4  1154.993441  5  0    0  1672531200000000000    2023        1      1   

   C.dayofweek  E.char_count  E.symbol_ratio.   __nlp__.broad  __nlp__.green  \
0            6             7                 1              0              1   
1            6             6                 2              0              0   
2            0             3                 5              0              1   
3            6             6                 2              1              1   
4            6             7                 1              0              1   

   __nlp__.narrow  __nlp__.red  __nlp__.short  __nlp__

Let's break down what AutoGluon did:
1. It left the numeric columns (A and B) unchanged.
2. It converted the datetime column (C) into multiple features: raw value, year, month, day, and day of the week.
3. It encoded the categorical column (D) as integers.
4. It created summary features for the text column (E) and generated a matrix indicating the presence of each word.

## Training a Model with Processed Features

Now that our ingredients are prepared, let's cook up a model:

In [5]:
predictor = TabularPredictor(label='magic_power')
predictor.fit(df, hyperparameters={'GBM': {}}, feature_generator=auto_ml_pipeline_feature_generator)

print("Feature importance:")
print(predictor.feature_importance(df))

No path specified. Models will be saved in: "AutogluonModels/ag-20240916_052149"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       11.38 GB / 12.67 GB (89.8%)
Disk Space Avail:   65.97 GB / 107.72 GB (61.2%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Be

Feature importance:


	0.28s	= Actual runtime (Completed 5 of 5 shuffle sets)


   importance    stddev   p_value  n   p99_high    p99_low
B   31.261425  6.803524  0.000253  5  45.269975  17.252874
D   29.652729  4.539631  0.000064  5  38.999892  20.305567
A   16.380274  2.605498  0.000074  5  21.745030  11.015517
E    4.652081  1.080328  0.000325  5   6.876490   2.427671
C    0.000000  0.000000  0.500000  5   0.000000   0.000000


## Handling Missing Data

Sometimes, our magical trees might have missing data. Let's see how AutoGluon handles this:

In [6]:
# Create missing data
df_missing = df.copy()
df_missing.iloc[0] = np.nan
df_missing.iloc[1, :2] = np.nan  # Set first two columns of second row to NaN

print("Data with missing values:")
print(df_missing.head())

# Process data with missing values
auto_ml_pipeline_feature_generator_missing = AutoMLPipelineFeatureGenerator()
processed_features_missing = auto_ml_pipeline_feature_generator_missing.fit_transform(X=df_missing)

print("\nProcessed features with missing values:")
print(processed_features_missing.head())

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10014.00 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...


Data with missing values:
             A    B          C      D                    E  magic_power
0          NaN  NaN        NaT    NaN                  NaN          NaN
1          NaN  NaN 2023-01-01  loamy   yellow long narrow     6.230541
2   939.829339  3.0 2023-01-02  loamy      short green red    11.861024
3  1082.190250  5.0 2023-01-01  sandy   green yellow broad   -63.940576
4  1154.993441  5.0 2023-01-01  sandy  green yellow narrow    49.630085


		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 5
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('category', [])     : 1 | ['D']
		('datetime', [])     : 1 | ['C']
		('float', [])        : 3 | ['A', 'B', 'magic_power']
		('object', ['text']) : 1 | ['E']
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])                    : 1 | ['D']
		('category', ['text_as_category'])  : 1 | ['E']
		('float', [])                       : 3 | ['A', 'B', 'magic_power']
		('int', ['binned', 'text_special']) : 3 | ['E.char_count', 'E.word_count', 'E.symbol_ratio. ']
		('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
		('int', ['text_ngram'])             : 6 | ['__nlp__.broad', '__nlp__.green', '__


Processed features with missing values:
             A    B  magic_power    D    E                    C  C.year  \
0          NaN  NaN          NaN  NaN  NaN  1672532945454545408    2023   
1          NaN  NaN     6.230541    2  NaN  1672531200000000000    2023   
2   939.829339  3.0    11.861024    2  NaN  1672617600000000000    2023   
3  1082.190250  5.0   -63.940576    0  NaN  1672531200000000000    2023   
4  1154.993441  5.0    49.630085    0    0  1672531200000000000    2023   

   C.month  C.day  C.dayofweek  E.char_count  E.word_count  E.symbol_ratio.   \
0        1      1            6             0             0                 0   
1        1      1            6             7             1                 3   
2        1      2            0             4             1                 6   
3        1      1            6             7             1                 3   
4        1      1            6             8             1                 2   

   __nlp__.broad  __nlp__.g

Notice how AutoGluon handles different types of missing data:
- For numeric and categorical columns, it keeps NaN values.
- For datetime columns, it replaces NaN with the mean of non-NaN values.

## Customizing Feature Engineering

Sometimes, we might want to adjust how our ingredients are prepared. Let's create a custom feature generation pipeline:

In [7]:
from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT

custom_pipeline = PipelineFeatureGenerator(
    generators=[
        [
            CategoryFeatureGenerator(maximum_num_cat=3),  # Only keep the top 3 categories
            IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
        ]
    ]
)

custom_processed_features = custom_pipeline.fit_transform(X=dfx)

print("Custom processed features:")
print(custom_processed_features.head())
print("\nCustom processed feature types:")
print(custom_processed_features.dtypes)

Fitting PipelineFeatureGenerator...
	Available Memory:                    10533.53 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Unused Original Features (Count: 1): ['C']
		These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
		Features can also be unused if they carry very little information, such as being categorical but having almost entirely uniqu

Custom processed features:
     D    E            A  B
0  NaN  NaN   906.217496  5
1    2  NaN  1108.895060  4
2    2  NaN   939.829339  3
3    0  NaN  1082.190250  5
4    0  NaN  1154.993441  5

Custom processed feature types:
D    category
E    category
A     float64
B       int64
dtype: object


In this custom pipeline:
1. We limit categorical features to only the top 3 categories, replacing others with NaN.
2. We keep numeric features as they are.

## Conclusion

Feature engineering is a crucial step in preparing data for machine learning models. AutoGluon provides powerful tools to automate this process, but also allows for customization when needed. By understanding these concepts, you're well on your way to becoming a master chef in the kitchen of data science!

Remember to experiment with different feature engineering techniques and see how they affect your model's performance. Happy coding!