Fitting with the current trend on Large Language Models (LLM), Upgini exploits the power of OpenAI’s GPT LLM to automate the entire feature engineering process for our dataset.

In [1]:
import warnings

import numpy as np
import pandas as pd

warnings.simplefilter("ignore", UserWarning)

In [2]:
df_full = pd.read_csv("data/Reviews.csv")

In [3]:
df_full.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
# convert Time to datetime column
df_full["Time"] = pd.to_datetime(df_full["Time"], unit="s")

In [5]:
# re-order coumns
df_full = df_full[
    [
        "Time",
        "ProfileName",
        "Summary",
        "Text",
        "HelpfulnessNumerator",
        "HelpfulnessDenominator",
        "Score",
    ]
]

In [6]:
df_full.head()

Unnamed: 0,Time,ProfileName,Summary,Text,HelpfulnessNumerator,HelpfulnessDenominator,Score
0,2011-04-27,delmartian,Good Quality Dog Food,I have bought several of the Vitality canned d...,1,1,5
1,2012-09-07,dll pa,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0,0,1
2,2008-08-18,"Natalia Corres ""Natalia Corres""","""Delight"" says it all",This is a confection that has been around a fe...,1,1,4
3,2011-06-13,Karl,Cough Medicine,If you are looking for the secret ingredient i...,3,3,2
4,2012-10-21,"Michael D. Bigham ""M. Wassir""",Great taffy,Great taffy at a great price. There was a wid...,0,0,5


We also filter our dataset to include reviews that have more than 10 helpfulness and were published after 2011–01–01.

In [7]:
df_full = df_full[
    (df_full["HelpfulnessDenominator"] > 10) & (df_full["Time"] >= "2011-01-01")
]

We also transform Helpfulness into a binary variable with 0.50 ratio.

In [8]:
df_full.loc[:, "Helpful"] = np.where(
    df_full.loc[:, "HelpfulnessNumerator"] / df_full.loc[:, "HelpfulnessDenominator"]
    > 0.50,
    1,
    0,
)

In [9]:
df_full.head()

Unnamed: 0,Time,ProfileName,Summary,Text,HelpfulnessNumerator,HelpfulnessDenominator,Score,Helpful
82,2012-01-04,"Johnnycakes ""Johnnycakes""",Forget Molecular Gastronomy - this stuff rocke...,I know the product title says Molecular Gastro...,15,15,5,1
381,2011-06-29,Allison Beegle,Waste of money,These condiments are overpriced and terrible. ...,7,13,1,1
1040,2011-10-16,"Mike ""stargaazor""",Buyer Beware!This product contains High Fructo...,It's no secret that high fructose corn syrup i...,5,11,1,0
1052,2012-01-26,Amy_C,Have not tried product.... however...,"They claim this product is ""unadulterated"", wh...",8,12,2,1
1069,2012-02-23,One Tree in the Forest,THIS IS NOT THE PRODUCT THAT I RECEIVED. THE A...,"I purchased this product because of its name, ...",5,11,1,0


Finally, we create a new column — combined — that will concatenate the summary and text into a single column. We also take this opportunity to drop any duplicates.

In [10]:
idx_list = list(df_full.index)  # Added by me

In [11]:
for idx in idx_list:  # Original codes were modified by me
    df_full.loc[
        idx, "combined"
    ] = f"Title: {df_full.loc[idx, 'Summary'].strip()} ; Content: {df_full.loc[idx, 'Text'].strip()}"

In [12]:
df_full.drop(
    ["Summary", "Text", "HelpfulnessNumerator", "HelpfulnessDenominator"],
    axis=1,
    inplace=True,
)

In [13]:
df_full.drop_duplicates(subset=["combined"], inplace=True)
df_full.reset_index(drop=True, inplace=True)

In [14]:
df_full.shape

(4022, 5)

In [15]:
df_full.head()

Unnamed: 0,Time,ProfileName,Score,Helpful,combined
0,2012-01-04,"Johnnycakes ""Johnnycakes""",5,1,Title: Forget Molecular Gastronomy - this stuf...
1,2011-06-29,Allison Beegle,1,1,Title: Waste of money ; Content: These condime...
2,2011-10-16,"Mike ""stargaazor""",1,0,Title: Buyer Beware!This product contains High...
3,2012-01-26,Amy_C,2,1,Title: Have not tried product.... however... ;...
4,2012-02-23,One Tree in the Forest,1,0,Title: THIS IS NOT THE PRODUCT THAT I RECEIVED...


# Feature Search and Enrichment with Upgini

Following from the Upgini documentation, we can start a feature search using the `FeaturesEnricher` object. Within that `FeaturesEnricher`, we can specify a `SearchKey` (i.e., the column that we want to search for).

We can search for the following column types:

- email
- hem
- IP
- phone
- date
- datetime
- country
- post code

In [16]:
from upgini import FeaturesEnricher, SearchKey

In [17]:
enricher = FeaturesEnricher(search_keys={"Time": SearchKey.DATE})

In [18]:
%%time
enricher.fit(df_full[["Time", "ProfileName", "Score", "combined"]], df_full["Helpful"])

The least populated class in Target contains less than 1000 rows.
Small numbers of observations may negatively affect the number of selected features and quality of your ML model.
Upgini recommends you increase the number of observations in the least populated class.
Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history
Detected task type: ModelTaskType.BINARY

Search key country_code `US` was used as default. 
See docs to turn off the automatic detection: https://github.com/upgini/upgini/blob/main/README.md#turn-off-autodetection-for-search-key-columns


Column name,Status,Errors
country_code,All valid,-
Time,All valid,-
target,All valid,-



Running search request, search_id=d030e1a4-b44c-46d8-9061-219a5986080b
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
|
[92m[1m
9 relevant feature(s) found with the search keys: ['Time', 'country_code'][0m


Provider,Source,Feature name,SHAP value,Coverage %,Type,Feature type
,,Score,0.440234,100.0,numerical,
Upgini,Public data,f_financial_date_finance_pca_0_302a6c24,0.017746,100.0,numerical,Free
Upgini,Public data,f_financial_date_gold_3295eb6b,0.009264,100.0,numerical,Free
Upgini,Public data,f_financial_date_silver_7d_to_1y_b364a149,0.00699,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_pca_47_f6eca17e,0.006402,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_pca_13_c27688b8,0.005566,100.0,numerical,Free
Upgini,Public data,f_economic_date_cbpol_umap_6_aa0352de,0.004096,100.0,numerical,Free
Upgini,Public data,f_weather_country_date_tavg_avg_9f0e1bbc,0.004027,100.0,numerical,Free
Upgini,Public data,f_weather_date_weather_pca_21_49c534ee,0.004013,100.0,numerical,Free
Upgini,Public data,f_financial_date_dow_jones_7d_to_1y_5d4a1c5a,0.00349,100.0,numerical,Free


Calculating accuracy uplift after enrichment...
[92m[1m
Quality metrics[0m


Unnamed: 0,Rows,Baseline roc_auc,Enriched roc_auc,Uplift
,,,,
Train,4022.0,0.760825,0.762075,0.00125


CPU times: total: 1min 15s
Wall time: 4min


# Feature Generation using GPT models

Digging deeper into the documentation, it seems that the FeaturesEnricher also accepts another parameter — generate_features.

`generate_features` allows us to search for and generated feature embeddings for text columns. This sounds really promising. We do have text columns — `combined` and `ProfileName`.

> Upgini has two LLMs connected to a search engine — GPT-3.5 from OpenAI and GPT-J — from the Upgini documentation

In [19]:
enricher = FeaturesEnricher(
    search_keys={"Time": SearchKey.DATE}, generate_features=["combined", "ProfileName"]
)

In [20]:
%%time
enricher.fit(df_full[['Time','ProfileName','Score','combined']], df_full['Helpful'])

The least populated class in Target contains less than 1000 rows.
Small numbers of observations may negatively affect the number of selected features and quality of your ML model.
Upgini recommends you increase the number of observations in the least populated class.
Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history
Detected task type: ModelTaskType.BINARY

Search key country_code `US` was used as default. 
See docs to turn off the automatic detection: https://github.com/upgini/upgini/blob/main/README.md#turn-off-autodetection-for-search-key-columns


Column name,Status,Errors
country_code,All valid,-
Time,All valid,-
target,All valid,-



Running search request, search_id=838a7cc8-954a-4e69-a826-2f1b837dd779
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
\


RuntimeError: All search tasks in the request have failed

With the newly generated features we see a massive boost in predictive performance — an uplift of 0.1. And the best part is that all of it was fully automated!

We definitely want to keep these features, given the massive performance gain that we observed. We can do this as follows:

In [None]:
df_full_enrich = enricher.transform(df_full)