# Using LLMs to Generate Description Using Metadata

This notebook focuses on taking a initial look at the selected data which includes Amazon product descriptions. The goal is to use appropriate metadata to generate new descriptions using various LLMs and prompts.

## Data Loading

In [None]:
%%capture
!pip install -q transformers datasets pandas accelerate evaluate bert_score nltk rouge_score textstat
import pandas as pd
import kagglehub

# Download dataset
path = kagglehub.dataset_download("piyushjain16/amazon-product-data")

# Load the data
df = pd.read_csv(path + "/dataset/train.csv")

In [None]:
# initial look at data
df.head()

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID,PRODUCT_LENGTH
0,1925202,ArtzFolio Tulip Flowers Blackout Curtain for D...,[LUXURIOUS & APPEALING: Beautiful custom-made ...,,1650,2125.98
1,2673191,Marks & Spencer Girls' Pyjama Sets T86_2561C_N...,"[Harry Potter Hedwig Pyjamas (6-16 Yrs),100% c...",,2755,393.7
2,2765088,PRIKNIK Horn Red Electric Air Horn Compressor ...,"[Loud Dual Tone Trumpet Horn, Compatible With ...","Specifications: Color: Red, Material: Aluminiu...",7537,748.031495
3,1594019,ALISHAH Women's Cotton Ankle Length Leggings C...,[Made By 95%cotton and 5% Lycra which gives yo...,AISHAH Women's Lycra Cotton Ankel Leggings. Br...,2996,787.401574
4,283658,The United Empire Loyalists: A Chronicle of th...,,,6112,598.424


In [None]:
df.columns

Index(['PRODUCT_ID', 'TITLE', 'BULLET_POINTS', 'DESCRIPTION',
       'PRODUCT_TYPE_ID', 'PRODUCT_LENGTH'],
      dtype='object')

In [None]:
# Standardize column names for convenience
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.columns

Index(['product_id', 'title', 'bullet_points', 'description',
       'product_type_id', 'product_length'],
      dtype='object')

In [None]:
# Exploring the dataset
df.info()
print('---'*20)
print(df.describe())
print('---'*20)
print('Dataset shape:',df.shape)
print('---'*20)
print('Missing Values:')
df.isnull().sum().sort_values(ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2249698 entries, 0 to 2249697
Data columns (total 6 columns):
 #   Column           Dtype  
---  ------           -----  
 0   product_id       int64  
 1   title            object 
 2   bullet_points    object 
 3   description      object 
 4   product_type_id  int64  
 5   product_length   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 103.0+ MB
------------------------------------------------------------
         product_id  product_type_id  product_length
count  2.249698e+06     2.249698e+06    2.249698e+06
mean   1.499795e+06     4.000456e+03    4.071839e+03
std    8.661944e+05     3.966146e+03    1.351685e+06
min    1.000000e+00     0.000000e+00    1.000000e+00
25%    7.494795e+05     2.300000e+02    5.118110e+02
50%    1.499558e+06     2.916000e+03    6.630000e+02
75%    2.250664e+06     6.403000e+03    1.062992e+03
max    2.999999e+06     1.342000e+04    1.885801e+09
----------------------------------------------

Unnamed: 0,0
description,1157382
bullet_points,837366
title,13
product_id,0
product_type_id,0
product_length,0


In [None]:
# Checkin for NaN values
# Number of NaN values per column
nan_per_column = df.isna().sum()
print("NaN values per column:")
print(nan_per_column)

# Total number of NaN values in the whole df
total_nan = df.isna().sum().sum()
print(f"\nTotal NaN values in DataFrame: {total_nan}")

NaN values per column:
product_id               0
title                   13
bullet_points       837366
description        1157382
product_type_id          0
product_length           0
dtype: int64

Total NaN values in DataFrame: 1994761


No visualization was included in this notebook because none could be done with the columns that would have meaning. 'product_type_id' has no relevant/important information as it just places the products into its proper categories such as: clothes, furniture, etc, which is already given information in 'bullet_points'. 'product length' also has no significance since it is just telling us the size of the product, which can also be found in 'bullet_points'.