## Data Preprocessing/Preparing Data for LLM

creating a new df that drops missing values or unusable rows.

In [None]:
# Selecting and renaming relevant columns
df_llm_input = df[['title', 'product_type_id', 'bullet_points', 'description']].copy()
df_llm_input = df_llm_input.rename(columns={'description': 'target_description'})

# Dropping rows with actual NaN values
df_llm_input.dropna(subset=['title', 'product_type_id', 'bullet_points', 'target_description'], inplace=True)

# Ensuring all values are strings
required_cols = ['title', 'product_type_id', 'bullet_points', 'target_description']
df_llm_input[required_cols] = df_llm_input[required_cols].astype(str)

# Removing rows with empty strings, only whitespace, or literal "nan"
for col in required_cols:
    df_llm_input = df_llm_input[
        ~df_llm_input[col].str.strip().str.lower().isin(['', 'nan'])
    ]

In [None]:
print('New shape of df after cleaning:', df_llm_input.shape)

New shape of df after cleaning: (1038458, 4)


Formatting the data for input to the LLMs. Only using a sample of the data (25 rows) for easier/faster execution.

In [None]:
# Construct LLM input prompts
df_llm_input['input_text'] = (
    'TITLE: ' + df_llm_input['title'].str.strip() +
    ' PRODUCT_TYPE_ID: ' + df_llm_input['product_type_id'].str.strip() +
    ' BULLET_POINTS: ' + df_llm_input['bullet_points'].str.strip()
)

# Keeping only columns needed for generation and evaluation
df_llm_input = df_llm_input[['input_text', 'target_description']]

# Sampling 25 clean rows
df_llm_input = df_llm_input.sample(n=25, random_state=42).reset_index(drop=True)
df_llm_input.head()

Unnamed: 0,input_text,target_description
0,"TITLE: Plane Light System, Plastic + Metal Tax...",Features:&nbsp;<br> Full set of bright LED lig...
1,TITLE: DECOR Kafe Home Decor Sunflower Wall St...,Welcome To The Foremost Place On The Web To Fi...
2,TITLE: Vbuyz Women's Rayon Foil Print Stitched...,Vbuyz women's green color rayon straight kurti...
3,TITLE: Mitsui Shop on Suruga Street in Edo by ...,<p></p><br><p>Lost Cabin Art & Decor wall deco...
4,"TITLE: Brass Glass ( 1 pcs ), 250ml (Glass wit...","Specification:-Set for 1, Material: Brass, Vol..."


In [None]:
print('New shape for testing:',df_llm_input.shape)

New shape for testing: (25, 2)
