# Overview

This notebook is used to prepare demo dataset for this project. We downloaded dataset from the [eCommerce Text Classifciation](https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification) problem on Kaggle.

There are 4 classes, where we will only take 100 samples from each class for demo purpose.

In [58]:
import pandas as pd

# Load and rename columns to adhere to the format required by CPR image
df = pd.read_csv("ecommerceDataset.csv", header=None)
df.rename(columns={0: "label", 1: "text"}, inplace=True)

# Create ID column as both training and CPR require an ID column
df["id"] = df.index
df.head()

Unnamed: 0,0,1,id
0,Household,Paper Plane Design Framed Wall Hanging Motivat...,0
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",1
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...,2
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1...",3
4,Household,Incredible Gifts India Wooden Happy Birthday U...,4


In [60]:
# Show distribution of labels
df.label.value_counts() / len(df)

label
Household                 0.383004
Books                     0.234408
Electronics               0.210630
Clothing & Accessories    0.171958
Name: count, dtype: float64

In [61]:
# stratified sampling
df = df.groupby('label', group_keys=False).apply(lambda x: x.sample(100))

Here we prepare data for `training_pipeline`, where `train.csv`, `val.csv`, and `test.csv` are used for training, validation, and testing respectively. At the same time, `label_mapping.json` is used to map label indices to human-readable labels.

We will first save these files locally, create a pipeline root on GCS, and upload these files to the pipeline root.

In [62]:
# Split train, val, test sets, and create label_mapping.json, stratefiying by label
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df.label)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42, stratify=train_df.label)
label_mapping = dict(enumerate(train_df.label.unique()))


In [63]:
# Create data directory and save files locally
import json
train_df.to_csv("data/train.csv", index=False)
val_df.to_csv("data/val.csv", index=False)
test_df.to_csv("data/test.csv", index=False)
with open("data/label_mapping.json", "w") as f:
    json.dump(label_mapping, f)

In [65]:
# Create timestamp
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
pipeline_root = f"gs://public-projects/demo/ecommerce/{timestamp}"
pipeline_root

'gs://public-projects/demo/ecommerce/20230626050711'

In [67]:
import gcsfs
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file("../service_acc_key.json")
fs = gcsfs.GCSFileSystem(project="independent-bay-388105", token="../service_acc_key.json")
fs.put("./data", f"{pipeline_root}/data", recursive=True)

[None, None, None, None]