# Retail Store JSONL FILE 데이터 셋 생성
아래의 Git에서 인공적으로 생성된 데이터 세트를 사용 함.

생성된 데이터 세트를 retail_stores_items.csv 저장 했습니다. 
- [원본 Retail Demo Store Git Repo](https://github.com/aws-samples/retail-demo-store)

# 0. 선수 사항
- 이 노트북은 "Amazon Translate" 서비스를 이용합니다. 그래서 이 노트북을 실행하는 Role 에 "TranslateFullAccess" Permission 이 추가 되어져야 합니다.

# 1. 데이터 로딩

In [21]:
import boto3

translate_client = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)



In [22]:
import pandas as pd

raw = pd.read_csv("retail_stores_items.csv")
print(raw.columns)
raw.head(3)


Index(['id', 'url', 'sk', 'name', 'category', 'style', 'description',
       'aliases', 'price', 'image', 'gender_affinity', 'current_stock',
       'featured'],
      dtype='object')


Unnamed: 0,id,url,sk,name,category,style,description,aliases,price,image,gender_affinity,current_stock,featured
0,e1669081-8ffc-4dec-97a6-e9176d7f6651,http://d32da96qlo1y4g.cloudfront.net/#/product...,,Sans Pareil Scarf,apparel,scarf,Sans pareil scarf for women,,124.99,http://d32da96qlo1y4g.cloudfront.net/images/ap...,F,12,
1,cfafd627-7d6b-43a5-be05-4c7937be417d,http://d32da96qlo1y4g.cloudfront.net/#/product...,,Chef Knife,housewares,kitchen,A must-have for your kitchen,,57.99,http://d32da96qlo1y4g.cloudfront.net/images/ho...,,9,
2,6e6ad102-7510-4a02-b8ce-5a0cd6f431d1,http://d32da96qlo1y4g.cloudfront.net/#/product...,,Gainsboro Jacket,apparel,jacket,This gainsboro jacket for women is perfect for...,,133.99,http://d32da96qlo1y4g.cloudfront.net/images/ap...,F,13,


# 2. 데이터 전처리

## 주요 컬럼만 유지하고 삭제

In [23]:
def preprocess_data(raw):
    def create_sinlge_category(s):
        category = s[2]
        style = s[3]        
        single_category = f"{category}|{style}"
        # print(f"{category}|{style}")
        
        return single_category
    
    df = raw.copy()
    df = df.drop(columns=['url','sk','aliases', 'price', 'image', 'gender_affinity', 'current_stock', 'featured'])
    
    df['single_category'] = df.apply(create_sinlge_category, axis=1)
    df = df.drop(columns=['style','category'])
    # df = df.dropna(axis=0, how='any')
    # df = df.reset_index(drop=True)
    return df

df_en = preprocess_data(raw)
df_en.head(3)

  category = s[2]
  style = s[3]


Unnamed: 0,id,name,description,single_category
0,e1669081-8ffc-4dec-97a6-e9176d7f6651,Sans Pareil Scarf,Sans pareil scarf for women,apparel|scarf
1,cfafd627-7d6b-43a5-be05-4c7937be417d,Chef Knife,A must-have for your kitchen,housewares|kitchen
2,6e6ad102-7510-4a02-b8ce-5a0cd6f431d1,Gainsboro Jacket,This gainsboro jacket for women is perfect for...,apparel|jacket


## description, single_category 를 한국어로 번역

In [24]:
%%time

import pandas as pd

# df = pd.read_csv("amazon_faq_en.csv")
# print(df.columns)

def translate_df(df, target_col, new_col, length):
    
    def translate(df, length):
        text = df[target_col]
        result = translate_client.translate_text(Text=text, 
                SourceLanguageCode="en", TargetLanguageCode="ko")
        result = result['TranslatedText']
        result = result[0:length]
        return result
    df[new_col] = df.apply(translate, length=length, axis=1)
    
    return df

# token_length = 200 # 200 --> Error
token_length = 1200

# df_ko = translate_df(df_en, target_col='description', new_col='description_ko', length = token_length)
# df_ko = translate_df(df_en, target_col='single_category', new_col='single_category_ko', length = token_length)


CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 8.34 µs


In [25]:
df_en

Unnamed: 0,id,name,description,single_category
0,e1669081-8ffc-4dec-97a6-e9176d7f6651,Sans Pareil Scarf,Sans pareil scarf for women,apparel|scarf
1,cfafd627-7d6b-43a5-be05-4c7937be417d,Chef Knife,A must-have for your kitchen,housewares|kitchen
2,6e6ad102-7510-4a02-b8ce-5a0cd6f431d1,Gainsboro Jacket,This gainsboro jacket for women is perfect for...,apparel|jacket
3,49b89871-5fe7-4898-b99d-953e15fb42b2,High Definition Speakers,High definition speakers to fill the house wit...,electronics|speaker
4,5cb18925-3a3c-4867-8f1c-46efd7eba067,Spiffy Sandals,This spiffy pair of sandals for woman is perfe...,footwear|sandals
...,...,...,...,...
2460,36cfd856-dd30-46a9-8654-1f1de77e674a,Easter Wreath,Easter wreath grown sustainably on our organic...,floral|wreath
2461,1ea9439f-dff5-41cf-aac3-718a6b4e7af6,White Sneakers,An all-around voguish pair of white sneakers,footwear|sneaker
2462,ccdf737c-c4fd-4c78-abd2-d5ef0428ef20,Wine Glass,Ideal for every kitchen,housewares|kitchen
2463,12f93a36-e282-4445-92ae-356eb6a560fd,Roses Arrangement,Roses arrangement grown sustainably on our org...,floral|arrangement


## single_category_ko 가 size_limit (예: 30) 이상 있는 것만 유지

In [26]:
pd.set_option('display.max_rows', 100)

def filter_dataset(df, size_limit=10):
    '''
    size_limit 이상인 카테고리만을 추출 함.
    '''
    print("Original shape: ", df.shape)    
    stat = df.groupby('single_category').count()
    stat = pd.DataFrame(stat)
    # print("stat columns: ", stat.columns)    
    # print("single_category: ", stat.index)    

    stat = stat[stat.id > size_limit]
    stat = stat[stat.id > size_limit]
    
    stat = df[df.single_category.isin(stat.index)]
    print("new_df shape: ", stat.shape)    
    
    return stat

size_limit = 30
filter_df = filter_dataset(df_en, size_limit = size_limit)
filter_df.head()

Original shape:  (2465, 4)
new_df shape:  (1871, 4)


Unnamed: 0,id,name,description,single_category
0,e1669081-8ffc-4dec-97a6-e9176d7f6651,Sans Pareil Scarf,Sans pareil scarf for women,apparel|scarf
1,cfafd627-7d6b-43a5-be05-4c7937be417d,Chef Knife,A must-have for your kitchen,housewares|kitchen
2,6e6ad102-7510-4a02-b8ce-5a0cd6f431d1,Gainsboro Jacket,This gainsboro jacket for women is perfect for...,apparel|jacket
4,5cb18925-3a3c-4867-8f1c-46efd7eba067,Spiffy Sandals,This spiffy pair of sandals for woman is perfe...,footwear|sandals
5,91cc9fa1-d8e9-46ae-9c8d-86264de2c6cc,Stylish Ceramic Bowl,This stylish ceramic bowl is a must-have,housewares|bowls


# 3. Train, Val 데이터 셋으로 층화 분리

In [27]:
from sklearn.model_selection import train_test_split

import numpy as np

def split_stratify_dataset(raw, test_size=0.2):
    X, y = raw.description, raw.single_category
    # 층화 데이터셋 분리
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, random_state=42)

    # 각 데이터셋의 클래스 비율 확인
    unique_train, counts_train = np.unique(y_train, return_counts=True)
    unique_test, counts_test = np.unique(y_test, return_counts=True)

    print("Train set class distribution:")
    print(dict(zip(unique_train, counts_train)))

    print("\nTest set class distribution:")
    print(dict(zip(unique_test, counts_test)))
    
    train_df = pd.DataFrame({'prompt': X_train, 'completion': y_train}).reset_index(drop=True)
    test_df = pd.DataFrame({'prompt': X_test, 'completion': y_test}).reset_index(drop=True)    
    
    return train_df, test_df

train_df, val_df = split_stratify_dataset(filter_df, test_size=0.1)

Train set class distribution:
{'accessories|backpack': 43, 'accessories|belt': 34, 'accessories|glasses': 53, 'accessories|handbag': 34, 'apparel|jacket': 73, 'apparel|scarf': 48, 'apparel|shirt': 60, 'beauty|grooming': 37, 'electronics|camera': 32, 'floral|arrangement': 32, 'floral|plant': 38, 'footwear|formal': 71, 'footwear|sandals': 28, 'footwear|sneaker': 43, 'furniture|chairs': 65, 'furniture|dressers': 29, 'furniture|sofas': 44, 'furniture|tables': 83, 'groceries|bakery': 49, 'groceries|meat': 28, 'groceries|vegetables': 37, 'homedecor|cushion': 41, 'homedecor|decorative': 58, 'homedecor|lighting': 48, 'housewares|bowls': 49, 'housewares|kitchen': 144, 'instruments|keys': 30, 'instruments|percussion': 36, 'instruments|strings': 55, 'jewelry|bracelet': 33, 'outdoors|fishing': 33, 'seasonal|christmas': 83, 'seasonal|easter': 38, 'seasonal|halloween': 41, 'seasonal|valentine': 33}

Test set class distribution:
{'accessories|backpack': 5, 'accessories|belt': 4, 'accessories|glasses'

In [28]:
train_df.head(5)

Unnamed: 0,prompt,completion
0,Relax in this nice sienna armchair,furniture|chairs
1,This video camera is perfect for capturing tho...,electronics|camera
2,You get a whole bunch of stuff and it is super...,beauty|grooming
3,A must-have for April,seasonal|easter
4,Unsurpassed for every kitchen,housewares|kitchen


In [29]:
val_df.head(50)

Unnamed: 0,prompt,completion
0,Outstanding white worktable for your office,furniture|tables
1,A favorite for early April,seasonal|easter
2,Ideal white chair for your office,furniture|chairs
3,This acoustic drum will delight the most deman...,instruments|percussion
4,This keyboard will delight the most demanding ...,instruments|keys
5,An all-around modish pair of powder blue sneakers,footwear|sneaker
6,This jar candle is a must-have for the lightin...,homedecor|lighting
7,This ceramic vase will delight everyone,homedecor|decorative
8,This unparalleled dresser has lots of drawers ...,furniture|dressers
9,Sassy belt for women,accessories|belt


# 4. JSONL 파일 생성

In [30]:

def create_json_file(df, jsonl_file_path):

    # Iterate over DataFrame rows and write each row as a JSON object to the file
    with open(jsonl_file_path, 'w') as jsonl_file:
        for index, row in df.iterrows():
            json_row = row.to_json()
            # print("json_row: ", json_row)
            jsonl_file.write(json_row)
            jsonl_file.write('\n')
    print(f"{jsonl_file_path} is created")




In [31]:
# Specify the path where you want to save the JSONL file
jsonl_file_path = 'train-retail-data-en.jsonl'
create_json_file(train_df, jsonl_file_path)


jsonl_file_path = 'val-retail-data-en.jsonl'
create_json_file(val_df, jsonl_file_path)


train-retail-data-en.jsonl is created
val-retail-data-en.jsonl is created
