# [Module 4] Interactions, Items, Users 데이터셋을 S3에 업로드

이 워크샵은 기본 커널을 conda_python3를 사용합니다.

## 0. 환경 설정

#### Library Import 

파이썬에는 광범위한 라이브러리 모음이 포함되어 있으며, 본 LAB을 위해서 핵심 Data Scientist용 Tool 인 boto3 (AWS SDK) 및 Pandas/Numpy와 같은 라이브러리를 가져와야 합니다.

In [1]:
import boto3
import json
import numpy as np
import pandas as pd
import time
from datetime import datetime

import matplotlib.pyplot as plt

변수 로딩

In [2]:
%store -r

## 1. 데이터셋 확인

In [3]:
interactions_df

Unnamed: 0,ITEM_ID,USER_ID,EVENT_TYPE,TIMESTAMP
0,26bb732f-9159-432f-91ef-bad14fedd298,3156,View,1591803788
1,26bb732f-9159-432f-91ef-bad14fedd298,3156,View,1591803788
2,dc073623-4b95-47d9-93cb-0171c20baa04,332,View,1591803812
3,dc073623-4b95-47d9-93cb-0171c20baa04,332,View,1591803812
4,31efcfea-47d6-43f3-97f7-2704a5397e22,3981,View,1591803830
...,...,...,...,...
674996,94a0ad41-8b19-4ecb-b0d7-33704e2d4421,4046,View,1598204625
674997,f9c470b0-152b-4776-893a-67ffc4064675,2627,View,1598204657
674998,1def0093-96b2-4cc4-a022-071941f75b92,3538,View,1598204664
674999,9bc87696-e9bd-4241-86b0-234e054a607b,5165,View,1598204678


In [4]:
items_df

Unnamed: 0,ITEM_ID,CVR,NAME,CATEGORY_L1,STYLE,PRODUCT_DESCRIPTION,PRICE
0,00096972-5f6b-44df-917b-f7d21ae5644c,0.00041,Pink Shirt,apparel,shirt,Swanky dress for women,225.99
1,0016fde3-0910-4cc1-8ef6-90e15f271073,0.00014,Farmed Salmon For Sushi,groceries,seafood,"Flavorful farmed salmon, always sourced sustai...",24.99
2,00225258-dbfb-4103-a573-007386571a49,0.00068,Easter Decorative Egg,seasonal,easter,A must-have for April,16.99
3,003e4953-d6cb-400c-90f6-9b0216b4603e,0.00036,Drought-Resistant Indoor Plant,floral,plant,Drought-resistant indoor plant grown sustainab...,76.99
4,004112e9-dca1-4402-ae6d-74e2b80b8c05,0.00035,Mango Coat,apparel,jacket,Mango coat for men about town,184.99
...,...,...,...,...,...,...,...
2444,ff973006-27da-45dd-899c-8441c5eaebe0,0.00059,Groovy Glasses,accessories,glasses,These groovy glasses for men are unparalleled,140.99
2445,ff9c5ec9-69d0-4338-b4b2-96d48b2e91aa,0.00031,Black Shirt,apparel,shirt,Black casual shirt for men,226.99
2446,ffbf120a-0b8e-41dd-bbe9-5b2a87b0c8c5,0.00035,Bread,groceries,bakery,Bread made fresh daily in our kitchens,7.99
2447,ffcc4cc8-a094-49ea-b9f2-8bf056261868,0.00017,Indoor Plant,floral,plant,Indoor plant delivered fresh and vibrant from ...,129.99


In [5]:
users_df

Unnamed: 0,USER_ID,USER_NAME,AGE,GENDER
0,1,user1,31,M
1,2,user2,58,F
2,3,user3,43,M
3,4,user4,38,M
4,5,user5,24,M
...,...,...,...,...
5245,5246,user5246,37,M
5246,5247,user5247,46,M
5247,5248,user5248,50,M
5248,5249,user5249,33,M


## 2. 데이터 분리

이 데이터에는 UserID, ItemID, Event_Type 및 Timestamp 컬럼이 포함되어 있습니다.<br>

- 시간순으로 정렬한 후에 90%는 학습용 데이터로, 마지막 최근 10% 데이터를 검증용 데이터로 분리합니다.

#### Data Set을 Train, Validation(holdout) 데이터 분리하기 

모든 사용자의 마지막(Timestamp기준으로) 10%의 데이터를 Validation(Holdout) 데이터로 분리합니다.

In [6]:
pd.options.display.max_rows = 5
def split_holdout(data, pct):
    df = data.copy()
    # Rank per each subgroup, 'USER_ID'
    ranks = df.groupby('USER_ID').TIMESTAMP.rank(pct=True, method='first')
    df = df.join((ranks> pct).to_frame('holdout'))
    
    holdout = df[df['holdout']].drop('holdout', axis=1)
    train = df[~df['holdout']].drop('holdout', axis=1)    
    
    return train, holdout

df_warm_train, df_warm_holdout = split_holdout(interactions_df, pct=0.9)

train 과 holdout의 분리된 데이타의 정보입니다. holdout은 대략 전체 대비 10%의 Row의 갯수 입니다. holdout의 timestamp는 train보다 미래의 숫자인 것을 인지할 수 있습니다. (숫자가 많은 것이 더 미래의 날짜를 의미함)

In [7]:
df_warm_train.info()
df_warm_train.nunique()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 526581 entries, 0 to 664340
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   ITEM_ID     526581 non-null  object
 1   USER_ID     526581 non-null  object
 2   EVENT_TYPE  526581 non-null  object
 3   TIMESTAMP   526581 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 20.1+ MB


ITEM_ID         2449
USER_ID         5250
EVENT_TYPE         2
TIMESTAMP     282956
dtype: int64

In [8]:
df_warm_holdout.info()
df_warm_holdout.nunique()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61138 entries, 432222 to 675003
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ITEM_ID     61138 non-null  object
 1   USER_ID     61138 non-null  object
 2   EVENT_TYPE  61138 non-null  object
 3   TIMESTAMP   61138 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 2.3+ MB


ITEM_ID        2446
USER_ID        5250
EVENT_TYPE        2
TIMESTAMP     34393
dtype: int64

In [9]:
df_warm_train.sort_values(['USER_ID','TIMESTAMP'])

Unnamed: 0,ITEM_ID,USER_ID,EVENT_TYPE,TIMESTAMP
21464,35efa417-357d-465e-99cb-b208bbc63f8b,1,View,1592007327
21465,35efa417-357d-465e-99cb-b208bbc63f8b,1,View,1592007327
...,...,...,...,...
585017,072ded32-2903-4f35-9f28-d6284c5f5605,5250,View,1597351374
585018,072ded32-2903-4f35-9f28-d6284c5f5605,5250,View,1597351374


In [10]:
df_warm_holdout.sort_values(['USER_ID','TIMESTAMP'])

Unnamed: 0,ITEM_ID,USER_ID,EVENT_TYPE,TIMESTAMP
612212,be14695b-f8cb-46b8-aecd-ef28f0218514,1,View,1597609245
621508,079ab14b-3435-4a95-ba1d-fc0b21e0cf4b,1,View,1597697421
...,...,...,...,...
609116,e66109bf-9ad5-430a-90e5-900c00119f39,5250,View,1597579890
662349,072ded32-2903-4f35-9f28-d6284c5f5605,5250,View,1598084705


## 3. 로컬에 train, item, validation(holdout) 및 coldstart 를 CSV 저장

완료되면 파일을 새 CSV로 저장한 다음, S3에 업로드합니다.<br>

In [11]:
import os
os.makedirs('dataset', exist_ok=True)

train, item, validation(holdout) 데이터를 로컬에 csv 파일로 저장 합니다.

In [12]:
# train, item, validation 로컬 저장
warm_train_interaction_filename="dataset/training_interaction.csv"
items_filename="dataset/training_item.csv"
users_filename="dataset/training_user.csv"
validation_interaction_filename="dataset/validation_interaction.csv"

df_warm_train.to_csv(warm_train_interaction_filename,index=False)
items_df.to_csv(items_filename,index=False)
users_df.to_csv(users_filename,index=False)
df_warm_holdout.to_csv(validation_interaction_filename,index=False)    

In [13]:
# warm data set 로컬 저장
warm_interation_filename="dataset/warm_interaction.csv"
interactions_df.to_csv(warm_interation_filename,index=False)


## 4. S3에 로컬 CSV 업로드

In [14]:
import sagemaker
#bucket='<YOUR BUCKET NAME>' # replace with the name of your S3 bucket
bucket = sagemaker.Session().default_bucket() 

In [15]:
#upload file for training
response_upload = boto3.Session().resource('s3').Bucket(bucket).Object(warm_train_interaction_filename).upload_file(warm_train_interaction_filename)
boto3.Session().resource('s3').Bucket(bucket).Object(users_filename).upload_file(users_filename)
boto3.Session().resource('s3').Bucket(bucket).Object(items_filename).upload_file(items_filename)

s3_warm_train_interaction_filename = "s3://{}/{}".format(bucket, warm_train_interaction_filename)
s3_items_filename = "s3://{}/{}".format(bucket, items_filename)
s3_users_filename = "s3://{}/{}".format(bucket, users_filename)

print("s3_warm_train_interaction_filename: \n", s3_warm_train_interaction_filename)
print("s3_items_filename: \n", s3_items_filename)
print("s3_users_filename: \n", s3_users_filename)

s3_warm_train_interaction_filename: 
 s3://sagemaker-us-east-1-376278017302/dataset/training_interaction.csv
s3_items_filename: 
 s3://sagemaker-us-east-1-376278017302/dataset/training_item.csv
s3_users_filename: 
 s3://sagemaker-us-east-1-376278017302/dataset/training_user.csv


In [16]:
! aws s3 ls {s3_warm_train_interaction_filename} --recursive
! aws s3 ls {s3_items_filename} --recursive
! aws s3 ls {s3_users_filename} --recursive

2023-02-28 11:52:53   30451496 dataset/training_interaction.csv
2023-02-28 11:52:54     315329 dataset/training_item.csv
2023-02-28 11:52:54      97565 dataset/training_user.csv


## 5. 변수 저장

다음 노트북에서 활용할 변수를 저장 합니다.

In [17]:
%store bucket
%store s3_warm_train_interaction_filename
%store s3_users_filename
%store s3_items_filename
%store warm_train_interaction_filename
%store items_filename
%store users_filename
%store validation_interaction_filename

%store warm_interation_filename

Stored 'bucket' (str)
Stored 's3_warm_train_interaction_filename' (str)
Stored 's3_users_filename' (str)
Stored 's3_items_filename' (str)
Stored 'warm_train_interaction_filename' (str)
Stored 'items_filename' (str)
Stored 'users_filename' (str)
Stored 'validation_interaction_filename' (str)
Stored 'warm_interation_filename' (str)
