# 프로젝트 제목: SynDelay 데이터를 활용한 배송 지연 예측 모델링

## 1. 데이터셋 소개
본 프로젝트는 공급망 내 배송 지연을 예측하기 위해 **SynDelay** 데이터셋을 사용합니다.

* **출처:** Xu, L., Long, Y., & Brintrup, A. (2025). SynDelay: A Synthetic Dataset for Delivery Delay Prediction.
* **데이터 성격:** 실제 공급망 데이터를 기반으로 생성된 합성 데이터(Synthetic Dataset)입니다.

In [233]:
import os

HOME = os.getcwd()
HOME


'c:\\Users\\ysj58\\github\\DataScience\\E-commerce\\JYS'

In [234]:
import pandas as pd 
import numpy as np


Train = pd.read_csv(os.path.join(HOME, 'data','Train.csv'))

In [240]:
Train.head()

Unnamed: 0,payment_type,profit_per_order,sales_per_customer,category_id,category_name,customer_city,customer_country,customer_id,customer_segment,customer_state,...,order_region,order_state,order_status,product_card_id,product_category_id,product_name,product_price,shipping_date,shipping_mode,label
0,PAYMENT,-32.924488,278.95,38,Kids' Golf Clubs,Caguas,Puerto Rico,12446.5625,Corporate,PR,...,Caribbean,Martinique,PENDING_PAYMENT,858,38,GolfBuddy VT3 GPS Watch,129.99,42177.5,Second Class,2
1,DEBIT,107.8745,263.98,17,Cleats,Caguas,Puerto Rico,7782.017,Corporate,PR,...,East Africa,Copperbelt,COMPLETE,365,17,Perfect Fitness Perfect Rip Deck,59.99,42502.39,Same Day,1
2,PAYMENT,35.770718,109.65013,17,Cleats,Caguas,Puerto Rico,7378.1113,Consumer,PR,...,West Asia,Ankara,PENDING_PAYMENT,365,17,Perfect Fitness Perfect Rip Deck,59.99,42951.266,Standard Class,0
3,PAYMENT,43.58756,113.09,18,Men's Footwear,Caguas,Puerto Rico,1448.6765,Consumer,PR,...,Central America,Francisco Morazan,PENDING_PAYMENT,403,18,Nike Men's CJ Elite 2 TD Football Cleat,129.99,42181.9,Second Class,2
4,PAYMENT,49.804802,191.9809,9,Cardio Equipment,Madison,EE. UU.,5123.5254,Corporate,WI,...,Central America,Leon,PENDING_PAYMENT,191,9,Nike Men's Free 5.0+ Running Shoe,99.99,42632.82,Standard Class,1


### 검증용 데이터 분리

In [236]:
from sklearn.model_selection import train_test_split

# 1. stratify 옵션을 써서 정답(Reached.on.Time_Y.N) 비율을 유지하며 8:2로 나눕니다.
# 이 함수는 내부적으로 데이터를 랜덤하게 섞어주기 때문에 sample을 따로 안 써도 됩니다.
part1, part2 = train_test_split(Train, 
                                test_size=0.2, 
                                random_state=42, 
                                stratify=Train['label'])

# 2. 각각 파일로 저장
part1.to_csv('data/train_df.csv', index=False)
part2.to_csv('data/test_df.csv', index=False)

print("데이터 분할 및 저장 완료!")
print(f"학습용: {part1.shape}, 테스트용: {part2.shape}")

데이터 분할 및 저장 완료!
학습용: (124390, 41), 테스트용: (31098, 41)


In [237]:
train_df = pd.read_csv('data/train_df.csv')
test_df = pd.read_csv('data/test_df.csv')

In [238]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124390 entries, 0 to 124389
Data columns (total 41 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   payment_type              124390 non-null  object 
 1   profit_per_order          124390 non-null  float64
 2   sales_per_customer        124390 non-null  float64
 3   category_id               124390 non-null  int64  
 4   category_name             124390 non-null  object 
 5   customer_city             124390 non-null  object 
 6   customer_country          124390 non-null  object 
 7   customer_id               124390 non-null  float64
 8   customer_segment          124390 non-null  object 
 9   customer_state            124390 non-null  object 
 10  customer_zipcode          124390 non-null  float64
 11  department_id             124390 non-null  int64  
 12  department_name           124390 non-null  object 
 13  latitude                  124390 non-null  f

In [239]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31098 entries, 0 to 31097
Data columns (total 41 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   payment_type              31098 non-null  object 
 1   profit_per_order          31098 non-null  float64
 2   sales_per_customer        31098 non-null  float64
 3   category_id               31098 non-null  int64  
 4   category_name             31098 non-null  object 
 5   customer_city             31098 non-null  object 
 6   customer_country          31098 non-null  object 
 7   customer_id               31098 non-null  float64
 8   customer_segment          31098 non-null  object 
 9   customer_state            31098 non-null  object 
 10  customer_zipcode          31098 non-null  float64
 11  department_id             31098 non-null  int64  
 12  department_name           31098 non-null  object 
 13  latitude                  31098 non-null  float64
 14  longit