### mart 판매 데이터를 기반으로 판매액을 예측하시오.
- 제공된 데이터 목록: mart_train.csv (훈련 데이터), mart_test.csv (평가용 데이터)
- 예측할 컬럼: total (총 판매액)
학습용 데이터(mart_train.csv)를 이용하여 총 판매액을 예측하는 모델을 만든 후, 이를 평가용 데이터(mart_test.csv)에 적용하여 얻은 예측값을 다음과 같은 형식의 CSV 파일로 생성하시오.
- 제출 파일은 다음 한 개의 컬럼을 포함해야 합니다.
- pred: 예측된 총 판매액
- 제출 파일명: 'result.csv'
- 제출한 모델의 성능은 RMSE(Root Mean Square Error) 평가지표에 따라 채점한다.
- 제출 CSV 파일명 및 형태: result.csv

~~~
pred
10000
20000
30000
40000
...
~~~

### 답안 제출 참고
- pd.read_csv('result.csv') 로 제출 코드 확인

# 1. 문제정의
- RMSE
- target: total
- 예측 파일명: result.csv
- 컬럼 1개(pred)

# 2. 라이브러리 및 데이터 불러오기

In [1]:
# 데이터 불러오기
import pandas as pd
train = pd.read_csv("mart_train.csv")
test = pd.read_csv("mart_test.csv")

# 3. 탐색적 데이터 분석(EDA)

In [2]:
# 데이터 크기 확인
train.shape, test.shape

((700, 10), (300, 9))

In [3]:
# train  샘플 확인
train.head()

Unnamed: 0,branch,city,customer_type,gender,product_line,total,payment_method,rating,time_of_day,day_name
0,A,Yangon,Member,Female,Health and beauty,823457.25,Ewallet,9.1,afternoon,Saturday
1,C,Naypyitaw,Normal,Female,Electronic accessories,120330.0,Cash,9.6,morning,Friday
2,A,Yangon,Normal,Male,Home and lifestyle,510788.25,Credit card,7.4,afternoon,Sunday
3,A,Yangon,Member,Male,Health and beauty,733572.0,Ewallet,8.4,evening,Sunday
4,A,Yangon,Normal,Male,Sports and travel,951567.75,Ewallet,5.3,morning,Friday


In [4]:
# test 샘플 확인
test.head()

Unnamed: 0,branch,city,customer_type,gender,product_line,payment_method,rating,time_of_day,day_name
0,C,Naypyitaw,Normal,Female,Fashion accessories,Ewallet,9.6,afternoon,Thursday
1,B,Mandalay,Normal,Male,Food and beverages,Credit card,4.3,evening,Wednesday
2,B,Mandalay,Member,Female,Fashion accessories,Credit card,5.0,evening,Wednesday
3,B,Mandalay,Member,Male,Health and beauty,Cash,9.2,morning,Sunday
4,B,Mandalay,Member,Female,Home and lifestyle,Cash,6.3,afternoon,Saturday


In [5]:
# 자료형 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   branch          700 non-null    object 
 1   city            700 non-null    object 
 2   customer_type   700 non-null    object 
 3   gender          700 non-null    object 
 4   product_line    700 non-null    object 
 5   total           700 non-null    float64
 6   payment_method  700 non-null    object 
 7   rating          700 non-null    float64
 8   time_of_day     700 non-null    object 
 9   day_name        700 non-null    object 
dtypes: float64(2), object(8)
memory usage: 54.8+ KB


In [6]:
# train 기초 통계값 확인
train.describe()

Unnamed: 0,total,rating
count,700.0,700.0
mean,485078.0,7.003429
std,364390.7,1.713078
min,19041.75,4.0
25%,200119.5,5.5
50%,381874.5,7.0
75%,706127.6,8.425
max,1563975.0,10.0


In [8]:
train.describe(include='O')

Unnamed: 0,branch,city,customer_type,gender,product_line,payment_method,time_of_day,day_name
count,700,700,700,700,700,700,700,700
unique,3,3,2,2,6,3,3,7
top,A,Yangon,Normal,Male,Sports and travel,Cash,evening,Saturday
freq,236,236,354,356,127,246,309,114


In [9]:
# train 결측치
train.isnull().sum()

branch            0
city              0
customer_type     0
gender            0
product_line      0
total             0
payment_method    0
rating            0
time_of_day       0
day_name          0
dtype: int64

In [10]:
# test 결측치
test.isnull().sum()

branch            0
city              0
customer_type     0
gender            0
product_line      0
payment_method    0
rating            0
time_of_day       0
day_name          0
dtype: int64

In [12]:
# traget 
train['total'].value_counts()

283641.75     2
263875.50     2
415422.00     2
326450.25     2
130851.00     2
             ..
293391.00     1
137103.75     1
348232.50     1
104107.50     1
1535625.00    1
Name: total, Length: 695, dtype: int64

# 4. 데이터 전처리

In [14]:
target = train.pop('total')

In [16]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

# 5. 검증 데이터 분할

In [23]:
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0)
X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

((560, 30), (140, 30), (560,), (140,))

# 6. 머신러닝 학습 및 평가

In [25]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
model = rf.fit(X_tr, y_tr)
pred = model.predict(X_val)
pred

array([546066.045 , 578279.835 , 532296.45  , 548084.4075, 495159.21  ,
       419774.355 , 464927.5575, 550659.06  , 569962.7325, 442991.43  ,
       533436.435 , 313835.9175, 483317.1   , 772205.3325, 474225.885 ,
       390585.195 , 553624.6275, 464890.23  , 514984.68  , 648688.4775,
       491477.0175, 614603.2725, 434590.065 , 538724.4975, 581107.275 ,
       415857.4875, 547410.78  , 486769.3425, 291739.14  , 387871.9425,
       426157.0425, 581832.72  , 337196.6325, 550093.7925, 296767.8   ,
       470672.2125, 466180.3125, 466847.325 , 704970.7875, 475374.2175,
       426557.7225, 503441.505 , 304425.765 , 515050.0425, 618847.74  ,
       319851.7875, 341523.315 , 621907.8075, 664629.6825, 426466.845 ,
       697635.855 , 509414.3775, 378168.0525, 428345.82  , 391869.765 ,
       720886.4775, 507657.3075, 446126.7825, 630664.65  , 515000.2725,
       555140.25  , 506139.6375, 474818.715 , 491461.8975, 668152.9575,
       449131.095 , 782975.34  , 763689.78  , 409705.8525, 41871

In [28]:
pred = model.predict(test)
pred

array([319681.53  , 668193.2775, 350330.085 , 537739.965 , 550546.92  ,
       462581.91  , 507944.115 , 654953.1975, 502513.0425, 456035.265 ,
       456678.9675, 577458.1575, 497665.5075, 489369.195 , 384054.7725,
       540164.835 , 597172.59  , 587449.485 , 663408.7425, 569170.8225,
       410536.8225, 750745.17  , 331440.165 , 642980.0475, 448425.3375,
       452563.8075, 411845.9625, 441466.9875, 390701.9025, 452056.6575,
       694986.705 , 541376.325 , 485499.8925, 315591.255 , 342288.9225,
       426016.395 , 607538.1375, 572149.1475, 503556.48  , 531280.1025,
       862446.8475, 681918.3   , 689965.29  , 422106.615 , 638270.955 ,
       690153.345 , 314131.3875, 606073.3875, 535833.585 , 431543.8575,
       476401.7475, 459960.165 , 569426.6025, 522620.4375, 694090.53  ,
       608115.8475, 518608.9125, 525425.985 , 354928.14  , 508889.43  ,
       503397.8775, 565008.0975, 487372.2525, 471706.2   , 463560.1425,
       440396.46  , 442938.6675, 465590.16  , 437449.1625, 60561

# 7. 예측 및 결과 파일 생성

In [31]:
result = pd.DataFrame({
    'pred' : pred
})

result.to_csv('result.csv',index=False)

pd.read_csv('result.csv')

Unnamed: 0,pred
0,319681.5300
1,668193.2775
2,350330.0850
3,537739.9650
4,550546.9200
...,...
295,478998.1350
296,449116.6050
297,548358.3000
298,451477.8450
