## 使用SageMaker+XGBoost，将时间序列转换为监督学习，完成预测性维护的实践
https://github.com/liangyimingcom/Use-SageMaker_XGBoost-convert-Time-Series-into-Supervised-Learning-for-predictive-maintenance
**关键字**：SageMaker；XGBoost；Python；滑窗；滑动窗口方法；时间序列预测转化为监督学习问题；将多元时间序列数据转换为监督学习问题；如何用Python将时间序列问题转化为有监督学习问题；时间序列预测的机器学习模型；Machine Learning；ML；

# Step1: 数据预处理与特征工程，以及滑动窗口原理解析

**《预测性维护》是传统制造业常见AI场景。过去多年，制造业一直在努力提高运营效率，并避免由于组件故障而导致停机。通常使用的方法是：**
1. 通常采用的方法是使用“物理传感器（标签）”做数据连接，存储和大屏上进行了大量重复投资，以监视设备状况并获得实时警报。
2. 主要的数据分析方法是单变量阈值和基于物理的建模方法，尽管这些方法在检测特定故障类型和操作条件方面很有效，但它们通常会错过"通过推导每台设备的多元关系"而检测到的重要信息。
3. 借助机器学习，可以提供从设备的历史数据中学习的数据驱动模型。主要挑战在于，Machine Learning(ML)的项目投资和工程师培训，实施这样的机器学习解决方案既耗时又昂贵。

**AWS Sagemaker提供了一个简单有效的解决方案，就是使用Sagemaker+XGboost完成检测到异常的设备行为，实现《预测性维护》的场景需求**，本文内容覆盖了：
1. 使用了“**滑窗**”方法进行数据集的重构，并配合XGBoost算法，**将多元时间序列数据集转换为监督学习问题（复杂问题转换为简单问题）；**
2. 使用Sagemaker Studio各项功能（自动机器学习Autopilot、自动化的调参 Hyperparameter tuning jobs、多模型终端节点multi-model endpoints等）**加速XGBoost超参数优化的速度，有效提高模型准确度，并大幅降低日程推理成本**；
3. 使用Sagemaker Studio **完成数据预处理与特征工程**：
   - [ ] 1）探索相关性；
   - [ ] 2）缩小特征值范围；
   - [ ] 3）将海量数据分为几批进行预处理，以避免服务器内存溢出；
   - [ ] 4）数据清理，滑动窗口清除无效数据；
   - [ ] 5）过滤数据，解决正负样本不平衡的问题；
4. 针对实验数据，使用Sagemaker+XGboost训练了6个预测模型，分别覆盖提前5、10、20、30、40、50分钟进行预测，演示预测结果结果。

---
### 本章节内容Contents

1. [数据预处理与特征工程]
2. [滑动窗口的代码实现]
3. [数据预处理]
4. [样本不均衡处理]
5. [数据标注与特征工程]

---

In [1]:
## [滑窗值：合并多少条记录在一起] 【真实结果为 n+1 】
n_slidingwindow = 10  #n_slidingwindow = 100
## [提前n个周期预警, 分钟 = n/2 分钟预警（30s一个间隔）] ，应该小于滑窗值
n_earlywarningcycle = 6

#周期预警的开始位
n_combinrows = n_earlywarningcycle
print('slidingwindow === ' + str(n_slidingwindow) +  ', earlywarningcycle === ' + str(n_earlywarningcycle) )

## 共计58个属性column
n_totalcolumns = 58 

## load csv path
csv_path = './yiming-arraged/10000871-part_ac.csv'



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
#from xgboost import XGBRegressor
%matplotlib inline
from matplotlib import pyplot
from sklearn.metrics import mean_absolute_error

import os
from datetime import datetime
from time import gmtime, strftime, sleep

In [3]:
# 预定义的环境变量
s3_prefix_mask = 'xgb-{}-TC{}-SW{}-WC{}-{}' # haierkong-xgb-{filename}-TC_{total columns 58}-SW_{n_slidingwindow 60}-WC_{warningcycle 30}-{created datetime}
#print(s3_prefix_mask) #for testing

csv_path_head, csv_path_tail = os.path.split(csv_path)
#created_date = strftime("%Y_%m_%d_%H_%M_%S", gmtime())
created_date = strftime("%Y%m%d%H%M%S", gmtime())

strfilled_n_slidingwindow=str(n_slidingwindow).rjust(3,'0')
strfilled_n_earlywarningcycle=str(n_earlywarningcycle).rjust(3,'0')
s3_prefix = s3_prefix_mask.format(csv_path_tail, n_totalcolumns, strfilled_n_slidingwindow, strfilled_n_earlywarningcycle, created_date)
print(s3_prefix) #for testing



xgb-10000871-part_ac.csv-TC58-SW010-WC006-20210405135629


In [4]:
df = pd.read_csv(csv_path,low_memory=False)

#df n_totalcolumns = 58 columns
df = df[[
'1S0Z7F92N86S68KI_1_AUXILIQUIDOPENING',
'1S0Z7F92N86S68KI_1_COMPCURRENT',
'1S0Z7F92N86S68KI_1_COMPEXHAUSTTEMP',
'1S0Z7F92N86S68KI_1_COMPLOAD',
'1S0Z7F92N86S68KI_1_COMPPOWER',
'1S0Z7F92N86S68KI_1_COMPRUNTIME',
'1S0Z7F92N86S68KI_1_COMPSPEED',
'1S0Z7F92N86S68KI_1_COMPSUCTIONTEMP',
'1S0Z7F92N86S68KI_1_COMPVOLTAGE',
'1S0Z7F92N86S68KI_1_CONDSIDEEXHAUSTPRESS',
'1S0Z7F92N86S68KI_1_DISCHARGESUPERHEAT',
'1S0Z7F92N86S68KI_1_ECONPRESS',
'1S0Z7F92N86S68KI_1_ECONREFRTEMP',
'1S0Z7F92N86S68KI_1_EVAPSIDESUCTIONPRESS',
'1S0Z7F92N86S68KI_1_INVERTERTEMP',
'1S0Z7F92N86S68KI_1_MAINFLOWVALVEOPENING',
'1S0Z7F92N86S68KI_1_MAINLOOPLEVEL',
'1S0Z7F92N86S68KI_2_AUXILIQUIDOPENING',
'1S0Z7F92N86S68KI_2_COMPCURRENT',
'1S0Z7F92N86S68KI_2_COMPEXHAUSTTEMP',
'1S0Z7F92N86S68KI_2_COMPLOAD',
'1S0Z7F92N86S68KI_2_COMPPOWER',
'1S0Z7F92N86S68KI_2_COMPRUNTIME',
'1S0Z7F92N86S68KI_2_COMPSPEED',
'1S0Z7F92N86S68KI_2_COMPSUCTIONTEMP',
'1S0Z7F92N86S68KI_2_COMPVOLTAGE',
'1S0Z7F92N86S68KI_2_CONDSIDEEXHAUSTPRESS',
'1S0Z7F92N86S68KI_2_DISCHARGESUPERHEAT',
'1S0Z7F92N86S68KI_2_ECONPRESS',
'1S0Z7F92N86S68KI_2_ECONREFRTEMP',
'1S0Z7F92N86S68KI_2_EVAPSIDESUCTIONPRESS',
'1S0Z7F92N86S68KI_2_INVERTERTEMP',
'1S0Z7F92N86S68KI_2_MAINFLOWVALVEOPENING',
'1S0Z7F92N86S68KI_2_MAINLOOPLEVEL',
'1S0Z7F92N86S68KI_3_AUXILIQUIDOPENING',
'1S0Z7F92N86S68KI_3_COMPCURRENT',
'1S0Z7F92N86S68KI_3_COMPEXHAUSTTEMP',
'1S0Z7F92N86S68KI_3_COMPLOAD',
'1S0Z7F92N86S68KI_3_COMPPOWER',
'1S0Z7F92N86S68KI_3_COMPRUNTIME',
'1S0Z7F92N86S68KI_3_COMPSPEED',
'1S0Z7F92N86S68KI_3_COMPSUCTIONTEMP',
'1S0Z7F92N86S68KI_3_COMPVOLTAGE',
'1S0Z7F92N86S68KI_3_CONDSIDEEXHAUSTPRESS',
'1S0Z7F92N86S68KI_3_DISCHARGESUPERHEAT',
'1S0Z7F92N86S68KI_3_ECONPRESS',
'1S0Z7F92N86S68KI_3_ECONREFRTEMP',
'1S0Z7F92N86S68KI_3_EVAPSIDESUCTIONPRESS',
'1S0Z7F92N86S68KI_3_INVERTERTEMP',
'1S0Z7F92N86S68KI_3_MAINFLOWVALVEOPENING',
'1S0Z7F92N86S68KI_3_MAINLOOPLEVEL',
'SYSTEM_CONDCAPACITY',
'SYSTEM_CONDSIDETEMPIN',
'SYSTEM_CONDSIDETEMPOUT',
'SYSTEM_EVAPCAPACITY',
'SYSTEM_EVAPSIDETEMPOUT',
'SYSTEM_UNITPOWER',
'code','time'
]]

#df #for testing

In [5]:
#数据清理0： 排序
# 按照时间排序数组后，然后删除时间的属性
df['time']=pd.to_datetime(df['time'])
df.sort_values('time', inplace=True)
# save for time sort checking...
#df.to_csv('modeldata_sortbytime.csv', header=True, index=True) #for testing
df = df.drop(['time'], axis=1)

In [6]:
#数据清理0: 转化 code 到 hascode = true/false 
# 根据hascode属性的请看，添加Ishascode =true/false 的列
df['hascode'] = (np.where(df['code'].isnull().values, False, True)).astype(object)
df = df.drop(['code'], axis=1)
#df.info()  # for testing
#df.head(3)  # for testing

##数据清理0:  将 “Ishascode =true/false”的列置换到第一列，作为分类Lable
model_data = pd.get_dummies(df)
model_data = pd.concat([model_data['hascode_True'], model_data.drop(
    ['hascode_False', 'hascode_True'], axis=1)], axis=1)
#willdel#model_data_values = model_data_values[0:10, 13:15] ## for testing
#willdel#DataFrame(model_data_values).info()
#model_data #for testing

In [7]:
# 定义函数 #

# 时间序列数据集转换为监督学习问题，将《多列时间序列数据》转换为《监督学习问题》Transform the timeseries data into supervised learning
# 参数: data=原始数据集；n_in=滑窗值（合并多少条时序记录合并在一起）；dropnan=是否保留华创后部分为空的记录；
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
       n_vars = 1 if type(data) is list else data.shape[1]
       df = DataFrame(data)
       cols = list()
        
       # input sequence (t-n, ... t-1)
       for i in range(n_in, 0, -1):
              cols.append(df.shift(i))
                
       # forecast sequence (t, t+1, ... t+n)
       for i in range(0, n_out):
              cols.append(df.shift(-i))
                
       # put it all together
       agg = concat(cols, axis=1)
        
       # drop rows with NaN values
       if dropnan:
              agg.dropna(inplace=True)
       return agg.values


# 对数据集进行分段滑窗，从而避免内存溢出；
# 参数：data=原始数据集；n_in=滑窗值（合并多少条时序记录合并在一起）；splite_md=行分段的数量，分的越小内存占用越小
def splite_series_to_supervised(data, n_in=1, splite_md=500000):
    splited_series_to_supervised = pd.DataFrame()  #定义一个临时的dataframe，用于解决内存溢出的问题
    # start, stop, step 三个参数可以为负数
    for i in range(0,len(data),splite_md):
        print('/ total= '+str(len(data))+', now splited_working_number= ' + str(i)), # for testing    
        splited_series_to_supervised = splited_series_to_supervised.append(DataFrame(
            series_to_supervised(model_data.iloc[ i : i+splite_md ], n_in=n_in, dropnan=False)))
        #print(model_data.iloc[ i : i+splite_md ].info()) # for testing
    
    return splited_series_to_supervised


#滑窗后处理：滑窗后的数据清理，将不是最后错误之前发生的滑窗全部删除
# 用 n_slidingwindow（滑窗数量） 做一个循环：不考虑（查询）最后一位的hascode，前面所有hascode=1 (每间隔58个列的第一个是hascode=true/false)的行全部删除；
# 即：只保留全部hascode=0的行(没有错误发生的行) 与 最后一位hascode=1的行(第一个错误发生的行)
def clear_supervised(data, n_slidingwindow, n_totalcolumns):
    count = 0
    while count < n_slidingwindow :
        n_colnum = count * n_totalcolumns
        data.drop(index=data[data[n_colnum].isin([True])].index, inplace=True)
        print('/ total = '+str(n_slidingwindow) +', now working on = ' + str(count)), # for testing
        #print('column num= '+ str(n_colnum)) # for testing
        count = count + 1
    
    return data

#-----------------------#-----------------------#-----------------------

# 滑窗后处理：正确滑窗，应该是 “有错和无错，各自一条”； 同时适用于pd.sample随机
#即：挑出报错时的最后一条数据 + 删除上面N条未报错数据（上一条正常数据为行数为 except - n）
def pickup_supervised_4train(data, n_slidingwindow, n_totalcolumns, splite_md=500000):
    splited_series_to_supervised = pd.DataFrame()  #定义一个临时的dataframe，用于解决内存溢出的问题
    
    for i in range(0,len(data),splite_md):
        print('\nworking_number ==' + str(i)), # for testing    

        splited_data= data.iloc[ i : i+splite_md ]
        splited_data.reset_index(drop=True,inplace=True)
        print('/ splited_data ==== '+str(splited_data.shape[0])),

        n_checkpoint = n_totalcolumns * n_slidingwindow # 检查点位数 
        index_hascode_truerows = splited_data[splited_data[n_checkpoint].isin([True])].index #检查点列 为真的Index号，用于下一步挑出来
        print('/ index_hascode_truerows ====== '+str(index_hascode_truerows)),

        target_data = pd.DataFrame(data=splited_data,index=index_hascode_truerows) # 把检查点列 为真 挑出来
        target_data = target_data.append(pd.DataFrame(data=splited_data,index=index_hascode_truerows-n_slidingwindow)) # 【-滑窗数量可以取到完全不同的validi值，适用于sample随机】把检查点列 为真的上n行，挑出来 (N等于滑窗个数)
        target_data[n_checkpoint].fillna(0, inplace=True) #上一行的 检查点列 有可能是空的，空值清洗为0
        print('/ target_data ======== '+ str(target_data.shape[0])),

        splited_series_to_supervised = splited_series_to_supervised.append(target_data)

    print('\n========================================')
    print('splited_series_to_supervised  ============================== '+ str(splited_series_to_supervised.shape[0])) # for testing1
    splited_series_to_supervised.drop_duplicates(inplace=True) #清除重复的行（造成1增加）
    return splited_series_to_supervised

# 滑窗后处理：正确滑窗，应该是 “有错和无错，各自一条”； 同时适用于pd.sample随机
# 即：挑出报错时的最后一条数据 + 删除上面N条未报错数据（上一条正常数据为行数为 except - n）
# 样本不均衡处理：以状态位=1的row为准，向上画出一个状态位=0的矩阵，从而仅保留部分状态位=0的滑窗集合（非故障数据集的筛选）
def pickup_supervised_4train_imbalance(data, n_slidingwindow, n_totalcolumns, splite_md=500000):
    splited_series_to_supervised = pd.DataFrame()  #定义一个临时的dataframe，用于解决内存溢出的问题
    
    for i in range(0,len(data),splite_md):
        print('\nworking_number ==' + str(i)), # for testing    

        splited_data= data.iloc[ i : i+splite_md ]
        splited_data.reset_index(drop=True,inplace=True)
        print('/ splited_data ==== '+str(splited_data.shape[0])),

        n_checkpoint = n_totalcolumns * n_slidingwindow # 检查点位数 
        index_hascode_truerows = splited_data[splited_data[n_checkpoint].isin([True])].index #检查点列 为真的Index号，用于下一步挑出来
        print('/ index_hascode_truerows ====== '+str(index_hascode_truerows)),

        target_data = pd.DataFrame(data=splited_data,index=index_hascode_truerows) # 把检查点列 为真 挑出来
        
          #--- 以状态位=1的row为准，向上画出一个状态位=0的矩阵 ---# 
        for i in range(1,n_slidingwindow+1): 
            target_data = target_data.append(pd.DataFrame(data=splited_data,index=index_hascode_truerows-i)) # 把检查点列 为真的上n滑窗行，挑出来 (N等于滑窗个数)，不适用于sample随机
            #print(' DivRows=' + str(i)), #for testing
            
        target_data[n_checkpoint].fillna(0, inplace=True) #上一行的 检查点列 有可能是空的，空值清洗为0
        print('/ target_data ======== '+ str(target_data.shape[0])),

        splited_series_to_supervised = splited_series_to_supervised.append(target_data)

    print('\n========================================')
    print('splited_series_to_supervised  ============================== '+ str(splited_series_to_supervised.shape[0])) # for testing1
    splited_series_to_supervised.drop_duplicates(inplace=True) #清除重复的行（造成1增加）    
    return splited_series_to_supervised

#-----------------------#-----------------------#-----------------------


# 滑窗后处理：准备xgboost的Lable数据集，处理：
# 1）将最后的hasissuce code lable放入第一列； 
# 2）删除合并后最后一行row的信息（共计58个属性column）
def ready4inference_supervised(data, n_slidingwindow, n_totalcolumns, n_combinrows):
    #两个工作：1）把最后的一位 hascode=true/false 挪到了第一位作为lable； 2）删除最后的58个属性列；    
    n_lasthascodepoint = n_slidingwindow*n_totalcolumns
    print('n_lasthascodepoint=' + str(n_lasthascodepoint))
    
    n_drop_end = (n_slidingwindow + 1) * n_totalcolumns # 需要增加58个column才是end；
    print('n_drop_end=' + str(n_drop_end))
    
    n_drop_start = (n_slidingwindow - n_combinrows) * n_totalcolumns
    print('n_drop_start=' + str(n_drop_start))
    
    data = pd.concat([data[n_lasthascodepoint], 
                            data.drop(data.iloc[:, n_drop_start:n_drop_end], axis=1)], axis=1) 
    print('\nFINALL data  ============================== '+ str(data.shape[0])) # for testing

    return data

# inference预测处理：传入modeldata数据和 sagemaker inference handle，获得预测结果
def sagemaker_predict(data, xgb_predictor, rows=1000):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join(
            [predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')   

In [8]:
#数据滑窗 1 ： 分段滑窗，避免内存溢出
## transform the timeseries data into supervised learning
#旧方法存档#model_data = DataFrame(series_to_supervised(model_data, n_in=n_slidingwindow, dropnan=False))  #合并n行到一行
#旧方法存档#model_data.info()   
model_data = splite_series_to_supervised(model_data, n_slidingwindow, 50000)

#model_data.to_csv('spliteseriestosupervised_data.csv', header=False, index=False) # has NO header for xgboost bin:log
#model_data.info() #for testing
#model_data #for testing

/ total= 500000, now splited_working_number= 0
/ total= 500000, now splited_working_number= 50000
/ total= 500000, now splited_working_number= 100000
/ total= 500000, now splited_working_number= 150000
/ total= 500000, now splited_working_number= 200000
/ total= 500000, now splited_working_number= 250000
/ total= 500000, now splited_working_number= 300000
/ total= 500000, now splited_working_number= 350000
/ total= 500000, now splited_working_number= 400000
/ total= 500000, now splited_working_number= 450000


In [9]:
# 滑窗后数据处理 2，清理不符合要求的滑入数据；
model_data = clear_supervised(model_data, n_slidingwindow, n_totalcolumns)
#model_data.to_csv('clearsupervised_data.csv', header=False, index=False) # has NO header for xgboost bin:log

# 校验查询用
n_checkpoint = n_totalcolumns * n_slidingwindow # 检查点位数  
print('\nAfter clear_supervised model_data issue rows on point = ' + str(n_checkpoint))
#model_data[model_data[n_checkpoint].isin([True])].info()  ## 校验查询用 #for testing
print('\nAfter clear_supervised model_data total  ============================== ')
model_data.info()  ## 校验查询用 #for testing
#model_data # for testing

/ total = 10, now working on = 0
/ total = 10, now working on = 1
/ total = 10, now working on = 2
/ total = 10, now working on = 3
/ total = 10, now working on = 4
/ total = 10, now working on = 5
/ total = 10, now working on = 6
/ total = 10, now working on = 7
/ total = 10, now working on = 8
/ total = 10, now working on = 9

After clear_supervised model_data issue rows on point = 580

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248150 entries, 0 to 47960
Columns: 638 entries, 0 to 637
dtypes: float64(638)
memory usage: 1.2 GB


In [10]:
#滑窗后数据处理 3： 分段挑选有效数据，避免内存溢出
model_data = pickup_supervised_4train_imbalance(model_data,n_slidingwindow, n_totalcolumns, splite_md=50000)
#model_data.to_csv('pickupsmallsupervisedfortrain_data.csv', header=False, index=False) # has NO header for xgboost bin:log

print('\nFINALL clear_supervised model_data issue rows on point = ' + str(n_checkpoint))
model_data[model_data[n_checkpoint].isin([True])].info()  ## 校验查询用 #for testing
print('\nFINALL model_data  ============================== '+ str(model_data.shape[0])) # for testing
#model_data.info()  ## 校验查询用 #for testing
#model_data #for testing


working_number ==0
/ splited_data ==== 50000
             2628,  7272,  8763,  8773, 11000, 12093, 12102, 12191, 12200,
            12213, 14575, 14586, 14587, 48678, 48717, 48723, 48735, 49200,
            49730, 49920, 49940],
           dtype='int64')

working_number ==50000
/ splited_data ==== 50000
              496,
            ...
            46972, 47025, 47061, 47068, 47138, 47806, 48389, 48427, 49657,
            49703],
           dtype='int64', length=187)

working_number ==100000
/ splited_data ==== 50000
            13934, 13936, 14001, 14049, 14378, 14404, 14409, 15371, 15557,
            15565, 15626, 18048, 18289, 18293, 18295, 20362, 22695, 33829,
            33833],
           dtype='int64')

working_number ==150000
/ splited_data ==== 50000

working_number ==200000
/ splited_data ==== 48150


FINALL clear_supervised model_data issue rows on point = 580
<class 'pandas.core.frame.DataFrame'>
Int64Index: 247 entries, 338 to 43165
Columns: 638 entries, 0 to 637
dtypes:

In [11]:
#滑窗后数据处理 4：准备xgboost training lable的标志位
model_data = ready4inference_supervised(model_data,n_slidingwindow, n_totalcolumns,n_combinrows)
#model_data.to_csv('ready4inference_data.csv', header=True, index=True)  #for testing
model_data.info()  ## 校验查询用 #for testing

# for testing
#model_data.columns
#model_data.columns.values
#model_data.head(3)
#model_data #for testing

n_lasthascodepoint=580
n_drop_end=638
n_drop_start=232

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2511 entries, 338 to 43155
Columns: 233 entries, 580 to 231
dtypes: float64(233)
memory usage: 4.5 MB


In [12]:
## train_data, validation_data = np.split(model_data.sample(frac=1,random_state=1729), [int(0.7 * len(model_data))])  #随机的sample，train/vali = 70/30比例
train_data, validation_data = np.split(model_data, [int(0.8 * len(model_data))])  ##顺序的sample，train/vali = 70/30比例
#train_data.to_csv('train_data.csv', header=True, index=False) #has header #header for autoML on sagemaker
train_data.to_csv('train_data.csv', header=False, index=False) # has NO header for xgboost bin:log
validation_data.to_csv('validation_data.csv', header=False, index=False)

whole_data = model_data
whole_data.to_csv('whole_data.csv', header=False, index=False)

In [13]:
!ls -lht *.csv

-rw-r--r-- 1 root root 3.0M Apr  5 13:56 whole_data.csv
-rw-r--r-- 1 root root 605K Apr  5 13:56 validation_data.csv
-rw-r--r-- 1 root root 2.4M Apr  5 13:56 train_data.csv


In [14]:
import os
import boto3
import sagemaker

sess = sagemaker.Session()

bucket = sess.default_bucket()
# s3_prefix = 'haier-xgb-{}-SW_{}-WC_{}-{}' #for testing

boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(s3_prefix, 'train/train_data.csv')).upload_file('train_data.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(
    s3_prefix, 'validation/validation_data.csv')).upload_file('validation_data.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(
    s3_prefix, 'wholedata/whole_data.csv')).upload_file('whole_data.csv')


In [15]:
print('upload completed in: ' + s3_prefix)
print('upload completed in bucket: ' +bucket)

upload completed in: xgb-10000871-part_ac.csv-TC58-SW010-WC006-20210405135629
upload completed in bucket: sagemaker-cn-northwest-1-337575217701
