# 在sktime中导入数据

## .ts数据文件
包含两个主要部分：头部信息和数据  
头部信息包括如下内容：  
```
@problemName <problem name>
@timeStamps <true/false>
@univariate <true/false>
@classLabel <true/false> <space delimted list of possible class values>
@data
```

数据格式如下：  
1、当@timeStamps取值为false时，即数据不带时间戳，每行为1个样本数据，每个变量间用冒号（:）分隔，缺失值为"?"。
```
2,3,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2
```

2、当@timeStamps取值为true时，即数据带有时间戳信息，每行一个样本数据，每个数据值为一个元组（时间戳，数值），变量间用冒号分隔，第一个样本数据表示为：
```
(0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)
```

```
2,5,?,?,?,?,?,5,?,?,?,?,4
```
用时间戳表示为：
```
(0,2),(1,5),(7,5),(12,4)
```
对于分类问题，应在最后一个维度中指定样本的分类标签，@class label应在头部信息中指定可能的分类取值，例如表示一个单变量样本：
```
1,4,23,34:1
```

## 在pandas的DataFrame存储数据
用于在sktime中，存储数据集的核心数据结构是pandas的DataFrame，其中DataFrame的行对应于样本，列对应于变量，DataFrame的每个元素是pandas的Series对象。
```
DataFrame:
 index |   dim_0  |   dim_1  |    ...   |  dim_c-1
   0  | pd.Series | pd.Series | pd.Series | pd.Series
   1  | pd.Series | pd.Series | pd.Series | pd.Series
  ... |    ...   |    ...  |    ...   |    ...
   n  | pd.Series | pd.Series | pd.Series | pd.Series
```

## 将.ts数据文件导入成DataFrame
使用如下方法导入数据：
```
load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y, replace_missing_vals_with='NaN')
```

In [2]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe
train_x, train_y= load_from_tsfile_to_dataframe("GunPoint_TRAIN.ts")
test_x, test_y = load_from_tsfile_to_dataframe("GunPoint_TEST.ts")

In [3]:
train_x.tail()

Unnamed: 0,dim_0
45,0 -0.56491 1 -0.56505 2 -0.56648 3...
46,0 -0.61464 1 -0.61499 2 -0.61479 3...
47,0 -0.77913 1 -0.77838 2 -0.77574 3...
48,0 -0.70303 1 -0.70262 2 -0.70250 3...
49,0 -1.435700 1 -1.432300 2 -1.43290...


In [4]:
train_x.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


In [5]:
train_y[0:5]

array(['2', '2', '1', '1', '2'], dtype='<U1')

In [6]:
train_x.shape

(50, 1)

In [7]:
test_x.shape

(150, 1)

In [8]:
train_x.iloc[0,0]

0     -0.64789
1     -0.64199
2     -0.63819
3     -0.63826
4     -0.63835
5     -0.63870
6     -0.64305
7     -0.64377
8     -0.64505
9     -0.64712
10    -0.64915
11    -0.65125
12    -0.65729
13    -0.66220
14    -0.66123
15    -0.66099
16    -0.66156
17    -0.66226
18    -0.66191
19    -0.66274
20    -0.66093
21    -0.66345
22    -0.66219
23    -0.66234
24    -0.66171
25    -0.66139
26    -0.66141
27    -0.66145
28    -0.66037
29    -0.65911
        ...   
120   -0.66464
121   -0.66412
122   -0.66430
123   -0.66120
124   -0.65935
125   -0.65258
126   -0.64332
127   -0.63887
128   -0.63656
129   -0.63317
130   -0.63304
131   -0.63301
132   -0.63303
133   -0.63522
134   -0.63447
135   -0.63579
136   -0.63628
137   -0.63540
138   -0.63607
139   -0.63755
140   -0.63926
141   -0.63972
142   -0.63973
143   -0.64018
144   -0.63923
145   -0.63939
146   -0.64023
147   -0.64043
148   -0.63867
149   -0.63866
Length: 150, dtype: float64

In [9]:
train_x.iloc[49,0]

0     -1.435700
1     -1.432300
2     -1.432900
3     -1.431600
4     -1.432600
5     -1.432300
6     -1.433500
7     -1.432400
8     -1.433300
9     -1.430000
10    -1.429700
11    -1.429100
12    -1.429300
13    -1.426700
14    -1.426700
15    -1.423800
16    -1.425800
17    -1.426200
18    -1.426000
19    -1.410300
20    -1.334600
21    -1.208900
22    -1.000400
23    -0.753660
24    -0.464170
25    -0.193930
26     0.002154
27     0.251830
28     0.408640
29     0.540030
         ...   
120    0.170290
121    0.027367
122   -0.155540
123   -0.310180
124   -0.494360
125   -0.672460
126   -0.852330
127   -1.044900
128   -1.208500
129   -1.343200
130   -1.453200
131   -1.529200
132   -1.572200
133   -1.587400
134   -1.586400
135   -1.578100
136   -1.557200
137   -1.535700
138   -1.511500
139   -1.492400
140   -1.478000
141   -1.461900
142   -1.447400
143   -1.438900
144   -1.437800
145   -1.436900
146   -1.434500
147   -1.435500
148   -1.435300
149   -1.430900
Length: 150, dtype: floa

导入多变量时间序列数据集：

In [10]:
train_x, train_y= load_from_tsfile_to_dataframe("BasicMotions_TEST.ts")

In [11]:
train_x.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
0,0 -0.740653 1 -0.740653 2 10.20844...,0 0.756509 1 0.756509 2 -9.216970 3...,0 -0.275809 1 -0.275809 2 -12.37890...,0 -0.423476 1 -0.423476 2 -14.69915...,0 0.013317 1 0.013317 2 4.578337 3...,0 0.013317 1 0.013317 2 -5.055081 3...
1,0 -0.247409 1 -0.247409 2 -0.771290 3...,0 -0.060459 1 -0.060459 2 -0.047618 3...,0 -0.608565 1 -0.608565 2 -0.294411 3...,0 -0.023970 1 -0.023970 2 -0.269001 3...,0 0.101208 1 0.101208 2 0.111862 3...,0 0.071911 1 0.071911 2 0.135832 3...
2,0 -0.663284 1 -0.663284 2 5.393924 3...,0 0.273010 1 0.273010 2 -3.079673 3...,0 -0.160963 1 -0.160963 2 -3.175911 3...,0 -0.245030 1 -0.245030 2 -6.408074 3...,0 -0.077238 1 -0.077238 2 0.471417 3...,0 -0.018644 1 -0.018644 2 -3.592890 3...
3,0 -1.088052 1 -1.088052 2 -0.683620 3...,0 0.183832 1 0.183832 2 -2.909047 3...,0 -0.260871 1 -0.260871 2 1.507042 3...,0 -0.284981 1 -0.284981 2 0.415486 3...,0 0.487397 1 0.487397 2 0.013317 3...,0 1.081329 1 1.081329 2 0.820319 3...
4,0 0.354481 1 0.354481 2 0.449142 3...,0 -0.567671 1 -0.567671 2 -1.899854 3...,0 -0.084270 1 -0.084270 2 0.913056 3...,0 -0.223723 1 -0.223723 2 0.692477 3...,0 -0.247694 1 -0.247694 2 0.149149 3...,0 0.050604 1 0.050604 2 0.849616 3...


In [12]:
train_x.shape

(40, 6)

In [13]:
import pandas as pd
pd.DataFrame(train_x.iloc[0,0])
pd.DataFrame(train_x.iloc[0,1])

Unnamed: 0,0
0,0.756509
1,0.756509
2,-9.216970
3,-5.977115
4,-3.711996
5,-3.711996
6,0.731686
7,1.070128
8,0.582956
9,0.656840


## 导入Weka的ARFF数据文件

### 导入单变量时间序列数据

In [14]:
from sktime.utils.load_data import load_from_arff_to_dataframe
X,y = load_from_arff_to_dataframe("GunPoint_TRAIN.arff")
X.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


### 导入多变量时间序列数据

In [15]:
X, y = load_from_arff_to_dataframe("BasicMotions_TRAIN.arff")
X.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
1,0 0.377751 1 0.377751 2 2.952965 3...,0 -0.610850 1 -0.610850 2 0.970717 3...,0 -0.147376 1 -0.147376 2 -5.962515 3...,0 -0.103872 1 -0.103872 2 -7.593275 3...,0 -0.109198 1 -0.109198 2 -0.697804 3...,0 -0.037287 1 -0.037287 2 -2.865789 3...
2,0 -0.813905 1 -0.813905 2 -0.424628 3...,0 0.825666 1 0.825666 2 -1.305033 3...,0 0.032712 1 0.032712 2 0.826170 3...,0 0.021307 1 0.021307 2 -0.372872 3...,0 0.122515 1 0.122515 2 -0.045277 3...,0 0.775041 1 0.775041 2 0.383526 3...
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...


In [16]:
X.shape

(40, 6)

## 在sktime中使用长格式数据

In [17]:
from sktime.utils.load_data import generate_example_long_table, from_long_to_nested

生成长格式样例数据：10个样本，4个变量，序列长度为10，其中
- case_id：样本编号，
- dim_id：变量编号，
- reading_id：序列顺序编号，
- value：时间序列取值。

In [18]:
X = generate_example_long_table(num_cases=10, series_len=10, num_dims=4)
X.head()

Unnamed: 0,case_id,dim_id,reading_id,value
0,0,0,0,0.093836
1,0,0,1,0.216825
2,0,0,2,0.048565
3,0,0,3,0.568434
4,0,0,4,0.30628


In [19]:
X.tail()

Unnamed: 0,case_id,dim_id,reading_id,value
395,9,3,5,0.779468
396,9,3,6,0.790262
397,9,3,7,0.672735
398,9,3,8,0.004834
399,9,3,9,0.8242


将长格式数据转成宽格式数据：  
每列为一个维度（变量），每一行为一个样本，DataFrame的每个元素为一个序列。

In [20]:
X_nested = from_long_to_nested(X)
X_nested.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3
0,0 0.093836 1 0.216825 2 0.048565 3 ...,0 0.152426 1 0.344559 2 0.384267 3 ...,0 0.857559 1 0.740584 2 0.553719 3 ...,0 0.780016 1 0.646227 2 0.439047 3 ...
1,0 0.169883 1 0.120200 2 0.929857 3 ...,0 0.174235 1 0.205274 2 0.685471 3 ...,0 0.067208 1 0.274968 2 0.364168 3 ...,0 0.128402 1 0.027613 2 0.808922 3 ...
2,0 0.745527 1 0.654558 2 0.949356 3 ...,0 0.307235 1 0.500242 2 0.172114 3 ...,0 0.744760 1 0.004578 2 0.432622 3 ...,0 0.164665 1 0.650588 2 0.395661 3 ...
3,0 0.011159 1 0.460379 2 0.128935 3 ...,0 0.974301 1 0.955067 2 0.217221 3 ...,0 0.936015 1 0.314994 2 0.195095 3 ...,0 0.541958 1 0.014991 2 0.925976 3 ...
4,0 0.810218 1 0.970045 2 0.865790 3 ...,0 0.301435 1 0.199742 2 0.649312 3 ...,0 0.740288 1 0.804767 2 0.645713 3 ...,0 0.658385 1 0.655386 2 0.554100 3 ...


In [21]:
X_nested.iloc[0,0]

0    0.093836
1    0.216825
2    0.048565
3    0.568434
4    0.306280
5    0.401763
6    0.859552
7    0.460126
8    0.445671
9    0.122521
dtype: float64