### Application specific unit (ASU)
The ASU is a positive integer representing the application specific unit (see the SPC Global Parameter
specification document). This is a zero based, monotonically increasing number. The first record in the
trace file need not have the ASU equal to zero, however unit zero must exist within the trace file. If there
are a total of n units described in the complete trace file, then the trace file must contain at least one
record for each of units 0 through n-1.

### Logical block address (LBA)
The LBA field is a positive integer that describes the ASU block offset of the data transfer for this record,
where the size of a block is contained in the description of the trace file. This offset is zero based, and
may range from 0 to n-1, where n is the capacity in blocks of the ASU. There is no upper limit on this
field, other than the restriction that sum of the address and size fields must be less than or equal to the
capacity of the ASU.

### Size
The size field is a positive integer that describes the number of **bytes** transferred for this record. A value
of zero is legal, the result of which is I/O subsystem dependent. Although the majority of records are
anticipated to be modulo 512, this constraint is not required. There is no upper limit on this field, other
than the restriction that sum of the address and size fields must be less than or equal to the capacity of
the ASU.

### Opcode
The opcode field is a single, case insensitive character that defines the direction of the transfer. There are
two possible values for this field: SPC Trace File Format Specification Page 4 of 6
1. “R” (or “r”) indicates a read operation. This implies data transfer from the ASU to the host computer.
2. “W” (or “w”) indicates a write operation. This implies data transfer to the ASU from the host computer

### Timestamp
The timestamp field is a positive real number representing the offset in **seconds** for this I/O from the
start of the trace. The format of this field is “s.d”, where “s” represents the integer portion, and “d”
represents the fractional portion of the timestamp. Both the integer and fractional parts of the field must
be present. The value of this field must be greater than or equal to all preceding records, and less than or
equal to all succeeding records. The first record need not have a value of “0.0”.

[SPC-Traces](https://traces.cs.umass.edu/index.php/Storage/Storage)

---

### Logical Block Address는 HDD의 SectorId를 의미하므로, SectorId로 치환하였다.
### Sector 1개의 크기는 512byte이고 각 Page의 크기를 4KB로 설정하였으므로, 한 Page에는 8개의 Sector가 들어간다.

In [16]:
import pandas as pd
import numpy as np

In [17]:
# define Constants
byte = 1
KB = 1024 * byte
SECTOR_SIZE = 512 * byte
PAGE_SIZE = 4 * KB
SECTORS_PER_PAGE = PAGE_SIZE // SECTOR_SIZE

In [18]:
columns = ['ASU', 'SectorId', 'Size', 'Opcode', 'Timestamp']
df = pd.read_csv('../csv/raw/Financial1.csv', header=None, names=columns)
print(df.head())
print("Size of Financial1.csv : ", len(df))

   ASU  SectorId  Size Opcode  Timestamp
0    0    303567  3584      w   0.000000
1    1     55590  3072      w   0.000000
2    0    303574  3584      w   0.026214
3    1    240840  3072      w   0.026214
4    1     55596  3072      r   0.078643
Size of Financial1.csv :  5334987


In [19]:
df_w = df[df['Opcode'].str.lower() == 'w'].copy()  # Operation Type이 'w' 인 것만 원함
df_w.drop(['ASU', 'Opcode'], axis=1, inplace=True)  # 'ASU', 'Opcode' Column을 drop
df_w.drop(df_w[df_w['Size'] == 0].index, inplace=True)  # Size가 0인 row 제거
df_w['# of Sectors'] = df_w['Size'].div(SECTOR_SIZE).apply(np.ceil)  # convert Byte size to Sectors
df_w['# of Sectors'] = df_w['# of Sectors'].astype('int32')
df_w['SectorId'] = df_w['SectorId'].astype(str)
df_w['Size'] = df_w['Size'].astype('int32')
df_w.head(50)

Unnamed: 0,SectorId,Size,Timestamp,# of Sectors
0,303567,3584,0.0,7
1,55590,3072,0.0,6
2,303574,3584,0.026214,7
3,240840,3072,0.026214,6
5,303581,3584,0.117964,7
6,55596,3072,0.117964,6
7,303588,3584,0.530841,7
8,55596,3072,0.530841,6
9,303595,3584,0.550502,7
10,240840,3072,0.550502,6


In [20]:
print(len(df_w))

4099353


In [21]:
datas = []
for sector_id, size, timestamp, numofsectors in df_w.values:
    for i in range(numofsectors//8 + 1):
        if size == 0:
            break
        datas.append([int(sector_id) // 8 + i, min(4*KB, size), timestamp])
        size -= 4*KB

total_df = pd.DataFrame(datas, columns=[['PageId', 'Size', 'Timestamp']])
total_df.head(50)

Unnamed: 0,PageId,Size,Timestamp
0,37945,3584,0.0
1,6948,3072,0.0
2,37946,3584,0.026214
3,30105,3072,0.026214
4,37947,3584,0.117964
5,6949,3072,0.117964
6,37948,3584,0.530841
7,6949,3072,0.530841
8,37949,3584,0.550502
9,30105,3072,0.550502


In [25]:
# Split data with model:simulation = 3:7
total_rows = len(total_df)

# 전체 30%를 모델 학습에 사용한다.
model_rows = int(total_rows * 0.3)

# 나머지 70%를 simulation에 사용한다.
model_df = total_df.iloc[:model_rows]
simulation_df = total_df.iloc[model_rows:]
print(len(model_df))
print(len(simulation_df))

1688474
3939773


In [27]:
model_df.head()

Unnamed: 0,PageId,Size,Timestamp
0,37945,3584,0.0
1,6948,3072,0.0
2,37946,3584,0.026214
3,30105,3072,0.026214
4,37947,3584,0.117964


In [26]:
model_df.to_csv("../csv/preprocessed/iotrace_model.csv", header=None, index=False)
simulation_df.to_csv("../csv/preprocessed/iotrace_simulation.csv", header=None, index=False)