## EDA

Data available [here](https://kzn-swift.massopen.cloud/swift/v1/devicehealth/).

In [16]:
import os
import gc
import wget
import json
from tqdm import tqdm
from zipfile import ZipFile

import numpy as np
import scipy as sp
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
month = "device_health_metrics_2020-01.zip"

In [3]:
url = f'https://kzn-swift.massopen.cloud/swift/v1/devicehealth/{month}'

In [4]:
zipped_data = wget.download(url)

In [5]:
# Figure out unzipping

#with ZipFile('../../device_health_metrics_2020-01.zip', 'r') as myzip:
#    myzip.extract('device_health_metrics_2020-01.csv')

In [7]:
health_data = pd.read_csv("device_health_metrics_2020-01.csv")
health_data.head()

Unnamed: 0,ts,device_uuid,invalid,report
0,2020-01-20 15:01:17,30c613da-3ee6-11ea-afb2-0025900057ea,t,"{""dev"": ""/dev/sdb"", ""error"": ""smartctl failed""..."
1,2020-01-20 15:01:18,30b5dc0e-3ee6-11ea-afb2-0025900057ea,t,"{""dev"": ""/dev/sdc"", ""error"": ""smartctl failed""..."
2,2020-01-20 15:01:19,30a59e48-3ee6-11ea-afb2-0025900057ea,t,"{""dev"": ""/dev/sdd"", ""error"": ""smartctl failed""..."
3,2020-01-20 15:01:20,30b34b24-3ee6-11ea-afb2-0025900057ea,t,"{""dev"": ""/dev/sde"", ""error"": ""smartctl failed""..."
4,2020-01-20 15:01:20,30bcc5c8-3ee6-11ea-afb2-0025900057ea,t,"{""dev"": ""/dev/sdf"", ""error"": ""smartctl failed""..."


## Some EDA + Unrolling

In [8]:
# how many devices do we have data from?
health_data['device_uuid'].nunique()

89

In [9]:
# how many data points do we have from device?
health_data['device_uuid'].value_counts()

30a81844-3ee6-11ea-afb2-0025900057ea    17
30c613da-3ee6-11ea-afb2-0025900057ea    17
30cb3888-3ee6-11ea-afb2-0025900057ea    16
30aef7c2-3ee6-11ea-afb2-0025900057ea    16
30a08c82-3ee6-11ea-afb2-0025900057ea    16
                                        ..
5bfae076-4459-11ea-a135-0cc47ad2c770     4
5c0dbfb6-4459-11ea-a135-0cc47ad2c770     4
0e8d2da2-4390-11ea-8497-0cc47a635394     4
6942c3b3-3c97-11ea-aeb4-002590005994     4
a26045dd-40a4-11ea-aeb4-002590005994     1
Name: device_uuid, Length: 89, dtype: int64

In [10]:
# how many of these data points had valid data?
health_data.invalid.value_counts()

f    792
t    310
Name: invalid, dtype: int64

**RESULT** Based on the above outputs, it looks like we have data from 89 unique devices. The number of data points from each device ranges from 1 to 17. Furthermore, roughly 72% of these data points have valid data.

In [11]:
# drop invalid data
health_data = health_data[health_data['invalid']=='f']
health_data.shape

(792, 4)

In [12]:
# convert json strings to python dicts
health_data['report'] = health_data['report'].apply(lambda x: json.loads(x))

In [13]:
# unroll device data column to get a flat df of features
unrolled_health_data = pd.json_normalize(health_data['report'])
unrolled_health_data.head()

Unnamed: 0,vendor,host_id,product,revision,model_name,nvme_vendor,scsi_version,rotation_rate,logical_block_size,json_format_version,...,nvme_smart_health_information_log.media_errors,nvme_smart_health_information_log.power_cycles,nvme_smart_health_information_log.power_on_hours,nvme_smart_health_information_log.data_units_read,nvme_smart_health_information_log.unsafe_shutdowns,nvme_smart_health_information_log.data_units_written,nvme_smart_health_information_log.num_err_log_entries,nvme_smart_health_information_log.controller_busy_time,nvme_total_capacity,nvme_unallocated_capacity
0,Hitachi,6942c3b2-3c97-11ea-aeb4-002590005994,HUA722010CLA330,R001,Hitachi HUA722010CLA330,hitachi,SPC-3,10000.0,512,"[1, 0]",...,,,,,,,,,,
1,Seagate,30930e18-3ee6-11ea-afb2-0025900057ea,ST31000528AS,R001,Seagate ST31000528AS,seagate,SPC-3,10000.0,512,"[1, 0]",...,,,,,,,,,,
2,Hitachi,30957ee6-3ee6-11ea-afb2-0025900057ea,HUA722010CLA330,R001,Hitachi HUA722010CLA330,hitachi,SPC-3,10000.0,512,"[1, 0]",...,,,,,,,,,,
3,Hitachi,3099dcb6-3ee6-11ea-afb2-0025900057ea,HUA722010CLA330,R001,Hitachi HUA722010CLA330,hitachi,SPC-3,10000.0,512,"[1, 0]",...,,,,,,,,,,
4,Hitachi,30957ee6-3ee6-11ea-afb2-0025900057ea,HUA722010CLA330,R001,Hitachi HUA722010CLA330,hitachi,SPC-3,10000.0,512,"[1, 0]",...,,,,,,,,,,


In [14]:
# how many disks had smartctl run successfully
unrolled_health_data['smartctl.exit_status'].value_counts()

0    493
4    299
Name: smartctl.exit_status, dtype: int64

**RESULT** From the above cell, it looks like for most of the data points, smartctl ran successfully with exit code 0 (no errors at all). For some, we had smartctl exit code 4 (i.e. bit 2 was raised), which means some smartctl attributes could not be fetched (as per docs here - https://linux.die.net/man/8/smartctl). In all cases, we have at least some valid smart attributes from each device.

In [17]:
# extract smart metrics
smart_metrics_df = unrolled_health_data['ata_smart_attributes.table'].to_frame()

# numerical index of column ata_smart_attributes
for row_idx in tqdm(range(len(smart_metrics_df))):
    # get the smart stats for current drive
    stats = smart_metrics_df.iloc[row_idx]['ata_smart_attributes.table']
    
    if isinstance(stats, list):
        for stat in stats:
            # extract normalized value, and int form of raw value
            smart_metrics_df.at[row_idx, 'smart_' + str(stat['id']) + '_normalized'] = stat['value']
            smart_metrics_df.at[row_idx, 'smart_' + str(stat['id']) + '_raw'] = stat['raw']['value']

smart_metrics_df.drop(columns=['ata_smart_attributes.table'], inplace=True)
smart_metrics_df.dropna(how='all').head()

100%|██████████| 792/792 [00:00<00:00, 1082.67it/s]


Unnamed: 0,smart_5_normalized,smart_5_raw,smart_9_normalized,smart_9_raw,smart_12_normalized,smart_12_raw,smart_177_normalized,smart_177_raw,smart_179_normalized,smart_179_raw,...,smart_206_normalized,smart_206_raw,smart_210_normalized,smart_210_raw,smart_246_normalized,smart_246_raw,smart_247_normalized,smart_247_raw,smart_248_normalized,smart_248_raw
223,100.0,0.0,99.0,1136.0,99.0,2.0,99.0,1.0,100.0,0.0,...,,,,,,,,,,
226,100.0,0.0,99.0,1148.0,99.0,2.0,100.0,0.0,100.0,0.0,...,,,,,,,,,,
227,100.0,0.0,62.0,33997.0,100.0,50.0,,,,,...,,,,,,,,,,
228,100.0,0.0,100.0,2070.0,100.0,2.0,,,,,...,,,,,,,,,,
229,100.0,0.0,62.0,33997.0,100.0,51.0,,,,,...,,,,,,,,,,


In [None]:
# show variety of vendors iz

In [None]:
# show different types of disks iz

In [None]:
# failures vs not failures kc

In [None]:
# how many disks smartctl ran kc