# Seismic Bumps

## Description

The Seismic-Bumps dataset logs whether a high energy seismic bump occured within the following shift where mining activity has taken place.

The column attributes are as follows:
| Column Name | Description |
| - | - |
| seismic | Seismic assessment result {'a': 'lack of hazard', 'b': 'low hazard', 'c': 'high hazard', 'd': 'danger state'} |
| seismoacoustic | Seismoacoustic assessment result {'a': 'lack of hazard', 'b': 'low hazard', 'c': 'high hazard', 'd': 'danger state'} |
| shift_type | Shift type {'W': 'coal-getting', 'N': 'preparation shift'} |
| energy | Seismic energy recorded within previous shift by the most active geophone (GMax) |
| npulse | Number of pulses recorded within the previous shift by the most active geophone (GMax) |
| denergy | Deviation of energy recorded within the previous shift by the most active geophone (GMax) from the average energy recorded during the eight previous shifts |
| dpulse | Deviation of the number of pulses recorded within the previous shift by the most active geophone (GMax) from the average number of pulses recorded during the eight previous shifts |
| hazard | Result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming from the most active geophone (GMax) only |
| nbumps | Total number of seismic bumps recorded within the previous shift |
| nbumps2 | Number of seismic bumps registered within the previous shift in the [10^2, 10^3) energy range |
| nbumps3 | Number of seismic bumps registered within the previous shift in the [10^3, 10^4) energy range |
| nbumps4 | Number of seismic bumps registered within the previous shift in the [10^4, 10^5) energy range |
| nbumps5 | Number of seismic bumps registered within the previous shift in the [10^5, 10^6) energy range |
| nbumps6 | Number of seismic bumps registered within the previous shift in the [10^6, 10^7) energy range |
| nbumps7 | Number of seismic bumps registered within the previous shift in the [10^7, 10^8) energy range |
| nbumps8 | Number of seismic bumps registered within the previous shift in the [10^8, 10^10) energy range |
| t_energy | Total energy of the seismic bumps registered within the previous shift |
| max_energy | Maximum energy of the seismic bumps registered within the previous shift |
| is_hazardous | Class label 0 ('non-hazardous state', no high energy seismic bumps occured during the next shift) and 1 ('hazardous state', high energy seismic bumps did occur in the next shift) |

[Source](https://archive.ics.uci.edu/ml/datasets/seismic-bumps)

## Importing the Dataset

In [None]:
import pandas as pd
from scipy.io import arff

column_names = ['seismic',
                'seismoacoustic',
                'shift_type',
                'energy',
                'npulse',
                'denergy',
                'dpulse',
                'hazard',
                'nbumps',
                'nbumps2',
                'nbumps3',
                'nbumps4',
                'nbumps5',
                'nbumps6',
                'nbumps7',
                'nbumps8',
                't_energy',
                'max_energy',
                'is_hazardous']

with open("..\..\datasets\classification\seismic_bumps.arff", "r") as dataset_file:
    raw_data, meta = arff.loadarff(dataset_file)

## Preparing the Dataset

In [None]:
# Convert the raw numpy dataset to a pandas DataFrame. This allows for mixed datatypes within the same multidimensional matrix object.
processed_data = pd.DataFrame(raw_data.tolist(), columns=column_names)

# Decode string columns.
processed_data['seismic'] = processed_data['seismic'].str.decode('utf-8')
processed_data['seismoacoustic'] = processed_data['seismoacoustic'].str.decode('utf-8')
processed_data['shift_type'] = processed_data['shift_type'].str.decode('utf-8')
processed_data['hazard'] = processed_data['hazard'].str.decode('utf-8')

# Decode integer target column.
processed_data['is_hazardous'] = processed_data['is_hazardous'].astype(int)

The following block prints the shape and column datatypes of the processed dataset.

In [None]:
print(processed_data.shape)
print(processed_data.dtypes)