# Data Preprocessing

This activity used the "SPECTF Heart" dataset (http://archive.ics.uci.edu/dataset/96/spectf+heart). The data includes 267 individuals, described by 45 different features. The first column corresponds to the target variable.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Reading the CSV file with the data
train_data = pd.read_csv('../data/SPECTF.train', header=None, delimiter = ',')
test_data = pd.read_csv('../data/SPECTF.test', header=None, delimiter = ',')

In [5]:
print(f'Shape of the train dataset: {train_data.shape}')
train_data.head()

Shape of the train dataset: (80, 45)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,1,59,52,70,67,73,66,72,61,58,...,66,56,62,56,72,62,74,74,64,67
1,1,72,62,69,67,78,82,74,65,69,...,65,71,63,60,69,73,67,71,56,58
2,1,71,62,70,64,67,64,79,65,70,...,73,70,66,65,64,55,61,41,51,46
3,1,69,71,70,78,61,63,67,65,59,...,61,61,66,65,72,73,68,68,59,63
4,1,70,66,61,66,61,58,69,69,72,...,67,69,70,66,70,64,60,55,49,41


In [6]:
print(f'Shape of the test dataset: {test_data.shape}')
test_data.head()

Shape of the test dataset: (187, 45)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,1,67,68,73,78,65,63,67,60,63,...,61,56,76,75,74,77,76,74,59,68
1,1,75,74,71,71,62,58,70,64,71,...,66,62,68,69,69,66,64,58,57,52
2,1,83,64,66,67,67,74,74,72,64,...,67,64,69,63,68,54,65,64,43,42
3,1,72,66,65,65,64,61,71,78,73,...,69,68,68,63,71,72,65,63,58,60
4,1,62,60,69,61,63,63,70,68,70,...,66,66,58,56,72,73,71,64,49,42


The project instructions demand that the data is concatenated (train and test):

In [8]:
dataset = pd.concat([train_data, test_data], axis=0, ignore_index=True)
print(f'Shape of the entire dataset: {dataset.shape}\n')
dataset.info()

Shape of the entire dataset: (267, 45)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267 entries, 0 to 266
Data columns (total 45 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       267 non-null    int64
 1   1       267 non-null    int64
 2   2       267 non-null    int64
 3   3       267 non-null    int64
 4   4       267 non-null    int64
 5   5       267 non-null    int64
 6   6       267 non-null    int64
 7   7       267 non-null    int64
 8   8       267 non-null    int64
 9   9       267 non-null    int64
 10  10      267 non-null    int64
 11  11      267 non-null    int64
 12  12      267 non-null    int64
 13  13      267 non-null    int64
 14  14      267 non-null    int64
 15  15      267 non-null    int64
 16  16      267 non-null    int64
 17  17      267 non-null    int64
 18  18      267 non-null    int64
 19  19      267 non-null    int64
 20  20      267 non-null    int64
 21  21      267 non-null    int64
 22  22      26

In [9]:
dataset = dataset.add_prefix("feature_").rename(columns={'feature_0': 'target'})
dataset.head()

Unnamed: 0,target,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_35,feature_36,feature_37,feature_38,feature_39,feature_40,feature_41,feature_42,feature_43,feature_44
0,1,59,52,70,67,73,66,72,61,58,...,66,56,62,56,72,62,74,74,64,67
1,1,72,62,69,67,78,82,74,65,69,...,65,71,63,60,69,73,67,71,56,58
2,1,71,62,70,64,67,64,79,65,70,...,73,70,66,65,64,55,61,41,51,46
3,1,69,71,70,78,61,63,67,65,59,...,61,61,66,65,72,73,68,68,59,63
4,1,70,66,61,66,61,58,69,69,72,...,67,69,70,66,70,64,60,55,49,41


In [41]:
dataset.to_csv('../data/SPECTF_preprocessed.csv')