# Data Preparation
In this section, we will prepare our data for machine learning by performing several key steps. First, we will load the data from its source, ensuring that it is in a format that can be easily used for analysis. Then, we will perform feature selection to identify the most relevant predictor features and objective features for our model. We will also perform data validation to ensure that the data is clean, consistent, and accurate.

Next, we will split the data into separate training, validation, and test sets, which will allow us to train and evaluate our model effectively. Finally, we will export the data in a format that can be easily used by our machine learning algorithms, such as pickles. By following these steps, we can ensure that our data is well-prepared and optimized for machine learning, which will ultimately help us to build better models and make more accurate predictions.

In [1]:
# change current working system path
import sys
sys.path.append('..') 

In [2]:
## boilerplate
import src.util as utils

import pandas as pd
import os
from sklearn.model_selection import train_test_split

In [3]:
# change current os working directory
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'd:\\ML\\PACMANN INTRO PROJECT - Gas Sensors Multi-Class Classification'

## Load config.yml configuration

In [4]:
config = utils.load_config()

### Load Data

In [5]:
df = pd.read_csv(config['dataset_path'])
df

Unnamed: 0,Serial Number,MQ2,MQ3,MQ5,MQ6,MQ7,MQ8,MQ135,Gas,Corresponding Image Name
0,0,555,515,377,338,666,451,416,NoGas,0_NoGas
1,1,555,516,377,339,666,451,416,NoGas,1_NoGas
2,2,556,517,376,337,666,451,416,NoGas,2_NoGas
3,3,556,516,376,336,665,451,416,NoGas,3_NoGas
4,4,556,516,376,337,665,451,416,NoGas,4_NoGas
...,...,...,...,...,...,...,...,...,...,...
6395,1595,658,445,455,414,491,321,436,Mixture,1595_Mixture
6396,1596,650,444,451,411,486,317,431,Mixture,1596_Mixture
6397,1597,630,443,446,407,474,312,429,Mixture,1597_Mixture
6398,1598,632,443,444,405,471,309,430,Mixture,1598_Mixture


### Feature Selection

In [6]:
# for the moment, let's ignore the image photo column
df = df.loc[:, config['predictors'] + [config['label']]]
df

Unnamed: 0,MQ2,MQ3,MQ5,MQ6,MQ7,MQ8,MQ135,Gas
0,555,515,377,338,666,451,416,NoGas
1,555,516,377,339,666,451,416,NoGas
2,556,517,376,337,666,451,416,NoGas
3,556,516,376,336,665,451,416,NoGas
4,556,516,376,337,665,451,416,NoGas
...,...,...,...,...,...,...,...,...
6395,658,445,455,414,491,321,436,Mixture
6396,650,444,451,411,486,317,431,Mixture
6397,630,443,446,407,474,312,429,Mixture
6398,632,443,444,405,471,309,430,Mixture


### Data Cleaning
Only doing data type convertion for all predictors and checking classes of objective feature.

In [7]:
for column in config['int_columns']:
    df[column] = df[column].astype('int32')

In [8]:
df[config['obj_columns']].value_counts().index.tolist()

[('Mixture',), ('NoGas',), ('Perfume',), ('Smoke',)]

### Data Defense

Based on columns data types in config.yml

In [9]:
def check_data(input_data, config):
    # Measure the length of the data
    len_input_data = len(input_data)

    # Check data types
    assert input_data.select_dtypes("int").columns.to_list() == config['int_columns'], "an error occurred in int columns"
    assert input_data.select_dtypes("object").columns.to_list() == config['obj_columns'], "an error occurred in object columns"

    # Check target classes
    assert input_data[config['label']].value_counts().index.to_list() == config['target_classes'], "an error occurred in target classes check"

    # Check sensor values range
    for i in input_data.filter(regex='MQ.*').columns.to_list():
        assert input_data[i].between(config['range_sensor_val'][0], config['range_sensor_val'][1]).sum() == len_input_data, "an error occurred in sensor values range check"

    print('Checking complete. Everything looks good.')

In [10]:
check_data(df, config)

Checking complete. Everything looks good.


### Data Splitting
In the data splitting stage, the dataset is divided into three separate groups for training, validation, and testing, respectively. We will use a 60:20:20 split ratio and set the random_state parameter to 42 to ensure reproducibility of the results. We will also use the stratify parameter to ensure that each class is represented in equal proportions across the train, validation, and test sets, thereby maintaining the balance of the dataset.

In [11]:
X = df.drop(config['obj_columns'], axis=1)
y = df[config['obj_columns']]

X_traival, X_test, y_traival, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_traival, y_traival, test_size=0.25, random_state=42, stratify=y_traival)

### Data Pickle Dump
Save the final datasets (train, validation, test) into pickle files.

In [12]:
utils.pickle_dump(df, config['dataset_processed_path'])

utils.pickle_dump(X_train, config['train_set_path'][0])
utils.pickle_dump(y_train, config['train_set_path'][1])

utils.pickle_dump(X_test, config['test_set_path'][0])
utils.pickle_dump(y_test, config['test_set_path'][1])

utils.pickle_dump(X_val, config['val_set_path'][0])
utils.pickle_dump(y_val, config['val_set_path'][1])