In [1]:
# Jupyter Notebook -- Dataset Preprocessing: Starting The Process of Cleaning and Organizing The Data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import os

* Pandas for reading, processing, editing, analyzing and manipulating data,
* Numpy for numerical calculations,
* Train_test_split from the scikit-learn library to divide the dataset into training and test sets,
* StandardScaler was used to scale and standardize the data. (The mean of the numerical data in the features is considered as 0 and the standard deviation is 1, and all numerical data is standardized for model training.)

1) First, the data set is loaded.

In [3]:
data = pd.read_csv('../data/data.csv')
data.dropna(inplace=True)

* With `inplace=True`, we removed rows containing NaN values from the data set.

2) Features are defined.

In [6]:
features = data[['Word_Count', 'Link_Count', 'Image_Count', 'Video_Count', 'Has_Ads', 'Domain_Age', 'Payment_Present',
                 'Login_Present', 'User_Comments', 'Cookies_Present', 'H1_Counts', 'H2_Counts']]

In [7]:
target = data['Category']

3) Control of class numbers. This control is an error-proofing position that I use for the weak dataset I use. Normally, this may not be necessary in a well-prepared dataset.

In [8]:
min_samples_per_class = 2
class_counts = target.value_counts()
valid_classes = class_counts[class_counts >= min_samples_per_class].index

Comparison of category tags and tags created with valid_classes:

In [9]:
filtered_data = data[data['Category'].isin(valid_classes)]

Selecting only the rows in valid indexes and assigning them as the new target value:

In [10]:
features = features.loc[filtered_data.index]
target = filtered_data['Category']

4) Preparing training, test and validation data sets with `train_test_split()`:

In [11]:
X_train, X_temp, y_train, y_temp = train_test_split(features, target, test_size=0.3, random_state=42, stratify=target)

* We separate the database to train(%70) and temp(%30)(will be separated to test(%15) and validation(%15) at the following codes.)
* With `stratify=target`, class distributions are preserved when dividing the data set. For example, if there is an imbalance between classes (e.g. 70% class A, 30% class B), these ratios are preserved in both training and test sets.

Removing classes that do not have enough examples in the temporary dataset (X_temp, y_temp) after the first split:

In [12]:
temp_class_counts = y_temp.value_counts()
valid_temp_classes = temp_class_counts[temp_class_counts >= min_samples_per_class].index

# Remove classes that do not have enough examples from the temporary dataset
X_temp = X_temp[y_temp.isin(valid_temp_classes)]
y_temp = y_temp[y_temp.isin(valid_temp_classes)]

In the data set that we consider as Train and Temp, the data allocated for Temp is divided separately for validation and testing:

In [13]:
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

Scaling of feature values for each dataset:

In [14]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

Saving the prepared data sets locally as a numpy array:

In [15]:
current_directory = os.getcwd()

In [16]:
np.save(os.path.join(current_directory, 'X_train_scaled.npy'), X_train_scaled)
np.save(os.path.join(current_directory, 'y_train.npy'), y_train)
np.save(os.path.join(current_directory, 'X_val_scaled.npy'), X_val_scaled)
np.save(os.path.join(current_directory, 'y_val.npy'), y_val)
np.save(os.path.join(current_directory, 'X_test_scaled.npy'), X_test_scaled)
np.save(os.path.join(current_directory, 'y_test.npy'), y_test)

In [17]:
print("Data processing completed and saved.")

Data processing completed and saved.
