# Step 1: Import Necessary Libraries
We begin by importing the required libraries, including pandas, numpy, scikit-learn, and TensorFlow/Keras.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer



# Explanation
1.pandas: A library used for data manipulation and analysis. It provides data structures like DataFrames that are suitable for handling tabular data.

2.numpy: A library for numerical operations in Python. It supports large, multi-dimensional arrays and matrices.

3.train_test_split: A function from sklearn to split datasets into training and testing sets.

4.MinMaxScaler: A feature scaling method that transforms features to a specified range, often [0, 1].

5.SimpleImputer: A class from sklearn used for handling missing values by replacing them with a specified statistic (mean, median, mode).

# Step 2: Load the Spambase Dataset
Next, we will load the Spambase dataset, which contains features extracted from emails to classify them as spam or not spam.

In [4]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [f'feature_{i}' for i in range(1, 58)] + ['spam_label']

df = pd.read_csv(url, header=None, names=column_names)


# Explanation:
We specify the URL for the dataset and define column names, where the last column is labeled spam_label, indicating whether the email is spam (1) or not (0).

We use pd.read_csv() to read the dataset into a pandas DataFrame.

# Step 3: Data Inspection
It's essential to inspect the dataset to understand its structure and content.

In [5]:
print(df.head())
print(df.info())


   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  \
0       0.00       0.64       0.64        0.0       0.32       0.00   
1       0.21       0.28       0.50        0.0       0.14       0.28   
2       0.06       0.00       0.71        0.0       1.23       0.19   
3       0.00       0.00       0.00        0.0       0.63       0.00   
4       0.00       0.00       0.00        0.0       0.63       0.00   

   feature_7  feature_8  feature_9  feature_10  ...  feature_49  feature_50  \
0       0.00       0.00       0.00        0.00  ...        0.00       0.000   
1       0.21       0.07       0.00        0.94  ...        0.00       0.132   
2       0.19       0.12       0.64        0.25  ...        0.01       0.143   
3       0.31       0.63       0.31        0.63  ...        0.00       0.137   
4       0.31       0.63       0.31        0.63  ...        0.00       0.135   

   feature_51  feature_52  feature_53  feature_54  feature_55  feature_56  \
0         0.0       0

# Explanation:
df.head() displays the first five rows of the dataset, allowing us to get a quick overview of the data.

df.info() provides details about the DataFrame, such as the number of entries, column data types, and non-null counts, helping us understand the dataset's structure.

# Step 4: Check for Missing Values
Before preprocessing, we need to check for any missing values in the dataset.



In [6]:
print("Missing values in each column:")
print(df.isnull().sum())


Missing values in each column:
feature_1     0
feature_2     0
feature_3     0
feature_4     0
feature_5     0
feature_6     0
feature_7     0
feature_8     0
feature_9     0
feature_10    0
feature_11    0
feature_12    0
feature_13    0
feature_14    0
feature_15    0
feature_16    0
feature_17    0
feature_18    0
feature_19    0
feature_20    0
feature_21    0
feature_22    0
feature_23    0
feature_24    0
feature_25    0
feature_26    0
feature_27    0
feature_28    0
feature_29    0
feature_30    0
feature_31    0
feature_32    0
feature_33    0
feature_34    0
feature_35    0
feature_36    0
feature_37    0
feature_38    0
feature_39    0
feature_40    0
feature_41    0
feature_42    0
feature_43    0
feature_44    0
feature_45    0
feature_46    0
feature_47    0
feature_48    0
feature_49    0
feature_50    0
feature_51    0
feature_52    0
feature_53    0
feature_54    0
feature_55    0
feature_56    0
feature_57    0
spam_label    0
dtype: int64


# Explanation:
df.isnull().sum() counts the number of missing values in each column. Identifying missing values is crucial for appropriate data handling.

# Step 5: Handle Missing Values
If there are missing values, we will handle them using imputation. Here, we will impute missing values with the mean for continuous features.

In [8]:
imputer = SimpleImputer(strategy='mean')
df.iloc[:, :-1] = imputer.fit_transform(df.iloc[:, :-1]) 


# Explanation:
> We create an instance of SimpleImputer, specifying the strategy as mean. This will replace missing values in continuous features with the mean of the respective columns.

> df.iloc[:, :-1] selects all columns except the last one (the target variable) for imputation

# Step 6: Separate Features and Target Variable
Now, we'll separate the features (X) from the target variable (y).

In [9]:
X = df.drop('spam_label', axis=1)  # Features
y = df['spam_label']  # Target variable


# Explanation:
X contains all the feature columns, while y contains the target column (spam_label). This separation is essential for training machine learning models.

# Step 7: Normalize the Feature Set
Next, we'll normalize the feature set to scale the features to a range of [0, 1] using MinMaxScaler.

In [10]:
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)


# Explanation:
MinMaxScaler rescales the feature values to lie within a specified range (default is [0, 1]). Normalization is important for many machine learning algorithms as it ensures that features contribute equally to the result.


# Step 8: Split the Data
Finally, we'll split the dataset into training, validation, and test sets. This will help us evaluate our model effectively.

In [11]:
X_train, X_temp, y_train, y_temp = train_test_split(X_normalized, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Training set shape: {X_train.shape}')
print(f'Validation set shape: {X_val.shape}')
print(f'Test set shape: {X_test.shape}')


Training set shape: (3220, 57)
Validation set shape: (690, 57)
Test set shape: (691, 57)


# Explanation:
>We use train_test_split twice to create three subsets: training (70%), validation (15%), and test sets (15%).
>
>We display the shapes of the resulting datasets to confirm that the split was performed correctly.

# SUMMARY :
In this tutorial, we walked through the preprocessing steps for the Spambase dataset from the UCI Machine Learning Repository, preparing it for machine learning tasks. We began by importing necessary libraries like pandas, numpy, and sklearn, followed by loading the dataset and assigning clear column names. An initial inspection of the dataset helped us understand its structure and check for missing values, which we addressed using the SimpleImputer by replacing them with the mean of the respective columns. We then separated the features from the target variable and normalized the feature set to a range of [0, 1] using MinMaxScaler, ensuring all features contribute equally to model training. Finally, the dataset was split into training (70%), validation (15%), and test sets (15%) to facilitate effective model evaluation. With these preprocessing steps completed, the dataset is now ready for training machine learning models.