## LOADING DATASET

The Iris dataset is loaded from a CSV file using pandas. The first few rows are displayed to check the data.

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
iris_data = pd.read_csv( r'/home/shahzaib/Documents//iris.csv')
print(iris_data.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


## 1-Removing Dublicates

Any duplicate rows in the dataset are removed to ensure the data is clean and free from redundancy.

In [3]:

iris_data = iris_data.drop_duplicates()
print("Duplicates removed. New shape:", iris_data.shape)


Duplicates removed. New shape: (147, 5)


## 2-Feature scaling

The numeric features (sepal and petal dimensions) are scaled using StandardScaler to ensure they have a mean of 0 and a standard deviation of 1. This helps in improving the performance of machine learning models.

In [4]:

scaler = StandardScaler()
features = iris_data.iloc[:, :-1]  
scaled_features = scaler.fit_transform(features)
print("feature Scaled")

feature Scaled


## 3-Data Splitting

The dataset is split into training and test sets (80% for training, 20% for testing). This prepares the data for training machine learning models and evaluating their performance.

In [5]:
# 3. Data Splitting
X = scaled_features  # Features
y = iris_data['species']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data Splitited")
print("Training set shape:", X_train.shape, "Test set shape:", X_test.shape)
print("Preprocessing complete.")

Data Splitited
Training set shape: (117, 4) Test set shape: (30, 4)
Preprocessing complete.


## Summary

The above codes  demonstrates basic data preprocessing steps using the Iris dataset. It covers the following key steps:

    Loading the Dataset: The Iris dataset is loaded using pandas, and the initial few rows are displayed to verify the data.

    Removing Duplicates: Duplicate rows are removed to ensure the dataset is clean, reducing redundancy.

    Feature Scaling: The feature columns (sepal and petal measurements) are scaled using StandardScaler to standardize the data, improving the performance of machine learning models.

    Data Splitting: The dataset is split into training (80%) and testing (20%) sets using train_test_split. This split allows for model training and evaluation.

The preprocessing steps ensure the data is clean, normalized, and ready for use in machine learning tasks.