In [1]:
# Task 2: Data Transformation

#This notebook demonstrates techniques for transforming the Iris dataset, including encoding categorical data, feature engineering, and aggregating data for better analysis. 

In [13]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target

# Display the first few rows
data.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [14]:
# Step 1: One-Hot Encoding for Nominal Variables
# We’ll apply one-hot encoding to the `species` column to create separate binary columns for each species type.


In [16]:
# One-hot encode the 'species' column
data_one_hot_encoded = pd.get_dummies(data, columns=['species'], prefix='species')

# Display data before and after one-hot encoding
print("Data before one-hot encoding:\n", data.head())
print("\nData after one-hot encoding:\n", data_one_hot_encoded.head())


Data before one-hot encoding:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   species  
0        0  
1        0  
2        0  
3        0  
4        0  

Data after one-hot encoding:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5            

In [17]:
# Step 2: Label Encoding for Ordinal Variables
#Label encoding is suitable for ordinal data. Suppose we add a hypothetical `size` column (small, medium, large) for demonstration.


In [19]:
# Add a hypothetical 'size' column
data['size'] = ['small', 'medium', 'large', 'medium', 'small'] * 30  # Just a repeating pattern

# Initialize label encoder and encode 'size'
label_encoder = LabelEncoder()
data['size_encoded'] = label_encoder.fit_transform(data['size'])

# Display data before and after label encoding
print("Data before label encoding:\n", data[['size']].head())
print("\nData after label encoding:\n", data[['size', 'size_encoded']].head())


Data before label encoding:
      size
0   small
1  medium
2   large
3  medium
4   small

Data after label encoding:
      size  size_encoded
0   small             2
1  medium             1
2   large             0
3  medium             1
4   small             2


In [20]:
# Step 3: Feature Engineering
#We’ll create a new feature by combining `sepal length` and `sepal width` to analyze the interaction between these dimensions.


In [21]:
# Create an interaction feature between sepal length and sepal width
data['sepal_area'] = data['sepal length (cm)'] * data['sepal width (cm)']

# Display data before and after adding interaction feature
print("Data before adding interaction feature:\n", data[['sepal length (cm)', 'sepal width (cm)']].head())
print("\nData after adding interaction feature:\n", data[['sepal length (cm)', 'sepal width (cm)', 'sepal_area']].head())


Data before adding interaction feature:
    sepal length (cm)  sepal width (cm)
0                5.1               3.5
1                4.9               3.0
2                4.7               3.2
3                4.6               3.1
4                5.0               3.6

Data after adding interaction feature:
    sepal length (cm)  sepal width (cm)  sepal_area
0                5.1               3.5       17.85
1                4.9               3.0       14.70
2                4.7               3.2       15.04
3                4.6               3.1       14.26
4                5.0               3.6       18.00


In [23]:
# Step 4: Aggregation Functions
#We’ll summarize the dataset by grouping by `species` and calculating the mean for each feature to observe species-wise trends.


In [25]:
# Group by species and calculate mean of each feature
data_grouped = data.groupby('species').mean()

# Display the aggregated data
print("Species-wise mean of each feature:\n", data_grouped)


Species-wise mean of each feature:
          sepal length (cm)  sepal width (cm)  petal length (cm)  \
species                                                           
0                    5.006             3.428              1.462   
1                    5.936             2.770              4.260   
2                    6.588             2.974              5.552   

         petal width (cm)  size_encoded  sepal_area  
species                                              
0                   0.246           1.2     17.2578  
1                   1.326           1.2     16.5262  
2                   2.026           1.2     19.6846  


In [None]:
# Summary

In this notebook, I have transformed the Iris dataset as follows:
1. **Encoded Categorical Data**: Used one-hot encoding for the nominal `species` column and label encoding for a hypothetical ordinal column.
2. **Feature Engineering**: Created a new feature (`sepal_area`) by combining `sepal length` and `sepal width`.
3. **Aggregated Data**: Grouped by `species` to calculate the mean of each feature, allowing us to see species-level trends.

These transformations prepare the dataset for improved analysis and modeling.
