
Write a python program to compute Mean, Median, Mode, Variance, Standard Deviation using the Iris dataset. Also, demonstrate various data pre-processing techniques for a random dataset, including reshaping, filtering, merging, handling missing values, and Min-max normalization.

## Load the iris dataset

Load the Iris dataset using a suitable library like scikit-learn or pandas.


In [1]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()

df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)

display(df_iris.head())
display(df_iris.info())

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


None

## Compute descriptive statistics

Calculate the Mean, Median, Mode, Variance, and Standard Deviation for each feature in the Iris dataset.


In [2]:
import numpy as np

mean_values = df_iris.mean()

median_values = df_iris.median()

mode_values = df_iris.mode().iloc[0]


variance_values = df_iris.var()

std_dev_values = df_iris.std()

statistics = {
    'Mean': mean_values,
    'Median': median_values,
    'Mode': mode_values,
    'Variance': variance_values,
    'Standard Deviation': std_dev_values
}

stats_df = pd.DataFrame(statistics)

display(stats_df)

Unnamed: 0,Mean,Median,Mode,Variance,Standard Deviation
sepal length (cm),5.843333,5.8,5.0,0.685694,0.828066
sepal width (cm),3.057333,3.0,3.0,0.189979,0.435866
petal length (cm),3.758,4.35,1.4,3.116278,1.765298
petal width (cm),1.199333,1.3,0.2,0.581006,0.762238


## Display results

Present the computed statistics in a clear format.


In [3]:
display(stats_df)

Unnamed: 0,Mean,Median,Mode,Variance,Standard Deviation
sepal length (cm),5.843333,5.8,5.0,0.685694,0.828066
sepal width (cm),3.057333,3.0,3.0,0.189979,0.435866
petal length (cm),3.758,4.35,1.4,3.116278,1.765298
petal width (cm),1.199333,1.3,0.2,0.581006,0.762238


## Create a random dataset

Generate a random dataset for demonstration purposes.


In [4]:
import numpy as np
import pandas as pd

data = {
    'Column A': np.random.rand(15),
    'Column B': np.random.rand(15),
    'Column C': np.random.rand(15)
}

df_random = pd.DataFrame(data)

display(df_random.head())

Unnamed: 0,Column A,Column B,Column C
0,0.465538,0.445022,0.75971
1,0.13585,0.271254,0.252382
2,0.467719,0.046016,0.628122
3,0.174865,0.39459,0.240067
4,0.581747,0.375265,0.485304


## Reshape the data

Demonstrate reshaping the data using appropriate techniques (e.g., melt, pivot).


In [5]:

df_melted = pd.melt(df_random, var_name='variable', value_name='value')

display(df_melted.head())

Unnamed: 0,variable,value
0,Column A,0.465538
1,Column A,0.13585
2,Column A,0.467719
3,Column A,0.174865
4,Column A,0.581747


## Filter the data

Show how to filter the data based on specific conditions.


In [6]:

df_filtered = df_random[df_random['Column A'] > 0.5]

display(df_filtered.head())

Unnamed: 0,Column A,Column B,Column C
4,0.581747,0.375265,0.485304
9,0.731315,0.702102,0.119362
10,0.993226,0.782158,0.621174
11,0.959279,0.774722,0.030088
12,0.556527,0.852705,0.899067


## Merge the data

Create another random dataset and demonstrate merging it with the first dataset.


In [7]:

data2 = {
    'Column D': np.random.rand(len(df_random))
}

df_random2 = pd.DataFrame(data2)

df_merged = pd.concat([df_random, df_random2], axis=1)

display(df_merged.head())

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
1,0.13585,0.271254,0.252382,0.343383
2,0.467719,0.046016,0.628122,0.632905
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949


## Handle missing values

Introduce missing values into the dataset and demonstrate techniques for handling them (e.g., imputation, dropping).


In [8]:
import numpy as np

np.random.seed(42)
mask = np.random.choice([True, False], size=df_merged.shape, p=[0.1, 0.9])
df_merged_with_missing = df_merged.mask(mask)

display("DataFrame with missing values:")
display(df_merged_with_missing)

'DataFrame with missing values:'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
1,0.13585,0.271254,,0.343383
2,0.467719,0.046016,,0.632905
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949
5,0.117306,0.77083,0.030528,0.020289
6,0.303839,0.100768,0.303682,0.067747
7,0.32492,,0.489434,0.435235
8,,0.980674,0.688351,0.462965
9,0.731315,,0.119362,0.583026


In [9]:

df_dropped = df_merged_with_missing.dropna()

display("DataFrame after dropping rows with missing values (head):")
display(df_dropped.head())

df_imputed = df_merged_with_missing.fillna(df_merged_with_missing.mean())

display("DataFrame after imputing missing values with mean (head):")
display(df_imputed.head())

'DataFrame after dropping rows with missing values (head):'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949
5,0.117306,0.77083,0.030528,0.020289
6,0.303839,0.100768,0.303682,0.067747


'DataFrame after imputing missing values with mean (head):'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
1,0.13585,0.271254,0.40248,0.343383
2,0.467719,0.046016,0.40248,0.632905
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949


## Perform min-max normalization

Apply Min-max normalization to a numerical feature in the dataset.


In [10]:

column_to_normalize = 'Column A'

min_val = df_imputed[column_to_normalize].min()
max_val = df_imputed[column_to_normalize].max()

df_imputed[f'{column_to_normalize}_normalized'] = (df_imputed[column_to_normalize] - min_val) / (max_val - min_val)

display(df_imputed.head())

Unnamed: 0,Column A,Column B,Column C,Column D,Column A_normalized
0,0.465538,0.445022,0.75971,0.010975,0.426141
1,0.13585,0.271254,0.40248,0.343383,0.067606
2,0.467719,0.046016,0.40248,0.632905,0.428513
3,0.174865,0.39459,0.240067,0.12532,0.110034
4,0.581747,0.375265,0.485304,0.77949,0.552517


## Display results

Show the results of each pre-processing step.


In [11]:
display("Original Random DataFrame:")
display(df_random)

display("Melted DataFrame:")
display(df_melted)

display("Filtered DataFrame:")
display(df_filtered)

display("Merged DataFrame:")
display(df_merged)

display("DataFrame with Missing Values:")
display(df_merged_with_missing)

display("DataFrame after Dropping Missing Values:")
display(df_dropped)

display("DataFrame after Imputing Missing Values and Normalization:")
display(df_imputed)

'Original Random DataFrame:'

Unnamed: 0,Column A,Column B,Column C
0,0.465538,0.445022,0.75971
1,0.13585,0.271254,0.252382
2,0.467719,0.046016,0.628122
3,0.174865,0.39459,0.240067
4,0.581747,0.375265,0.485304
5,0.117306,0.77083,0.030528
6,0.303839,0.100768,0.303682
7,0.32492,0.558833,0.489434
8,0.49795,0.980674,0.688351
9,0.731315,0.702102,0.119362


'Melted DataFrame:'

Unnamed: 0,variable,value
0,Column A,0.465538
1,Column A,0.13585
2,Column A,0.467719
3,Column A,0.174865
4,Column A,0.581747
5,Column A,0.117306
6,Column A,0.303839
7,Column A,0.32492
8,Column A,0.49795
9,Column A,0.731315


'Filtered DataFrame:'

Unnamed: 0,Column A,Column B,Column C
4,0.581747,0.375265,0.485304
9,0.731315,0.702102,0.119362
10,0.993226,0.782158,0.621174
11,0.959279,0.774722,0.030088
12,0.556527,0.852705,0.899067
14,0.661941,0.537056,0.756287


'Merged DataFrame:'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
1,0.13585,0.271254,0.252382,0.343383
2,0.467719,0.046016,0.628122,0.632905
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949
5,0.117306,0.77083,0.030528,0.020289
6,0.303839,0.100768,0.303682,0.067747
7,0.32492,0.558833,0.489434,0.435235
8,0.49795,0.980674,0.688351,0.462965
9,0.731315,0.702102,0.119362,0.583026


'DataFrame with Missing Values:'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
1,0.13585,0.271254,,0.343383
2,0.467719,0.046016,,0.632905
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949
5,0.117306,0.77083,0.030528,0.020289
6,0.303839,0.100768,0.303682,0.067747
7,0.32492,,0.489434,0.435235
8,,0.980674,0.688351,0.462965
9,0.731315,,0.119362,0.583026


'DataFrame after Dropping Missing Values:'

Unnamed: 0,Column A,Column B,Column C,Column D
0,0.465538,0.445022,0.75971,0.010975
3,0.174865,0.39459,0.240067,0.12532
4,0.581747,0.375265,0.485304,0.77949
5,0.117306,0.77083,0.030528,0.020289
6,0.303839,0.100768,0.303682,0.067747
11,0.959279,0.774722,0.030088,0.219863
12,0.556527,0.852705,0.899067,0.047146
13,0.073683,0.234956,0.381687,0.999222


'DataFrame after Imputing Missing Values and Normalization:'

Unnamed: 0,Column A,Column B,Column C,Column D,Column A_normalized
0,0.465538,0.445022,0.75971,0.010975,0.426141
1,0.13585,0.271254,0.40248,0.343383,0.067606
2,0.467719,0.046016,0.40248,0.632905,0.428513
3,0.174865,0.39459,0.240067,0.12532,0.110034
4,0.581747,0.375265,0.485304,0.77949,0.552517
5,0.117306,0.77083,0.030528,0.020289,0.047439
6,0.303839,0.100768,0.303682,0.067747,0.250293
7,0.32492,0.505078,0.489434,0.435235,0.273219
8,0.452755,0.980674,0.688351,0.462965,0.412239
9,0.731315,0.505078,0.119362,0.583026,0.715172


## Dimensionality Reduction using PCA

Apply Principal Component Analysis (PCA) to reduce the dimensionality of the Iris dataset.

In [12]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_iris)

pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

df_pca = pd.DataFrame(data = principal_components, columns = ['principal component 1', 'principal component 2'])

display("Original DataFrame shape:", df_iris.shape)
display("DataFrame after PCA shape:", df_pca.shape)
display("Principal Components (first 5 rows):")
display(df_pca.head())

'Original DataFrame shape:'

(150, 4)

'DataFrame after PCA shape:'

(150, 2)

'Principal Components (first 5 rows):'

Unnamed: 0,principal component 1,principal component 2
0,-2.264703,0.480027
1,-2.080961,-0.674134
2,-2.364229,-0.341908
3,-2.299384,-0.597395
4,-2.389842,0.646835
