**Objective**:
The goal of this classwork is to provide hands-on experience in data preprocessing, specifically focusing on handling missing values and applying feature scaling techniques. Students will learn how to clean and prepare data for machine learning models by working with a provided dataset.

Required: Print your result clearly after finishing each small task for grader to check.

**Submission: One HTML file, must show your code and output in the file clearly.**


**Handling Missing Values**:

Task 1.1: Detect missing values in the dataset and count the number of missing values in each column.

Task 1.2: Handle missing values in the dataset:
Use fill by mean for numerical columns (Age, Income, Experience, Score).
Drop rows with missing values in non-numerical columns (Name, City).


**Feature Scaling**:

Task 2.1: Apply Min-Max Scaling to the columns Age, Income, Experience, and Score.

Task 2.2: Apply Standardization (Z-score) to the same columns.

Task 2.3: Apply Robust Scaling to the same columns.


**Dataset for Classwork**

Here’s the dataset for this classwork.

In [50]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Creating a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Helen'],
    'Age': [25, 30, np.nan, 35, 28, 40, np.nan, 50],
    'Income': [50000, 60000, 75000, np.nan, 52000, 82000, 92000, np.nan],
    'Experience': [1, 3, 5, 7, np.nan, 12, 15, 20],
    'Score': [85, np.nan, 78, 95, 89, 100, 60, 70],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston', 'Austin', 'Dallas', np.nan]
}

# Converting to DataFrame
df = pd.DataFrame(data)
print("Sample Data for Classwork:\n", df)



Sample Data for Classwork:
       Name   Age   Income  Experience  Score           City
0    Alice  25.0  50000.0         1.0   85.0       New York
1      Bob  30.0  60000.0         3.0    NaN  San Francisco
2  Charlie   NaN  75000.0         5.0   78.0    Los Angeles
3    David  35.0      NaN         7.0   95.0        Chicago
4     Emma  28.0  52000.0         NaN   89.0         Boston
5    Frank  40.0  82000.0        12.0  100.0         Austin
6    Grace   NaN  92000.0        15.0   60.0         Dallas
7    Helen  50.0      NaN        20.0   70.0            NaN


In [51]:
#Task 1.1: Detect missing values in the dataset and count the number of missing values in each column.
missingVal = df.isnull()
missingVal

missingCount = df.isnull().sum()
print("Missing Value count:\n",missingCount)

Missing Value count:
 Name          0
Age           2
Income        2
Experience    1
Score         1
City          1
dtype: int64


In [52]:
#Task 1.2: Handle missing values in the dataset:
#Use fill by mean for numerical columns (Age, Income, Experience, Score)
# Drop rows with missing values in non-numerical columns (Name, City).

meanFilled = df.fillna(df.mean(numeric_only=True))
meanFilled = meanFilled.dropna()
meanFilled

Unnamed: 0,Name,Age,Income,Experience,Score,City
0,Alice,25.0,50000.0,1.0,85.0,New York
1,Bob,30.0,60000.0,3.0,82.428571,San Francisco
2,Charlie,34.666667,75000.0,5.0,78.0,Los Angeles
3,David,35.0,68500.0,7.0,95.0,Chicago
4,Emma,28.0,52000.0,9.0,89.0,Boston
5,Frank,40.0,82000.0,12.0,100.0,Austin
6,Grace,34.666667,92000.0,15.0,60.0,Dallas


In [65]:
#Task 2.1: Apply Min-Max Scaling to the columns Age, Income, Experience, and Score.
meanFilled
# Initializing the scalers
minmax_scaler = MinMaxScaler()

# Applying Min-Max Scaling
cols = ['Age', 'Income', 'Experience', 'Score']
df_minmax_scaled = meanFilled.copy()
df_minmax_scaled[cols] = pd.DataFrame(minmax_scaler.fit_transform(df_minmax_scaled[cols]))
df_minmax_scaled.columns = [f"{col}_MinMax" for col in df_minmax_scaled.columns]
print(df_minmax_scaled)

  Name_MinMax  Age_MinMax  Income_MinMax  Experience_MinMax  Score_MinMax  \
0       Alice    0.000000       0.000000           0.000000      0.625000   
1         Bob    0.333333       0.238095           0.142857      0.560714   
2     Charlie    0.644444       0.595238           0.285714      0.450000   
3       David    0.666667       0.440476           0.428571      0.875000   
4        Emma    0.200000       0.047619           0.571429      0.725000   
5       Frank    1.000000       0.761905           0.785714      1.000000   
6       Grace    0.644444       1.000000           1.000000      0.000000   

     City_MinMax  
0       New York  
1  San Francisco  
2    Los Angeles  
3        Chicago  
4         Boston  
5         Austin  
6         Dallas  


In [68]:
#Task 2.2: Apply Standardization (Z-score) to the same columns.
meanFilled
# Initializing the scalers
standard_scaler = StandardScaler()


# Applying Min-Max Scaling
cols = ['Age', 'Income', 'Experience', 'Score']
df_standard_scaled = meanFilled.copy()
df_standard_scaled[cols] = pd.DataFrame(standard_scaler.fit_transform(df_standard_scaled[cols]))
df_standard_scaled.columns = [f"{col}_Standardized" for col in df_standard_scaled.columns]

print(df_standard_scaled)

  Name_Standardized  Age_Standardized  Income_Standardized  \
0             Alice         -1.590654            -1.279453   
1               Bob         -0.526841            -0.587857   
2           Charlie          0.466051             0.449538   
3             David          0.536972             0.000000   
4              Emma         -0.952366            -1.141134   
5             Frank          1.600785             0.933655   
6             Grace          0.466051             1.625251   

   Experience_Standardized  Score_Standardized City_Standardized  
0                -1.399433            0.066027          New York  
1                -0.964054           -0.147290     San Francisco  
2                -0.528675           -0.514669       Los Angeles  
3                -0.093296            0.895592           Chicago  
4                 0.342084            0.397853            Boston  
5                 0.995153            1.310375            Austin  
6                 1.648222        

In [69]:
#Task 2.3: Apply Robust Scaling to the same columns.

robust_scaler = RobustScaler()
cols = ['Age', 'Income', 'Experience', 'Score']
df_robust_scaled = meanFilled.copy()
df_robust_scaled[cols] = pd.DataFrame(robust_scaler.fit_transform(df_robust_scaled[cols]))
df_robust_scaled.columns = [f"{col}_Robust" for col in df_robust_scaled.columns]

print(df_robust_scaled)

  Name_Robust  Age_Robust  Income_Robust  Experience_Robust  Score_Robust  \
0       Alice   -1.657143      -0.822222          -0.923077      0.000000   
1         Bob   -0.800000      -0.377778          -0.615385     -0.218182   
2     Charlie    0.000000       0.288889          -0.307692     -0.593939   
3       David    0.057143       0.000000           0.000000      0.848485   
4        Emma   -1.142857      -0.733333           0.307692      0.339394   
5       Frank    0.914286       0.600000           0.769231      1.272727   
6       Grace    0.000000       1.044444           1.230769     -2.121212   

     City_Robust  
0       New York  
1  San Francisco  
2    Los Angeles  
3        Chicago  
4         Boston  
5         Austin  
6         Dallas  
