# 1. What is Data Reduction 

Purpose:

Data reduction is about reducing the size of a dataset while keeping its important information intact. This is useful to:

Improve computational efficiency

Reduce storage requirements

Simplify models

Remove redundant or irrelevant data



# Key Techniques of Data Reduction

Dimensionality Reduction – reducing the number of features (columns)

Numerical Aggregation / Binning – grouping continuous values into bins

Sampling / Row Reduction – reducing the number of rows

Feature Selection – keeping only the most important variables

Removing Redundant Data – dropping duplicates

# 1. Remove duplicates

Purpose:

Sometimes datasets have duplicate rows or columns that are exact copies.

Removing duplicates reduces dataset size without losing any information.

In [1]:
import pandas as pd

df = pd.DataFrame({
    'Car': ['A', 'B', 'A', 'C', 'B'],
    'Price': [15000, 20000, 15000, 22000, 20000]
})

print(df)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)


  Car  Price
0   A  15000
1   B  20000
2   A  15000
3   C  22000
4   B  20000
  Car  Price
0   A  15000
1   B  20000
3   C  22000


# 2. Feature Selection (Reducing Columns)

Feature selection is all about keeping only what matters, whether by statistical methods, model-based evaluation, or built-in importance scores.

In [6]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Sample dataset
df = pd.DataFrame({
    'Price':[15000, 20000, 16000, 22000, 21000],
    'Age':[5, 3, 4, 2, 6],
    'HP':[100, 120, 110, 130, 115],
    'Doors':[3, 5, 3, 5, 3]
})

X = df[['Age','HP','Doors']]
y = df['Price']

# 1️ Correlation filter
corr = df.corr()['Price']
print("Correlation with Price:\n", corr)

# 2️ Wrapper method - RFE
model = LinearRegression()
rfe = RFE(model, n_features_to_select=2)
rfe.fit(X, y)
print("RFE Selected Features:", X.columns[rfe.support_])

# 3️ Embedded method - RandomForest
rf = RandomForestRegressor()
rf.fit(X, y)
importances = pd.Series(rf.feature_importances_, index=X.columns)
print("Feature Importances:\n", importances.sort_values(ascending=False))



Correlation with Price:
 Price    1.000000
Age     -0.355371
HP       0.897448
Doors    0.644831
Name: Price, dtype: float64
RFE Selected Features: Index(['Age', 'Doors'], dtype='object')
Feature Importances:
 HP       0.727486
Doors    0.138459
Age      0.134056
dtype: float64


# 3. Dimensionality Reduction (Reducing Correlated Features)

Purpose:

If a dataset has many correlated numeric columns, PCA or other techniques can reduce them into fewer components.

Keeps most information while reducing columns.

Example using PCA:

In [3]:
from sklearn.decomposition import PCA

df_features = pd.DataFrame({
    'HP': [100, 110, 105, 120, 115],
    'CC': [1400, 1600, 1500, 1800, 1600],
    'Weight': [1200, 1500, 1400, 1600, 1500]
})

# PCA to reduce 3 features into 2 components
pca = PCA(n_components=2)
reduced_features = pca.fit_transform(df_features)
print(reduced_features)


[[-297.63488302   38.90152899]
 [  56.86271549  -27.63409389]
 [ -84.63691419  -29.31268408]
 [ 268.3636357    45.61588976]
 [  57.04544601  -27.57064078]]


# 4. Numerical Aggregation / Binning

Purpose:

Continuous variables (like age, price, or mileage) can be grouped into bins.

Makes data simpler and reduces the number of unique values.

In [4]:
df = pd.DataFrame({'Price':[15000, 20000, 16000, 22000, 21000]})

# Bin Price into categories
df['PriceCategory'] = pd.cut(df['Price'], bins=[0,16000,20000,25000], labels=['Low','Medium','High'])
print(df)


   Price PriceCategory
0  15000           Low
1  20000        Medium
2  16000           Low
3  22000          High
4  21000          High


# 5. Sampling / Row Reduction

Purpose:

Large datasets may have millions of rows.

Sampling allows us to work with a representative subset, reducing memory usage and speeding up analysis.

In [5]:
df = pd.DataFrame({'Car':['A','B','C','D','E'], 'Price':[15000,20000,16000,22000,21000]})

# Randomly sample 60% of rows
df_sampled = df.sample(frac=0.6, random_state=42)
print(df_sampled)


  Car  Price
1   B  20000
4   E  21000
2   C  16000
