<a href="https://colab.research.google.com/github/harika373/10I7-Batch21/blob/main/lab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
'''Clean and Preprocess a Real-World Dataset (e.g., Hospital Dataset) for
Analysis.
Hospital Dataset – Outlier Detection & Removal
• Objective:
• Detect and remove unrealistic values (outliers) from
hospital dataset.
• Requirements:
• Create dataset with patient ages.
• Identify invalid ages (e.g., > 120 or < 0).
• Remove such outliers.
• Deliverables:
• Python code for outlier detection.
• Output showing dataset before/after cleaning.
• Discussion on why outlier removal matters'''
import pandas as pd

# -----------------------------
# Step 1: Create Hospital Dataset
# -----------------------------
data = {
    "patient_id": range(1, 11),
    "age": [25, 34, 120, 45, -5, 60, 89, 150, 33, 72],  # includes invalid ages
    "disease": ["Flu", "Cold", "Cancer", "Asthma", "Diabetes",
                "Flu", "Cold", "Flu", "Asthma", "Cancer"]
}

hospital_df = pd.DataFrame(data)

print("📌 Original Dataset:")
print(hospital_df)

# -----------------------------
# Step 2: Detect Outliers (Invalid Ages)
# -----------------------------
invalid_ages = hospital_df[(hospital_df["age"] > 120) | (hospital_df["age"] < 0)]

print("\n⚠️ Invalid Age Records (Outliers):")
print(invalid_ages)

# -----------------------------
# Step 3: Remove Outliers
# -----------------------------
cleaned_df = hospital_df[(hospital_df["age"] >= 0) & (hospital_df["age"] <= 120)]

print("\n✅ Cleaned Dataset (Outliers Removed):")
print(cleaned_df)

# -----------------------------
# Step 4: Insights
# -----------------------------
print("\n📊 Insights:")
print(f"Original dataset size: {len(hospital_df)} records")
print(f"Cleaned dataset size: {len(cleaned_df)} records")
print(f"Outliers removed: {len(hospital_df) - len(cleaned_df)} records")

📌 Original Dataset:
   patient_id  age   disease
0           1   25       Flu
1           2   34      Cold
2           3  120    Cancer
3           4   45    Asthma
4           5   -5  Diabetes
5           6   60       Flu
6           7   89      Cold
7           8  150       Flu
8           9   33    Asthma
9          10   72    Cancer

⚠️ Invalid Age Records (Outliers):
   patient_id  age   disease
4           5   -5  Diabetes
7           8  150       Flu

✅ Cleaned Dataset (Outliers Removed):
   patient_id  age disease
0           1   25     Flu
1           2   34    Cold
2           3  120  Cancer
3           4   45  Asthma
5           6   60     Flu
6           7   89    Cold
8           9   33  Asthma
9          10   72  Cancer

📊 Insights:
Original dataset size: 10 records
Cleaned dataset size: 8 records
Outliers removed: 2 records


In [2]:
'''Clean and Preprocess a Real-World Dataset (e.g., Banking Dataset) for
Analysis.
Banking Dataset – Feature Scaling
• Objective:
• Normalize customer income data to prepare for machine
learning models.
• Requirements:
• Create dataset with customer incomes.
• Apply Min-Max normalization.
• Apply Standardization (Z-score).
• Deliverables:
• Python code for scaling features.
• Output showing normalized & standardized values.
• Discussion on when to use scaling.
Expected Time
to Complete: 2
hours'''
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# -----------------------------
# Step 1: Create Banking Dataset
# -----------------------------
data = {
    "customer_id": [1, 2, 3, 4, 5, 6],
    "income": [25000, 48000, 32000, 58000, 150000, 72000]
}

banking_df = pd.DataFrame(data)

print("📌 Original Dataset:")
print(banking_df)

# -----------------------------
# Step 2: Apply Min-Max Normalization
# -----------------------------
minmax_scaler = MinMaxScaler()
banking_df["income_minmax"] = minmax_scaler.fit_transform(banking_df[["income"]])

# -----------------------------
# Step 3: Apply Standardization (Z-score)
# -----------------------------
standard_scaler = StandardScaler()
banking_df["income_zscore"] = standard_scaler.fit_transform(banking_df[["income"]])

print("\n✅ Dataset After Scaling:")
print(banking_df)

📌 Original Dataset:
   customer_id  income
0            1   25000
1            2   48000
2            3   32000
3            4   58000
4            5  150000
5            6   72000

✅ Dataset After Scaling:
   customer_id  income  income_minmax  income_zscore
0            1   25000          0.000      -0.945454
1            2   48000          0.184      -0.390251
2            3   32000          0.056      -0.776479
3            4   58000          0.264      -0.148859
4            5  150000          1.000       2.071952
5            6   72000          0.376       0.189091
