<a href="https://colab.research.google.com/github/krishang10/AIML/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

# Sample dataset creation (you can replace this with your dataset)
data = {
    'age': [25, 30, 35, np.nan, 45, 50, 100],  # Outlier at 100, Missing value at index 3
    'salary': [50000, 60000, 55000, 62000, np.nan, 48000, 1000000],  # Outlier at 1000000, Missing value at index 4
    'experience': [2, 5, np.nan, 10, 15, 20, 25]  # Missing value at index 2
}

df = pd.DataFrame(data)

print("Original Dataset:\n", df)

# 1. Handling Missing Values
# Fill missing values with the mean or median (Imputation)
df['age'].fillna(df['age'].median(), inplace=True)
df['salary'].fillna(df['salary'].median(), inplace=True)
df['experience'].fillna(df['experience'].median(), inplace=True)

print("\nDataset after handling missing values:\n", df)

# 2. Detecting Outliers using Z-score
z_scores = np.abs(stats.zscore(df))
print("\nZ-scores of the dataset:\n", z_scores)

# Setting a threshold for Z-score (usually 3) and filtering the outliers
outliers = (z_scores > 3).any(axis=1)
print("\nRows with outliers:\n", df[outliers])

# Removing the outliers (Optional: Depending on the use case)
df_no_outliers = df[~outliers]

print("\nDataset after removing outliers:\n", df_no_outliers)

# 3. Handling Outliers using IQR (Interquartile Range)
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Filtering the outliers based on IQR
df_IQR_filtered = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

print("\nDataset after filtering outliers using IQR:\n", df_IQR_filtered)


Original Dataset:
      age     salary  experience
0   25.0    50000.0         2.0
1   30.0    60000.0         5.0
2   35.0    55000.0         NaN
3    NaN    62000.0        10.0
4   45.0        NaN        15.0
5   50.0    48000.0        20.0
6  100.0  1000000.0        25.0

Dataset after handling missing values:
      age     salary  experience
0   25.0    50000.0         2.0
1   30.0    60000.0         5.0
2   35.0    55000.0        12.5
3   40.0    62000.0        10.0
4   45.0    57500.0        15.0
5   50.0    48000.0        20.0
6  100.0  1000000.0        25.0

Z-scores of the dataset:
         age    salary  experience
0  0.921443  0.424593    1.443275
1  0.706439  0.394342    1.041835
2  0.491436  0.409468    0.038232
3  0.276433  0.388292    0.372767
4  0.061430  0.401905    0.296302
5  0.153574  0.430644    0.965370
6  2.303607  2.449244    1.634438

Rows with outliers:
 Empty DataFrame
Columns: [age, salary, experience]
Index: []

Dataset after removing outliers:
      age   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['salary'].fillna(df['salary'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are 