In [1]:
# Detect & Remove Outliers using IQR Method

# Objective: Learn to identify and remove outliers from a dataset using the Interquartile Range (IQR) method.
# Instructions:
# For each example, perform the following steps:
#     1. Load the Dataset: Load the dataset into your environment. You can use pandas to read the CSV file.
#     2. Calculate IQR: Calculate the first quartile (Q1), third quartile (Q3), and the IQR for the specified column.
#     3. Identify Outliers: Determine which data points are considered outliers.
#     4. Remove Outliers: Remove the outliers from the dataset.
#     5. Verify: Ensure the outliers are removed by checking the size or summary statistics of the dataset before and after the removal.
    
    
    

# Task:
#     Dataset: sales_data.csv(get it by your own it includes the column of Monthly_Sales)
#     Column to analyze: Monthly_Sales
#     Steps:
#         1. Load sales_data.csv .
#         2. Calculate Q1, Q3, and IQR for Monthly_Sales .
#         3. Identify outliers.
#         4. Remove the outliers.
#         5. Check the number of rows removed.

import pandas as pd
import numpy as np

# Generate a sample dataset with 'Monthly_Sales' column
np.random.seed(42)
monthly_sales = np.random.normal(loc=10000, scale=2000, size=1000)  # Normally distributed sales data

# Introducing some outliers
monthly_sales[::50] = monthly_sales[::50] + np.random.randint(5000, 15000, size=20)  # Adding outliers at regular intervals

# Create a DataFrame
df = pd.DataFrame({'Monthly_Sales': monthly_sales})

# Check the first few rows of the dataset
print("Original Data:")
print(df.head())

# Calculate Q1 (25th percentile), Q3 (75th percentile), and IQR (Interquartile Range)
Q1 = df['Monthly_Sales'].quantile(0.25)
Q3 = df['Monthly_Sales'].quantile(0.75)
IQR = Q3 - Q1

# Identify the outliers (outside of 1.5*IQR range)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the data to remove outliers
df_no_outliers = df[(df['Monthly_Sales'] >= lower_bound) & (df['Monthly_Sales'] <= upper_bound)]

# Verify the number of rows removed
removed_rows = len(df) - len(df_no_outliers)
print(f"\nNumber of rows removed: {removed_rows}")

# Show summary statistics before and after removing outliers
print("\nSummary statistics before removing outliers:")
print(df['Monthly_Sales'].describe())

print("\nSummary statistics after removing outliers:")
print(df_no_outliers['Monthly_Sales'].describe())





Original Data:
   Monthly_Sales
0   25071.428306
1    9723.471398
2   11295.377076
3   13046.059713
4    9531.693251

Number of rows removed: 24

Summary statistics before removing outliers:
count     1000.000000
mean     10253.031112
std       2518.419336
min       3517.465320
25%       8749.288605
50%      10094.379265
75%      11383.319170
max      27604.912190
Name: Monthly_Sales, dtype: float64

Summary statistics after removing outliers:
count      976.000000
mean     10056.659599
std       1911.919669
min       5056.711000
25%       8725.049317
50%      10052.976889
75%      11300.997393
max      15264.764130
Name: Monthly_Sales, dtype: float64
