<a href="https://colab.research.google.com/github/mvsrrk/mvsrrk.github.io/blob/main/EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** Pre-processing, Transformation, Joins, Manipulation**

1. *Clean the Data by Handling Missing Values, Duplicates, and Errors.*
2. *Perform Necessary Data Transformations Such as Normalization and Scaling.*
3. *Join Data from Multiple Sources as Needed.*
4. *Apply Business Logic and Derive New Columns.*
5. *Document the Transformation Processes and Results.*


In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


In [None]:
# Main dataset
data = {
    "ID": [1, 2, 2, 3, 4, np.nan, 6],
    "Name": ["Alice", "Bob", "Bob", "Charlie", "David", "Eve", None],
    "Age": [25, 30, 30, np.nan, 45, 22, 35],
    "Salary": [50000, 60000, 60000, 70000, 80000, np.nan, 90000],
    "Department": ["HR", "Finance", "Finance", "IT", None, "HR", "IT"],
}
df = pd.DataFrame(data)
print(df)

# Second dataset for merging
extra_data = {
    "ID": [1, 2, 3, 4, 6],
    "Performance_Score": [85, 90, 88, 92, 95],
}
extra_df = pd.DataFrame(extra_data)


    ID     Name   Age   Salary Department
0  1.0    Alice  25.0  50000.0         HR
1  2.0      Bob  30.0  60000.0    Finance
2  2.0      Bob  30.0  60000.0    Finance
3  3.0  Charlie   NaN  70000.0         IT
4  4.0    David  45.0  80000.0       None
5  NaN      Eve  22.0      NaN         HR
6  6.0     None  35.0  90000.0         IT


In [None]:

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Fill null values
df['ID'].fillna(df['ID'].median(), inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
df['Name'].fillna("Unknown", inplace=True)
df['Department'].fillna("Unknown", inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['ID'].fillna(df['ID'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

Unnamed: 0,ID,Name,Age,Salary,Department
0,1.0,Alice,25.0,50000.0,HR
1,2.0,Bob,30.0,60000.0,Finance
3,3.0,Charlie,30.0,70000.0,IT
4,4.0,David,45.0,80000.0,Unknown
5,3.0,Eve,22.0,70000.0,HR
6,6.0,Unknown,35.0,90000.0,IT


In [None]:


# Normalize numerical features using MinMaxScaler
numerical_features = ['Age', 'Salary']
scaler = MinMaxScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

df

Unnamed: 0,ID,Name,Age,Salary,Department
0,1.0,Alice,0.130435,0.0,HR
1,2.0,Bob,0.347826,0.25,Finance
3,3.0,Charlie,0.347826,0.5,IT
4,4.0,David,1.0,0.75,Unknown
5,3.0,Eve,0.0,0.5,HR
6,6.0,Unknown,0.565217,1.0,IT


In [None]:

# Merge the two dataframes
df = pd.merge(df, extra_df, on='ID', how='left')
df

Unnamed: 0,ID,Name,Age,Salary,Department,Performance_Score
0,1.0,Alice,0.130435,0.0,HR,85
1,2.0,Bob,0.347826,0.25,Finance,90
2,3.0,Charlie,0.347826,0.5,IT,88
3,4.0,David,1.0,0.75,Unknown,92
4,3.0,Eve,0.0,0.5,HR,88
5,6.0,Unknown,0.565217,1.0,IT,95


In [None]:
df['Bonus']=df['Salary']*0.10  #applying the business logic
df

Unnamed: 0,ID,Name,Age,Salary,Department,Performance_Score,Bonus
0,1.0,Alice,0.130435,0.0,HR,85,0.0
1,2.0,Bob,0.347826,0.25,Finance,90,0.025
2,3.0,Charlie,0.347826,0.5,IT,88,0.05
3,4.0,David,1.0,0.75,Unknown,92,0.075
4,3.0,Eve,0.0,0.5,HR,88,0.05
5,6.0,Unknown,0.565217,1.0,IT,95,0.1
