# Pandas Day 6


## Sorting Data

Sorting is used to arrange data in a specific order.

- Data can be sorted by column values or index
- Sorting helps in understanding patterns and rankings
- Supports ascending and descending order


In [186]:
import pandas as pd 

# Loading data set 
df = pd.read_csv('Datasets/employees_merged_data.csv')

df

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88
6,106,Ravi,IT,52000,3,88


## Sort_Values

In [187]:
# For ascending order 
df.sort_values('salary').head()   # in ascending order 

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
1,102,Aman,HR,45000,1,65
4,105,Kunal,HR,48000,2,70
0,101,Nikhil,IT,50000,2,78
5,106,Ravi,IT,52000,3,88
6,106,Ravi,IT,52000,3,88


In [188]:
# For decending order 
df.sort_values('salary',ascending=False)

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
3,104,Sahil,Finance,60000,4,90
2,103,Rohit,IT,55000,3,82
5,106,Ravi,IT,52000,3,88
6,106,Ravi,IT,52000,3,88
0,101,Nikhil,IT,50000,2,78
4,105,Kunal,HR,48000,2,70
1,102,Aman,HR,45000,1,65


In [189]:
# Sorting by multiple columns :
df.sort_values(['salary','experience'])

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
1,102,Aman,HR,45000,1,65
4,105,Kunal,HR,48000,2,70
0,101,Nikhil,IT,50000,2,78
5,106,Ravi,IT,52000,3,88
6,106,Ravi,IT,52000,3,88
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90


## Sort_Index

In [190]:
# For ascending order 
df.sort_index(inplace=True)  # inplace = true ----> changes made in the dataset .
df

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88
6,106,Ravi,IT,52000,3,88


In [191]:
# For decending order 
df.sort_index(ascending=False)   # ascending = false ----> decending order.

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
6,106,Ravi,IT,52000,3,88
5,106,Ravi,IT,52000,3,88
4,105,Kunal,HR,48000,2,70
3,104,Sahil,Finance,60000,4,90
2,103,Rohit,IT,55000,3,82
1,102,Aman,HR,45000,1,65
0,101,Nikhil,IT,50000,2,78


## Handling Duplicate Data

Duplicate records can affect data accuracy.

- Duplicates can be identified using built-in methods
- Unwanted duplicates can be removed to keep data clean
- Validation is done after removal to ensure correctness


In [192]:
# Checking duplicate data 
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [193]:
# Record of duplicate data 
df[df.duplicated()]

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
6,106,Ravi,IT,52000,3,88


In [194]:
# Check duplicate based on specific column 
df.duplicated(subset='emp_id')

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [195]:
# Remove duplicate row 
df.drop_duplicates(inplace=True)

In [196]:
# Remove duplicate based on column 
df.drop_duplicates(subset='emp_id')

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88


In [197]:
# validation after removing duplicate data 
df.duplicated().sum()

np.int64(0)

## Renaming Columns

Renaming columns improves readability and consistency.

- Columns can be renamed to be more meaningful
- Helpful when column names are unclear or inconsistent
- Makes datasets easier to understand and work with


In [198]:
df

Unnamed: 0,emp_id,emp_name,department,salary,experience,performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88


In [199]:
# Renaming single column 
df = df.rename(columns={'emp_id':'Employee_ID'})
df

Unnamed: 0,Employee_ID,emp_name,department,salary,experience,performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88


In [200]:
# Renaming multiple column 
df = df.rename(columns={
    'emp_name':'Employee_name',
    'department':'Department',
    'salary':'Salary',
    'experience':'Experience',
    'performance_score':'Performance_score'})

df

Unnamed: 0,Employee_ID,Employee_name,Department,Salary,Experience,Performance_score
0,101,Nikhil,IT,50000,2,78
1,102,Aman,HR,45000,1,65
2,103,Rohit,IT,55000,3,82
3,104,Sahil,Finance,60000,4,90
4,105,Kunal,HR,48000,2,70
5,106,Ravi,IT,52000,3,88


## Creating New Columns

New columns are used to generate additional information from existing data.


In [201]:
# New column using airthmetic :
df['Annual_salary']=df['Salary']*12

df

Unnamed: 0,Employee_ID,Employee_name,Department,Salary,Experience,Performance_score,Annual_salary
0,101,Nikhil,IT,50000,2,78,600000
1,102,Aman,HR,45000,1,65,540000
2,103,Rohit,IT,55000,3,82,660000
3,104,Sahil,Finance,60000,4,90,720000
4,105,Kunal,HR,48000,2,70,576000
5,106,Ravi,IT,52000,3,88,624000


## Using apply()

The `apply()` method is used to apply a custom function to data.

- Can be applied to a column or across rows
- Useful when built-in operations are not sufficient
- Helps perform custom transformations on data


In [202]:
# Column vise 
df['Experience_level']=df['Experience'].apply(lambda x: "Junior" if x<2 else "Mid" if x<4 else "Senior")
df

Unnamed: 0,Employee_ID,Employee_name,Department,Salary,Experience,Performance_score,Annual_salary,Experience_level
0,101,Nikhil,IT,50000,2,78,600000,Mid
1,102,Aman,HR,45000,1,65,540000,Junior
2,103,Rohit,IT,55000,3,82,660000,Mid
3,104,Sahil,Finance,60000,4,90,720000,Senior
4,105,Kunal,HR,48000,2,70,576000,Mid
5,106,Ravi,IT,52000,3,88,624000,Mid


In [203]:
# row wise 
df["performance_label"] = df.apply(
    lambda row: "Excellent" if row["Performance_score"] >= 80 and row["Experience"] >= 3 else "Average",
    axis=1
)
df

Unnamed: 0,Employee_ID,Employee_name,Department,Salary,Experience,Performance_score,Annual_salary,Experience_level,performance_label
0,101,Nikhil,IT,50000,2,78,600000,Mid,Average
1,102,Aman,HR,45000,1,65,540000,Junior,Average
2,103,Rohit,IT,55000,3,82,660000,Mid,Excellent
3,104,Sahil,Finance,60000,4,90,720000,Senior,Excellent
4,105,Kunal,HR,48000,2,70,576000,Mid,Average
5,106,Ravi,IT,52000,3,88,624000,Mid,Excellent


## Conditional Transformation

Conditional transformation is used to modify data based on specific conditions.

- Applies logic to categorize or label data
- Helps convert raw values into meaningful groups
- Commonly used for decision-based transformations


In [204]:
# Create a new column based on conditions
df["Performance_category"] = df["Performance_score"].apply(
    lambda x: "High" if x >= 80 else "Medium" if x >= 60 else "Low"
)

# Another example using multiple conditions
df["Salary_band"] = df["Salary"].apply(
    lambda x: "Low" if x < 50000 else "Mid" if x < 60000 else "High"
)

# Verify the transformation
df[["Performance_score", "Performance_category", "Salary", "Salary_band"]].head()


Unnamed: 0,Performance_score,Performance_category,Salary,Salary_band
0,78,Medium,50000,Mid
1,65,Medium,45000,Low
2,82,High,55000,Mid
3,90,High,60000,High
4,70,Medium,48000,Low


In [205]:
# Verifying changes 
df.head()

Unnamed: 0,Employee_ID,Employee_name,Department,Salary,Experience,Performance_score,Annual_salary,Experience_level,performance_label,Performance_category,Salary_band
0,101,Nikhil,IT,50000,2,78,600000,Mid,Average,Medium,Mid
1,102,Aman,HR,45000,1,65,540000,Junior,Average,Medium,Low
2,103,Rohit,IT,55000,3,82,660000,Mid,Excellent,High,Mid
3,104,Sahil,Finance,60000,4,90,720000,Senior,Excellent,High,High
4,105,Kunal,HR,48000,2,70,576000,Mid,Average,Medium,Low


In [206]:
#converting final data set into a new csv file
df.to_csv('Datasets/employees_final_data.csv')

## Summary (Day 6)

- Sorted data using different criteria to understand order and ranking
- Identified and removed duplicate records
- Renamed columns for better readability and consistency
- Created new columns using existing data
- Applied custom logic using the `apply()` method
- Performed conditional transformations to categorize data
- Validated changes after transformation
- Saved the transformed dataset for further analysis
