### GROUP MEMBER:
##### NAME: Wan Ammar Rusyaidi Bin Wan Ahmad Sulmizan
##### MATRIC NO: 2412333
##### NAME: Tuan Muhammad Faqih Zikri Bin Tuan Mohd Pauzi
##### MATRIC NO: 2414913
##### TITLE : Analysis Salary Data

### INTRODUCTION
##### This analysis is to explore how salary is influenced by key factors such as:
##### 1) Years of Experience – Does more experience lead to higher pay?
##### 2) Age – How does salary change over different age groups?
##### 3) Gender – Are there significant pay gaps between male and female employees?
##### 4) Job Title – Which roles earn the highest and lowest salaries?
##### 5) Education Level – Do higher qualifications result in better compensation?

### BACKGROUND PROBLEM
##### The need for this analysis stems from several critical gaps in understanding workplace compensation:

##### 1)Lack of Transparency in Pay Structures
##### Many employees and employers operate without clear benchmarks for how salaries should scale with experience, education, or seniority. This opacity can lead to arbitrary pay decisions, dissatisfaction, and talent attrition.

##### 2)Potential Gender Pay Disparities
##### Despite increased awareness of gender equality, pay gaps persist across industries. Quantifying these disparities and understanding whether they vary by role or experience level is essential for fostering equitable workplaces.

##### 3)Career Development Uncertainty
##### Professionals often lack empirical data on how investments in education or years of experience translate into earnings growth. This makes it difficult to plan career trajectories or evaluate the return on educational investments.

##### 4)Inconsistent Compensation Policies
##### HR departments and organizational leaders frequently design pay structures based on industry standards rather than internal data. This can result in misaligned incentives, pay compression, or unintended biases in compensation.

### OBJECTIVE OF THE ANALYSIS
##### 1)Establish Salary Correlations
##### -Determine the strength of relationships between salary and:
##### =>Years of experience
##### =>Education level (Bachelor’s, Master’s, PhD, etc.)
##### =>Age demographics

##### 2)Evaluate Gender Pay Equity
##### -Compare earnings between male and female employees in comparable roles.
##### -Identify whether gaps persist after accounting for experience and education.

##### 3)Benchmark Compensation by Role
##### -Rank job titles by median and percentile salaries.
##### -Highlight which positions offer the highest growth potential.

##### 4)Generate Actionable Insights
##### -Provide employees with data to support salary negotiations.
##### -Equip HR teams with evidence to refine compensation frameworks.
##### -Help job seekers assess the financial implications of career choices.

### LOADING THE RAW DATASET

In [1]:
#Data Loading & Initialization
import pandas as pd
import numpy as np
import pandas as pd
df = pd.read_csv('salary_data.csv')
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
...,...,...,...,...,...,...
6699,49.0,Female,PhD,Director of Marketing,20.0,200000.0
6700,32.0,Male,High School,Sales Associate,3.0,50000.0
6701,30.0,Female,Bachelor's Degree,Financial Manager,4.0,55000.0
6702,46.0,Male,Master's Degree,Marketing Manager,14.0,140000.0


### DATA PREPARATION

In [2]:
#Checking Data Types
df.dtypes

Age                    float64
Gender                  object
Education Level         object
Job Title               object
Years of Experience    float64
Salary                 float64
dtype: object

In [3]:
#Check for missing values
print("Missing values before cleaning:")
print(df.isnull().sum())

Missing values before cleaning:
Age                    2
Gender                 2
Education Level        3
Job Title              2
Years of Experience    3
Salary                 5
dtype: int64


In [4]:
#Drop rows with missing Age/Salary 
df=df.dropna(subset=['Age', 'Salary','Gender','Education Level','Job Title'])
df.dropna()
df = df.reset_index(drop=True)

In [5]:
#Convert float to integers
df.loc[:, 'Age']=df['Age'].astype(int)
df.loc[:, 'Salary']=df['Salary'].astype(int)
df.loc[:, 'Years of Experience']=df['Years of Experience'].astype(int)

In [6]:
# Verified Data Type Conversion
print("\nData types after conversion:")
print(df[['Age', 'Salary','Years of Experience']].dtypes)


Data types after conversion:
Age                    float64
Salary                 float64
Years of Experience    float64
dtype: object


In [7]:
#Check for missing values
print("Missing values after cleaning:")
print(df.isnull().sum())

Missing values after cleaning:
Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64


In [8]:
#Displaying the updated Dataset
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
...,...,...,...,...,...,...
6693,49.0,Female,PhD,Director of Marketing,20.0,200000.0
6694,32.0,Male,High School,Sales Associate,3.0,50000.0
6695,30.0,Female,Bachelor's Degree,Financial Manager,4.0,55000.0
6696,46.0,Male,Master's Degree,Marketing Manager,14.0,140000.0


In [9]:
#Checking for Duplicate Records
df.duplicated().sum()

np.int64(4911)

In [10]:
#Cleaning Duplicate Entries
df = df.drop_duplicates()
df = df.reset_index(drop=True)

In [11]:
##Displaying the updated Dataset
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
...,...,...,...,...,...,...
1782,43.0,Female,Master's Degree,Digital Marketing Manager,15.0,150000.0
1783,27.0,Male,High School,Sales Manager,2.0,40000.0
1784,33.0,Female,Bachelor's Degree,Director of Marketing,8.0,80000.0
1785,37.0,Male,Bachelor's Degree,Sales Director,7.0,90000.0


In [12]:
#Identifying Unique Education Categories
df['Education Level'].unique()

array(["Bachelor's", "Master's", 'PhD', "Bachelor's Degree",
       "Master's Degree", 'High School', 'phD'], dtype=object)

In [13]:
#Standardizing Education Level Values
df['Education Level'] = df['Education Level'].replace({
    "Bachelor's": "Bachelor's Degree",
    "Master's": "Master's Degree", "phD":"PhD"}).str.strip()

In [25]:
##Identifying Unique Gender Categories
df['Gender'].unique()

array(['Male', 'Female'], dtype=object)

In [15]:
#Remove gender=other
df = df[df['Gender'] != 'Other']

In [16]:
# Sort the DataFrame by Salary in ascending order
sorted_df = df.sort_values('Salary')
# Display the sorted table
sorted_df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
248,29.0,Male,Bachelor's Degree,Junior Business Operations Analyst,1.0,350.0
1396,31.0,Female,Bachelor's Degree,Junior HR Coordinator,4.0,500.0
647,25.0,Female,Bachelor's Degree,Front end Developer,1.0,550.0
917,23.0,Male,PhD,Software Engineer Manager,1.0,579.0
1264,28.0,Female,High School,Junior Sales Associate,1.0,25000.0
...,...,...,...,...,...,...
1321,49.0,Male,Master's Degree,Marketing Manager,23.0,228000.0
1303,51.0,Male,PhD,Data Scientist,24.0,240000.0
1434,45.0,Male,Bachelor's Degree,Financial Manager,21.0,250000.0
30,50.0,Male,Bachelor's Degree,CEO,25.0,250000.0


In [17]:
#Standardizing Salary Values
df.loc[:, 'Salary'] = np.where(df['Salary'] < 1000, df['Salary'] * 100, df['Salary'])

In [18]:
# Sort the DataFrame by Salary in ascending order
sorted_df = df.sort_values('Salary')
# Display the updated sorted table
sorted_df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
1264,28.0,Female,High School,Junior Sales Associate,1.0,25000.0
1473,25.0,Female,High School,Sales Associate,0.0,25000.0
1242,29.0,Female,High School,Junior Sales Associate,1.0,25000.0
1324,30.0,Female,High School,Junior Sales Associate,1.0,25000.0
1251,23.0,Male,High School,Junior Sales Associate,1.0,25000.0
...,...,...,...,...,...,...
1321,49.0,Male,Master's Degree,Marketing Manager,23.0,228000.0
1303,51.0,Male,PhD,Data Scientist,24.0,240000.0
1434,45.0,Male,Bachelor's Degree,Financial Manager,21.0,250000.0
30,50.0,Male,Bachelor's Degree,CEO,25.0,250000.0


In [19]:
df.reset_index(drop=True, inplace=True)

In [20]:
#Displaying the updated Dataset
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1780 entries, 0 to 1779
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  1780 non-null   float64
 1   Gender               1780 non-null   object 
 2   Education Level      1780 non-null   object 
 3   Job Title            1780 non-null   object 
 4   Years of Experience  1780 non-null   float64
 5   Salary               1780 non-null   float64
dtypes: float64(3), object(3)
memory usage: 83.6+ KB


Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
count,1780.0,1780,1780,1780,1780.0,1780.0
unique,,2,4,191,,
top,,Male,Bachelor's Degree,Software Engineer Manager,,
freq,,966,768,127,,
mean,35.122472,,,,9.124157,113245.042135
std,8.184608,,,,6.804983,51435.936128
min,21.0,,,,0.0,25000.0
25%,29.0,,,,3.0,70000.0
50%,33.0,,,,8.0,110000.0
75%,41.0,,,,13.0,160000.0


### SAVING THE PREPARED DATASET

In [21]:
#Exporting Cleaned Dataset
df.to_csv('clean_data.csv', index=False)

In [22]:
#Initializing Visualization Environment
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('clean_data.csv')

In [23]:
#Dataset Structure Overview
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1780 entries, 0 to 1779
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  1780 non-null   float64
 1   Gender               1780 non-null   object 
 2   Education Level      1780 non-null   object 
 3   Job Title            1780 non-null   object 
 4   Years of Experience  1780 non-null   float64
 5   Salary               1780 non-null   float64
dtypes: float64(3), object(3)
memory usage: 83.6+ KB


### STREAMLIT PYTHON SOURCE CODE