#About Dataset
Salaries dataset generally provides information about the employees of an organization in relation to their compensation. It typically includes details such as how much each employee is paid (their salary), their job titles, the departments they work in, and possibly additional information like their level of experience, education, and employment history within the organization.

# Features
- 'Id'
- 'EmployeeName'
- 'JobTitle'
- 'BasePay'
- 'OvertimePay'
- 'OtherPay'
- 'Benefits'
- 'TotalPay' -> salary
- 'TotalPayBenefits'
- 'Year'
- 'Notes'
- 'Agency'
- 'Status'

# Tasks

1. **Basic Data Exploration**: Identify the number of rows and columns in the dataset, determine the data types of each column, and check for missing values in each column.

2. **Descriptive Statistics**: Calculate basic statistics mean, median, mode, minimum, and maximum salary, determine the range of salaries, and find the standard deviation.

3. **Data Cleaning**: Handle missing data by suitable method with explain why you use it.

4. **Basic Data Visualization**: Create histograms or bar charts to visualize the distribution of salaries, and use pie charts to represent the proportion of employees in different departments.

5. **Grouped Analysis**: Group the data by one or more columns and calculate summary statistics for each group, and compare the average salaries across different groups.

6. **Simple Correlation Analysis**: Identify any correlation between salary and another numerical column, and plot a scatter plot to visualize the relationship.

8. **Summary of Insights**: Write a brief report summarizing the findings and insights from the analysis.

# Very Important Note
There is no fixed or singular solution for this assignment, so if anything is not clear, please do what you understand and provide an explanation.

In [None]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt


df = pd.read_csv('Salaries.csv')
df.head()


# 1.   **Basic Data Exploration**
> **Summarization** for what I will go through:


*   **Exploring** The column labels of the DataFrame
*   **Exploring** The dimensions of the DataFrame
*   **Exploring** The data types of each column in the DataFrame
*   **Calculating** The number of missing values (NaNs) in each column of the DataFrame & Understanding the overall distribution of missing data in the DataFrame


##### ***Note*** for the code snippet of the missing values:
```
# The output values for the first run of missing values code snippet.
Id                       0
EmployeeName             0
JobTitle                 0
BasePay                609
OvertimePay              4
OtherPay                 4
Benefits             36163
TotalPay                 0
TotalPayBenefits         0
Year                     0
Notes               148654
Agency                   0
Status              148654
dtype: int64

```














In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
missing_values = df.isnull().sum()
missing_values

# 2.   **Descriptive Statistics**
> **Summarization** for what I will go through:

*   **Calculate** the ***mean*** salary in the dataset, It represents the "**average**" salary.
*   **Calculate** the ***median*** salary in the dataset, It represents the "**middle**" value when all salaries are arranged in ascending order.
*   **Calculate** the ***mode*** salary in the dataset, It represents the salary that appears most often.
*   **Finding** the ***minimum*** salary in the dataset, It represents the "**lowest**" salary value present.
*   **Finding** the ***maximum*** salary in the dataset, It represents the "**highest**" salary value present.
*   **Calculate** the range of salaries in the dataset, It represents the difference between the **maximum** and **minimum** salaries.
*   **Calculate** the standard deviation of salaries in the dataset, It measures how spread out the salary values are around the mean











In [None]:
sal_mean = df.TotalPay.mean()
sal_mean

In [None]:
sal_median = df.TotalPay.median()
sal_median

In [None]:
sal_mode = df.TotalPay.mode()
sal_mode

In [None]:
sal_min = df.TotalPay.min()
sal_min

In [None]:
sal_max = df.TotalPay.max()
sal_max

In [None]:
sal_range = sal_max - sal_min
sal_range

In [None]:
sal_std = df.TotalPay.std()
sal_std

# 3.   **Data Cleaning**
> **Summarization** for what I will go through:

*   Missing Value Imputation
*   Replacing Original Columns
*   Dropping Unnecessary Columns

 #### In summary, and during data cleaning, two columns, 'Notes' and 'Status', were removed as they contained no data and were deemed irrelevant for analysis and I imputed other partialy missing columns so we don't lose any entry points, by implementing this I prepared the DataFrame for further analysis




In [49]:
missing_columns = ['Benefits','BasePay','OvertimePay','OtherPay']

imputer = ColumnTransformer(
    transformers=[
        ('numeric', SimpleImputer(strategy='median'), missing_columns),
    ])

imputed_columns = pd.DataFrame(imputer.fit_transform(df))

In [50]:
df[['Benefits','BasePay','OvertimePay','OtherPay']] = imputed_columns

In [None]:
#### These columns were empty and don't provide relevant information, so they are dropped.
df.drop(columns=['Notes', 'Status'], axis=1, inplace=True)
df

# 4.   **Basic Data Visualization**
> **Summarization** for what I will go through:

*   Graphical Distribution of Total Pay

#### The histogram below shows us how many individuals fall into different ranges of total pay.

In [None]:
plt.hist(df['TotalPay'], bins=100)
plt.xlabel('Total Pay')
plt.ylabel('Frequency')
plt.title('Distribution of Total Pay')
plt.show()

# 5.   **Grouped Analysis**
> **Summarization** for what I will go through:

#### Analyzing the relationship between ***Job titles*** and ***Total pay*** in the dataset

  > Conclusion:

-----> The Highest salary goes to "GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY"

-----> The Lowest salaries goes to "Drug Court Coordinator, Public Safety Comm Tech, IS Technician Assistant, etc"

In [None]:
TitleVsTotalPay = df.groupby('JobTitle')['TotalPay'].mean().sort_values(ascending=False)
TitleVsTotalPay

# 6. **Simple Correlation Analysis**
> **Summarization** for what I will go through:
####   Identifies which numerical columns in the data have the strongest relationships (positive or negative) with the "TotalPay" column



In [None]:
##Dropping-Catagorical-Columns
num_df = df.drop(["EmployeeName","JobTitle","Agency"], axis=1)
##Calculating-Correlations-with-"TotalPay"
correlations = num_df.corr()["TotalPay"].drop("TotalPay").sort_values(ascending=False)
correlations

In [None]:
plt.scatter(df['TotalPay'], df['TotalPayBenefits'])
plt.xlabel('TotalPay')
plt.ylabel('TotalPayBenefits')
plt.title('TotalPay vs TotalPayBenefits')
plt.show()

In [None]:
plt.scatter(df['TotalPay'], df['Benefits'])
plt.xlabel('TotalPay')
plt.ylabel('Benefits')
plt.title('TotalPay vs Benefits')
plt.show()

# 7.   **Summary of Insights**
> ***The findings and insights from the analysis***

*   **The presence** of a **negative minimum** value likely indicates **errors or exceptional cases** needing further investigation (It is just one value with **ID: 148654**)

*   **The wide range** of over **568,000 dollars** and high standard deviation of over **50,000 dollars** confirm the significant **variability** in **salaries**, with some individuals earning considerably more than others.


*   **The lack of** a **clear mode value** makes it harder to pinpoint the most common salary

*   "**Benefits**" (0.977) and "**BasePay**" (0.951) have the **strongest** positive **correlations** with "**TotalPay**". The strongest **relationship** among them all.

*   "**Year**" (0.032) has a very **weak** **positive** **correlation**

*   **Missing department column** in the data set & Two main columns were invalid 'Notes' & 'Status'.

