# Exercise

What follows are several exercises regarding aggregation and grouping in pandas.

In this exercise, you will work with a fictional dataset containing sales data for a retail store. The dataset is provided in CSV format and consists of the following columns:

1. Employee_ID: Unique identifier for each employee (Integer).
2. Department: Department where the employee works (Categorical - String).
3. Gender: Gender of the employee (Categorical - String).
4. Age: Age of the employee (Integer).
5. Years_of_Experience: Number of years of experience of the employee (Integer).
6. Performance_Rating: Performance rating of the employee (Float, scale of 1 to 5).
Your task is to use pandas to perform various data analysis tasks and derive insights from the dataset.

In [6]:
import pandas as pd

# Create a fictional dataset
data = {
    'Employee_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Department': ['HR', 'IT', 'Marketing', 'Finance', 'HR', 'IT', 'Marketing', 'Finance', 'HR', 'IT'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female'],
    'Age': [35, 28, 42, 39, 45, 32, 37, 41, 29, 36],
    'Years_of_Experience': [10, 5, 15, 12, 20, 8, 13, 18, 6, 11],
    'Performance_Rating': [4.5, 3.8, 4.9, 4.2, 4.7, 4.0, 4.8, 4.3, 4.9, 4.6]
}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)
df


Unnamed: 0,Employee_ID,Department,Gender,Age,Years_of_Experience,Performance_Rating
0,101,HR,Male,35,10,4.5
1,102,IT,Female,28,5,3.8
2,103,Marketing,Male,42,15,4.9
3,104,Finance,Female,39,12,4.2
4,105,HR,Male,45,20,4.7
5,106,IT,Female,32,8,4.0
6,107,Marketing,Male,37,13,4.8
7,108,Finance,Female,41,18,4.3
8,109,HR,Female,29,6,4.9
9,110,IT,Female,36,11,4.6


### 1. Calculate Average Performance Rating by Department and Gender:
- Group the data by Department and Gender.
- Calculate the average Performance_Rating for each group.

In [7]:
# Calculate average performance rating by department and gender
avg_rating_dept_gender = df.groupby(['Department', 'Gender'])['Performance_Rating'].mean()

# Display the result
print("Average performance rating by department and gender:")
avg_rating_dept_gender


Average performance rating by department and gender:


Department  Gender
Finance     Female    4.250000
HR          Female    4.900000
            Male      4.600000
IT          Female    4.133333
Marketing   Male      4.850000
Name: Performance_Rating, dtype: float64

### 2. Identify Top Performer in Each Department:
- For each department, identify the employee with the highest Performance_Rating.
- Display the employee's Employee_ID, Department, and Performance_Rating.

In [9]:


# Identify top performer in each department
top_performer_dept = df.loc[df.groupby('Department')['Performance_Rating'].idxmax()]

# Display the result
print("\nTop performer in each department:")
top_performer_dept[['Employee_ID', 'Department', 'Performance_Rating']]




Top performer in each department:


Unnamed: 0,Employee_ID,Department,Performance_Rating
7,108,Finance,4.3
8,109,HR,4.9
9,110,IT,4.6
2,103,Marketing,4.9


### 3. Calculate Age Range Statistics by Department:
- Group the data by Department.
- Calculate the minimum, maximum, and average Age for each department.

In [18]:
# Calculate age range statistics by department
age_stats_dept = df.groupby('Department')['Age'].agg(['min', 'max', 'mean'])

# Display the result
print("\nAge range statistics by department:")
# change type of mean to be integer
import numpy as np
age_stats_dept = age_stats_dept.astype(dict(mean=np.int32))
age_stats_dept


Age range statistics by department:


Unnamed: 0_level_0,min,max,mean
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Finance,39,41,40
HR,29,45,36
IT,28,36,32
Marketing,37,42,39


### 4. Identify Employees with Below Average Performance Rating:
- Calculate the overall average Performance_Rating across all employees.
- Identify employees whose Performance_Rating is below the overall average.
- Display the Employee_ID, Performance_Rating, and Department of these employees.

In [20]:
# Calculate overall average performance rating
overall_avg_rating = df['Performance_Rating'].mean()

# Identify employees with below average performance rating
below_avg_rating_employees = df[df['Performance_Rating'] < overall_avg_rating]

# Display the result
print("\nEmployees with below average performance rating:")
below_avg_rating_employees[['Employee_ID', 'Performance_Rating', 'Department']]



Employees with below average performance rating:


Unnamed: 0,Employee_ID,Performance_Rating,Department
1,102,3.8,IT
3,104,4.2,Finance
5,106,4.0,IT
7,108,4.3,Finance


### 5. Calculate Age Group Distribution by Gender:
- Create age groups for employees (e.g., 20-29, 30-39, 40-49, etc.).
- Group the data by Gender and age groups.
- Calculate the count of employees in each gender-age group.

In [25]:
# Define age groups
bins = [20, 30, 40, 50, 60]
labels = ['20-29', '30-39', '40-49', '50-59']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Calculate age group distribution by gender
# observed=True is to show only categories that are within the values
age_group_distribution = df.groupby(['Gender', 'Age_Group'], observed=True).size()

# Display the result
print("\nAge group distribution by gender:")
age_group_distribution


Age group distribution by gender:


Gender  Age_Group
Female  20-29        2
        30-39        3
        40-49        1
Male    30-39        2
        40-49        2
dtype: int64