<a href="https://www.kaggle.com/code/isissantoscosta/365ds-practice-exams-sql?scriptVersionId=241371024" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a id='top'></a>
Created on May 21, 2025 • Updated on May 23, 2025 • by [Ísis Santos Costa](https://www.linkedin.com/in/isis-santos-costa/)

<hr>

**How to expand and assess data querying skills?**

This notebook focuses on solving with pandas questions from the [**365 Data Science • Practice Exams: SQL**](https://learn.365datascience.com/exams/?tab=practice) curriculum, a **free resource** designed to help test and elevate data science skills. Here, you'll find a set of practices on data querying and analysis within a People Analytics context. The steps of a business analysis process are applied to get insights from the provided data, covering detailed answers to 365DS SQL Practice Exams 1, 2, and 3, which has an emphasis on illustrating the usage and value of **SQL procedures and functions**, and here is tackled in a pythonic approach.

From the [365 Data Science Practice Exams](https://365datascience.com/resources-center/practice-exams/) webpage:

> Discover a plethora of online exams that will test your current knowledge and ability to solve data science problems.  
> Evaluate your skills online **at no cost** with SQL mock tests, Excel and NumPy exam questions, and more.

<br>

The data for the 365 Data Science SQL Practice Exams is available as a Kaggle dataset: [🎓 365DS Practice Exams • People Analytics Dataset](https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset/).

<center>
    <img src='https://raw.githubusercontent.com/isis-santos-costa/isis-santos-costa/refs/heads/main/img/pandas.png' alt='databases' width='350'>
</center>

<a id='business_questions'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 1 • Business questions</strong></div>

In the realm of business, data is abundant, insights are precious. The distinction often lies in one crucial factor: **starting with the right business questions**. It transforms data analysis from a technical exercise into a **strategic driver**, focusing efforts on what truly matters. This ensures insights are actionable and directly align with **strategic goals**, such as increasing customer satisfaction, optimizing costs, and boosting revenue.

For that reason, given the context defined by the exercise dataset - **People Analytics** - a first step to make the analysis interesting is to compile a set of strategic questions on that context that have the potential to drive impactful positive change **for the company's advancement**.

With that in mind, here it goes a set of candidate business questions to drive the analysis, categorized by People Analytics themes:

**A. Workforce Demographics & Structure:**

1. What is the current distribution of employees by department, title, and gender?
2. How has the workforce size changed over time, by department and overall?
3. What is the average tenure of employees across the company and within specific departments or titles?
4. What is the gender diversity breakdown within each department and across different job titles?

**B. Compensation & Benefits:**

5.  What are the average, median, and range of salaries by department, title, and tenure?
6.  How have salary trends evolved over time for different roles or employee groups?
7.  Are there significant salary differences based on gender for similar roles/tenure? (Identifying potential pay equity issues)
8.  What is the average salary increase rate per year, and how does it vary by department or title?

**C. Talent Mobility & Turnover:**

9.  What is the overall employee turnover rate, and how does it vary by department, manager, or title over time?
10. Which departments or managers experience the highest/lowest employee retention rates?
11. What is the average duration employees stay in a particular title before promotion or departure?
12. What are the common career paths or transitions within the company (e.g., from 'Engineer' to 'Senior Engineer', or internal department transfers)?
13. How frequently do employees change departments or titles?

**D. Management & Leadership:**

14. Which managers have the longest average tenure with their teams?
15. How many employees report to each manager over time? (Understanding span of control).
16. What is the average salary of managers compared to non-managers?

<a id='data_collection'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 2 • Data collection</strong></div>

This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:

**Original authors**: 
The foundational dataset was authored by Prof. Dr. Fusheng Wang [🔗](https://www3.cs.stonybrook.edu/~fuswang/) (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo [🔗](https://web.cs.ucla.edu/~zaniolo/) (UCLA). This work is primarily described in their paper: **Wang, F., & Zaniolo, C. (2004). *Publishing and Querying the Histories of Archived Relational Databases in XML*. [DOI:10.1109/WISE.2003.1254473](http://dx.doi.org/10.1109/WISE.2003.1254473).**

**Relational conversion**: It was originally distributed as an `.xml` file. Giuseppe Maxia (known as @datacharmer on GitHub[🔗](https://github.com/datacharmer/) and LinkedIn[🔗](https://www.linkedin.com/in/datacharmer/), as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a `.sql` file, making it accessible for relational database use.

**Kaggle upload**: This `.sql` version was then loaded to Kaggle as the « [Employees Dataset](https://www.kaggle.com/datasets/huzaifamirza/employees-dataset) » by Mirza Huzaifa[🔗](https://www.linkedin.com/in/mirza-huzaifa-ali-baig-601743223/) on February 5th, 2023.  


**Kaggle dataset**: On May 20th, 2025, for convenient access and ease of use in analytical tools, the `.sql` file has been [converted](https://www.kaggle.com/code/isissantoscosta/create-database-from-sql-file-sqlite/) into a single `.db` (SQLite) database file as well as a set of individual `.csv` files by Ísis Santos Costa[🔗](https://www.linkedin.com/in/isis-santos-costa/), and loaded into this Kaggle Dataset: [🎓 365DS Practice Exams • People Analytics Dataset](https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/365ds-practice-exams-people-analytics-dataset/dept_emp.csv
/kaggle/input/365ds-practice-exams-people-analytics-dataset/dept_manager.csv
/kaggle/input/365ds-practice-exams-people-analytics-dataset/employees.csv
/kaggle/input/365ds-practice-exams-people-analytics-dataset/titles.csv
/kaggle/input/365ds-practice-exams-people-analytics-dataset/salaries.csv
/kaggle/input/365ds-practice-exams-people-analytics-dataset/employees.db
/kaggle/input/365ds-practice-exams-people-analytics-dataset/departments.csv


<a id='data_prep'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 3 • Data prep</strong></div>



## 🔎 Inspecting the dataset

In [2]:
# Build a dfs dictionary with all tables in the dataset
csv_filenames = [filename for filename in filenames if filename.endswith('.csv')]

dfs = {}
for file_name in csv_filenames:
    table_name = file_name.replace('.csv', '')
    dfs[table_name] = pd.read_csv(os.path.join(dirname, file_name))

# Inspect the dataset
print('🔎 INSPECTING THE DATASET ##############################################################')

print('\n\n TABLES INFO ++++++++++++++++++++++++++++++++++++++')
for (table_name, df) in dfs.items():
    print('\n', table_name.upper())
    df.info()

print('\n\n COLUMN NAMES +++++++++++++++++++++++++++++++++++++')
for (table_name, df) in dfs.items():
    print('\n', table_name.upper())
    print(list(df.columns))

🔎 INSPECTING THE DATASET ##############################################################


 TABLES INFO ++++++++++++++++++++++++++++++++++++++

 DEPT_EMP
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331603 entries, 0 to 331602
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   emp_no     331603 non-null  int64 
 1   dept_no    331603 non-null  object
 2   from_date  331603 non-null  object
 3   to_date    331603 non-null  object
dtypes: int64(1), object(3)
memory usage: 10.1+ MB

 DEPT_MANAGER
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   emp_no     24 non-null     int64 
 1   dept_no    24 non-null     object
 2   from_date  24 non-null     object
 3   to_date    24 non-null     object
dtypes: int64(1), object(3)
memory usage: 900.0+ bytes

 EMPLOYEES
<class 'pandas.core.frame.Da

## 🏷️ Applying appropriate data types

In [3]:
# The INFO seen above reveals that the following fields need to be casted to the appropriate type:
# Category: ['gender']
# Datetime: ['from_date', 'to_date', 'birth_date', 'hire_date']

# Classify types of columns to recast
category_cols = ['gender']
datetime_cols = ['from_date', 'to_date', 'birth_date', 'hire_date']

# Cast to appropriate types
print('\n\n\n 🏷️ APPLYING APPROPRIATE DATA TYPES ###################################################### ')
for (table_name, df) in dfs.items():
    
    column_names = df.columns
    for column_name in column_names:
        
        if column_name in category_cols:
            df[column_name] = df[column_name].astype('category')
            print(df[column_name].dtype, table_name, column_name)
            
        if column_name in datetime_cols:
            df[column_name] = df[column_name].replace('9999-01-01', pd.NaT)
            df[column_name] = df[column_name].astype('datetime64[ns]')
            print(df[column_name].dtype, table_name, column_name)

print('\n\n\n TABLES DESCRIBE ++++++++++++++++++++++++++++++++++')
for (table_name, df) in dfs.items():
    print('\n\n', table_name.upper(), '\n', df.describe())




 🏷️ APPLYING APPROPRIATE DATA TYPES ###################################################### 
datetime64[ns] dept_emp from_date
datetime64[ns] dept_emp to_date
datetime64[ns] dept_manager from_date
datetime64[ns] dept_manager to_date
datetime64[ns] employees birth_date
category employees gender
datetime64[ns] employees hire_date
datetime64[ns] titles from_date
datetime64[ns] titles to_date
datetime64[ns] salaries from_date
datetime64[ns] salaries to_date



 TABLES DESCRIBE ++++++++++++++++++++++++++++++++++


 DEPT_EMP 
               emp_no                      from_date  \
count  331603.000000                         331603   
mean   253332.605025  1993-01-01 23:42:24.762260864   
min     10001.000000            1985-01-01 00:00:00   
25%     85005.500000            1989-02-25 00:00:00   
50%    250001.000000            1993-01-27 00:00:00   
75%    424999.500000            1996-11-09 00:00:00   
max    499999.000000            2002-08-01 00:00:00   
std    161831.919445           

<a id='data_analysis_365'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 4 • Data Analysis | Exam Questions</strong></div>

## 🎓 Exam 2 • Question 1
Retrieve a list of all employees hired in year 2000, sorted by first name in ascending order.
What is the last name of third employee from the obtained input?

<br>

---  

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [4]:
# Get the employees table
df = dfs['employees']

# Filter to `hire_date` year = 2000
df = df.loc[df['hire_date'].dt.year == 2000]

### Sorting by first name in ascending order
df = df.sort_values(by='first_name', ascending=True)

### Getting the last name of the third listed employee
ans = df.iloc[3-1]['last_name']
ans

'Delgrande'

## 🎓 Exam 2 • Question 2
Compare, by department: `female_avg_salary`, `male_avg_salary`.

Which is correct?
* Female employees earn more in Finance
* Male employees in HR earn more than female in Sales
* Female employees in CS earn more than male in HR
* Male employees earn less in Production

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [5]:
# INFO needed:
# salaries: `salaries` table
# gender: `employees` table
# department: `dept_emp` table, `departments` table

# Get the salaries table (967330 rows × 4 columns)
df1 = dfs['salaries']

# Get the employees table (300024 rows × 6 columns)
df2 = dfs['employees']

# Join salaries ⋈ employees (w indices, for joining performance)
df = pd.merge(df1.set_index('emp_no'), df2.set_index('emp_no'), on='emp_no', how='inner')

# Get the dept_emp table (331603 rows × 4 columns)
df3 = dfs['dept_emp']

# Join df ⋈ dept_emp
df = pd.merge(df, df3.set_index('emp_no'), left_index=True, right_index=True, how='inner', suffixes=('_' + 'salary', '_' + 'dept'))

# Get the departments table (9 rows × 2 columns)
df4 = dfs['departments']

# Join df ⋈ departments
df = pd.merge(df.set_index('dept_no'), df4.set_index('dept_no'), on='dept_no', how='inner')

# Subset to fields of interest (⚠️ to be reviewed)
df = df[['dept_name', 'gender', 'salary']]

# # Prepare `dept_name`, `female_avg_salary`, `male_avg_salary`
# df = df.pivot_table(
#     index='dept_name',
#     columns='gender',
#     values='salary',
#     aggfunc='mean',
#     observed=False
# ).round(2)
# df

# # Prepare `dept_name`, `female_avg_salary`, `male_avg_salary`
df = df.groupby(['dept_name', 'gender'], observed=True).mean().round(2)

# # Compare
df = df.sort_values(by='salary', ascending=False)
df

# Which is correct:
# * Female employees earn more in Finance
# * Male employees in HR earn more than female in Sales
# * Female employees in CS earn more than male in HR
# * Male employees earn less in Production

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
dept_name,gender,Unnamed: 2_level_1
Sales,M,80879.76
Sales,F,80626.56
Marketing,M,72198.19
Marketing,F,71464.48
Finance,M,70327.03
Finance,F,69914.92
Research,M,59965.77
Research,F,59712.78
Production,M,59596.36
Development,M,59576.33


## 🎓 Exam 2 • Question 4
Prepare a routine to retrieve an employer's last department, given their `emp_no`. 
Who from the list works in the "Human Resources" department? [10014, 10100, 10200, 11345]

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [6]:
# Get the dept_emp table (331603 rows × 4 columns)
df1 = dfs['dept_emp']

# Function:
def get_first_row_by_partition_desc(df, partition_col, order_col):
    """
    Returns the first row within each partition after ordering by a column in descending order.

    Args:
        df (pd.DataFrame): The input DataFrame.
        partition_col (str): The column to partition by.
        order_col (str): The column to order by.

    Returns:
        pd.DataFrame: A DataFrame containing the first row of each partition.
    """
    df_sorted = df.sort_values(by=order_col, ascending=False)
    df_grouped = df_sorted.groupby(partition_col, group_keys=False)
    df_first_row = df_grouped.first()
    return df_first_row

# Get first row by partition desc
partition_col = 'emp_no'
order_col = 'from_date'
df = get_first_row_by_partition_desc(df1, partition_col, order_col).reset_index()

# Get the departments table (9 rows × 2 columns)
df2 = dfs['departments']

# Join df ⋈ departments
df = pd.merge(df.set_index('dept_no'), df2.set_index('dept_no'), on='dept_no', how='inner')

# Filter by list of `emp_no` of interest
df = df.loc[df['emp_no'].isin([10014, 10100, 10200, 11345])]

# Filter to searched dept_name: 'Human Resources'
df = df.loc[df['dept_name'] == 'Human Resources']
df

Unnamed: 0_level_0,emp_no,from_date,to_date,dept_name
dept_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d003,10100,1987-09-21,NaT,Human Resources


## 🎓 Exam 3 • Question 1
From `dept_emp`, which are the MIN and MAX `dept_no`?

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [7]:
# Get the dept_emp table
df = dfs['dept_emp']

# Select the min and max dept_no
ans = [df['dept_no'].min(), df['dept_no'].max()]
ans

['d001', 'd009']

## 🎓 Exam 3 • Question 2
For all employees with `emp_no` <= 10040 → get [`emp_no`, `min_dept_no`), `manager`].  

Set:
* `manager`= 110022 for `emp_no` <= 10020  
* `manager`= 110039 for `emp_no` between 10021 & 10040  

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [8]:
# Get the dept_emp table
df = dfs['dept_emp']

# Filter to emp_no <= 10040
df = df.loc[df['emp_no'] <= 10040].copy()

# Add column ´manager'
df['manager'] = -1

# For emp_no <= 10020, set 'manager' = 110022
# For emp_no  > 10020, set 'manager' = 110039
df.loc[df['emp_no'] <= 10020, 'manager'] = 110022
df.loc[df['emp_no'] >  10020, 'manager'] = 110039
df

Unnamed: 0,emp_no,dept_no,from_date,to_date,manager
0,10001,d005,1986-06-26,NaT,110022
1,10002,d007,1996-08-03,NaT,110022
2,10003,d004,1995-12-03,NaT,110022
3,10004,d004,1986-12-01,NaT,110022
4,10005,d003,1989-09-12,NaT,110022
5,10006,d005,1990-08-05,NaT,110022
6,10007,d008,1989-02-10,NaT,110022
7,10008,d005,1998-03-11,2000-07-31,110022
8,10009,d006,1985-02-18,NaT,110022
9,10010,d004,1996-11-24,2000-06-26,110022


## 🎓 Exam 3 • Question 3
What is the # salaries >= $ 104,038 in contracts with more than 365 days duration?

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [9]:
from pandas.tseries.offsets import DateOffset

# Get salaries table
df = dfs['salaries']

# Filter contracts based on duration exceeding 365 days (83220 rows × 4 columns)
today = pd.Timestamp.today()
duration = df['to_date'].fillna(today) - df['from_date']
df = df.loc[duration > pd.Timedelta(days=365)]

# Filter to salaries >= $ 104,038
df = df.loc[df['salary'] >= 104038]

len(df)

4146

## 🎓 Exam 3 • Question 8
List all the company's Engineers

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [10]:
# Get titles table
df = dfs['titles']

# Filter to rows having 'Engineer' in the title (227881 rows × 4 columns)
df = df.loc[df['title'].str.lower().str.contains('engineer')]
print(df['title'].unique())
df

['Senior Engineer' 'Engineer' 'Assistant Engineer']


Unnamed: 0,emp_no,title,from_date,to_date
0,10001,Senior Engineer,1986-06-26,NaT
2,10003,Senior Engineer,1995-12-03,NaT
3,10004,Engineer,1986-12-01,1995-12-01
4,10004,Senior Engineer,1995-12-01,NaT
7,10006,Senior Engineer,1990-08-05,NaT
...,...,...,...,...
443301,499996,Engineer,1996-05-13,2002-05-13
443302,499996,Senior Engineer,2002-05-13,NaT
443303,499997,Engineer,1987-08-30,1992-08-29
443304,499997,Senior Engineer,1992-08-29,NaT


## 🎓 Exam 3 • Question 9
Create a function to retrieve the highest salary of an employee, given their `emp_no`

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [11]:
# Function
def get_highest_salary(emp_no: int):
    '''
    Retrieves an employee's highest salary, given their `emp_no`
    '''
    # Get salaries table
    df = dfs['salaries']

    # Filter to emp_no
    df = df.loc[df['emp_no'] == emp_no]
    
    # Return the highest salary for emp_no
    highest_salary = df['salary'].max()
    
    return highest_salary

get_highest_salary(10010)

80324

## 🎓 Exam 3 • Question 10
Who from `emp_no` (11356, 11451) has salary = $ 83067 ?

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [12]:
# Print the highest salary for both employees (11356, 11451)
pd.DataFrame({'emp_no': [11356, 
                         11451],
              'highest_salary': [get_highest_salary(11356), 
                                 get_highest_salary(11451)]}
            ).set_index('emp_no')

Unnamed: 0_level_0,highest_salary
emp_no,Unnamed: 1_level_1
11356,83067
11451,89688


## 🎓 Exam 3 • Question 11
Create a function to retrieve employee `salary_info`, f(`emp_no`, `info_type`: 'min'|'max'|'range')

<br>

---

Note: The questions are here paraphrased in a concise manner. For full context and original phrasing, please refer to the exam sponsor[🔗](https://learn.365datascience.com/exams/).

In [13]:
# Function to get salary_info as function of emp_no, info_type ('min'|'max'|'range')

def get_salary_info(emp_no: int, info_type: ['min', 'max', 'rng']):
    '''
    Returns `salary_info`, given `emp_no` and `info_type` ('min'|'max'|'rng')
    '''

    # Get salary table
    df = dfs['salaries']

    # Filter to emp_no
    df = df.loc[df['emp_no'] == emp_no]

    # Build salary_info dictionary
    salary_info = {
        'min': df['salary'].min(),
        'max': df['salary'].max(),
        'rng': df['salary'].max() - df['salary'].min()
    }

    # Return salary_info, filtered to emp_no, info_type
    try:
        salary_info = salary_info[info_type]
    except:
        salary_info = '`emp_no` not found in `salaries` table'
    
    return salary_info

get_salary_info(10010, 'rng')

7836

WIP • to be continued (coming soon, in May 2025)

<a id='data_analysis_extra'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 4 • Data Analysis | Extra</strong></div>

## 💼 Extra: Tenure across departments • Analysis
* How does tenure compare across departments?
* Which departments present a broad range in tenure?
* From which department are the employees with shortest / longest tenure?

In [14]:
## 🎓 Extra• Analyzing tenure across departments
# * How does tenure compare across departments?
# * Which departments present a broad range in tenure?
# * From which department are the employees with shortest / longest tenure?

# Build a table with columns: [ mdn_tenure_yrs_rank, dept_name, 
#                               is_overall_min_tenure_yrs, is_overall_max_tenure_yrs, is_min_tenure_rng_yrs, is_max_tenure_rng_yrs, 
#                               mdn_tenure_yrs, avg_tenure_yrs, min_tenure_yrs, max_tenure_yrs, rng_tenure_yrs ]

# Get dept_emp table (331603 rows × 4 columns)
df = dfs['dept_emp']
df.loc[df['emp_no']==29487]

# Group by (dept_no, emp_no) to get min, max (101796 rows × 2 columns)
df = df.groupby(['dept_no', 'emp_no']).agg(
    min_from_date = ('from_date', 'min'),
    max_to_date   = (  'to_date', 'max')
)

# Calculate tenure
# Ref.: https://stackoverflow.com/a/77697512/7865030
df = (     df  ['max_to_date'].dt.to_period('D')
      .sub(df['min_from_date'].dt.to_period('D'))
      .apply(lambda x: x.n if pd.notna(x) else pd.NA)
     ).to_frame('tenure_yrs') / 365.25

df = df.loc[df['tenure_yrs'] > 0]

# Build tenure_info DataFrame
df = df.groupby('dept_no')['tenure_yrs'].agg(
    mdn_tenure_yrs='median',
    avg_tenure_yrs='mean',
    min_tenure_yrs='min',
    max_tenure_yrs='max'
)
df['rng_tenure_yrs'] = df['max_tenure_yrs'] - df['min_tenure_yrs']

# Get departments table
df1 = dfs['departments']

# Replace dept_no with dept_name
df = pd.merge(df, df1.set_index('dept_no'), left_index=True, right_index=True)
df = df.set_index('dept_name', drop=True)

# Include mdn_tenure_yr_rank
df.insert(0, 'mdn_tenure_yrs_rank', df['mdn_tenure_yrs'].rank(method='min', ascending=False).astype(int))
df = df.sort_values(by='mdn_tenure_yrs_rank')

# Calculate overall min/max and range flags
overall_min_tenure = df['min_tenure_yrs'].min()
overall_max_tenure = df['max_tenure_yrs'].max()
overall_min_rng    = df['rng_tenure_yrs'].min()
overall_max_rng    = df['rng_tenure_yrs'].max()

# Create boolean flags by comparing each department's value to the overall value
df.insert(1, 'is_overall_min_tenure_yrs', (df['min_tenure_yrs'] == overall_min_tenure))
df.insert(2, 'is_overall_max_tenure_yrs', (df['max_tenure_yrs'] == overall_max_tenure))
df.insert(3,     'is_min_tenure_rng_yrs', (df['rng_tenure_yrs'] == overall_min_rng))
df.insert(4,     'is_max_tenure_rng_yrs', (df['rng_tenure_yrs'] == overall_max_rng))
df

Unnamed: 0_level_0,mdn_tenure_yrs_rank,is_overall_min_tenure_yrs,is_overall_max_tenure_yrs,is_min_tenure_rng_yrs,is_max_tenure_rng_yrs,mdn_tenure_yrs,avg_tenure_yrs,min_tenure_yrs,max_tenure_yrs,rng_tenure_yrs
dept_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Sales,1,True,True,False,True,3.619439,4.614379,0.002738,17.489391,17.486653
Development,2,True,False,False,False,3.611225,4.641721,0.002738,17.470226,17.467488
Human Resources,3,True,False,False,False,3.605749,4.600626,0.002738,17.075975,17.073238
Finance,4,True,False,False,False,3.561944,4.647984,0.002738,17.371663,17.368925
Production,5,True,False,False,False,3.518138,4.511399,0.002738,17.437372,17.434634
Research,6,True,False,False,False,3.315537,4.340737,0.002738,17.221081,17.218344
Quality Management,7,True,False,False,False,3.268994,4.359321,0.002738,17.114305,17.111567
Marketing,8,True,False,True,False,3.178645,4.272058,0.002738,16.903491,16.900753
Customer Service,9,True,False,False,False,3.151266,4.222113,0.002738,17.221081,17.218344


<a id='synthesis'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>Step 5 • Synthesis</strong></div>

## 💼 Extra: Tenure across departments • Synthesis

This analysis of employee tenure reveals a **generally healthy retention landscape**, though with notable departmental differences and a peculiar anomaly in minimum tenure.

Overall, the **median tenure** in the company remains at a healthy level, varying between **3.15 and 3.61 years** by department.

When analyzing tenure across departments, **Sales** stands out as the **top department for retaining people**, based on its median tenure. _Customer Service_ is on the _opposite_ end of this spectrum. The overall **longest tenure** in the dataset also belongs to **Sales**, at approximately **17.5 years**.

Conversely, the _minimum tenure_ is remarkably short—**just one day**—across all departments. This is highly unusual and suggests a potential anomaly that warrants closer scrutiny, once data integrity issues are ruled out.

In terms of variability, Sales shows the highest range in tenure, while **Marketing** demonstrates the **highest consistency in tenure duration**.

<a id='references'></a>

# <div style="background-color:#03002e; padding:18px; border-radius:8px; color:white; text-align:center; font-weight:regular; overflow:hidden"><strong>References</strong></div>

* How to calculate date difference in months with pandas: [🔗](https://stackoverflow.com/a/77697512/7865030)

<a id='takeaways'></a>
<div style="border: 2px solid #090088; border-radius: 10px;">
    <p style="font-size:26px; text-align:justify; padding-top: 10px; padding-left: 25px; padding-right: 25px; margin-bottom:-0.5em;">
        <strong>Takeaways</strong>
    </p>
  <p style="text-align:justify; padding: 25px; margin-bottom:-3em;">
      This notebook is driven by the question: <br>
      <font style="font-style:italic;">
          « <strong>How to expand and assess data querying skills?</strong> »<br>
      </font></p>    
      <ul style="text-align:justify; padding: 38px; margin-top:-1em;">
          <li>
          The <strong><a href='https://learn.365datascience.com/exams/?tab=practice'>365 Data Science • Practice Exams: SQL</a></strong> curriculum was chosen as a <strong>free</strong> source of challenging and interesting questions and data
          </li>
          <li>
          <strong>Pandas</strong> was applied as a natural pythonic alternative to SQL in querying data
          </li>
          <li>
          The <a href='https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset/'><strong>🎓 365DS Practice Exams • People Analytics Dataset</strong></a> dataset was then explored
          </li>
          <li>
          Questions from the <strong>Practice Exams</strong> were solved, with commented approaches
          </li>
          <li>
          In addition, strategies used in solving the exams questions were applied on an <strong>analysis of tenure by department</strong>
          </li>
      </ul>
  </p>
  <p style="font-size:26px; text-align:right; padding-right:25px; 
      margin-top:-2.5em; margin-bottom:0em;">
      <a href='#top' style="text-decoration:none;"><strong>↑</strong></a></p>
</div>

<h6 style="background-color:#03002e; padding:12px; border-radius:8px; color:white; text-align:center; font-weight:bold; font-size:150%; font-style:normal;">
    <strong>See you on the next coding!</strong><br>
    <font style='font-size:55%; font-weight:thin; font-style:italic;'>(yours or mine) (or ours!)</font>
</h6>