# Dataset: Absenteeism at work

Source: UCI Machine Learning Repository 

URL: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

### Dataset description 

The data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real) depending on the purpose of the research.The data set (Absenteeism at work - Part I) was used in academic research at the Universidade Nove de Julho - Postgraduate Program in Informatics and Knowledge Management.


### Categorical data information 

The data contains the following categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

1. Individual identification (ID)
2. Reason for absence (ICD).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)


#### Install - execute this once. Can be commented out afterwards if running from Syzygy or locally. 

In [None]:
try:
    %pip install jupysql --quiet
    print("Success")
except:
    print("retry installing")

#### Load the data

In [None]:
import requests
import zipfile
import io
import pandas as pd
from sqlalchemy.engine import create_engine

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"

# download the ZIP file
response = requests.get(url)

# extract the contents of the ZIP file
zf = zipfile.ZipFile(io.BytesIO(response.content))
df = pd.read_csv(zf.open("Absenteeism_at_work.csv"), sep=";", index_col=0)

# Replace spaces with underscores in the column names
df.columns = [c.replace(" ", "_").replace("/","_per_") for c in df.columns]

#### Store the data into a SQLite instance

In [None]:
engine = create_engine("sqlite://")

df.to_sql("absenteeism", engine)

#### Load Engine

In [None]:
%load_ext sql
%sql engine

#### Use JupySQL to perform the queries and answer the questions.

In [None]:
%%sql 
SELECT *
FROM absenteeism 
LIMIT 5

#### Question 1 (Easy):
What is the average distance from residence to work? 

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic and the `AVG(Distance_from_Residence_to_Work)` function to calculate the average distance from residence to work.. 

```python
%%sql
SELECT AVG(Distance_from_Residence_to_Work) 
FROM absenteeism;
```
</details>

#### Question 2 (Medium):


What is the average absenteeism time for employees with BMI higher than the average BMI

In [None]:
%%sql


<details>

<summary>Answers</summary>

You can use the `%%sql` magic and. Use the `AVG(Absenteeism_time_in_hours)` with the alias `AVG_Absenteeism_time_in_hours` function to count the average absenteeism (time units hours). 

`WHERE Body_mass_index > (`: This part begins a condition that the data must meet to be included in our average calculation. Here, we're only interested in rows where the `Body_mass_index` is greater than a certain value.

`SELECT AVG(Body_mass_index) FROM absenteeism)`: This is a subquery, a query within a query. It's calculating the average `Body_mass_index` for the entire absenteeism table.

```python
%%sql
SELECT AVG(Absenteeism_time_in_hours) as AVG_Absenteeism_time_in_hours
FROM absenteeism 
WHERE Body_mass_index > (
    SELECT AVG(Body_mass_index) 
    FROM absenteeism);

```
</details>

#### Question 3 (Hard):

Which reasons for absence are more frequent for social drinkers than social non-drinkers?

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic. We use `SELECT` to extract the `Reason_for_absence` from the `absenteeism` table. 

The column `Social_drinker` is encoded using binary notation, 0=is not a social drinker, 1=is a social drinker. 

We next group by their reason for absence. 

`HAVING COUNT() > (`  begins the condition that the groups must meet to be included in the results. Only groups where the count of rows (representing the number of instances of each `Reason_for_absence` among social drinkers) is greater than a certain value will be included.

`SELECT COUNT() FROM absenteeism WHERE Social_drinker = 0 GROUP BY Reason_for_absence)`  is a subquery that calculates the count of rows for each `Reason_for_absence` where `Social_drinker` is 0 (indicating the employee is not a social drinker), effectively giving us the number of instances of each `Reason_for_absence` among non-social drinkers.

```python
%%sql
SELECT Reason_for_absence 
FROM absenteeism 
WHERE Social_drinker = 1 
GROUP BY Reason_for_absence 
HAVING COUNT() > (
    SELECT COUNT() 
    FROM absenteeism 
    WHERE Social_drinker = 0 
    GROUP BY Reason_for_absence);

```
</details>

### References   

Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

### Acknowledgements

Professor Gary Johns for contributing to the selection of relevant research attributes.

Professor Emeritus of Management

Honorary Concordia University Research Chair in Management

John Molson School of Business

Concordia University

Montreal, Quebec, Canada

Adjunct Professor, OB/HR Division

Sauder School of Business,

University of British Columbia

Vancouver, British Columbia, Canada