# Dataset: Absenteeism at work

Source: UCI Machine Learning Repository 

URL: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

### Dataset description 

The data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real) depending on the purpose of the research.The data set (Absenteeism at work - Part I) was used in academic research at the Universidade Nove de Julho - Postgraduate Program in Informatics and Knowledge Management.


### Categorical data information 

The data contains the following categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

1. Individual identification (ID)
2. Reason for absence (ICD).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)


#### Install - execute this once. Can be commented out afterwards if running from Syzygy or locally. 

In [None]:
try:
    %pip install jupysql --quiet
    print("Success")
except:
    print("retry installing")

#### Load the data

In [None]:
import requests
import zipfile
import io
import pandas as pd
from sqlalchemy.engine import create_engine

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"

# download the ZIP file
response = requests.get(url)

# extract the contents of the ZIP file
zf = zipfile.ZipFile(io.BytesIO(response.content))
df = pd.read_csv(zf.open("Absenteeism_at_work.csv"), sep=";", index_col=0)

# Replace spaces with underscores in the column names
df.columns = [c.replace(" ", "_").replace("/","_per_") for c in df.columns]

#### Store the data into a SQLite instance

In [None]:
engine = create_engine("sqlite://")

df.to_sql("absenteeism", engine)

#### Load Engine

In [None]:
%load_ext sql
%sql engine

#### Use JupySQL to perform the queries and answer the questions.

In [None]:
%%sql 
SELECT *
FROM absenteeism 
LIMIT 5

#### Question 1 (Easy):
How many unique employees are listed in the dataset?

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic and the `COUNT(DISTINCT Age)` function to count the total number of unique instances of the `Age` column. 

```python
%%sql
SELECT COUNT(DISTINCT Age) 
FROM absenteeism;
```
</details>

#### Question 2 (Medium):
What is the average transportation expense for each season?


In [None]:
%%sql


<details>

<summary>Answers</summary>

You can use the `%%sql` magic and. Use the `AVG(Transportation_expense)` with the alias `AVG_Transportation_Expense` function to count the average transporation expense, then group by seasons.

```python
%%sql
SELECT Seasons, AVG(Transportation_expense) AS AVG_Transportation_Expense
FROM absenteeism 
GROUP BY Seasons;

```
</details>

#### Question 3 (Hard):

Find the age of employees who have been absent for more than 5 hours with an unjustified absence.

Hint: investigate encoding on the data source.

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic. 'Unjustified absence' is coded with 26. From there all that is required is selecting the age, and using `WHERE` to set up the appropriate conditions. 

```python
%%sql
SELECT Age 
FROM absenteeism 
WHERE Reason_for_absence = 26 AND Absenteeism_time_in_hours > 5;

```
</details>

### References   

Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

### Acknowledgements

Professor Gary Johns for contributing to the selection of relevant research attributes.

Professor Emeritus of Management

Honorary Concordia University Research Chair in Management

John Molson School of Business

Concordia University

Montreal, Quebec, Canada

Adjunct Professor, OB/HR Division

Sauder School of Business,

University of British Columbia

Vancouver, British Columbia, Canada