# Dataset: Absenteeism at work

Source: UCI Machine Learning Repository 

URL: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

### Dataset description 

The data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real) depending on the purpose of the research.The data set (Absenteeism at work - Part I) was used in academic research at the Universidade Nove de Julho - Postgraduate Program in Informatics and Knowledge Management.


### Categorical data information 

The data contains the following categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

1. Individual identification (ID)
2. Reason for absence (ICD).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)


## 5 minute crash course into JupySQL

Play the following video to get familiar with JupySQL to execute queries on Jupyter using DuckDB.

<b>If you get stuck, join our Slack community!</b> https://ploomber.io/community


[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/CsWEUYLaYU0/0.jpg)](https://www.youtube.com/watch?v=CsWEUYLaYU0)


#### Install - execute this once. Can be commented out afterwards if running from Syzygy or locally. 

In [None]:
try:
    %pip install jupysql duckdb-engine pandas --quiet
    print("Success")
except:
    print("retry installing")

#### Load the data

In [1]:
from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"

# download the file
urlretrieve(url, "Absenteeism_at_work_AAA.zip")

# Extract the CSV file
with ZipFile("Absenteeism_at_work_AAA.zip", 'r') as zf:
    zf.extractall()

# Check the extracted CSV file name (in this case, it's "Absenteeism_at_work.csv")
csv_file_name = "Absenteeism_at_work.csv"

# Data clean up
df = pd.read_csv(csv_file_name, sep=",")
df.columns = df.columns.str.replace(' ', '_')

# Save the cleaned up CSV file
df.to_csv("Absenteeism_at_work_cleaned.csv", index=False)

#### Load Engine

<b>Note</b> Ensure you restart any previous notebook that has the same database name as the one initialized below.

In [2]:
%reload_ext sql
%sql duckdb:///absenteeism.duck.db

In [3]:
%%sql
create or replace table absenteeism as
from read_csv_auto('Absenteeism_at_work_cleaned.csv', header=True, sep=';')

*  duckdb:///absenteeism.duck.db
Done.


Count
740


In [4]:
%%sql
SELECT count(*) FROM absenteeism

*  duckdb:///absenteeism.duck.db
Done.


count_star()
740


#### Use JupySQL to perform the queries and answer the questions.

Example: show the first 5 rows.

In [5]:
%%sql 
SELECT *
FROM absenteeism 
LIMIT 5

*  duckdb:///absenteeism.duck.db
Done.


ID,Reason_for_absence,Month_of_absence,Day_of_the_week,Seasons,Transportation_expense,Distance_from_Residence_to_Work,Service_time,Age,Work_load_Average/day_,Hit_target,Disciplinary_failure,Education,Son,Social_drinker,Social_smoker,Pet,Weight,Height,Body_mass_index,Absenteeism_time_in_hours
11,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
36,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
3,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
7,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
11,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


#### Question 1 (Easy):
How many records are there in the 'absenteeism' table? 

Enter your answer in the cell below.

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic and the `COUNT(*)` function to count the total number of records. 

```python
%%sql
SELECT COUNT(*) 
FROM absenteeism
```
</details>

#### Question 2 (Medium):
On which days of the week does the average absenteeism time exceed 4 hours? 

Enter your answer in the cell below.

In [None]:
%%sql

<details>

<summary>Answers</summary>

You can use the `%%sql` magic and break down the query as follows:

1. Select the column with name `Day_of_the_week`
2. From the table called `absenteeism`
3. Then group the values by day of the week that have an average value (use `AVG`) of more than 4 hours in absenteeism. 

```python
%%sql
SELECT Day_of_the_week 
FROM absenteeism 
GROUP BY Day_of_the_week 
HAVING AVG(Absenteeism_time_in_hours) > 4;
```
</details>

#### Question 3 (Hard):
Find the top 3 ages with the highest total absenteeism hours, excluding disciplinary failures.

Enter your answer in the cell below.

In [None]:
%%sql



<details>

<summary>Answers</summary>

You can use the `%%sql` magic and break down the query as follows:

1. Select the column with name `Age`, compute the Sum of `Absenteeism_time_in_hours`. Give this sum an alias `Sum_Absenteeism`.
2. From the table called `absenteeism`
3. The keywork WHERE is used to filter the data that meets a specific condition, in this case `Disciplinary_failure` is equal to zero.
4. Group values by the `Age` column.
5. Sort the values by the sum and show the first 3 values.

```python
%%sql
SELECT Age, SUM(Absenteeism_time_in_hours) AS Sum_Absenteeism
FROM absenteeism 
WHERE Disciplinary_failure = 0 
GROUP BY Age 
ORDER BY Sum_Absenteeism
DESC LIMIT 3;
```
</details>

### Bonus: Save the tables you created using the `--save` option, use the saved tables to generate visualizations.

Here are a few tutorials to get you started:

Parameterizing SQL queries: https://jupysql.ploomber.io/en/latest/user-guide/template.html

SQL Plot: https://jupysql.ploomber.io/en/latest/api/magic-plot.html

Organizing Large queries: https://jupysql.ploomber.io/en/latest/compose.html

Plotting with ggplot: https://jupysql.ploomber.io/en/latest/user-guide/ggplot.html

Turning your notebook into a Voila dashboard: https://ploomber.io/blog/voila-tutorial/


<h2 align='center'>Congratulations! You can share this notebook with your network, or add it as part of your portfolio.</h2>

### References   

Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

### Acknowledgements

Thank you Mark Needham for producing the 5 minute crash course on using JupySQL.