# Clustering Project
## Absenteeism at work <br>

Author: Piotr Druzdzel <br>
E-mail: piotr.druzdzel@gmail.com <br><br>


Data source: UCI Machine Learning Repository <br>
https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work#

## Attribute Information:

1. Individual identification (ID) <br>
2. Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows: <br>

 0 No absence <br>

 I Certain infectious and parasitic diseases
II Neoplasms <br>
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism <br>
IV Endocrine, nutritional and metabolic diseases <br>
V Mental and behavioural disorders <br>
VI Diseases of the nervous system <br>
VII Diseases of the eye and adnexa <br>
VIII Diseases of the ear and mastoid process <br>
IX Diseases of the circulatory system <br>
X Diseases of the respiratory system <br>
XI Diseases of the digestive system <br>
XII Diseases of the skin and subcutaneous tissue <br>
XIII Diseases of the musculoskeletal system and connective tissue <br>
XIV Diseases of the genitourinary system <br>
XV Pregnancy, childbirth and the puerperium <br>
XVI Certain conditions originating in the perinatal period <br>
XVII Congenital malformations, deformations and chromosomal abnormalities <br>
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified <br>
XIX Injury, poisoning and certain other consequences of external causes <br>
XX External causes of morbidity and mortality <br>
XXI Factors influencing health status and contact with health services. <br>

 #### Categories without (CID): 
 XXII Patient follow-up <br>
XXIII Medical consultation <br> 
XXIV Blood donation <br>
XXV Laboratory examination <br> 
XXVI Unjustified absence <br> 
XXVII Physiotherapy <br>
XXVIII Dental consultation <br>

3. Month of absence <br>
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6)) <br>
5. Seasons (summer (1), autumn (2), winter (3), spring (4)) <br>
6. Transportation expense <br>
7. Distance from Residence to Work (kilometers) <br>
8. Service time <br>
9. Age <br>
10. Work load Average/day <br>
11. Hit target <br>
12. Disciplinary failure (yes=1; no=0) <br>
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4)) <br>
14. Son (number of children) <br>
15. Social drinker (yes=1; no=0) <br>
16. Social smoker (yes=1; no=0) <br>
17. Pet (number of pet) <br>
18. Weight <br>
19. Height <br>
20. Body mass index <br> <br>
21. Absenteeism time in hours (target - for final evaluation only)

#### Timing the script

In [1]:
from datetime import datetime
startTime = datetime.now()

#### Quick fix for the dark theme plotting to have readable labels:

In [2]:
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)

#### Basic libraries:

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk

#### Importing the data:

In [23]:
df = pd.read_csv('Absenteeism_at_work.csv')
df

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14,7,3,1,289,36,13,33,264.604,...,0,1,2,1,0,1,90,172,30,8
736,1,11,7,3,1,235,11,14,37,264.604,...,0,3,1,0,0,1,88,172,29,4
737,4,0,0,3,1,118,14,13,40,271.219,...,0,1,1,1,0,8,98,170,34,0
738,8,0,0,4,2,231,35,14,39,271.219,...,0,1,2,1,0,2,100,170,35,0


## Exploratory Data Analysis (EDA):

#### Columns:

In [6]:
df.columns

Index(['ID', 'Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

#### Basic info, only numerical data:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               740 non-null    int64  
 2   Month of absence                 740 non-null    int64  
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           740 non-null    int64  
 6   Distance from Residence to Work  740 non-null    int64  
 7   Service time                     740 non-null    int64  
 8   Age                              740 non-null    int64  
 9   Work load Average/day            740 non-null    float64
 10  Hit target                       740 non-null    int64  
 11  Disciplinary failure             740 non-null    int64  
 12  Education             

#### Checking for missing data:

In [8]:
df.isnull().sum()

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    0
Seasons                            0
Transportation expense             0
Distance from Residence to Work    0
Service time                       0
Age                                0
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Son                                0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64

#### Basic statistics:

In [9]:
df.describe()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,...,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,18.017568,19.216216,6.324324,3.914865,2.544595,221.32973,29.631081,12.554054,36.45,271.490235,...,0.054054,1.291892,1.018919,0.567568,0.072973,0.745946,79.035135,172.114865,26.677027,6.924324
std,11.021247,8.433406,3.436287,1.421675,1.111831,66.952223,14.836788,4.384873,6.478772,39.058116,...,0.226277,0.673238,1.098489,0.495749,0.260268,1.318258,12.883211,6.034995,4.285452,13.330998
min,1.0,0.0,0.0,2.0,1.0,118.0,5.0,1.0,27.0,205.917,...,0.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0,19.0,0.0
25%,9.0,13.0,3.0,3.0,2.0,179.0,16.0,9.0,31.0,244.387,...,0.0,1.0,0.0,0.0,0.0,0.0,69.0,169.0,24.0,2.0
50%,18.0,23.0,6.0,4.0,3.0,225.0,26.0,13.0,37.0,264.249,...,0.0,1.0,1.0,1.0,0.0,0.0,83.0,170.0,25.0,3.0
75%,28.0,26.0,9.0,5.0,4.0,260.0,50.0,16.0,40.0,294.217,...,0.0,1.0,2.0,1.0,0.0,1.0,89.0,172.0,31.0,8.0
max,36.0,28.0,12.0,6.0,4.0,388.0,52.0,29.0,58.0,378.884,...,1.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0,38.0,120.0


#### Checking Reasons:

In [15]:
a = df['Reason for absence'].unique()
a.sort()
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28], dtype=int64)

#### ID is like a name and does not add any value, so can be dropped:

In [18]:
df.drop('ID', axis=1, inplace=True)

In [20]:
df

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,14,7,3,1,289,36,13,33,264.604,93,0,1,2,1,0,1,90,172,30,8
736,11,7,3,1,235,11,14,37,264.604,93,0,3,1,0,0,1,88,172,29,4
737,0,0,3,1,118,14,13,40,271.219,95,0,1,1,1,0,8,98,170,34,0
738,0,0,4,2,231,35,14,39,271.219,95,0,1,2,1,0,2,100,170,35,0


In [21]:
df['Reason for absence'].value_counts()

23    149
28    112
27     69
13     55
0      43
19     40
22     38
26     33
25     31
11     26
10     25
18     21
14     19
1      16
7      15
6       8
12      8
8       6
21      6
9       4
5       3
24      3
16      3
4       2
15      2
3       1
2       1
17      1
Name: Reason for absence, dtype: int64