1. Which dataset you’re using
2. Specification of your dataset
3. Pre-processing
4. How classification and semi-supervised algorithms work
5. Result of implementation
6. Conclusion

### About Dataset
##### Context
The database was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil.

Number of Instances: 740 


Number of Attributes: 21

Missing Values: 0


Dataset :[https://archive.ics.uci.edu/dataset/445/absenteeism+at+work](https://archive.ics.uci.edu/dataset/445/absenteeism+at+work)

## Features in the Dataset (Columns)
1. Individual identification (ID)
2. Reason for absence (ICD).(Explained below)
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day 
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)


##### Reason for absence
ICD10 Codes: [https://icd.who.int/browse10/2016/en](https://icd.who.int/browse10/2016/en)

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories as follows:

1. Certain infectious and parasitic diseases  
2. Neoplasms  
3. Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism  
4. Endocrine, nutritional and metabolic diseases  
5. Mental and behavioural disorders  
6. Diseases of the nervous system  
7. Diseases of the eye and adnexa  
8. Diseases of the ear and mastoid process  
9. Diseases of the circulatory system  
10. Diseases of the respiratory system  
11. Diseases of the digestive system  
12. Diseases of the skin and subcutaneous tissue  
13. Diseases of the musculoskeletal system and connective tissue  
14. Diseases of the genitourinary system  
15. Pregnancy, childbirth and the puerperium  
16. Certain conditions originating in the perinatal period  
17. Congenital malformations, deformations and chromosomal abnormalities  
18. Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified  
19. Injury, poisoning and certain other consequences of external causes  
20. External causes of morbidity and mortality  
21. Factors influencing health status and contact with health services.

And 7 categories without (ICD):
 
22. patient follow-up

23. medical consultation

24. blood donation

25. laboratory examination

26. unjustified absence

27. physiotherapy

28. dental consultation



In [1]:
# importing the libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
# importing a dataset
df = pd.read_csv("Absenteeism_at_work.csv", delimiter=";")
# displaying the first 5 rows of the dataframe
df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               740 non-null    int64  
 2   Month of absence                 740 non-null    int64  
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           740 non-null    int64  
 6   Distance from Residence to Work  740 non-null    int64  
 7   Service time                     740 non-null    int64  
 8   Age                              740 non-null    int64  
 9   Work load Average/day            740 non-null    float64
 10  Hit target                       740 non-null    int64  
 11  Disciplinary failure             740 non-null    int64  
 12  Education             

defining encoding dictionaries

In [4]:
months = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6:'June'
          , 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December', 0:'Unknown'}

days = {2: 'Monday', 3: 'Tuesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Friday'}
seasons = {1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Fall'}
education = {1: 'High School', 2: 'Graduate', 3: 'Postgraduate', 4: 'Master and Doctor'}
yes_no = {0: 'No', 1: 'Yes'}

Preprocessing

In [5]:
df2 = df.copy()
df2['Month of absence'] = df2['Month of absence'].map(months)
df2['Day of the week'] = df2['Day of the week'].map(days)
df2['Seasons'] = df2['Seasons'].map(seasons)
df2['Disciplinary failure'] = df2['Disciplinary failure'].map(yes_no)
df2['Education'] = df2['Education'].map(education)
df2['Social drinker'] = df2['Social drinker'].map(yes_no)
df2['Social smoker'] = df2['Social smoker'].map(yes_no)
df2.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,July,Tuesday,Winter,289,36,13,33,239.554,...,No,High School,2,Yes,No,1,90,172,30,4
1,36,0,July,Tuesday,Winter,118,13,18,50,239.554,...,Yes,High School,1,Yes,No,0,98,178,31,0
2,3,23,July,Wednesday,Winter,179,51,18,38,239.554,...,No,High School,0,Yes,No,0,89,170,31,2
3,7,7,July,Thursday,Winter,279,5,14,39,239.554,...,No,High School,2,Yes,Yes,0,68,168,24,4
4,11,23,July,Thursday,Winter,289,36,13,33,239.554,...,No,High School,2,Yes,No,1,90,172,30,2
