# Creating a Logistic Regression to predict Absenteeism

Using Logistic Regression to predict absenteeism.

### Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np

### Load the Preprocessed data

In [2]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


In [3]:
# We expect that half of these predictors won't have merit,
# Intuition is that Reason for Abset, Distrace to Work, Daily Work Load Av and children & Pets will have the higher impact
# Since we'll be performing regression analysis,
# the model itself will give us a fair indication of which variables are important for the analysis.

### Create the Targets (Supervised ML)

Logistic regression is a type of classification. Therefore our model will be classifying employees into classes:

- Excessively absent (above median) --> Target: 1
- Moderately absent (below median) --> Target: 0

We will take the median value of the `Absenteeism Time in Hour` and use it as a cut-off line. While this method might be a bit naive, is numerically stable and rigid.

In [4]:
# Find the median for 'Absenteeism Time in Hours'
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

In [5]:
# Since we're working with supervised learning, we create a variable 'targets'
# 'targets' will tell us whether an employee has been absent from work more than 3 hours
# NumPys where method works like IF on Excel
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets #array

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [6]:
# Add the targets to the DataFrame data_preprocessed
data_preprocessed['Excessive Absenteeism'] = targets
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


### A comment on targets & Balancing the dataset

By using the median, implicitly balance the dataset as about half of our targets will be 1s and the other half 0s. This will prevent our model from learning to output only 0s or only 1s.

In [7]:
# Around 46% of the targets are 1s
# a balance of 45-55 is almost always suffiecient
targets.sum() / targets.shape[0]

0.45571428571428574

### Checkpoint 👀

In [8]:
# Drop Absenteeism Time in Hours & creat a Checkpoint
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Daily Work Load Average', 
                                            'Day of the Week', 'Distance to Work'],axis=1)

In [9]:
# checking that we have sucessfully created a Checkpoint
data_with_targets is data_preprocessed

False

In [10]:
data_with_targets.head(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
5,0,0,0,1,7,179,38,31,0,0,0,0
6,0,0,0,1,7,361,28,27,0,1,4,1
7,0,0,0,1,7,260,36,23,0,4,0,1
8,0,0,1,0,7,155,34,25,0,2,0,1
9,0,0,0,1,7,235,37,29,1,1,1,1


### Selecting the inputs for the regression

In [11]:
data_with_targets.shape

(700, 12)

In [12]:
# iloc Pandas method selects (Slices) data by position when given rows and columns
# For our inputs, we need all rows from our DataFrame and,
# all columns except for 'Excessive Absenteeism' (targets)
unscaled_inputs = data_with_targets.iloc[:,:-1]

# For better DataFrame display options
pd.options.display.max_columns = None
pd.options.display.max_rows = None

unscaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1
5,0,0,0,1,7,179,38,31,0,0,0
6,0,0,0,1,7,361,28,27,0,1,4
7,0,0,0,1,7,260,36,23,0,4,0
8,0,0,1,0,7,155,34,25,0,2,0
9,0,0,0,1,7,235,37,29,1,1,1


### Standardize the data

In [13]:
# Import the relevant module
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

In [14]:
# Start by creating an empty StandardScaler object
#absenteeism_scaler = StandardScaler()

# We will use a CustomScaler instead to avoid the dummy variables.

class CustomScaler(BaseEstimator,TransformerMixin):
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [15]:
# First, we must decide the columns to be scaled from the origininal features we had
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [16]:
# Then, create a new variable to store the feture we want to scale
# Obviously, we'll omit the dummy variables from this list.

#columns_to_scale = ['Month Value','Day of the Week', 'Transportation Expense', 'Distance to Work',
       #'Age', 'Daily Work Load Average', 'Body Mass Index', 'Children', 'Pets']

# Seems more intuitive to discard the columns we don't want to be scaled
# In this case, these are all the dummies
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

In [17]:
#List comprehension is a syntactic construct which allows us to create a list from ecisting lists based on loops, conditions, etc
columns_to_scale = [x 
                    for x in unscaled_inputs.columns.values 
                    if x not in columns_to_omit]

In [18]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [19]:
# Next, fit the input data
# This line of code will calculate and store the mean and satndard deviation
absenteeism_scaler.fit(unscaled_inputs)

CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pets'],
             copy=None, with_mean=None, with_std=None)

In [20]:
# So far, we have just prepared the scaling mechanism
# To apply it, we use the method .transfrom()

scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs
# All our dummies have remained untouched

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.182726,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
5,0,0,0,1,0.182726,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
6,0,0,0,1,0.182726,2.092381,-1.320435,0.061825,0,-0.01928,2.843016
7,0,0,0,1,0.182726,0.568211,-0.065439,-0.878984,0,2.679969,-0.58969
8,0,0,1,0,0.182726,-1.016322,-0.379188,-0.40858,0,0.880469,-0.58969
9,0,0,0,1,0.182726,0.190942,0.091435,0.532229,1,-0.01928,0.268487


In [21]:
scaled_inputs.shape

(700, 11)

### Splitting the data for Training and Testing

To prevent overfitting, we train the model with most of our data, but not all of it. We use a smaller portion of the data to test the model's accuracy on data it hasn't seen before.

In [22]:
#import the relevant module
from sklearn.model_selection import train_test_split

In [23]:
# Split the data
# We will obtain 4 arrays:
    # array 1: training dataset with inputs
    # array 2: test dataset with inputs
    # array 3: training dataset with targets
    # array 4: test dataset with targets
    
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size=0.8, shuffle=True, random_state=20)
# Usually we opt for a 90-10 or 80-20 split because we want to train on more data
# shuffle=True splits arrays or matrices into random train and test subsets. This can make our training difficult as we're dealing with different data all the time
# random_state will make the shuffle pseudorandom (always shuffling the observations in the same 'random' way)
# the method has split the scaled inputs and targets into matching forms that we can use in the ML process

In [24]:
print(x_train.shape, y_train.shape)

(560, 11) (560,)


In [25]:
print(x_test.shape, y_test.shape)

(140, 11) (140,)


## Logistic regression with sklearn



*NOTE:*
*StatsModel was unable to provide an aswer for this model. Whenever we are training a ML model there are many mathematical issue arising at the background, imperfect libraries like StatsModel are not always numerically stable for more complicated models.*

In [26]:
# import the relevant modules
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Training the model

In [27]:
reg = LogisticRegression()

In [28]:
reg.fit(x_train, y_train) #this method does basically all the machine learni

# as output we get all the default parametres for the logistic regression model we have just specified



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [29]:
reg.score(x_train, y_train)
# We conclude that our model has an accuracy of about 77%
# The model score has fallen by a bit less than 1% compared to when we standardized ALL features
# This isn't unusual as we've modified 5 input features
# We've lost practically an insignificant accuracy but we've gained much more interpretability

0.775

### Manually checking the accuracy

1. It is always good to have full understanding of what we're doing
2. we will be using this idea later on this exercise

In [30]:
# Accuracy means that x% of the model outputs match the targets
# Thus, we will proceed to finding the outputs and comparing them to the targets
model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [31]:
# Let's compare the model's outputs to the targets
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [32]:
# if an element hasn't been guessed correctly, the result we obtain is 'False'
# But, how many Correct preditions do we have?
# Knowing that True=1 and False=0 in boolean:
np.sum(model_outputs==y_train)

434

In [33]:
# Accuracy = Correct predictions / # observations
np.sum(model_outputs==y_train) / model_outputs.shape[0]

# We've obtained the same outcome as we did with the method .score()

0.775

### Finding the intercept (or bias) and coeffients (or weigths)

Python - SQL - Tableau integration

Our ultimate goal is to create a function that can easily and relaiably predict values from within Tableau (user-fiendly software).

Regression analysis is about determining certain coeffient or weights which we apply to the inputs to obtaina  final result.

In [34]:
# Finding the intercept
reg.intercept_

array([-1.46547112])

In [35]:
# Finding the coeffiecient
reg.coef_

array([[ 2.62749942,  0.86338637,  2.96050661,  0.66390745,  0.15493732,
         0.59979822, -0.17245127,  0.27568526, -0.23452541,  0.34249662,
        -0.2775137 ]])

In [36]:
# To know what variable those coefficients refer to
# We can get the coefficients from the names of our inputs column values
# However, keep in mind tat whenever we emply sklearn (usually) the results are arrays, not data frames (ie. no columns)
# We refer to the original pandas DataFrame with unscaled inouts (since we only want the column names) 

feature_name = unscaled_inputs.columns.values
feature_name

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [37]:
# Create a neat DataFrame that will contain the intercept, feature names and the corresponding coeffiecient
summary_table = pd.DataFrame(columns=['Feature Name'], data = feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_) # by deafult nd.arrays are rows not columns
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Reason_1,2.627499
1,Reason_2,0.863386
2,Reason_3,2.960507
3,Reason_4,0.663907
4,Month Value,0.154937
5,Transportation Expense,0.599798
6,Age,-0.172451
7,Body Mass Index,0.275685
8,Education,-0.234525
9,Children,0.342497


In [38]:
# Adding the intercept right at the top of our new DataFrame:

summary_table.index = summary_table.index + 1 # shift up all indeced by 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]] #we fill out the index 0 element
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-1.465471
1,Reason_1,2.627499
2,Reason_2,0.863386
3,Reason_3,2.960507
4,Reason_4,0.663907
5,Month Value,0.154937
6,Transportation Expense,0.599798
7,Age,-0.172451
8,Body Mass Index,0.275685
9,Education,-0.234525


### Interpreting the coefficients

**Standardized coefficients** are the coefficients of a regression where all variables have been standardized.

- Weights: the further away from zero (no matter whether positive or negative), the bigger the wight for this feature. This intuition works for models where all variables are of the SAME SCALE.


Since we're using a logistic regression, all the coefficients that we found, by definition, refer to the **log(odds)**. To make them more interpretable, we'll find the exponentials of the coefficients.

In [39]:
# create a new series in our DataFrame to include the Odds ratio (the exponenstials of our coefficients)
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
0,Intercept,-1.465471,0.230969
1,Reason_1,2.627499,13.839121
2,Reason_2,0.863386,2.371177
3,Reason_3,2.960507,19.307751
4,Reason_4,0.663907,1.942367
5,Month Value,0.154937,1.167585
6,Transportation Expense,0.599798,1.821751
7,Age,-0.172451,0.841599
8,Body Mass Index,0.275685,1.317433
9,Education,-0.234525,0.790946


In [40]:
# By default, the coefficients are sorted in ascending order,
# the most important ones are at the bottom (biggest weight/odds ratio)
summary_table.sort_values('Odds_ratio', ascending=False)
# Now coeffiecients are sorted according to the logic of the problem at hand

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
3,Reason_3,2.960507,19.307751
1,Reason_1,2.627499,13.839121
2,Reason_2,0.863386,2.371177
4,Reason_4,0.663907,1.942367
6,Transportation Expense,0.599798,1.821751
10,Children,0.342497,1.40846
8,Body Mass Index,0.275685,1.317433
5,Month Value,0.154937,1.167585
7,Age,-0.172451,0.841599
9,Education,-0.234525,0.790946


#### INTERPRETATION:

A feature is NOT particularly important:

- if its coefficient is around 0 (a weight of 0 implies that no matter the feature value, we will multiply it by 0 in the model)
- if its odds ratio is around 1 (for a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio (1= no change))

Given all the features, we can conclude that
`Daily Work Load Average`, `Distance to Work` and `Day of the Week`
seem to be the ones that make no difference (surprisingly, our intuition wasn't totally correct). **We will consider dropping these fetures later on.**

**The four Reasons for Absence are the most important predictors.** Remember that hen creating the dummies we dropped `Reason_0` which represented a situation where a person was absent but no particular reason was given. Therefore the base model is when there is no reason.

From the coefficients, it seems hat whenever a person has stated a reason whe have a much higher chance of "excessive absence". **How much bigger of a chance?**

#### PROBLEM: We've standardized all variables, including the dummies.

This is bad practice as when we standardized we lose the whole interpretability of a dummy (unit changes).

The predictive power of the model is still valid and it is a good classifier, but **we don't know how the different reasons compare.** This is a problem, since those are the most important features.

> Code correction: `CustomScaler`: It will not standardize all inputs but only the ones we choose.

As result, the model score has fallen by a bit less than 1% compared to when we standardized ALL features. This isn't unusual as we've modified 5 input features. We've lost practically an insignificant accuracy but we've gained much more interpretability.

#### INTERPRETATION CONTINUED:

Now that we have kept the dummies unscaled, we're able to interpret the coefficients for Reasons for Absence much better.

Considering `Reason_0` = No reason = baseline model (when no reason is given),

- `Reason_3` = Poisoning: a person is almost 20 times more likely to be excessively absent after poisoned than when no reason was reported.
- `Reason_1` = Various diseases: "The normal absenteeism case". An individual gets sick, they don't go to work. A person is almost 14 times more likely to be excessively absent than a person who didn't specify a reason.
- `Reason_2` = Pregnancy and giving birth: it's a prominent cause of absenteeism, but way less pronounced than reasons 1 and 3. Appointments, checks and perhaps some emergencies from time to time but it's only around 2 times more likely to be excessively absent than the base model.
- `Reason_4` = Light diseases: similar interpretation as above.

Then, `Transporation Expense` is the most relevant non-dummy input feature for excessive absenteeism in the model.

`Pets` has a negative coefficient. The odds are 1-0.751567 = 24% lower than the base model (no pets).

Finally, the `intercept` has no specific meaning in the model. Thus, we can say the intercept or BIAS 'calibrates' the model. If we didn't have an intercept each prediction would be off by precisely that value.

### Backwards elimination

To simplify our model by removing all the features which have close to no contribution to the model.

- When we have the p-values, we get rid of all coefficients with p-values < 0.05.
- With sklearn we don't get p-values, but intuition is thta if the wight is small enough, it won't make a difference anyway.

If we remove `Daily Work Load Average`, `Distance to Work` and `Day of the Week`, the rest of our model should not really change in terms of coefficient values.

> The three variables we dropped were practically useless. With or without them we've obtained practically the same results (coefficients and accuracy). Either way, a simplest model is always preferable.

### Testing the model

Once we test, we're conceptually not allowed to touch the model anymore.

In [41]:
# At this stage, our Train accuracy is around 77%
# However, that doesn't mean much as out model has since this data many many times

reg.score(x_test, y_test)

0.75

Based on data that the model has NEVER seen before, in 75% of the cases, the model will  predict (correctly) if the person is going to be excessively absent.

In [43]:
# Probability estimates for all outputs (classes)
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.71221976, 0.28778024],
       [0.58760009, 0.41239991],
       [0.44337438, 0.55662562],
       [0.77903962, 0.22096038],
       [0.08458343, 0.91541657],
       [0.33103371, 0.66896629],
       [0.29792496, 0.70207504],
       [0.12956221, 0.87043779],
       [0.78307821, 0.21692179],
       [0.74708659, 0.25291341],
       [0.49514969, 0.50485031],
       [0.22640297, 0.77359703],
       [0.07030984, 0.92969016],
       [0.73504052, 0.26495948],
       [0.30533085, 0.69466915],
       [0.55035881, 0.44964119],
       [0.55027426, 0.44972574],
       [0.53930442, 0.46069558],
       [0.40117774, 0.59882226],
       [0.05320682, 0.94679318],
       [0.69874615, 0.30125385],
       [0.77903962, 0.22096038],
       [0.41634563, 0.58365437],
       [0.41634563, 0.58365437],
       [0.2412915 , 0.7587085 ],
       [0.74317087, 0.25682913],
       [0.51065194, 0.48934806],
       [0.85703303, 0.14296697],
       [0.19934235, 0.80065765],
       [0.77903962, 0.22096038],
       [0.

In [44]:
predicted_proba.shape
# The first column shows the probability the model assigned to the observation being 0
# the second column is the probability the model assigned to the observation being zer
# Thus, suming the number horizontally will add up to 1.

(140, 2)

In [46]:
# We're interested in the probability of excessive absenteeism,
# so the probability of getting 1.
# Slice out the values of the second column

predicted_proba[:,1]

array([0.28778024, 0.41239991, 0.55662562, 0.22096038, 0.91541657,
       0.66896629, 0.70207504, 0.87043779, 0.21692179, 0.25291341,
       0.50485031, 0.77359703, 0.92969016, 0.26495948, 0.69466915,
       0.44964119, 0.44972574, 0.46069558, 0.59882226, 0.94679318,
       0.30125385, 0.22096038, 0.58365437, 0.58365437, 0.7587085 ,
       0.25682913, 0.48934806, 0.14296697, 0.80065765, 0.22096038,
       0.37028423, 0.68316787, 0.68825755, 0.52694241, 0.22096038,
       0.53492642, 0.22453007, 0.74389237, 0.40329273, 0.60301627,
       0.21343976, 0.45483346, 0.2403088 , 0.4388431 , 0.82622935,
       0.57857132, 0.69461059, 0.28778024, 0.22209028, 0.2061074 ,
       0.57577123, 0.36438663, 0.66896629, 0.27128561, 0.83334736,
       0.43399232, 0.88600663, 0.23396355, 0.37170685, 0.38209505,
       0.69796139, 0.65909803, 0.29392197, 0.79686146, 0.20956093,
       0.2699923 , 0.10399887, 0.22453007, 0.73944244, 0.30081832,
       0.22453007, 0.32688766, 0.90337554, 0.45745729, 0.59997

In reality, logistic regression models calculate these probabilities in the background.

- If the probability  is < 0.5, it places a 0
- If the probability is >0.5, it places a 1

### Save the model

Saving the model means saving the `reg` object.

In [47]:
# pickle is a Python module used to convert a Python object into a character stream
import pickle

In [48]:
with open('model', 'wb') as file:
    pickle.dump(reg, file)

# the file name we create is 'model'
# wb means write bytes, 
# conversely, when we unpickle we'll use rb or read bytes
# .dump() method when we pickle, we dump the information in a file (save),
# when we unpickle, we .load() it

In [49]:
# We must save the absenteeism_scaler too
# The information in the absenteeism_scaler is used to preprocess new data

with open('scaler', 'wb') as file:
    pickle.dump(absenteeism_scaler, file)