## Create a logistic regression to predict absenteeism

In [378]:
#Import libraries:
import pandas as pd
import numpy as np

## Load data

In [381]:
data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")

In [383]:
data_preprocessed.head(5)

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,1,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


We take all of those columns, and try to predict "absenteism time in hours"
we expect that half of those predictors won't have merit.
The model itself will give us indication of which variables are important and which not.

## Create the targets

We will first be classifying people into classes: "Moderately absent" and "Excessively absent".

In [388]:
#Classes: 
# Moderately absent / Excessively absent
# We will use a methodology which is a bit naive but "numerically stable": We take the median of absenteism and use it as a cutoff line
data_preprocessed['Absenteeism Time in Hours'].median()
#The median is 3

3.0

In [390]:
#If an observation has been absent for less than 3h we'll assign value of 0, if more, 1
#These are our TARGETS, and our goal is to predict them

#let's create a new variable 'targets':
#we use np.where, it's like Excel's IF: condition, value if true, value if false:

targets = np.where(data_preprocessed['Absenteeism Time in Hours']> data_preprocessed['Absenteeism Time in Hours'].median(), 1,0)
#result is an np array

In [392]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [394]:
data_preprocessed['Excessive Absenteeism'] = targets

In [396]:
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,1,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,1,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,1,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


## A comment on targets
By using the median, we have implicitly balanced the dataset, half 0 and half 1, this will prevent our model from learning to output only 1s or 0s

In [399]:
#portion of "1" targets:
targets.sum()/targets.shape[0]

0.45571428571428574

In [433]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

# Initially that was the only drop, after running the first model, we come back and drop 3x features
# that turned out to be useless:

data_with_targets = data_with_targets.drop(['Day of the Week','Daily Work Load Average','Distance to Work' ], axis=1)


In [435]:
data_with_targets

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,1,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,5,179,40,22,1,2,0,1
696,1,1,0,0,5,225,28,24,0,1,2,0
697,1,1,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


## Select the inputs for the regression

In [438]:
# We'll use iloc:  DataFrame.iloc[row indices, column indices]
# Select (slices) data by position when given rows and columns wanted
data_with_targets.shape

(700, 12)

In [440]:
unscaled_inputs = data_with_targets.iloc[:, 0:-1] # all but the last column

## Standardize the data

In [443]:
from sklearn.preprocessing import StandardScaler

# absenteeism_scaler = StandardScaler() #This is an empty StandardScaler object

# this is the line of code we use at the beginning, but later, due to adding dummies,
# we need to go back and add a "custom scaler" that scales just numeric variables and leaves dummies unchanged


#absenteeism_scaler will be used to subtract the mean and divide by the standard deviation variablewise (featurewise)

### Custom Scaler

In [446]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing  import StandardScaler


# This is a custom scaler, based on StandardScaler from sklearn,
# however, when we declare the scaler object, it has an extra argument "columns" to scale
# the custom scaler won't standardize all inputs but only the ones we choose, 
# so we'll be able to preserve dummies untouched (we could also have standardized prior to creating the dummies)
# it's not different from the standard scaler in the way it works

# PS: I had to mod the code in the video (from a comment in the Q&A section)

class CustomScaler (BaseEstimator, TransformerMixin):
    
#    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
    def __init__(self, columns, copy: bool=True, with_mean: bool=True, with_std: bool=True):
        #self.scaler = StandardScaler(copy, with_mean, with_std)
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns],y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None ):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]



In [448]:
#Let's check the unscaled columns:

unscaled_inputs.columns.values

array(['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [450]:
#First approach was to set "Columns to scale":
#columns_to_scale = ['Month Value','Day of the Week', 'Transportation Expense', 'Distance to Work',
#       'Age', 'Daily Work Load Average', 'Body Mass Index','Children', 'Pet']

#But later we go back to it and, instead, list the "Columns to omit", and then manage it from there:
columns_to_omit = ['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4', 'Education']

# Then use a list comprehension (syntactic construct which allows us to create a list from an existing list
# based on loops, conditionals etc)

columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit ]

#Technically, written in this way, the list works like a loop, which looks into all column values 
# and takes those which are not part of the variable columns_to_omit.

#after dropping those variables, the result has pretty much the same accuracy. A simpler model is always preferable

In [452]:
# Finally we use the Custom Scaler
absenteeism_scaler = CustomScaler(columns_to_scale)

In [454]:
absenteeism_scaler.fit(unscaled_inputs)
# This line will calculate and store the mean and the standard deviation from unscaled inputs
# Store in the "absenteeism_scaler" object

# Whenever you get new data you will know that the standardization information is contained in "absenteeism_scaler"
# thus you'll be able to standardize the new data the same way.

  return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


In [456]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [458]:
scaled_inputs
# we see that the dummies are still 0s and 1s

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,1,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,1,0,0,-0.388293,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,1,0,0,-0.388293,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


In [460]:
# We have prepared the "scaling mechanism" but not applied yet.

#this is a repetition from the previous first version
#scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

# .transform does the actual scaling, using the information contained in "absenteeism_scaler"
# aka it subtracts the mean and divide by st. dev.

# PS: when we get new data we just:
# new_data_scaled =  absenteeism_scaler.transform( new_data_raw )



In [462]:
scaled_inputs.shape

(700, 11)

## Split the data into train / test    + shuffle the data

In [465]:
from sklearn.model_selection import train_test_split

### Split

In [468]:
# This splits the data:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets, train_size = 0.8, random_state = 20) 
# Default split is 75/25 but we don't like too much data for testing, as it means less for training
# train_test_split has "shuffle = True" by default
# Is good to set random_state, so it will shuffle observations always in the "same random way"

# This creates 4 arrays: training dataset with inputs, one with targets
# Test datased with inputs, one with targets
print("Train: " + str(x_train.shape) +str( y_train.shape))
print("Test: " + str(x_test.shape) + str(y_test.shape))

Train: (560, 11)(560,)
Test: (140, 11)(140,)


# Modeling
"Machine learning is 90% preprocessing and 10% modeling"

## Logistic Regression with sklearn

In [472]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Training the model

In [475]:
# Now we must declare a new variable which will be a logistic regression object:
reg = LogisticRegression()

In [477]:
# Next, we must fit the regression:
reg.fit(x_train, y_train)
# This method does basically all the machine learning.

In [479]:
# We evaluate the model accuracy:
reg.score(x_train, y_train)

# sklearn.linear_model.LogisticRegression.score(inputs, targets)
# returns the mean accuracy on the test data and labels

# accuracy of 0.78, based on the data we used, our model learned to classify about 80% of the observations correctly



0.7696428571428572

## Manually checking accuracy

in order to truly understand the results, I want us to find this accuracy manually.
1) it is always good to have the full understanding of what we are doing.
2) we will be using this idea later on, so we might as well start now.

What does accuracy mean? 

The logistic regression model is trained on the train inputs.

Based on them, it finds outputs, which are trying to be as close to the targets as possible.
However, accuracy means that 80% of the model outputs match the targets.
So, if we wanna find the accuracy of a model manually we should find the outputs and compare them with the targets.

In [483]:
# To find the outputs of the regression we use an Sklearm method:
model_outputs = reg.predict(x_train)

In [485]:
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [487]:
# we can compare with y_train
model_outputs == y_train
# and we obtain a matrix of "true/false"


array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [489]:
correct_targets = np.sum(model_outputs == y_train)
print("accuracy= " + str(correct_targets/ y_train.shape[0]))

accuracy= 0.7696428571428572


We've got a model and its train accuracy.
However, to be able to use this model outside of Python we don't need this black box.
We want the nuts and bolts. In this Python, SQL, Tableau integration, the ultimate goal would be to create a function which can easily and reliably predict values from within Tableau.
Since Tableau is a nice looking manager friendly software, that's the place where the end users of our analysis will likely take advantage of our model.

So, to use this logistic regression model outside of Python, we must get our hands on the coefficients and the intercept.
Moreover, in order to interpret this logistics model we still need to do so.

## Finding the intercept and coefficients

In [494]:
reg.intercept_

array([-1.7133823])

In [496]:
reg.coef_

array([[ 1.4590134 ,  1.4590134 ,  3.16557351,  0.90051223,  0.16096056,
         0.60943268, -0.16443603,  0.27799558, -0.18963966,  0.36298373,
        -0.27966701]])

In [498]:
# To understand which variables those coefficients refer to:
scaled_inputs.head() 

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.182726,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
3,1,1,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487


We can represent them in a clearer way:

In [501]:
feature_name = unscaled_inputs.columns.values

In [503]:
summary_table = pd.DataFrame(columns = ['Feature Name'], data =feature_name)

In [505]:
summary_table['Coefficient']= np.transpose(reg.coef_)
#Note that we must transpose this array because by default ND arrays are rows and not columns.

In [506]:
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Reason 1,1.459013
1,Reason 2,1.459013
2,Reason 3,3.165574
3,Reason 4,0.900512
4,Month Value,0.160961
5,Transportation Expense,0.609433
6,Age,-0.164436
7,Body Mass Index,0.277996
8,Education,-0.18964
9,Children,0.362984


In [509]:
# Now we want to add the intercept, but append would add it at the end, we want it at the beginning

# So, we first shift all indices up by 1:
summary_table.index = summary_table.index +1


In [511]:
summary_table.loc[0] = ["Intercept", reg.intercept_[0]] # it adds it to the bottom

#summary_table = summary_table.sort_index()
#different form using the option:
summary_table.sort_index(inplace=True)

summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-1.713382
1,Reason 1,1.459013
2,Reason 2,1.459013
3,Reason 3,3.165574
4,Reason 4,0.900512
5,Month Value,0.160961
6,Transportation Expense,0.609433
7,Age,-0.164436
8,Body Mass Index,0.277996
9,Education,-0.18964


In [513]:
reg.intercept_

array([-1.7133823])

The closer the coefficients (weights) are to zero, the smaller the weight.
And alternatively, the further away from zero, no matter if positive or negative, the bigger the weight of this feature.

This holds only for models where all variables are of the same scale (standardized coefficients)
values.

Standardizing allows for a simple and easy-to-understand comparison between the variables.
Standardized = all variables have a variance of one or the same scale.


## Log Odds

Whenever we are dealing with a logistic regression, the coefficients we are predicting are the so-called log odds.

This is a consequence of the choice of model.

Logistic regressions, by default, are nothing but a linear function predicting log odds.
These log odds are later transformed into zeroes and ones.

## Interpreting the coefficients

Therefore, all the coefficients that we have refer to the log odds.

So, to make them more interpretable, let's find the exponentials of these coefficients.
I'll create a new series in our data frame called odds ratio.
Odds ratio is the correct term for what we will get after we find the exponentials of the coefficients.

In [520]:
# For this, let's create a new series in our dataframe:
summary_table ['Odds Ratio']=np.exp(summary_table['Coefficient'])
summary_table.sort_values('Odds Ratio', ascending=False)

# This gives us all coefficient, sorted according to their relevance to the problem at hand

Unnamed: 0,Feature Name,Coefficient,Odds Ratio
3,Reason 3,3.165574,23.702334
1,Reason 1,1.459013,4.301713
2,Reason 2,1.459013,4.301713
4,Reason 4,0.900512,2.460863
6,Transportation Expense,0.609433,1.839388
10,Children,0.362984,1.437612
8,Body Mass Index,0.277996,1.32048
5,Month Value,0.160961,1.174639
7,Age,-0.164436,0.848372
9,Education,-0.18964,0.827257


If a coefficient is around zero or its odds ratio is close to one, this means that the corresponding feature is not particularly important.

The reasoning in terms of weights is that a weight of zero implies that no matter the feature value, we will multiply it by zero and the whole result will be zero.

The meaning in terms of odds ratios is the following.
For one unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio.
So if the odds ratio is one, then the odds don't change at all.

For example, if the odds are 5:1 and the odds ratio is 2, we would say that for one unit change, the odds change from 5:1 to one to 10:1  because we multiply them by the odds ratio.

Alternatively, if the odds ratio is 0.2, the odds ratio would change to 1:1.

When the odds ratio is 1, we don't have a change as multiplication with the number one keeps things equal.

### Reasons:
We've got the four reasons for absence, which are the most important predictors.
When we were creating the dummies, the one we dropped was reason zero.
Reason zero represented a situation when a person was absent but no particular reason was given.

Therefore, the base model is the case where there is no reason, aka "reason zero".

From the coefficients, it seems that whenever a person has stated any reason, we have a much higher chance of getting excessive absence.

A good question would be how much bigger of a chance?

In [304]:
summary_table.sort_values('Odds Ratio', ascending=False)
# From the top to "body mass index" they are important
# and the bottom as well, Education and Pet are indeed important

# Daily Work Load Average, Distance to Work, Day of the Week and age, the coefficient is almost 0,
# so regardless of the particular values, they will barely effect the model



Unnamed: 0,Feature Name,Coefficient,Odds Ratio
3,Reason 3,3.165574,23.702334
1,Reason 1,1.459013,4.301713
2,Reason 2,1.459013,4.301713
4,Reason 4,0.900512,2.460863
6,Transportation Expense,0.609433,1.839388
10,Children,0.362984,1.437612
8,Body Mass Index,0.277996,1.32048
5,Month Value,0.160961,1.174639
7,Age,-0.164436,0.848372
9,Education,-0.18964,0.827257


### REASONS

the base model is "no reason" 

Reason 0 =  no reason, which is the baseline model,

reason 1 = various diseases,

reason 2 = relating to pregnancy and giving birth,

reason 3 = regarding poisoning and peculiar reasons not categorized elsewhere,

reason 4 = light diseases.

The most crucial reason for excessive absence is poisoning.
The weight means the odds of someone being excessively absent after being poisoned are 22 times higher than when no reason was reported.

## Transportation Expense:

After that, we've got Transportation Expense.
This is the most important non-dummy feature in the model, but here's the problem.

It is one of our standardized variables. We don't have direct interpretability of it. 
Its odds ratio implies that for one standardized unit, or for one standard deviation increase in Transportation Expense, it is close to twice as likely to be excessively absent.

This is the main drawback of standardization. Standardized models almost always yield higher accuracy because the optimization algorithms work better in this way.

Machine learning engineers prefer models with higher accuracy, so they normally go for standardization.

Econometricians and statisticians however, prefer less accurate but more interpretable models, because they care about the underlying reasons behind different phenomena.

Data scientist may be in either position.

Sometimes, they need higher accuracy,other times, they must find the main drivers of our problem.

So it makes sense to create two different models.

One with standardized features and one without them, and then draw insights from both.

However, should we opt for predicting values,we definitely prefer higher accuracy.

So standardization is more often the norm.

### Pet

Pet is a continuous variable. Its odds ratio is 0.75.

So for each additional standardized unit of Pet, the odds are 1 minus its odds ratio, or 24% lower than the base model.

One explanation may be, if you have several pets, you're probably not taking care of them on your own.

### Intercept

It is used to get more accurate predictions, but there's no specific meaning attached to it.

That's why in machine learning, you can say that it calibrates the model, and you could also call it a BIAS.

Nevertheless, without an Intercept, each prediction would be off the mark by precisely that value.

## Backward Elimination

The idea is that we can simplify our model by removing all features which have close to no contribution to the model.

Usually, when we have the p-values of variables, we get rid of all coefficients with p-values above 0.05.

When learning with sklearn, we don't have p-values because we don't necessarily need them.

The reasoning of the engineers who created the package is that if the weight is small enough it won't make a difference anyway, and we trust their work.

We go back to the checkpoint where we created the targets (last manipulation step before standardizing), and we remove those 3x features that are deemed useless.
The accuracy barely changes, and we have a simpler model.


## TESTING
Our train accuracy is around 77%, and that's great news.

Well, kind of, it doesn't really mean much.

Our algorithm has seen this train data many times, in fact thousands of times during the training process.

However, it may fail miserably when provided with new data.

As we said earlier, we should test it on data it is never seen. It is time to use the test data.



That's because testing is done only once and at the very end of the machine learning process.

Why is that?

Well, some researchers are looking at the testing accuracy and then tweaking the model a bit to get better test accuracy.

However, if you do this operation enough times what will this be?

An iterative process in which you change some parameters based on a function the accuracy in this case.

But that's basically the definition of the machine learning training process.

So instead of testing you'll be using the test data to train a bit more but this time manually, this makes the test data set useless because you are not really testing.

The takeaway is that once we test we are not conceptually allowed to touch the model anymore.

## Testing the model

In [550]:
reg.score(x_test, y_test)
# 75%

0.75

So based on data that the model has never seen before we can say that in 75% of the cases the model will predict 
if a person is going to be excessively absent.

The test accuracy is always less than the train accuracy, by definition. 
Often dramatically lower, even 10 or 20%, due to overfitting

In [530]:
#Instead of 0 and 1, we can get the probability of an output being zero or one:
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.71688667, 0.28311333],
       [0.58619314, 0.41380686],
       [0.4399751 , 0.5600249 ],
       [0.78528198, 0.21471802],
       [0.07695541, 0.92304459],
       [0.32721742, 0.67278258],
       [0.28772915, 0.71227085],
       [0.13064262, 0.86935738],
       [0.78896966, 0.21103034],
       [0.75266822, 0.24733178],
       [0.49604089, 0.50395911],
       [0.22016969, 0.77983031],
       [0.06951712, 0.93048288],
       [0.72903232, 0.27096768],
       [0.30582681, 0.69417319],
       [0.5463748 , 0.4536252 ],
       [0.55328253, 0.44671747],
       [0.54189817, 0.45810183],
       [0.38219447, 0.61780553],
       [0.05363755, 0.94636245],
       [0.70248671, 0.29751329],
       [0.78528198, 0.21471802],
       [0.41166702, 0.58833298],
       [0.41166702, 0.58833298],
       [0.25271149, 0.74728851],
       [0.74818334, 0.25181666],
       [0.50587842, 0.49412158],
       [0.85678959, 0.14321041],
       [0.20817015, 0.79182985],
       [0.78528198, 0.21471802],
       [0.

What we get is a 140 by two array.
There are 140 test observations and two columns.
The first column shows the probability, our model assigned to the observation being zero, and the second the probability the model assigned to the observation being one.

That's why summing any two numbers horizontallywill give you an output of one.

In [533]:
predicted_proba[:,1]

array([0.28311333, 0.41380686, 0.5600249 , 0.21471802, 0.92304459,
       0.67278258, 0.71227085, 0.86935738, 0.21103034, 0.24733178,
       0.50395911, 0.77983031, 0.93048288, 0.27096768, 0.69417319,
       0.4536252 , 0.44671747, 0.45810183, 0.61780553, 0.94636245,
       0.29751329, 0.21471802, 0.58833298, 0.58833298, 0.74728851,
       0.25181666, 0.49412158, 0.14321041, 0.79182985, 0.21471802,
       0.37003345, 0.68798168, 0.69268649, 0.52691108, 0.21471802,
       0.53637476, 0.21878342, 0.73492776, 0.408596  , 0.61465849,
       0.20707082, 0.46530347, 0.23489788, 0.39327288, 0.83442289,
       0.55896885, 0.70352381, 0.28311333, 0.21738022, 0.19962674,
       0.59061411, 0.31759307, 0.67278258, 0.26790772, 0.84147775,
       0.43626405, 0.8813381 , 0.23159343, 0.32056843, 0.33065924,
       0.71106359, 0.66258641, 0.29366608, 0.78817873, 0.20736947,
       0.26483613, 0.08060369, 0.21878342, 0.72791352, 0.3087659 ,
       0.21878342, 0.27879876, 0.90926831, 0.46505872, 0.60384

This will give us the probabilities of absenteeism and this result is much cooler than simply zeros or ones.

In reality, logistic regression models calculate these probabilities in the background.

If the probability is below 0.5 it places a zero, otherwise a one.

## Next steps:

We will now save our model, so we can use it later on. We don't need to train it every time. We just need to determine the weights once and then save them for later use.

We will create our own module so that our less technical colleagues can take advantage of this model too.


Roughly speaking, we wanna create a file that will store the following information:

- This machine learning model is a logistic regression.

- It has these and these coefficients and intercept.

- The random state that was chosen for the shuffling was 20 etc.



The object 'reg' which was an instance of the sklearn logistic regression class, contains all this information.

In fact, this is the object we use to find the intercept coefficients and accuracy.

Saving the model is equivalent to saving the 'reg' object.


## Save the model

In [539]:
#Pickling is the process of converting a Python object into a character stream.

import pickle

In [541]:
with open ('model', "wb") as  file:
    pickle.dump (reg, file)

# model = file name
# we = Write bytes - when unpickling we use rb, read bytes
# when we pickle, we "dump" information in a file, when we unpickle, we load it
# in the dump method we specify the object to be dumped

# In simple words, pickling means: converting a Python object (no matter what) into a string of characters. 

In [543]:
# We must save the absenteeism_scaler too!
# this information is needed to pre-process any new data, using the same rules as the ones applied to training data

with open ('scaler', "wb") as  file:
    pickle.dump (absenteeism_scaler, file)

## Deploy the model

Deploying a model consists in making it available and ready to use.
Generally it consists of two steps, saving the model and then applying it to new data.


Storing code in a module will allow us to reuse it without trouble.

In essence, we will treat the methods in this module in the same way we treat the Numpy, sklearn and pandas methods.

We export an optimised version of the code as a module, to use in a new notebook