# Creating a Logistic Regression to Predict Absenteeism

## Import the Relevant Libraries

In [1]:
import pandas as pd
import numpy as np

## Load the Data

In [2]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
data_preprocessed.head()

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


The model itself will give us a fair indication of which variables are important for the analysis.

Logistic Regression is a type of classification.

## Create the Targets

Create two classes:
   1. Moderately Absent
   2. Excessively Absent
  
Will take the median value of the Absenteeism Time in Hours and use it as a cut-off line.

In [4]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

1. Less than or Equal to 3.0 --> Moderately Absent 
2. Greater than 3.0 --> Excessively Absent

In ML, 0s and 1s are TARGETS

Using the median to classify essentially balances the data, thus half of the data fits into classification 1 and the other half into classification 2. Therefore, prevents our model from learning to output only 0s or only 1s.

In [5]:
# np.where(condition, value if True, value is False) --> checks if a condition has been satisfied and assigns a value accordingly 
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [6]:
# create new column in dataframe for Excessive Absenteeism
data_preprocessed['Excessive Absenteeism', 'Daily Work Load Average', 'Distance to Work'] = targets
data_preprocessed.head()

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,"(Excessive Absenteeism, Daily Work Load Average, Distance to Work)"
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


## A Comment on the Targets

To prove our model is not learning to output only 0s and 1s...

In [7]:
targets.sum() / targets.shape[0]

0.45571428571428574

Around 46% of the targets are 1s. A 60-40 split will usually work for a logistic regression, but not true for other algorithms such as neural networks. 

45-55 is almost always sufficient.

In [8]:
# Drop Absenteeism Time in Hours
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

In [9]:
# check that at this point, there is a checkpoint of the data
# Using 'is' --> true = the 2 variables refer to the same object
#                false = the 2 variable refer to different objects
data_with_targets is data_preprocessed

False

In [10]:
data_with_targets.head()

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,"(Excessive Absenteeism, Daily Work Load Average, Distance to Work)"
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


# Select the Inputs for the Regression

In [11]:
data_with_targets.shape

(700, 15)

In [12]:
# 'DataFrame.iloc[row indices, column indices]' --> selects (slices) data by position when given rows and columns wanted
data_with_targets.iloc[:,0:14]  # all rows and columns 0 through 13; '.loc' is inclusive of the range whereas '.iloc' is exclusive 


Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


In [13]:
data_with_targets.iloc[:,:-1]    #will give the same results without have to count the number of columns; just want all columns but the last

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


In [14]:
unscaled_inputs = data_with_targets.iloc[:,:-1] 

## Standardize the Data

In [15]:
# THE FOLLOWING CODE (WHICH HAS BEEN COMMENTED OUT) IS BAD PRACTICE BECAUSE IT ALSO STANDARDIZES THE DUMMY VARIABLES
#from sklearn.preprocessing import StandardScaler

# declare standard scaler object
#absenteeism_scaler = StandardScaler()       #EMPTY scaler object; no information in it yet
                                           # will be use to subtract the mean and divide by the standard deviation variablewise (featurewise)

In [16]:
# Use this instead
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin):
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [17]:
unscaled_inputs.columns.values

array(['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [18]:
# scale columns except for the (4) dummy variables
#columns_to_scale = ['Month Value',
#       'Day of the Week', 'Transportation Expense', 'Distance to Work',
#       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
#       'Children', 'Pets']

columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

In [19]:
# List Comprehension -- a syntactic construct which allos us to create a list from existing lists based on loops, conditionals, etc.
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [20]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [21]:
absenteeism_scaler.fit(unscaled_inputs)

CustomScaler(columns=['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4',
                      'Month Value', 'Day of the Week',
                      'Transportation Expense', 'Distance to Work', 'Age',
                      'Daily Work Load Average', 'Body Mass Index', 'Children',
                      'Pets'],
             copy=None, with_mean=None, with_std=None)

In [22]:
# scale the inputs
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs

# new_data_raw = pd.read_csv('new_data.csv')
# new_data_scaled = absenteeism_scaler.transform(new_data_raw)

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,-0.577350,-0.092981,-0.314485,0.821365,0.030796,-0.800950,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,-0.577350,-0.092981,-0.314485,-1.217485,0.030796,-0.800950,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,-0.577350,-0.092981,-0.314485,0.821365,0.030796,-0.232900,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1.732051,-0.092981,-0.314485,-1.217485,0.030796,0.335149,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,-0.577350,-0.092981,-0.314485,0.821365,0.030796,0.335149,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1.732051,-0.092981,-0.314485,-1.217485,-0.568019,-0.232900,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1.732051,-0.092981,-0.314485,-1.217485,-0.568019,-0.232900,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1.732051,-0.092981,-0.314485,-1.217485,-0.568019,0.335149,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,-0.577350,-0.092981,-0.314485,0.821365,-0.568019,0.335149,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [23]:
scaled_inputs.shape   # print out size

(700, 14)

## Split the Data into Train & Test, then Shuffle

### Import the Relevant Module

In [24]:
from sklearn.model_selection import train_test_split

#splits arrays or matrices into random train and test subsets

### Split

In [25]:
train_test_split(scaled_inputs, targets)

[     Reason 1  Reason 2  Reason 3  Reason 4  Month Value  Day of the Week  \
 523  1.732051 -0.092981 -0.314485 -1.217485     0.929019         0.335149   
 113 -0.577350 -0.092981 -0.314485  0.821365    -0.268611        -0.232900   
 457 -0.577350 -0.092981 -0.314485  0.821365    -0.268611         0.903199   
 619 -0.577350 -0.092981 -0.314485  0.821365    -0.568019         0.335149   
 461 -0.577350 -0.092981 -0.314485  0.821365    -0.268611        -0.232900   
 ..        ...       ...       ...       ...          ...              ...   
 252 -0.577350 -0.092981  3.179797 -1.217485    -0.867426         0.903199   
 544 -0.577350 -0.092981 -0.314485  0.821365     1.228426        -1.368999   
 104 -0.577350 -0.092981  3.179797 -1.217485     0.330204        -0.232900   
 82   1.732051 -0.092981 -0.314485 -1.217485    -0.568019        -1.368999   
 121 -0.577350 -0.092981 -0.314485  0.821365     1.228426        -0.800950   
 
      Transportation Expense  Distance to Work       Age  \
 5

In [26]:
# default split shows training size = 75% and test size = 25%
# add parameters 'train _size = 0.8' to change training size = 80%
# 'sklearn.mode_selection.train_test_split(inputs, targets, train_size, shuffle=True, random_state)'
# splits arrays or matrices into random train and test subsets
# rerunning our code, we get a different split every time due to shuffle by default 

x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets, train_size = 0.8, random_state = 20)

In [27]:
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [28]:
print(x_test.shape, y_test.shape)

(140, 14) (140,)


# Logistic Regression with Sklearn

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Training the Model

In [30]:
# declare Logistic Regression object
reg = LogisticRegression(solver='liblinear')

In [31]:
reg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
reg.score(x_train, y_train)

0.7821428571428571

In [33]:
# can conclude the model has an accuracy of about 80%
# also stated as 80% of the model outputs match the targets

### Manually Check the Accuracy

In [34]:
model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [35]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [36]:
# hard to compare with the naked eye
# use code to compare the outputs to the targets
model_outputs == y_train

# True = matches
# False = does not match

array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True, False,  True,  True, False,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [37]:
# Boolean: True = 1 and False = 0

np.sum((model_outputs==y_train))

438

In [38]:
model_outputs.shape[0]

560

In [39]:
np.sum((model_outputs==y_train)) / model_outputs.shape[0]

0.7821428571428571

## Finding the Intercept and Coefficients of Linear Regression

In [40]:
reg.intercept_

array([-0.14352457])

In [41]:
reg.coef_

array([[ 2.06899633,  0.32883249,  1.56068887,  1.31090995,  0.02582743,
        -0.08621396,  0.72169608, -0.05835588, -0.20543182, -0.0273594 ,
         0.32990512, -0.39927087,  0.38392076, -0.32004597]])

In [42]:
# we wnat to know what variables these coeffecients refer to

In [43]:
unscaled_inputs.columns.values
# 'scaled_inputs.columns.values' will receive an error due to employing sklearn, the results are arrays and not dataframes
# thus must use unscaled_inputs and then create a dataframe

array(['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [44]:
feature_name = unscaled_inputs.columns.values

In [45]:
summary_table = pd.DataFrame(columns=['Feature Name'], data = feature_name)

# must transpose the array, because by default, np.arrays are rows and we want columns
summary_table['Coefficient'] = np.transpose(reg.coef_)

summary_table     # prints summary tables of the variables and correlating coefficients

Unnamed: 0,Feature Name,Coefficient
0,Reason 1,2.068996
1,Reason 2,0.328832
2,Reason 3,1.560689
3,Reason 4,1.31091
4,Month Value,0.025827
5,Day of the Week,-0.086214
6,Transportation Expense,0.721696
7,Distance to Work,-0.058356
8,Age,-0.205432
9,Daily Work Load Average,-0.027359


In [46]:
# add one to all indices of dataframe 'summary_table'
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-0.143525
1,Reason 1,2.068996
2,Reason 2,0.328832
3,Reason 3,1.560689
4,Reason 4,1.31091
5,Month Value,0.025827
6,Day of the Week,-0.086214
7,Transportation Expense,0.721696
8,Distance to Work,-0.058356
9,Age,-0.205432


Standardized Coefficients are the coefficients of a regression where all variables have been standardized

### Interpreting the Coefficients

log(odds) = intercept + (b_1 * x_1) + (b_2 * x_2) +...+ (b_n * x_n)

In [47]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table.head()

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
0,Intercept,-0.143525,0.8663
1,Reason 1,2.068996,7.916873
2,Reason 2,0.328832,1.389345
3,Reason 3,1.560689,4.762101
4,Reason 4,1.31091,3.709548


In [48]:
# 'DataFrame.sort_values(Series)' --> sorts the values in a data frame with respect to a given column (series)
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
1,Reason 1,2.068996,7.916873
3,Reason 3,1.560689,4.762101
4,Reason 4,1.31091,3.709548
7,Transportation Expense,0.721696,2.057921
13,Children,0.383921,1.468029
11,Body Mass Index,0.329905,1.390836
2,Reason 2,0.328832,1.389345
5,Month Value,0.025827,1.026164
10,Daily Work Load Average,-0.027359,0.973011
8,Distance to Work,-0.058356,0.943314


A Feature is not particulary important if..
 - coefficient is close to '0':
         - implies that no matter the feature value, we will multiply is by 0 (in the modela)
         
 - coefficient is close to '1':
         - for a unit change in the standardized fature, the odds increase by a multiple equal to the odds ratio (1 = no change)
         - i.e.   odds * odds ratio = new odds
                  5:1  *      2     = 10:1
                  5:1  *     0.2    = 1:1
                  5:1  *      1     = 5:1 

Interpreting the Summary Table:
 - the variables 'Daily Work Load Average', 'Distance to Work', and 'Day of the Week' may be dropped due to their coefficients being close to 0
 - When employees give Reasons 1 through 4, there seems to be a likelihood of future absenteeism

BACKWARD ELIMINATION
 - The idea is that we can simplify our model by removing all features which have close to no contribution to the model
 - When we ahve the p-values, we get rid of all coeff with p-values > 0.05
 - if the weight is small enough, it won't make a difference; if these variables are removed, the rest of the model should not really change in terms of coefficient values

# Testing the Model

In [49]:
reg.score(x_test, y_test)   #reg.score(train, test)

0.7357142857142858

Based on data the model has NEVER seen before, in 73.6% of the cases, the model will predict (correctly) if a person is going to be excessively absent

Often the test accuracy is 10-20% lower than the train accuract (due to overfitting)

In [51]:
# 'sklearn.linear_model.LogisticRegression.predict_proba(x)' -- returns the probability estimates for all possible outputs (classes)
predicted_proba = reg.predict_proba(x_test)
predicted_proba  # first column: probability of being 0, second column: prob of being 1

array([[0.76424392, 0.23575608],
       [0.61299367, 0.38700633],
       [0.4064611 , 0.5935389 ],
       [0.7852416 , 0.2147584 ],
       [0.06091598, 0.93908402],
       [0.26799665, 0.73200335],
       [0.28704915, 0.71295085],
       [0.06715365, 0.93284635],
       [0.75947822, 0.24052178],
       [0.76485305, 0.23514695],
       [0.47335262, 0.52664738],
       [0.13663258, 0.86336742],
       [0.04179783, 0.95820217],
       [0.68987638, 0.31012362],
       [0.20845446, 0.79154554],
       [0.47262008, 0.52737992],
       [0.49323601, 0.50676399],
       [0.53775049, 0.46224951],
       [0.38057849, 0.61942151],
       [0.0294033 , 0.9705967 ],
       [0.73485646, 0.26514354],
       [0.77686748, 0.22313252],
       [0.42793913, 0.57206087],
       [0.44441853, 0.55558147],
       [0.15455462, 0.84544538],
       [0.76056028, 0.23943972],
       [0.45911251, 0.54088749],
       [0.89668308, 0.10331692],
       [0.16377217, 0.83622783],
       [0.7682632 , 0.2317368 ],
       [0.

In [52]:
predicted_proba.shape

(140, 2)

In [53]:
predicted_proba[:,1]

# if the probability is below 0.5, it places a 0 and vice versa, a 1

array([0.23575608, 0.38700633, 0.5935389 , 0.2147584 , 0.93908402,
       0.73200335, 0.71295085, 0.93284635, 0.24052178, 0.23514695,
       0.52664738, 0.86336742, 0.95820217, 0.31012362, 0.79154554,
       0.52737992, 0.50676399, 0.46224951, 0.61942151, 0.9705967 ,
       0.26514354, 0.22313252, 0.57206087, 0.55558147, 0.84544538,
       0.23943972, 0.54088749, 0.10331692, 0.83622783, 0.2317368 ,
       0.41236329, 0.740362  , 0.70136431, 0.52934869, 0.22313252,
       0.61657817, 0.25677884, 0.83052297, 0.47542524, 0.60685731,
       0.24285615, 0.42122295, 0.23411185, 0.09848998, 0.82827749,
       0.72962697, 0.76015435, 0.24182808, 0.2697702 , 0.2019056 ,
       0.50288522, 0.07157697, 0.70537404, 0.25992939, 0.86731853,
       0.43989597, 0.95196806, 0.29102294, 0.07845762, 0.07693382,
       0.74766171, 0.68013057, 0.28608179, 0.85695747, 0.23887663,
       0.23770158, 0.01155912, 0.24411587, 0.84178061, 0.31749594,
       0.24316561, 0.07765299, 0.92238447, 0.46776185, 0.65184

# Save the Model

'pickle[module]' -- is a Python module used to convert a Python object into a character stream

In [54]:
import pickle

- file name: model
- write bytes: wb
- dump = save
- reg = object to be dumped

In [55]:
with open('model', 'wb') as file:
    pickle.dump(reg, file)

# A Note on Pickling
 

There are several popular ways to save (and finalize) a model. To name some, you can use Joblib (a part of the SciPy ecosystem), and JSON. Certainly, each of those choices has its pros and cons. Pickle is probably the most intuitive and definitely our preferred choice.

Once again, ‘pickle’ is the standard Python tool for serialization and deserialization. In simple words, pickling means: converting a Python object (no matter what) into a string of characters. Logically, unpickling is about converting a string of characters (that has been pickled) into a Python object.



There are some potential issues you should be aware of, though!

Pickle and Python version.

Pickling is strictly related to Python version. It is not recommended to (de)serialize objects across different Python versions. Logically, if you’re working on your own this will never be an issue (unless you upgrade/downgrade your Python version). 



Pickle is slow.

Well, you will barely notice that but for complex structures it may take loads of time to pickle and unpickle.



Pickle is not secure.

This is evident from the documentation of pickle, quote: “Never unpickle data received from an untrusted or unauthenticated source.” The reason is that just about anything can be pickled, so you can easily unpickle malicious code.



Now, if you are unpickling your own code, you are more or less safe.



If, however, you receive pickled objects from someone you don’t fully trust, you should be very cautious. That’s how viruses affect your operating system.



Finally, even your own file may be changed by an attacker. Thus, the next time you unpickle, you can unpickle just about anything (that this unethical person put there)