# Introduction: Predicting incident priority based on selected attributes


As part of this study, we are going to use publicly available dataset of incident logs collected from an incident management tool called ServiceNow. The data is anonymized for the sake of privacy, but is fit for our current goal

https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log

#### In the below code, I have tried to explain the steps as much as possible. I have imported the libraries at the point of use, instead of in the beginning in order to improve code readability.


In [377]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [378]:
SNow = pd.read_csv('incident_event_log.csv')

In [379]:
SNow.head() #visualize the data

Unnamed: 0,number,incident_state,active,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,opened_at,...,u_priority_confirmation,notify,problem_id,rfc,vendor,caused_by,closed_code,resolved_by,resolved_at,closed_at
0,INC0000045,New,True,0,0,0,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
1,INC0000045,Resolved,True,0,0,2,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
2,INC0000045,Resolved,True,0,0,3,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
3,INC0000045,Closed,False,0,0,4,True,Caller 2403,Opened by 8,29/2/2016 01:16,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 149,29/2/2016 11:29,5/3/2016 12:00
4,INC0000047,New,True,0,0,0,True,Caller 2403,Opened by 397,29/2/2016 04:40,...,False,Do Not Notify,?,?,?,?,code 5,Resolved by 81,1/3/2016 09:52,6/3/2016 10:00


# Prepare the data: Data Pre-Processing

From the data visualization, we found that there are some columns which have missing information, represented by '?', either in almost all the records or in the majority of records.
I decided to drop all those columns where missing information is in more than a third of the total number of rows

In [380]:
max_bad = 50000 #max number of missing values
SNow2 = SNow.loc[:,[(len(SNow[SNow[x]=='?']) < max_bad) for x in columns]]

In [385]:
SNow2.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141712 entries, 0 to 141711
Columns: 29 entries, number to closed_at
dtypes: bool(4), int64(3), object(22)
memory usage: 27.6+ MB


From the above, it is apparent that this dataset is huge. Thus, in order to reduce the size, as well as to remove all the, '?', missing information, I decided to remove all rows that contained '?' in any of the columns

In [391]:
SNow2 = SNow2.replace('?',np.nan)

In [392]:
SNow2 = SNow2.dropna(axis = 0)

In [394]:
SNow2.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75230 entries, 4 to 141711
Columns: 29 entries, number to closed_at
dtypes: bool(4), int64(3), object(22)
memory usage: 15.2+ MB


In [390]:
SNow2.columns

Index(['number', 'incident_state', 'active', 'reassignment_count',
       'reopen_count', 'sys_mod_count', 'made_sla', 'caller_id', 'opened_by',
       'opened_at', 'sys_updated_by', 'sys_updated_at', 'contact_type',
       'location', 'category', 'subcategory', 'u_symptom', 'impact', 'urgency',
       'priority', 'assignment_group', 'assigned_to', 'knowledge',
       'u_priority_confirmation', 'notify', 'closed_code', 'resolved_by',
       'resolved_at', 'closed_at'],
      dtype='object')

Now this dataset is much more manageable in size, and not having any missing information as well. 

Now I am dropping all the columns which do not have impact on the outcome. 

Impact and Urgency Columns together decide the priority. Since we are already having priority, which is going to be our outcome variable, we can do away with Impact and Urgency.

In [395]:
SNow3= SNow2.drop(['number','notify', 'resolved_by','sys_updated_at','impact', 'urgency'], axis=1)

In [224]:
print(SNow3['incident_state'].unique())

['New' 'Active' 'Awaiting User Info' 'Resolved' 'Closed'
 'Awaiting Problem' 'Awaiting Vendor' 'Awaiting Evidence']


We want to keep only the rows which tells us about the Incident New status, since we are trying to predict the priotiy of incidents, which gets decided as soon as an incident is opened (in most of the cases)

In [397]:
snow_new = SNow3.drop(SNow3[SNow3['incident_state']!='New'].index)

In [398]:
snow_new.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15268 entries, 4 to 141390
Columns: 23 entries, incident_state to closed_at
dtypes: bool(4), int64(3), object(16)
memory usage: 2.4+ MB


In [399]:
# snow_closed.describe
print(snow_new['priority'].unique())

['3 - Moderate' '4 - Low' '2 - High' '1 - Critical']


In [400]:
snow_new.columns

Index(['incident_state', 'active', 'reassignment_count', 'reopen_count',
       'sys_mod_count', 'made_sla', 'caller_id', 'opened_by', 'opened_at',
       'sys_updated_by', 'contact_type', 'location', 'category', 'subcategory',
       'u_symptom', 'priority', 'assignment_group', 'assigned_to', 'knowledge',
       'u_priority_confirmation', 'closed_code', 'resolved_at', 'closed_at'],
      dtype='object')

Calculate the completion time of the incidents, in number of days

In [401]:
snow_new['completion_days'] = pd.to_datetime(snow_new['closed_at']) - pd.to_datetime(snow_new['opened_at'])

I guess we can drop the columns incident_state and active also. And since we have already calculated the completion time, we can drop date columns too.

In [402]:
snow_new.drop(['incident_state','active','resolved_at','closed_at','opened_at'], axis=1,inplace=True)

In [403]:
snow_new.head()

Unnamed: 0,reassignment_count,reopen_count,sys_mod_count,made_sla,caller_id,opened_by,sys_updated_by,contact_type,location,category,subcategory,u_symptom,priority,assignment_group,assigned_to,knowledge,u_priority_confirmation,closed_code,completion_days
4,0,0,0,True,Caller 2403,Opened by 397,Updated by 746,Phone,Location 165,Category 40,Subcategory 215,Symptom 471,3 - Moderate,Group 70,Resolver 89,True,False,code 5,95 days 05:20:00
20,0,0,0,True,Caller 4491,Opened by 180,Updated by 340,Phone,Location 204,Category 9,Subcategory 97,Symptom 450,3 - Moderate,Group 25,Resolver 125,True,False,code 3,125 days 06:22:00
40,0,0,0,True,Caller 2838,Opened by 131,Updated by 265,Phone,Location 143,Category 53,Subcategory 168,Symptom 580,3 - Moderate,Group 70,Resolver 78,True,False,code 6,156 days 09:50:00
41,0,0,1,True,Caller 2838,Opened by 131,Updated by 265,Phone,Location 143,Category 53,Subcategory 168,Symptom 580,3 - Moderate,Group 70,Resolver 78,True,False,code 6,156 days 09:50:00
49,0,0,0,True,Caller 5323,Opened by 131,Updated by 265,Phone,Location 108,Category 44,Subcategory 229,Symptom 580,3 - Moderate,Group 5,Resolver 216,True,False,code 1,125 days 08:22:00


##### Lets try to predict the priority of the incidents based on some features. Feature selection is based on my personal experience in the IT Services industry. Even though there are better ways of identifying features by machine learning, e.g. PCA, I do not feel confident yet in applying those procedures on so many categorical variables

In [404]:
snow_new.columns

Index(['reassignment_count', 'reopen_count', 'sys_mod_count', 'made_sla',
       'caller_id', 'opened_by', 'sys_updated_by', 'contact_type', 'location',
       'category', 'subcategory', 'u_symptom', 'priority', 'assignment_group',
       'assigned_to', 'knowledge', 'u_priority_confirmation', 'closed_code',
       'completion_days'],
      dtype='object')

In [405]:
X = snow_new[['caller_id', 'opened_by', 'sys_updated_by', 'contact_type', 'location', 'category', 'subcategory', 'u_symptom','u_priority_confirmation']]
y = snow_new[['priority']]

# Create the model

For testing our model, we will use test_train_split to split the data in 70:30 ratio. 

Accuracy score will determine the performance of the model

In [406]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix,f1_score

In [407]:
pd.options.mode.chained_assignment = None  # default='warn'

### Numerical encoding of the data

Even though the below form of numerical encoding is a very poor form of encoding as it can confuse a machine learning librabry by assuming weightage. For example opened_by = 397 may get more weight than opened_by = 8. But as of now, this is what I am going to use due to time contraint of project submission

In [408]:
#Numerical encoding
for col in ['caller_id', 'opened_by', 'sys_updated_by', 'location', 'category', 'subcategory','u_symptom']:
    X[col] = X[col].str.extract('(\\d+)').astype(np.int64)

# Label Encoding
enc= LabelEncoder()
for col in ['u_priority_confirmation', 'contact_type']:
    X.loc[:,col] = enc.fit_transform(X.loc[:,col])
y['priority'] = enc.fit_transform(y['priority'] ).astype(np.int64)

In [409]:
X.head() #Visualize the split data
# y.head()

Unnamed: 0,caller_id,opened_by,sys_updated_by,contact_type,location,category,subcategory,u_symptom,u_priority_confirmation
4,2403,397,746,1,165,40,215,471,0
20,4491,180,340,1,204,9,97,450,0
40,2838,131,265,1,143,53,168,580,0
41,2838,131,265,1,143,53,168,580,0
49,5323,131,265,1,108,44,229,580,0


In [412]:
#Split the data in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=20)

In [413]:
import warnings
warnings.filterwarnings('always') 

In [414]:
#Establish baseline accuracy
print(f' The baseline of accuracy is {accuracy_score(y_test, np.full(y_test.shape, 3))}')
# print(classification_report(y_test,np.full(y_test.shape, 3)))

 The baseline of accuracy is 0.030561012879284


Lets check the distribution of incidents with respect to different priority values

In [415]:
y_train.priority.value_counts()

2    9879
1     323
3     310
0     175
Name: priority, dtype: int64

As it can be seen that priority = 3 is disproportionately represented, we need some gradient boosting mechanism to overcome this deficiency.

We will use a Gradient Boosting classifier to build or model and will base our prediction on that

## Gradient Boost Classifier

In [430]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [417]:
m_xgb = XGBClassifier()
m_xgb.fit(X_train, y_train)
y_predict_xgb = m_xgb.predict(X_test)

In [420]:
y_predict_xgb = m_xgb.predict(X_test)

#The final report is:
f1 = f1_score(y_test,y_predict_xgb,average='macro')
print(f'The macro F1 score for initial XGB model:{f1}')
print(classification_report(y_test,y_predict_xgb))

The macro F1 score for initial XGB model:0.8136550631515472
              precision    recall  f1-score   support

           0       0.89      0.64      0.75        64
           1       0.88      0.60      0.72       149
           2       0.97      1.00      0.98      4228
           3       0.94      0.71      0.81       140

    accuracy                           0.97      4581
   macro avg       0.92      0.74      0.81      4581
weighted avg       0.97      0.97      0.97      4581



##### This seems to be quite a good results already! 

Let's try to apply some weightage to the classes in order to offset some of the impact of the numerical encoding of the data

In [421]:
from sklearn.utils import class_weight

In [422]:
class_weights = list(class_weight.compute_class_weight(class_weight='balanced',classes=np.unique(y_train), y=y_train['priority']))


weights_array = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train['priority']):
    weights_array[i] = class_weights[val-1]

xgb_weighted_model = XGBClassifier()   
xgb_weighted_model.fit(X_train, y_train,sample_weight=weights_array)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective='multi:softprob', predictor=None, ...)

In [423]:
y_predict_weighted = xgb_weighted_model.predict(X_test) 
f1 = f1_score(y_test,y_predict_weighted,average='macro')

print(f'The F1 score for the Weighted model:{f1}')
print(classification_report(y_test,y_predict_weighted))

The F1 score for the Weighted model:0.738397348804021
              precision    recall  f1-score   support

           0       0.88      0.67      0.76        64
           1       0.81      0.64      0.71       149
           2       0.96      0.99      0.98      4228
           3       1.00      0.34      0.50       140

    accuracy                           0.96      4581
   macro avg       0.91      0.66      0.74      4581
weighted avg       0.96      0.96      0.95      4581



,

### Oops!! The accuracy score went down! Lets do some hyperparameter tuning to find some improvements?


##### Since I am going to use grid-search, which is very resource intensive, I will focus on tuning only two parameters. Learning_rate and n_estimators. Let's see how the grid search perform

In [425]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
import datetime

In [427]:
print('Start time: ',datetime.datetime.now())

para_search_grid ={'learning_rate':[0.2, 0.6, 0.8, 1.2], 'n_estimators':[200, 400, 800, 1200]}
kfold = StratifiedKFold(n_splits=6, shuffle=True, random_state=10)
grid_search = GridSearchCV(xgb_weighted_model, param_grid = para_search_grid, scoring='f1_macro',cv=kfold)
grid_search.fit(X_train, y_train,sample_weight=weights_array)

print('.\n End time: ',datetime.datetime.now())

Start time:  2023-02-13 20:39:45.129885
.
 End time:  2023-02-13 20:57:45.029703


In [429]:
print(grid_search.best_params_)
y_decision_fn = grid_search.predict(X_test) 
f1 = f1_score(y_test,y_decision_fn,average='macro')
print(f'The F1 score for XGB_weighted model after 1st tuning:{f1}')
print(classification_report(y_test,y_decision_fn))

{'learning_rate': 0.8, 'n_estimators': 400}
The F1 score for XGB_weighted model after 1st tuning:0.7933588218762218
              precision    recall  f1-score   support

           0       0.78      0.67      0.72        64
           1       0.73      0.63      0.68       149
           2       0.97      0.99      0.98      4228
           3       0.97      0.67      0.79       140

    accuracy                           0.96      4581
   macro avg       0.86      0.74      0.79      4581
weighted avg       0.96      0.96      0.96      4581



It seems 20 minutes of grid search could not find parameters better than the default ones. 

So, if I have to use this model with the current dataset, or with current approach, I will either use the base model, or put more effort in tuning the hyperparameters.
With the time constraint for this project, I am concluding this exercise here.

##### This concludes the current scope of this excercise that I undertook. I have some future goals with the dataset (which was the original intention mentioned by the publishers of the data. I will continue to work on those, and if possible publish the results also.