# Data Description & Context:


In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings.

Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve, they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3 teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers / Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support towards incident closure. L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams.

During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service. Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.


# Domain:

Information Technology

# Project Description

In this capstone project, the goal is to build a classifier that can classify the tickets by analyzing text.

# Project Objectives

The objective of the project is,
- Learn how to use different classification models.
- Use transfer learning to use pre-built models.
- Learn to set the optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc.
- Read different research papers of given domain to obtain the knowledge of advanced models for the given problem.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import RegexpTokenizer
import re
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

sns.set_style(style='darkgrid')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
cd '/content/drive/My Drive/AIML/13_Capstone Project AIML/NLP - Project 1/'

/content/drive/My Drive/AIML/13_Capstone Project AIML/NLP - Project 1


In [4]:
df = pd.read_excel('input_data.xlsx', sheet_name=0)
df.head()

Unnamed: 0,Short description,Description,Caller,Assignment group
0,login issue,-verified user details.(employee# & manager na...,spxjnwir pjlcoqds,GRP_0
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,hmjdrvpb komuaywn,GRP_0
2,cant log in to vpn,\r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail...,eylqgodm ybqkwiam,GRP_0
3,unable to access hr_tool page,unable to access hr_tool page,xbkucsvz gcpydteq,GRP_0
4,skype error,skype error,owlgqjme qhcozdfx,GRP_0


# Basic EDA

In [5]:
# Find the shape of the data, data type of individual columns
df.info()  #info about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8500 entries, 0 to 8499
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  8492 non-null   object
 1   Description        8499 non-null   object
 2   Caller             8500 non-null   object
 3   Assignment group   8500 non-null   object
dtypes: object(4)
memory usage: 265.8+ KB


In [6]:
# Checking the presence of missing values
df.isnull().values.any()

True

In [7]:
df.isna().apply(pd.value_counts)   #null value check 

Unnamed: 0,Short description,Description,Caller,Assignment group
False,8492,8499,8500.0,8500.0
True,8,1,,


In [8]:
# Total number of missing values
df.isnull().sum().sum()

9

- There are 4 columns in total
    - One target column - 'Assignment group'
    - None of the columns have numeric values
    - 3 predictor variables
- There are no null values for the columns 'Caller' and 'Assignment group'

In [9]:
# Provides information like total count, unique count, value which occurs most often, maximum frequency of occurance
df.describe().T

Unnamed: 0,count,unique,top,freq
Short description,8492,7481,password reset,38
Description,8499,7817,the,56
Caller,8500,2950,bpctwhsn kzqsbmtp,810
Assignment group,8500,74,GRP_0,3976


In [10]:
df.drop(['Caller'], axis = 1, inplace=True)

In [11]:
df

Unnamed: 0,Short description,Description,Assignment group
0,login issue,-verified user details.(employee# & manager na...,GRP_0
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,GRP_0
2,cant log in to vpn,\r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail...,GRP_0
3,unable to access hr_tool page,unable to access hr_tool page,GRP_0
4,skype error,skype error,GRP_0
...,...,...,...
8495,emails not coming in from zz mail,\r\n\r\nreceived from: avglmrts.vhqmtiua@gmail...,GRP_29
8496,telephony_software issue,telephony_software issue,GRP_0
8497,vip2: windows password reset for tifpdchb pedx...,vip2: windows password reset for tifpdchb pedx...,GRP_0
8498,machine nÃ£o estÃ¡ funcionando,i am unable to access the machine utilities to...,GRP_62


### Find and remove exact duplicates

In [12]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
duplicateRowsDF

Duplicate Rows except first occurrence based on all columns are :


Unnamed: 0,Short description,Description,Assignment group
51,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
81,erp SID_34 account locked,erp SID_34 account locked,GRP_0
123,unable to display expense report,unable to display expense report,GRP_0
157,ess password reset,ess password reset,GRP_0
229,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
...,...,...,...
8424,windows account lockout,windows account lockout,GRP_0
8450,unable to connect to wifi,unable to connect to wifi,GRP_0
8451,password reset erp SID_34,password reset erp SID_34,GRP_0
8458,windows account locked,windows account locked,GRP_0


In [13]:
duplicateRowsDF[duplicateRowsDF['Short description'] == 'call for ecwtrjnq jpecxuty']

Unnamed: 0,Short description,Description,Assignment group
51,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
229,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
2714,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
3085,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
3219,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0
4303,call for ecwtrjnq jpecxuty,call for ecwtrjnq jpecxuty,GRP_0


In [14]:
duplicateRowsDF['Short description'].value_counts()

windows password reset                                                                  28
password reset                                                                          25
account locked in ad                                                                    22
windows account locked                                                                  22
erp SID_34 account unlock                                                               17
                                                                                        ..
reset passwords for ezrsdgfc hofgvwel using password_management_tool password reset.     1
job Job_549 failed in job_scheduler at: 10/07/2016 23:05:00                              1
unable to connect to outlook                                                             1
unable to log in to collaboration_platform                                               1
unable to log in to windows                                                              1

In [15]:
duplicateRowsDF[duplicateRowsDF['Short description'] == 'outlook not working']

Unnamed: 0,Short description,Description,Assignment group
1019,outlook not working,outlook not working,GRP_0
5941,outlook not working,outlook not working,GRP_0


In [16]:
# Drop duplicated by the following code.
df.drop_duplicates(inplace=True)

In [17]:
# Check to confirm if duplicate rows except first occurrence based on all columns are removed
df[df.duplicated()]

Unnamed: 0,Short description,Description,Assignment group


In [18]:
df.info()  #info about the data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7909 entries, 0 to 8499
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  7904 non-null   object
 1   Description        7908 non-null   object
 2   Assignment group   7909 non-null   object
dtypes: object(3)
memory usage: 247.2+ KB


### Handling Missing Values

In [19]:
df.isna().apply(pd.value_counts)   #null value check 

Unnamed: 0,Short description,Description,Assignment group
False,7904,7908,7909.0
True,5,1,


In [20]:
df[df['Short description'].isnull()]

Unnamed: 0,Short description,Description,Assignment group
2604,,\r\n\r\nreceived from: ohdrnswl.rezuibdt@gmail...,GRP_34
3383,,\r\n-connected to the user system using teamvi...,GRP_0
3906,,-user unable tologin to vpn.\r\n-connected to...,GRP_0
3924,,name:wvqgbdhm fwchqjor\nlanguage:\nbrowser:mic...,GRP_0
4341,,\r\n\r\nreceived from: eqmuniov.ehxkcbgj@gmail...,GRP_0


In [21]:
df[df['Description'].isnull()]

Unnamed: 0,Short description,Description,Assignment group
4395,i am locked out of skype,,GRP_0


In [22]:
print(list(df[df['Short description'].isna()].index))
print(list(df[df['Description'].isna()].index))

[2604, 3383, 3906, 3924, 4341]
[4395]


In [23]:
df['Short description'].fillna(value=' ', inplace=True)
df['Description'].fillna(value=' ', inplace=True)

In [24]:
print(list(df[df['Short description'].isna()].index))
print(list(df[df['Description'].isna()].index))

[]
[]


In [25]:
df[df['Short description'].str.contains('skype error')]

Unnamed: 0,Short description,Description,Assignment group
4,skype error,skype error,GRP_0
285,skype error while logging in,skype error while logging in,GRP_0
3392,skype error : getting skype certificate error,skype error : getting skype certificate error,GRP_0
4813,skype error,skype error,GRP_0


In [26]:
df.head()

Unnamed: 0,Short description,Description,Assignment group
0,login issue,-verified user details.(employee# & manager na...,GRP_0
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,GRP_0
2,cant log in to vpn,\r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail...,GRP_0
3,unable to access hr_tool page,unable to access hr_tool page,GRP_0
4,skype error,skype error,GRP_0


In [27]:
df.iloc[[1]]

Unnamed: 0,Short description,Description,Assignment group
1,outlook,\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail...,GRP_0


In [28]:
df['Description'][1]

'\r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail.com\r\n\r\nhello team,\r\n\r\nmy meetings/skype meetings etc are not appearing in my outlook calendar, can somebody please advise how to correct this?\r\n\r\nkind '

In [29]:
df.replace({r'\r\n': ' '}, regex=True, inplace=True)

In [30]:
df['Description'][1]

'  received from: hmjdrvpb.komuaywn@gmail.com  hello team,  my meetings/skype meetings etc are not appearing in my outlook calendar, can somebody please advise how to correct this?  kind '

In [31]:
df.to_excel("input_data_clean_1.xlsx",
             sheet_name='Sheet1')  

In [32]:
df['Short description'][3903]

'ç”µè„‘æ—\xa0æ³•è¿žæŽ¥å…¬å…±ç›˜ï¼Œè¯·å¸®æˆ‘è½¬ç»™å°\x8fè´º'

In [33]:
df['Short description'] = df['Short description'].str.encode('ascii', 'ignore').str.decode('ascii')

In [34]:
df.shape

(7909, 3)

In [35]:
df.iloc[3680:3689,:]

Unnamed: 0,Short description,Description,Assignment group
3900,problems with ap vpn,received from: elixsfvu.pxwbjofl@gmail.com ...,GRP_0
3901,kis documents will not generate because the st...,kis documents will not generate because the st...,GRP_18
3902,wrong unit price on inwarehouse_tool,received from: ovhtgsxd.dcqhnrmy@gmail.com ...,GRP_13
3903,,ç”µè„‘æ— æ³•è¿žæŽ¥å…¬å…±ç›˜ï¼Œè¯·å¸®æˆ‘è½¬ç»™å...,GRP_30
3904,usa pc companyst-apc-01 in the pvd area cannot...,usa pc companyst-apc-01 in the pvd area needs ...,GRP_3
3906,,-user unable tologin to vpn. -connected to th...,GRP_0
3907,i am not able to log into my vpn. when i am tr...,name:mehrugshy\nlanguage:\nbrowser:microsoft i...,GRP_0
3911,vpn connectivity,received from: gmrxwqlf.vzacdmbj@gmail.com ...,GRP_0
3912,referencing ticket ticket_no1499477. customer ...,please revisit ticket ticket_no1499477 and rev...,GRP_20


In [36]:
df['Description'][3903]

'ç”µè„‘æ—\xa0æ³•è¿žæŽ¥å…¬å…±ç›˜ï¼Œè¯·å¸®æˆ‘è½¬ç»™å°\x8fè´º'

In [37]:
df['Description'] = df['Description'].str.encode('ascii', 'ignore').str.decode('ascii')

In [38]:
df['Description'][3903]

''

In [39]:
df.iloc[3680:3689,:]

Unnamed: 0,Short description,Description,Assignment group
3900,problems with ap vpn,received from: elixsfvu.pxwbjofl@gmail.com ...,GRP_0
3901,kis documents will not generate because the st...,kis documents will not generate because the st...,GRP_18
3902,wrong unit price on inwarehouse_tool,received from: ovhtgsxd.dcqhnrmy@gmail.com ...,GRP_13
3903,,,GRP_30
3904,usa pc companyst-apc-01 in the pvd area cannot...,usa pc companyst-apc-01 in the pvd area needs ...,GRP_3
3906,,-user unable tologin to vpn. -connected to th...,GRP_0
3907,i am not able to log into my vpn. when i am tr...,name:mehrugshy\nlanguage:\nbrowser:microsoft i...,GRP_0
3911,vpn connectivity,received from: gmrxwqlf.vzacdmbj@gmail.com ...,GRP_0
3912,referencing ticket ticket_no1499477. customer ...,please revisit ticket ticket_no1499477 and rev...,GRP_20


In [40]:
df.isna().apply(pd.value_counts)   #null value check 

Unnamed: 0,Short description,Description,Assignment group
False,7909,7909,7909


In [41]:
def dup_col (row):
  #print("row=", i, row)
  if row['Short description'] == row['Description'] :
    return row['Description']
  else :
    return row['Short description'] + ' ' + row['Description']

df['Clean_Desc'] = df.apply (lambda row: dup_col(row), axis=1)

In [42]:
df.to_excel("input_data_clean_2.xlsx", sheet_name='Sheet1')  

In [43]:
df

Unnamed: 0,Short description,Description,Assignment group,Clean_Desc
0,login issue,-verified user details.(employee# & manager na...,GRP_0,login issue -verified user details.(employee# ...
1,outlook,received from: hmjdrvpb.komuaywn@gmail.com ...,GRP_0,outlook received from: hmjdrvpb.komuaywn@gma...
2,cant log in to vpn,received from: eylqgodm.ybqkwiam@gmail.com ...,GRP_0,cant log in to vpn received from: eylqgodm.y...
3,unable to access hr_tool page,unable to access hr_tool page,GRP_0,unable to access hr_tool page
4,skype error,skype error,GRP_0,skype error
...,...,...,...,...
8495,emails not coming in from zz mail,received from: avglmrts.vhqmtiua@gmail.com ...,GRP_29,emails not coming in from zz mail received f...
8496,telephony_software issue,telephony_software issue,GRP_0,telephony_software issue
8497,vip2: windows password reset for tifpdchb pedx...,vip2: windows password reset for tifpdchb pedx...,GRP_0,vip2: windows password reset for tifpdchb pedx...
8498,machine no est funcionando,i am unable to access the machine utilities to...,GRP_62,machine no est funcionando i am unable to acce...


### Tokenizer
Regular expression based tokenizers to consider only alphabetical sequences and ignore numerical sequences.

In [44]:
def Desc_to_words(desc):    
    words = RegexpTokenizer('\w+').tokenize(desc)
    words = [re.sub(r'([xx]+)|([XX]+)|(\d+)', '', w).lower() for w in words]
    words = list(filter(lambda a: a != '', words))    
    return words

### Vocabulary
Extracing all the unique words from the dataset

In [45]:
all_words = list()
for desc in df['Clean_Desc']:
    for w in Desc_to_words(desc):
        all_words.append(w)

In [46]:
print('Size of vocabulary: {}'.format(len(set(all_words))))
print('Description\n', df['Clean_Desc'][10], '\n')
print('Tokens\n', Desc_to_words(df['Clean_Desc'][10]))

Size of vocabulary: 15785
Description
 engineering tool says not connected and unable to submit reports 

Tokens
 ['engineering', 'tool', 'says', 'not', 'connected', 'and', 'unable', 'to', 'submit', 'reports']


### Indexing
Indexing each unique word in the dataset by assigning it a unique number.

In [47]:
index_dict = dict()
count = 1
index_dict['<unk>'] = 0
for word in set(all_words):
    index_dict[word] = count
    count += 1

### Dataset
Utilizing indexed words to replace words by index. This makes the dataset numerical and keras readable.

In [48]:
embeddings_index = {}
f = open('/content/drive/My Drive/AIML/11_NaturalLanguageProcessing/Week2_WordEmbedding/Mentor Material - Week 2/glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

#### Taking average of all word embeddings in a sentence to generate the sentence representation.

In [49]:
data_list = list()
for desc in df['Clean_Desc']:
    sentence = np.zeros(300)
    count = 0
    for w in Desc_to_words(desc):
        try:
            sentence += embeddings_index[w]
            count += 1
        except KeyError:
            continue
    data_list.append(sentence / count)

#### Converting categrical labels to numerical format and further one hot encoding on the numerical labels.

In [50]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Assignment group'])
df['Target'] = le.transform(df['Assignment group'])
df.head()

Unnamed: 0,Short description,Description,Assignment group,Clean_Desc,Target
0,login issue,-verified user details.(employee# & manager na...,GRP_0,login issue -verified user details.(employee# ...,0
1,outlook,received from: hmjdrvpb.komuaywn@gmail.com ...,GRP_0,outlook received from: hmjdrvpb.komuaywn@gma...,0
2,cant log in to vpn,received from: eylqgodm.ybqkwiam@gmail.com ...,GRP_0,cant log in to vpn received from: eylqgodm.y...,0
3,unable to access hr_tool page,unable to access hr_tool page,GRP_0,unable to access hr_tool page,0
4,skype error,skype error,GRP_0,skype error,0


### One hot Encoding

In [51]:
np.array(data_list)

array([[-0.17510451,  0.1941614 , -0.019432  , ..., -0.05165571,
        -0.15891811,  0.01151319],
       [-0.20180574, -0.0501844 , -0.1169173 , ..., -0.13440585,
        -0.01636232,  0.00236614],
       [-0.219786  ,  0.03106118, -0.22590529, ..., -0.06619683,
         0.014843  ,  0.0499007 ],
       ...,
       [-0.250205  ,  0.12759781,  0.0298854 , ...,  0.16857821,
        -0.22893   ,  0.1205251 ],
       [-0.1092463 ,  0.11192604, -0.03287869, ..., -0.18469939,
        -0.0517648 ,  0.11906922],
       [ 0.02294691,  0.02800554, -0.01523123, ...,  0.0810663 ,
         0.16094308,  0.27886   ]])

In [52]:
df.Target.values

array([ 0,  0,  0, ...,  0, 59, 44])

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.array(data_list), df.Target.values, test_size=0.15, random_state=0)

In [58]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(6722, 300)
(6722,)
(1187, 300)
(1187,)


In [60]:
X_train = X_train[:800]
y_train = y_train[:800]
X_test = X_test[:200]
y_test = y_test[:200]

In [63]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(800, 300)
(800,)
(200, 300)
(200,)


In [64]:
X_train[0]

array([-0.10401875,  0.25111988,  0.04213745, -0.04899287, -0.059037  ,
       -0.1317755 ,  0.2192235 , -0.01171787,  0.020197  , -0.59295876,
        0.16813237,  0.08359187,  0.16733266, -0.161739  , -0.03220563,
        0.04745151,  0.170906  , -0.09737017, -0.03025575, -0.132216  ,
        0.14542083, -0.1015875 ,  0.05818917, -0.02354188, -0.05926375,
       -0.1234435 , -0.09205576,  0.23311113, -0.02491975, -0.07179288,
        0.0515555 , -0.11438906, -0.09244915,  0.04265425, -0.11239937,
       -0.1777459 , -0.33023075,  0.0285175 , -0.1842135 ,  0.05831262,
        0.14859825,  0.16322588, -0.057699  ,  0.01038558,  0.0240335 ,
        0.14714125,  0.11978382,  0.00746251, -0.38865212,  0.0631255 ,
        0.29362413, -0.10149663,  0.00183213, -0.07878375, -0.11179213,
        0.12001575, -0.22737875,  0.22137612,  0.11527717,  0.08686825,
       -0.07260675,  0.08073087, -0.00350475, -0.110088  , -0.0900415 ,
        0.05149313,  0.17096912,  0.1044555 ,  0.02451888,  0.16

In [55]:
np.array(data_list).shape

(7909, 300)

#### Training and testing the classifier

In [62]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
clf = BernoulliNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

ValueError: ignored

In [57]:
X_train

array([[-0.10401875,  0.25111988,  0.04213745, ..., -0.03960598,
        -0.12222125,  0.01296369],
       [-0.22905499,  0.28033275,  0.23902727, ..., -0.02308249,
        -0.035561  , -0.25083675],
       [-0.07622183,  0.086603  , -0.33500833, ...,  0.12186667,
        -0.15733017, -0.02880683],
       ...,
       [-0.2738608 , -0.00059899, -0.1437912 , ..., -0.029395  ,
        -0.2170958 ,  0.0027705 ],
       [-0.1678796 ,  0.08483297,  0.11993981, ..., -0.2295921 ,
        -0.01980518, -0.01407855],
       [-0.366693  ,  0.17339091, -0.0419679 , ..., -0.0682719 ,
        -0.2869566 ,  0.064745  ]])