 #  <p style="text-align: center;">Technical Support Data Analysis

Technical support data can often be a rich source of information on opportunities for improving customer experience. The lesser the trouble customers have with the product the better. Even better when the customers are able to overcome
technical challenge quickly with minimal effort. Let us analyze the tech support data and do some basic analysis on problem types, time to resolve the problem and channel of suppor that is most suitable

## Loading the Dataset

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection  import train_test_split
from sklearn.cluster import KMeans
#import sklearn.metrics

tech_supp_df = pd.read_csv("technical_support_data.csv")
tech_supp_df.dtypes

FileNotFoundError: File b'technical_support_data.csv' does not exist

The dataset contains one record for each unique problem type. It has metrics for each type like count, average calls to resolve, average resolution time etc.

In [None]:
tech_supp_df.head()

## Group Data into similar clusters

Now, we will use K-Means clustering to group data based on their attribute. First, we need to determine the optimal number of groups. For that we conduct the knee test to see where the knee happens.

In [None]:
tech_supp_attributes = tech_supp_df.drop("PROBLEM_TYPE",axis=1)

#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(tech_supp_attributes)
    prediction=model.predict(tech_supp_attributes)
    meanDistortions.append(sum(np.min(cdist(tech_supp_attributes, model.cluster_centers_, 'euclidean'), axis=1)) / tech_supp_attributes.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')


Though the bend is not coming out clearly as there are many bends, let us look at 2 clusters and 3 clusters

In [None]:
# Let us first start with K = 2
final_model=KMeans(2)
final_model.fit(tech_supp_attributes)
prediction=final_model.predict(tech_supp_attributes)

#Append the prediction 
tech_supp_df["GROUP"] = prediction
print("Groups Assigned : \n")
tech_supp_df[["PROBLEM_TYPE", "GROUP"]]

Analyze the distribution of the data among the two groups (K = 2). One of the most informative visual tool is boxplot.


In [None]:
#plt.cla()

# plt.boxplot([[tech_supp_df["no_of_cases"][tech_supp_df.GROUP==0]],
#              [tech_supp_df["no_of_cases"][tech_supp_df.GROUP==1]] ],
#               labels=('GROUP 1','GROUP 2'))





tech_supp_df.boxplot(by='GROUP',layout=(2,4),figsize=(15,10))




In [None]:
# The K = 2 boxplot clearly shows outliers in group 1. Indicating that the group 1 is stretched
# indicating probability of another cluster. Let us try with K = 3, the next elbow point

In [None]:
# Let us first start with K = 3
final_model=KMeans(3)
final_model.fit(tech_supp_attributes)
prediction=final_model.predict(tech_supp_attributes)

#Append the prediction 
tech_supp_df["GROUP"] = prediction
print("Groups Assigned : \n")
tech_supp_df[["PROBLEM_TYPE", "GROUP"]]
tech_supp_df.info

In [None]:
#plt.cla()

plt.boxplot([[tech_supp_df["no_of_cases"][tech_supp_df.GROUP==0]],
             [tech_supp_df["no_of_cases"][tech_supp_df.GROUP==1]] ,
             [tech_supp_df["no_of_cases"][tech_supp_df.GROUP==2]] ],
              labels=('GROUP 1','GROUP 2','GROUP 3'))


In [None]:
#Analyzing in terms of k = 3 seems to give a better segregation of the technical support tickets than K=2. 
#The boxes are tighter indicating the spread of data is much less in K = 3 than in K = 2 and there are no outliers!

In [None]:
# That we have 3 clusters to work with, let us boxplot on Avg_resol_time in days
plt.cla()
plt.boxplot([[tech_supp_df["Avg_pending_calls"][tech_supp_df.GROUP==0]],
              [tech_supp_df["Avg_pending_calls"][tech_supp_df.GROUP==1]] ,
                [tech_supp_df["Avg_pending_calls"][tech_supp_df.GROUP==2]] ],
            labels=('GROUP 1','GROUP 2','GROUP 3'))
 

From the box plot it is clear that technical issues belonging to group 2 and 3 take much less time to resolve and hence not so many pending calls even though they are them most frequently occuring tech support issues (box plot 1) 

The group 2 and 3 may be most frequently reported issues and take less time to resolve but then do they re-occur i.e. same person reports those issues multiple times and hence the count is high (box plot 1)?

In [None]:
plt.cla()
plt.boxplot([[tech_supp_df["recurrence_freq"][tech_supp_df.GROUP==0]],
              [tech_supp_df["recurrence_freq"][tech_supp_df.GROUP==1]] ,
                [tech_supp_df["recurrence_freq"][tech_supp_df.GROUP==2]] ],
            labels=('GROUP 1','GROUP 2','GROUP 3'))

Group 2 technical issues are reportedly higer in count but most of it is reoccuring!!! Simple to solve issues but re-occur frequently indicating opportunity for quality improvement. This report needs to be brought to the notice of the engineering dept.

Whereas group 3, not so frequently occuring as group 2 has a small percentage of reoccuring cases. Easy to resolve but there is some %age of recurrance indicating probably a need to train the technical support staff to do a quality check before closing the issue

In [None]:
# Analyse the groups by Replace percentage i.e. %age of cases that need replacement

In [None]:
plt.cla()
plt.boxplot([[tech_supp_df["Replace_percent"][tech_supp_df.GROUP==0]],
              [tech_supp_df["Replace_percent"][tech_supp_df.GROUP==1]] ,
                [tech_supp_df["Replace_percent"][tech_supp_df.GROUP==2]] ],
            labels=('GROUP 1','GROUP 2','GROUP 3'))

Replacement rate for group2 and 3 is almost non existent. Yet again indicating that these issues are easy to resolve whereas the group 1 is a cluster of issues that need more effort and maybe replacement too. 

In [None]:
# That we have 3 clusters to work with, let us boxplot on Avg_resol_time in days
plt.cla()
plt.boxplot([[tech_supp_df["Avg_resol_time"][tech_supp_df.GROUP==0]],
              [tech_supp_df["Avg_resol_time"][tech_supp_df.GROUP==1]] ,
                [tech_supp_df["Avg_resol_time"][tech_supp_df.GROUP==2]] ],
            labels=('GROUP 1','GROUP 2','GROUP 3'))
 

In [None]:
# Average resolution time distribution across the three clusters reflects the same information as avg pending cases.

# In view of this analysis, one can think of providing self help facilities to the customer for group 2 and group 3 issues
# Even a chat facility or helpline number may bring down these issues and also customer is likely to feel good
# with immediate help and resolution.
# One may even consider automating the ticket resolutions....
