### Introduction
The project is about extracting relevant key performance indicator (KPIs) and metrics that can be used by PhoneNow to view the long-term trends in customer and agent behaviour. PhoneNow is a telecommunication company interested in visualizing data in such a way that important aspects become very clear.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

## Data Wrangling

### Data Gathering

In [2]:
#load the dataset
df = pd.read_excel('01-Call-Center-Dataset.xlsx')

### Assessing Data

### Visual Assessment

In [3]:
#show all rows and columns from dataset
df

Unnamed: 0,Call Id,Agent,Date,Time,Topic,Answered (Y/N),Resolved,Speed of answer in seconds,AvgTalkDuration,Satisfaction rating
0,ID0001,Diane,2021-01-01,09:12:58,Contract related,Y,Y,109.0,00:02:23,3.0
1,ID0002,Becky,2021-01-01,09:12:58,Technical Support,Y,N,70.0,00:04:02,3.0
2,ID0003,Stewart,2021-01-01,09:47:31,Contract related,Y,Y,10.0,00:02:11,3.0
3,ID0004,Greg,2021-01-01,09:47:31,Contract related,Y,Y,53.0,00:00:37,2.0
4,ID0005,Becky,2021-01-01,10:00:29,Payment related,Y,Y,95.0,00:01:00,3.0
...,...,...,...,...,...,...,...,...,...,...
4995,ID4996,Jim,2021-03-31,16:37:55,Payment related,Y,Y,22.0,00:05:40,1.0
4996,ID4997,Diane,2021-03-31,16:45:07,Payment related,Y,Y,100.0,00:03:16,3.0
4997,ID4998,Diane,2021-03-31,16:53:46,Payment related,Y,Y,84.0,00:01:49,4.0
4998,ID4999,Jim,2021-03-31,17:02:24,Streaming,Y,Y,98.0,00:00:58,5.0


### Programmatic Assessment

In [4]:
#list first 5 rows
df.head()

Unnamed: 0,Call Id,Agent,Date,Time,Topic,Answered (Y/N),Resolved,Speed of answer in seconds,AvgTalkDuration,Satisfaction rating
0,ID0001,Diane,2021-01-01,09:12:58,Contract related,Y,Y,109.0,00:02:23,3.0
1,ID0002,Becky,2021-01-01,09:12:58,Technical Support,Y,N,70.0,00:04:02,3.0
2,ID0003,Stewart,2021-01-01,09:47:31,Contract related,Y,Y,10.0,00:02:11,3.0
3,ID0004,Greg,2021-01-01,09:47:31,Contract related,Y,Y,53.0,00:00:37,2.0
4,ID0005,Becky,2021-01-01,10:00:29,Payment related,Y,Y,95.0,00:01:00,3.0


In [5]:
#check the number of rows, columns, datatypes and missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Call Id                     5000 non-null   object 
 1   Agent                       5000 non-null   object 
 2   Date                        5000 non-null   object 
 3   Time                        5000 non-null   object 
 4   Topic                       5000 non-null   object 
 5   Answered (Y/N)              5000 non-null   object 
 6   Resolved                    5000 non-null   object 
 7   Speed of answer in seconds  4054 non-null   float64
 8   AvgTalkDuration             4054 non-null   object 
 9   Satisfaction rating         4054 non-null   float64
dtypes: float64(2), object(8)
memory usage: 390.8+ KB


In [6]:
#check basic statistics
df.describe()

Unnamed: 0,Speed of answer in seconds,Satisfaction rating
count,4054.0,4054.0
mean,67.52072,3.403552
std,33.592872,1.21222
min,10.0,1.0
25%,39.0,3.0
50%,68.0,3.0
75%,97.0,4.0
max,125.0,5.0


In [7]:
#check basic statistics
df.describe(include='all')

Unnamed: 0,Call Id,Agent,Date,Time,Topic,Answered (Y/N),Resolved,Speed of answer in seconds,AvgTalkDuration,Satisfaction rating
count,5000,5000,5000,5000,5000,5000,5000,4054.0,4054,4054.0
unique,5000,8,90,375,5,2,2,,391,
top,ID0001,Jim,2021-01-11,11:55:41,Streaming,Y,Y,,00:04:43,
freq,1,666,84,30,1022,4054,3646,,22,
mean,,,,,,,,67.52072,,3.403552
std,,,,,,,,33.592872,,1.21222
min,,,,,,,,10.0,,1.0
25%,,,,,,,,39.0,,3.0
50%,,,,,,,,68.0,,3.0
75%,,,,,,,,97.0,,4.0


In [8]:
#check for null values
df.isna().sum()

Call Id                         0
Agent                           0
Date                            0
Time                            0
Topic                           0
Answered (Y/N)                  0
Resolved                        0
Speed of answer in seconds    946
AvgTalkDuration               946
Satisfaction rating           946
dtype: int64

In [9]:
#check for duplicates
df.duplicated().sum()

0

In [10]:
#check the unique values in Agent
df.Agent.unique()

array(['Diane', 'Becky', 'Stewart', 'Greg', 'Jim', 'Joe', 'Martha', 'Dan'],
      dtype=object)

In [11]:
#check the unique values in Topic
df.Topic.unique()

array(['Contract related', 'Technical Support', 'Payment related',
       'Admin Support', 'Streaming'], dtype=object)

### Tidiness issues

* Date and Time columns are in separate columns

### Quality issues

* `Date` column is string/object datatype
* `Time` column is string/object datatype
* `Answered (Y/N)` column name will not be easy to work with
* `Speed of answer in seconds` column name is too long
* `Satisfaction rating` column name has space in-between

### Cleaning

### Tidiness Issues

In [12]:
#make a copy of the dataset before cleaning
df_copy = df.copy()

#### Define
I will create a new column `DateTime` that will combine both the date and time

#### Code

In [14]:
#create a new column 
df['DateTime'] = pd.to_datetime(df['Date'].astype(str) + ' ' + df['Time'].astype(str))

#### Test

In [15]:
#list the columns datatype
df.dtypes

Call Id                               object
Agent                                 object
Date                                  object
Time                                  object
Topic                                 object
Answered (Y/N)                        object
Resolved                              object
Speed of answer in seconds           float64
AvgTalkDuration                       object
Satisfaction rating                  float64
DateTime                      datetime64[ns]
dtype: object

### Quality Issues

#### Define
I will change the datatype of Date to datetimestamp

#### Code

In [13]:
#change Date datatype to datetime
df.Date = pd.to_datetime(df.Date)

#### Test

In [14]:
#list the Date column datatype
df.dtypes

Call Id                               object
Agent                                 object
Date                          datetime64[ns]
Time                                  object
Topic                                 object
Answered (Y/N)                        object
Resolved                              object
Speed of answer in seconds           float64
AvgTalkDuration                       object
Satisfaction rating                  float64
dtype: object

#### Define
I will change the datatype of `Time` column to datetimestamp

#### Code

In [15]:
#change Time datatype to datetime
df.Time = pd.to_datetime(df.Time)

TypeError: <class 'datetime.time'> is not convertible to datetime

#### Test

In [14]:
#list the Date column datatype
df.dtypes

Call Id                               object
Agent                                 object
Date                          datetime64[ns]
Time                                  object
Topic                                 object
Answered (Y/N)                        object
Resolved                              object
Speed of answer in seconds           float64
AvgTalkDuration                       object
Satisfaction rating                  float64
dtype: object