# PyHR Services Data Cleanup and Exploration

### Introduction 
This notebook contains the clean up and exploration for PyHR Services, human resources business process outsourcing (BPO) services and consulting.The notebook looks at the inconsistencies in the columns caused by  typos, missing data, and other anomalies. The result is a re-organized CSV file that has data ready for analysis.

### Data Extraction 
* The PyHR Services data comes from HRPy ticket system.
* Data Provided: 
    *Data Provided
    *Case ID
    *Company Region
    *Status
    *Source
    *Creator
    *Current Agent
    *Creation Date
    *Creation Year
    *Due Date
    *Closed Date Service Group
    *Service Pended Date
    *Pending Reason
    *Latest Communicated Date
    *Latest Communication to User Group
    *Last Transfer Date Type
    *Service Center Case Age (in days)
    *Days From Latest Communication
    *Days From Last Transfer 
    *Systems 
    *Requestor
   

In [10]:
# Import Dependencies
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import csv
import requests

In [62]:
  # File to Load 
ticket_data = "Resources/ticket_data.csv"

  # Read the Ticket file and store Pandas data frame
df_ticket = pd.read_csv(ticket_data, low_memory=False)
   # Print out in the window

#remove last 3 rows because it's junk    
df_ticket.drop(df_ticket.tail(3).index,inplace=True)

#null regions are all US locations, so we'll fill in those values 
df_ticket["Region"] = df_ticket["Region"].fillna("USA")

#clean up the source values since they let people type in whatever they want
df_ticket["Source"] = df_ticket["Source"].replace(
   {"e-mail": "E-mail", "Email": "E-mail", "Chat": "E-mail", "MyHRW" : "Ticket Management System", "Employee Portal" : "Ticket Management System","HRIS" : "Interface", "Other 3rd Party System" : "Interface", "Postal mail / Fax" : "Fax & Mail", "Mail" : "Fax & Mail", "Fax" : "Fax & Mail"})

#drop the useless columns
df_ticket = df_ticket.drop(['Creator', 'Latest Communication to', 'Last Transfer Date', 'Days From Last Transfer'], axis = 1)

#print header
df_ticket.tail(25)

Unnamed: 0,Case ID,Company,Region,Status,Source,Current Agent,Creation Date,Due Date,Closed Date,Service Group,...,Latest Communicated Date,User Group,Type,Service Center,Case Age (in days),Days From Latest Communication,Days From Last Transfer,Systems,Requestor,Unnamed: 25
64661,9159872,US,USA,Pending,Call,MyHRW_EmilyP,03/11/19 14:38,03/13/19 14:38,,Payroll Data Management,...,03/12/19 18:04,,Case,Jacksonville,2.67,1.5,,,Employee,Data Transaction
64662,9160305,US,USA,Pending,Call,MyHRW_KyleD,03/11/19 15:01,03/13/19 15:01,,Organizational Data Management,...,,,Case,Jacksonville,2.63,,,,Employee,Data Transaction
64663,9161134,US,USA,Pending,Ticket Management System,MyHRW_KyleD,03/11/19 15:46,03/13/19 15:46,,Compensation administration,...,03/13/19 15:44,,Case,Jacksonville,2.63,0.63,,,Employee,Data Transaction
64664,9161192,US,USA,Pending,Ticket Management System,MyHRW_KyleD,03/11/19 15:49,03/13/19 15:49,,Compensation administration,...,,,Case,Jacksonville,2.63,,,,Employee,Data Transaction
64665,9161589,US,USA,Pending,E-mail,MyHRW_johnjarieln,03/11/19 16:12,03/18/19 16:13,,Workforce Administration,...,03/12/19 21:53,,Case,Jacksonville,2.58,1.38,,,Employee,Case & Issue
64666,9161280,US,USA,Pending,E-mail,MyHRW_johnjarieln,03/11/19 15:52,03/18/19 15:52,,Request Handling,...,03/12/19 19:58,Vendor - CRQ,Case,Jacksonville,2.63,1.46,,,Client Business Partner,Case & Issue
64667,9161898,US,USA,Pending,E-mail,myhrw_todds,03/11/19 16:40,03/18/19 16:40,,Request Handling,...,03/13/19 19:08,Vendor - CRQ,Case,Jacksonville,2.58,0.46,,,Employee,Case & Issue
64668,9162122,US,USA,Pending,Call,MyHRW_EmilyP,03/11/19 17:04,03/13/19 17:05,,Organizational Data Management,...,03/12/19 14:59,Vendor - DMA,Case,Jacksonville,2.54,1.67,,,Manager,Data Transaction
64669,9162334,US,USA,Pending,Ticket Management System,Myhrw_CaseyR,03/11/19 17:27,03/19/19 00:41,,Leave Administration,...,03/13/19 20:11,,Case,Jacksonville,2.54,0.42,,,Employee,Case & Issue
64670,9162371,US,USA,Pending,Ticket Management System,MyHRW_EmilyP,03/11/19 17:30,03/13/19 17:30,,Organizational Data Management,...,03/12/19 14:59,,Case,Jacksonville,2.54,1.67,,,Manager,Data Transaction


In [53]:
print(df_ticket.shape)
print(type(df_ticket))

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df.shape)
df.head()


(64686, 26)
<class 'pandas.core.frame.DataFrame'>
(2, 2)


Unnamed: 0,col1,col2
0,1,3
1,2,4


## Initial Data Exploration 

On an initial exploration of the data, we some missing columns: * Region, *Closed Date, *Pended Date, *Pending Reason, *Latest Communicated Date, *Latest Communication to, *User Group, *Last Transfer Date, *Days From Latest Communication, *Days From *Last Transfer, *System, *Request Type.

It does not necessarily mean that all the data is not present, sometimes it just needs to be extracted and reformatted, sometimes is not applicable. Currently, there are 64687 rows of information and 25 columns. 

In [66]:
# Count rows and columns 
header_list = list(df_ticket.columns.values)
print(header_list)

company_list = df_ticket['Company'].unique().tolist()
#print(company_list)
region_list = df_ticket['Region'].unique().tolist()
#print(region_list)
#status_list = df_ticket['Status'].unique().tolist()
#print(status_list)
source_list = df_ticket['Source'].unique().tolist()
#print(source_list)
#print(df_ticket["Source"].value_counts())
days_list = df_ticket['Days From Last Transfer'].unique().tolist()
print(days_list)
print(df_ticket['Days From Last Transfer'].value_counts())
#agent list is garbage, but maybe we do something with it later
#agent_list = df_ticket['Current Agent'].unique().tolist()
#print(agent_list)
service_group_list = df_ticket['Service Group'].unique().tolist()
print(service_group_list)
print(df_ticket['Service Group'].value_counts())

print(df_ticket.nunique())
#for header in header_list:
    #print(df_ticket[header].nunique)

['Case ID ', 'Company', 'Region', 'Status', 'Source', 'Current Agent', 'Creation Date', 'Due Date', 'Closed Date', 'Service Group', 'Service', 'Pended Date', 'Pending Reason', 'Latest Communicated Date', 'User Group', 'Type', 'Service Center', 'Case Age (in days)', 'Days From Latest Communication', 'Days From Last Transfer', 'Systems', 'Requestor', 'Unnamed: 25']
[0.0, nan]
0.0    61688
Name: Days From Last Transfer, dtype: int64
['09. Leave Management (MyHRW)', '11. Payroll Management (MyHRW)', '02. Information/Inquiry (Simple) (MyHRW)', '17. Workforce Administration (MyHRW)', '06. Compensation Administration (MyHRW)', '10. Organizational Management (MyHRW)', '12. Performance Management (MyHRW)', '04. Benefits (MyHRW)', '13. Recruiting (MyHRW)', '03. Application/System Support (MyHRW)', '14. Reporting (MyHRW)', 'Workforce Administration', 'Leave Administration', 'Time and Attendance Data Management', 'Payroll Cycles', 'Organizational Data Management', 'Inbound Interface Administration

In [18]:
# Check for missing values 
df_ticket.count()

Case ID                           64688
Company                           64688
Region                            29354
Status                            64686
Source                            64686
Creator                           64686
Current Agent                     64685
Creation Date                     64686
Due Date                          64686
Closed Date                       62336
Service Group                     64686
Service                           64686
Pended Date                         662
Pending Reason                      844
Latest Communicated Date          18034
Latest Communication to               0
User Group                        57883
Last Transfer Date                    0
Type                              64686
Service Center                    64686
Case Age (in days)                64686
Days From Latest Communication    62737
Days From Last Transfer           61688
Systems                            8125
Requestor                         64259


## Initial Data Clean Up

In this section...

In [36]:
# Check the data frame size after dropping the rows
df_bike_clean.shape

(990500, 15)

## Reorganized and Updated Data Frame

After the initial exploration of the data and clean up...

In [46]:
# Export to csv
