# Data Cleaning & Manipulation

### Steps:
1. Remove spaces in columns
2. Rearrange bin size for Age and Senority 
3. Separate dataset into three dataframes (2nd & 3rd dataframe will be cleaned and plotted in another notebook <br/>
(*contact_reasons / effects_of_subquestions_on_ratings*))
4. Reshape main dataframe
5. Rearrange bin size for Age, number of contacts in different departments
6. Convert location values to provide meaning informations
7. Missing values

In [1]:
import pandas as pd
import numpy as np
from plotly import graph_objects as go
import plotly_express as px

In [2]:
df = pd.read_csv("satisfaction_survey_admin.csv")
df.head()

Unnamed: 0,Location,Gender,Age,Age_bins,Seniority,Seniority_bins,satisfaction_with_administration_general,Times_approched_accounting,reasons_approched_accounting_1,reasons_approched_accounting_2,...,Security_department_evaluation_break down_Q6,Security_department_evaluation_break down_Q7,Security_department_evaluation_break down_Q8,Security_department_evaluation_break down_Q9,Security_department_evaluation_break down_Q10,Security_department_evaluation_break down_Q11,General_Evaluation_data&records_department,data&records_department_evaluation_break down_Q1,data&records_department_evaluation_break down_Q2,data&records_department_evaluation_break down_Q3
0,City_1(HQ),Female,3.0,C. 31 to 40,3.0,C. 6 to 10 years,8.0,C. Twice,3.0,4.0,...,,,,,,,,,,
1,City_2(main devlopment center),Male,4.0,D. 41 to 50,3.0,D. 11 to 20 years,10.0,A. Zero times,1.0,,...,9.0,8.0,10.0,9.0,8.0,9.0,8.0,8.0,8.0,8.0
2,City_3(lastest expantion),Female,2.0,B. 21 to 30,3.0,C. 6 to 10 years,8.0,C. Twice,3.0,4.0,...,,8.0,,8.0,8.0,6.0,,10.0,10.0,10.0
3,City_3(lastest expantion),Female,2.0,B. 21 to 30,1.0,A. upto 2 years,7.0,E. 5 times or more,5.0,1.0,...,,,,,,,,,,
4,City_3(lastest expantion),Female,4.0,D. 41 to 50,4.0,D. 11 to 20 years,5.0,D. 3 to 4 times,4.0,1.0,...,9.0,8.0,,9.0,9.0,9.0,,,,


#### 1. Remove spaces in columns

In [3]:
df_col = [i.strip() for i in list(df.columns)]

df.columns=df_col

#### 2. Simplify columns' names 

In [4]:
df= df.rename({'Age' : 'age_category',
               'Seniority' : 'seniority_category',
               'satisfaction_with_administration_general':'satisfaction_score',
               'Times_approched_accounting': '#contact_acct',
               'Last_aprproch_Evaluation_accounting': 'rating_acct',
               'Times_approched_HR' : '#contact_HR',
               'Last_aprproch_Evaluation_HR' : 'rating_HR',
               'Times approached Office Management' : '#contact_OM',
               'Last_aprproch_Evaluation_Office Management' : 'rating_OM',
               'General_Evaluation_security_department' : 'rating_security',
               'General_Evaluation_data&records_department' : 'rating_D&R',
               "Gender" : "gender",
               "Location": "location"},axis=1)

In [5]:
df.head()

Unnamed: 0,location,gender,age_category,Age_bins,seniority_category,Seniority_bins,satisfaction_score,#contact_acct,reasons_approched_accounting_1,reasons_approched_accounting_2,...,Security_department_evaluation_break down_Q6,Security_department_evaluation_break down_Q7,Security_department_evaluation_break down_Q8,Security_department_evaluation_break down_Q9,Security_department_evaluation_break down_Q10,Security_department_evaluation_break down_Q11,rating_D&R,data&records_department_evaluation_break down_Q1,data&records_department_evaluation_break down_Q2,data&records_department_evaluation_break down_Q3
0,City_1(HQ),Female,3.0,C. 31 to 40,3.0,C. 6 to 10 years,8.0,C. Twice,3.0,4.0,...,,,,,,,,,,
1,City_2(main devlopment center),Male,4.0,D. 41 to 50,3.0,D. 11 to 20 years,10.0,A. Zero times,1.0,,...,9.0,8.0,10.0,9.0,8.0,9.0,8.0,8.0,8.0,8.0
2,City_3(lastest expantion),Female,2.0,B. 21 to 30,3.0,C. 6 to 10 years,8.0,C. Twice,3.0,4.0,...,,8.0,,8.0,8.0,6.0,,10.0,10.0,10.0
3,City_3(lastest expantion),Female,2.0,B. 21 to 30,1.0,A. upto 2 years,7.0,E. 5 times or more,5.0,1.0,...,,,,,,,,,,
4,City_3(lastest expantion),Female,4.0,D. 41 to 50,4.0,D. 11 to 20 years,5.0,D. 3 to 4 times,4.0,1.0,...,9.0,8.0,,9.0,9.0,9.0,,,,


#### 3. Separate dataset into two dataframes: 

*By doing so, we simplify the main dataframe which contains major findings for future plotting.*

1. **df_main**: with major survey results including employee profile, overall satisfaction score for internal services of the company, number of contacts with major departments and the relative ratings during contacts. We will later export to employee_satisfaction.csv for further plotting. <br/>
2. **contact_reasons**: containing counts of reasons for the employees to approach major departments<br/>
3. **effects_of_subquestions_on_ratings** : containing rating for each service from each department

In [6]:
df_main = df.drop(df.filter(like='down').columns,axis=1)
df_main = df_main.drop(df_main.filter(like='reasons').columns,axis=1)

contact_reasons = pd.concat([df.filter(like='score'),df.filter(like='rating'),df.filter(like='reasons')],axis=1)
effects_of_subquestions_on_ratings = pd.concat([df.filter(like='score'),df.filter(like='rating'),df.filter(like='down')],axis=1)

#### 4. Reshape main dataframe

- Drop duplicated columns and unwanted record
  Duplicated columns: 
    1. age_category & Age_bins 2. seniority_category & Seniority_bins
  Unwanted record:
    - Location = Remote : Only one employee is working remotely therefore survey result is not representative
- Add ID column to identity each employee

In [7]:
# drop duplicates
df_main = df_main.drop(['Age_bins','Seniority_bins'],axis=1)
df_main = df_main.drop(df[df['location']=='Remote'].index)

In [8]:
# add ID column and move it to the front
df_main = df_main.assign(ID=(range(1, len(df_main) + 1)))
cols = df_main.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_main = df_main[cols]
df_main.head(50)

Unnamed: 0,ID,location,gender,age_category,seniority_category,satisfaction_score,#contact_acct,rating_acct,#contact_HR,rating_HR,#contact_OM,rating_OM,rating_security,rating_D&R
0,1,City_1(HQ),Female,3.0,3.0,8.0,C. Twice,3.0,D. 3 to 4 times,6.0,B. Once,1.0,4.0,
1,2,City_2(main devlopment center),Male,4.0,3.0,10.0,A. Zero times,,C. Twice,7.0,E. 5 times or more,10.0,9.0,8.0
2,3,City_3(lastest expantion),Female,2.0,3.0,8.0,C. Twice,8.0,A. Zero times,,E. 5 times or more,10.0,8.0,
3,4,City_3(lastest expantion),Female,2.0,1.0,7.0,E. 5 times or more,9.0,C. Twice,7.0,B. Once,9.0,,
4,5,City_3(lastest expantion),Female,4.0,4.0,5.0,D. 3 to 4 times,9.0,D. 3 to 4 times,8.0,C. Twice,8.0,9.0,
5,6,City_2(main devlopment center),Male,4.0,3.0,8.0,E. 5 times or more,4.0,E. 5 times or more,9.0,E. 5 times or more,10.0,8.0,
7,7,City_3(lastest expantion),Female,2.0,3.0,10.0,E. 5 times or more,8.0,E. 5 times or more,9.0,E. 5 times or more,10.0,10.0,10.0
8,8,City_2(main devlopment center),Female,4.0,2.0,9.0,D. 3 to 4 times,9.0,D. 3 to 4 times,9.0,D. 3 to 4 times,9.0,9.0,
9,9,City_1(HQ),Male,3.0,1.0,8.0,D. 3 to 4 times,8.0,D. 3 to 4 times,10.0,B. Once,9.0,8.0,8.0
10,10,City_1(HQ),Female,5.0,5.0,9.0,D. 3 to 4 times,9.0,B. Once,9.0,E. 5 times or more,10.0,10.0,


#### 5. Rearrange bin size for Age, number of contacts in different departments
- Some of the bins for those columns contain only a few values. Rearrange the bin size to improve data visualization. 

In [9]:
df_main["age_category"] = df_main["age_category"].replace({2: 1, 
                                                           3: 2,
                                                           4: 3, 
                                                           5: 4})

In [10]:
approched_dep_list = ["#contact_acct", '#contact_HR', '#contact_OM']
for lst in approched_dep_list:
        df_main[lst] = df_main[lst].replace({"B. Once" : "1 to 2 times",
                                             "C. Twice" : "1 to 2 times",
                                             "D. 3 to 4 times" : "3 to 4 times",
                                             "E. 5 times or more" : "5 times or more",
                                             "A. Zero times" :  "0 times",
                                             "A. Zero Times" :  "0 times"})

#### 6. Convert location values to provide meaning informations

In [11]:
df_main['location'] = df_main['location'].replace({"City_1(HQ)": "Boston",
                                                   "City_2(main devlopment center)": "Amsterdam",
                                                   "City_3(lastest expantion)": "New Delhi "
                                                   })

#### 7. Missing values
- Missing values are from those employees who had never used the services. They will not affect on the calculations on mean values which we will use later on for plotting. We will keep it for now. 

In [12]:
df_main.isnull().sum()/ len(df_main)

ID                    0.000000
location              0.000000
gender                0.000000
age_category          0.000000
seniority_category    0.000000
satisfaction_score    0.011050
#contact_acct         0.000000
rating_acct           0.055249
#contact_HR           0.000000
rating_HR             0.077348
#contact_OM           0.000000
rating_OM             0.071823
rating_security       0.049724
rating_D&R            0.646409
dtype: float64

### Export dataframes

In [14]:
# df_main
df_main.to_csv("employee_satisfaction.csv", index=False)

# contact_reasons *will be cleaned and plotted in another notebook
contact_reasons.to_csv('contact_reasons.csv',index=False)

# effects_of_subquestions_on_ratings  *will be cleaned and plotted in another notebook
effects_of_subquestions_on_ratings.to_csv('effects_of_subquestions_on_ratings.csv',index=False)
