# Module 1

### LoadTronic Employee Satisfaction Analysis

This notebook is to perform initial exploration and cleansing of employee satisfaction survey data.

Refer to the main notebook sections below:
> [Questions](#Questions)

> [Responses](#Responses)

In [1]:
# Load modules for data anlaysis
import pandas as pd 
import matplotlib.pyplot as plt

## Questions
Here we will load the questions data and answer the following questions:
- How many “measurement” categories are there?
- Do you notice anything strange about the category labels?
- Are the measurement categories set up appropriately?

In [5]:
# Read questions dataset from local csv
q_df_raw = pd.read_csv('questions.csv')
q_df_raw.head()

Unnamed: 0,id,measurement,question
0,1,engagement,I am proud to work for [Company]
1,2,engagement,I would recommend [Company] as a great place t...
2,3,engagement,I rarely think about looking for a job at anot...
3,4,engagement,I see myself still working at [Company] in two...
4,5,engagement,[Company] motivates me to go beyond what I wou...


In [21]:
q_df_raw['question'].unique()

array(['I am proud to work for [Company]',
       'I would recommend [Company] as a great place to work',
       'I rarely think about looking for a job at another company',
       "I see myself still working at [Company] in two years' time",
       '[Company] motivates me to go beyond what I would in a similar role elsewhere',
       'The leaders at [company] keep people informed about what is happening',
       'My manager is a great role model for employees',
       'The leaders at [Company] have communicated a vision that motivates me',
       'I have access to the resources I need to do my job well',
       'I have access to the learning and development I need to do my job well',
       'Most of the systems and processes here support us getting our work done effectively',
       'I know what I need to do to be successful in my role',
       'I receive appropriate recognition when I do good work',
       'Day-to-day decisions here demonstrate that quality and improvement are top pr

In [6]:
# Summary of number of questions by category
q_df_raw.groupby('measurement')['question'].count()

measurement
 alignment       2
 development     1
 enablement      2
 engagement      2
alignment        1
development      2
enablement       1
engagement       3
leadership       3
Name: question, dtype: int64

#### Measurements
There appear to be `5` categories in total, but we are seeing an `extra 4 labels` in our group because they include *leading spaces*

In [7]:
# Copy of dataframe to begin cleaning
q_df = q_df_raw.copy()

# Remove surround whitespace characters from field
q_df['measurement'] = q_df['measurement'].str.strip()
q_df.groupby('measurement')['question'].count()

measurement
alignment      3
development    3
enablement     3
engagement     5
leadership     3
Name: question, dtype: int64

In [8]:
# To check column datatypes
q_df.dtypes

id              int64
measurement    object
question       object
dtype: object

#### Measurement Categories
We can see that the `measurement` field is set up as an `object` dtype. This is really categorical data and we can use a specific Pandas datatype to represent this.

In [9]:
# Convert to categorical
q_df['measurement'] = q_df['measurement'].astype('category')
q_df.dtypes

id                int64
measurement    category
question         object
dtype: object

In [10]:
# Save cleansed data
q_df.to_csv('questions_clean.csv')

## Responses
Here we will load the responses data and answer the following questions:
- Are any responses outside the expected range?
- Do you observe any partial survey completions?
- Do you notice anything unexpected with any of the responses?
- Are there any partial responses?

In [11]:
r_df_raw = pd.read_csv('responses.csv')
r_df_raw.head()

Unnamed: 0,employee_id,question_id,answer
0,343,1,4
1,343,2,4
2,343,3,3
3,343,4,3
4,343,5,3


In [12]:
# Check min/max answer for 1-5 scale
r_df_raw.describe()

Unnamed: 0,employee_id,question_id,answer
count,6477.0,6477.0,6477.0
mean,391.624672,9.0,3.865833
std,219.857438,4.899358,0.847225
min,4.0,1.0,1.0
25%,206.0,5.0,3.0
50%,396.0,9.0,4.0
75%,578.0,13.0,5.0
max,769.0,17.0,5.0


In [13]:
# Check response completeness
r_df_raw.groupby('question_id').count()

Unnamed: 0_level_0,employee_id,answer
question_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,381,381
2,381,381
3,381,381
4,381,381
5,381,381
6,381,381
7,381,381
8,381,381
9,381,381
10,381,381


In [17]:
# Check for partial surveys
r_df_raw.groupby('employee_id').count().describe()

Unnamed: 0,question_id,answer
count,381.0,381.0
mean,17.0,17.0
std,0.0,0.0
min,17.0,17.0
25%,17.0,17.0
50%,17.0,17.0
75%,17.0,17.0
max,17.0,17.0


In [15]:
# Confirm no null values
r_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6477 entries, 0 to 6476
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   employee_id  6477 non-null   int64
 1   question_id  6477 non-null   int64
 2   answer       6477 non-null   int64
dtypes: int64(3)
memory usage: 151.9 KB


#### Response Evaluation
Based on my initial evaluation, the response data appears to be in good shape. All answer values follow the `1-5 likert scale`, all survey responses appear to be `complete`, and no surveys appear to be `missing answers`.

In [19]:
# Copy of dataframe for cleansing
r_df = r_df_raw.copy()

# Calculate score based on answer
r_df['score'] = r_df['answer'] / 5
r_df.head(1)

Unnamed: 0,employee_id,question_id,answer,score
0,343,1,4,0.8


In [20]:
r_df.describe()

Unnamed: 0,employee_id,question_id,answer,score
count,6477.0,6477.0,6477.0,6477.0
mean,391.624672,9.0,3.865833,0.773167
std,219.857438,4.899358,0.847225,0.169445
min,4.0,1.0,1.0,0.2
25%,206.0,5.0,3.0,0.6
50%,396.0,9.0,4.0,0.8
75%,578.0,13.0,5.0,1.0
max,769.0,17.0,5.0,1.0


In [24]:
r_df.groupby('employee_id')[['answer','score']].mean().describe()

Unnamed: 0,answer,score
count,381.0,381.0
mean,3.865833,0.773167
std,0.337335,0.067467
min,2.882353,0.576471
25%,3.705882,0.741176
50%,3.882353,0.776471
75%,4.058824,0.811765
max,4.705882,0.941176


In [32]:
# Merge questions measurement (categories)
q_r_df = r_df.merge(q_df, left_on='question_id', right_on='id')
q_r_df.groupby('measurement').agg({'score':['mean','min','max','std']})

Unnamed: 0_level_0,score,score,score,score
Unnamed: 0_level_1,mean,min,max,std
measurement,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
alignment,0.780402,0.4,1.0,0.155351
development,0.671391,0.2,1.0,0.158845
enablement,0.781627,0.2,1.0,0.177288
engagement,0.794646,0.4,1.0,0.159413
leadership,0.823447,0.2,1.0,0.160809


#### Response Ananlysis
Based on the summary of responses above, we can see that the `mean score is 0.77`, which trends above average (toward 'agree') on the responses. Because the questions are framed positively (i.e. 'I am proud to work at [company]'), a higher score suggests higher employee satisfaction.

The minimum and maximum values at the individual repsonse level does not mean much to us, but when we aggregate by employee we can see the high and low average scores across all sections. Better yet, by bringing in the categorical measurement labels from the questions data we can look at the average across categories, and see that the `leadership` questions scored the highest, while `development` had the lowest scores on average.

In [33]:
r_df.to_csv('responses_clean.csv')

In [34]:
!jupyter nbconvert --to html LoanTronic_Cleansing.ipynb

[NbConvertApp] Converting notebook LoanTronic_Cleansing.ipynb to html
[NbConvertApp] Writing 679196 bytes to LoanTronic_Cleansing.html
