In [1]:
import pandas as pd
import numpy as np

In this notebook, we'll cover two powerful opertions in pandas, `df.groupby()` and `df.pivot_table().` Both can be used to aggregate data for faster analysis. 

In [2]:
df = pd.read_csv('employee_data.csv')

But first, let's analyze our dataset. This is a randomly generated dataset that contains employee data from a fictional company. 

In [3]:
df.head(5)

Unnamed: 0,Employee ID,Age,Salary,YOE,Department,Job Title,Performance Rating,Education Level
0,1,57,90788,9,Finance,Manager,Outstanding,Bachelors
1,2,32,58734,7,Customer Service,Clerk,Outstanding,Masters
2,3,28,65207,1,Sales,Clerk,Outstanding,PhD
3,4,29,83856,1,HR,Director,Meets Expectations,Masters
4,5,31,67558,2,HR,Clerk,Meets Expectations,Masters


Let's see how many people are employed in each department. We can call `.value_counts()` on the `Department` column of our dataframe to return the counts of all unique values. 

In [4]:
df['Department'].value_counts()

Department
Sales               55
IT                  47
HR                  45
Customer Service    43
R&D                 41
Finance             35
Marketing           34
Name: count, dtype: int64

### Groupby

Let's start grouping! [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) is typically applied to a column in your dataset that represents a category. It returns something called a `groupby object` which essentially contains your original dataframe but in a more structured format.


In [5]:
#Let's group our dataset by Department 
df.groupby(['Department'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc1180dd160>

As expected, we got back a `groupby object.` Note that the groupby object by itself isn't super useful- you can consider it an intermediate result. We have to perform an __operation__ on the groupby object to do analysis.

You can apply different kinds of operations to a groupby object. Some common operations include aggregations like sum(), mean(), min(), and count(). Let's group by Department again, but this time, let's specify an aggregation function and a column to aggregate on.

In [6]:
df.groupby('Department')['YOE'].mean()

Department
Customer Service    4.883721
Finance             4.714286
HR                  4.888889
IT                  5.617021
Marketing           3.735294
R&D                 4.317073
Sales               4.781818
Name: YOE, dtype: float64

Great, now we know the average years of experience per Department! 

Let's calculate the variance in salaries within each department to see where pay is most unequal.

In [7]:
df.groupby('Department')['Salary'].std()


Department
Customer Service    13824.868945
Finance             13975.216976
HR                  14022.377081
IT                  14205.069555
Marketing           13574.278471
R&D                 13574.676433
Sales               12928.213720
Name: Salary, dtype: float64

What's the max pay in relation to performance ratings?

In [8]:
df.groupby('Performance Rating')['Salary'].max()


Performance Rating
Exceeds Expectations    98638
Meets Expectations      97887
Needs Improvement       97860
Outstanding             99966
Name: Salary, dtype: int64

We can also do multi-index groupbys (in other words, we can input multiple columns into the groupby operation). Let's see how Job Titles within each Department impact the average pay.

In [9]:
dep_and_title_groupby = df.groupby(['Department', 'Job Title'])['Salary'].mean()
dep_and_title_groupby

Department        Job Title  
Customer Service  Analyst        71302.500000
                  Associate      73627.142857
                  Clerk          77646.666667
                  Coordinator    70662.400000
                  Director       77518.000000
                  Intern         67571.833333
                  Manager        68184.800000
Finance           Analyst        65523.000000
                  Associate      72727.500000
                  Clerk          79419.333333
                  Coordinator    80596.100000
                  Director       57136.500000
                  Intern         73883.333333
                  Manager        80732.250000
HR                Analyst        74539.714286
                  Associate      82332.666667
                  Clerk          68220.200000
                  Coordinator    70651.750000
                  Director       68323.500000
                  Intern         73127.000000
                  Manager        79613.500000
IT  

### [Pivot Tables](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)

Now that we understand how groupby works, let's talk about `pivot_tables`. Pivot tables in pandas work similarly to pivot tables in Excel. They're used to reshape (pivot) your data, and they make it easy to apply aggregation functions on top of the reshaped output. 

With pivot tables, we can specify: 

`index` --> The key we want to group by on the pivot table index

`columns` --> The key we want to group by on the pivot table column

`value` --> The column we want to do aggregation on 

`aggfunc` --> Our aggregation function

Let's see how employees from different departments did on their Performance Ratings. 

In [10]:
pivot_performance = pd.pivot_table(df, values='Employee ID', index='Department', columns='Performance Rating', aggfunc='count')
pivot_performance

Performance Rating,Exceeds Expectations,Meets Expectations,Needs Improvement,Outstanding
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Customer Service,13,10,8,12
Finance,6,11,10,8
HR,15,10,12,8
IT,7,12,16,12
Marketing,11,10,5,8
R&D,10,10,7,14
Sales,16,12,14,13


Let's create another pivot table to explore the education level across different departments. This time, we'll add the `margins = True` parameter to get a cross total (per department) and vertical total (across education levels). 

You'll see an "All" column and an "All" row that calculates those values.

In [11]:
pivot_education_dept = pd.pivot_table(df, values='Employee ID', index='Department', columns='Education Level', aggfunc='count', margins=True)
pivot_education_dept

Education Level,Bachelors,Masters,PhD,All
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Customer Service,16,12,15,43
Finance,14,10,11,35
HR,13,18,14,45
IT,17,9,21,47
Marketing,12,14,8,34
R&D,8,15,18,41
Sales,17,17,21,55
All,97,95,108,300


In the last section, we used `groupby` to see how Job Titles within each Department impact the average pay. Let's do this again but using `pivot_tables.`

In [12]:
dep_and_title_pivot = pd.pivot_table(df, values='Salary', index='Department', columns='Job Title', aggfunc='mean')
dep_and_title_pivot

Job Title,Analyst,Associate,Clerk,Coordinator,Director,Intern,Manager
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer Service,71302.5,73627.142857,77646.666667,70662.4,77518.0,67571.833333,68184.8
Finance,65523.0,72727.5,79419.333333,80596.1,57136.5,73883.333333,80732.25
HR,74539.714286,82332.666667,68220.2,70651.75,68323.5,73127.0,79613.5
IT,70279.571429,80819.0,60791.25,67352.0,85688.2,77293.571429,76154.111111
Marketing,75464.75,72058.0,80517.2,71169.166667,79129.5,79637.875,72976.0
R&D,62759.0,94246.4,79555.2,79685.3,76879.0,80842.4,78633.5
Sales,65267.4,69187.833333,69373.5,83749.75,70984.4,72789.545455,69417.0


In [13]:
#compare to groupby results from earlier!
dep_and_title_groupby

Department        Job Title  
Customer Service  Analyst        71302.500000
                  Associate      73627.142857
                  Clerk          77646.666667
                  Coordinator    70662.400000
                  Director       77518.000000
                  Intern         67571.833333
                  Manager        68184.800000
Finance           Analyst        65523.000000
                  Associate      72727.500000
                  Clerk          79419.333333
                  Coordinator    80596.100000
                  Director       57136.500000
                  Intern         73883.333333
                  Manager        80732.250000
HR                Analyst        74539.714286
                  Associate      82332.666667
                  Clerk          68220.200000
                  Coordinator    70651.750000
                  Director       68323.500000
                  Intern         73127.000000
                  Manager        79613.500000
IT  

### Groupby vs. Pivot Tables

Okay, but which one should I use?  

* If you're looking for something quick and efficient, `groupby()` is slightly faster for larger datasets

* `groupby()` allows for custom aggregation functions, while `pivot_table()` only supports common aggregation functions

* If you're looking to share your results with a larger group, `pivot_table()` has better formatting and a tabular output
  
* `pivot_table()` has built-in flexibility with paramaters like `margins=True` and `fill_value =0`. If you were using groupby, you would need to manage those things yourself

For most use cases, it doesn't matter which one you pick. If you don't use the `columns` paramater in `pivot_table()`, then `groupby()` and `pivot_table()` produce the same data. Infact, pivot tables are essentially defined using groupby!


`pivot_table --> groupby + unstack`

and 

`groupby --> pivot_table + stack`

Let's prove this. We have our department and title `groupby` results from earlier. Let's recreate `dep_and_title_pivot` by unstacking `dep_and_title_groupby.`


In [14]:
#a reminder of what dep_and_title_groupby looks like
dep_and_title_groupby

Department        Job Title  
Customer Service  Analyst        71302.500000
                  Associate      73627.142857
                  Clerk          77646.666667
                  Coordinator    70662.400000
                  Director       77518.000000
                  Intern         67571.833333
                  Manager        68184.800000
Finance           Analyst        65523.000000
                  Associate      72727.500000
                  Clerk          79419.333333
                  Coordinator    80596.100000
                  Director       57136.500000
                  Intern         73883.333333
                  Manager        80732.250000
HR                Analyst        74539.714286
                  Associate      82332.666667
                  Clerk          68220.200000
                  Coordinator    70651.750000
                  Director       68323.500000
                  Intern         73127.000000
                  Manager        79613.500000
IT  

In [15]:
#a reminder of what dep_and_title_pivot looks like 
dep_and_title_pivot

Job Title,Analyst,Associate,Clerk,Coordinator,Director,Intern,Manager
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer Service,71302.5,73627.142857,77646.666667,70662.4,77518.0,67571.833333,68184.8
Finance,65523.0,72727.5,79419.333333,80596.1,57136.5,73883.333333,80732.25
HR,74539.714286,82332.666667,68220.2,70651.75,68323.5,73127.0,79613.5
IT,70279.571429,80819.0,60791.25,67352.0,85688.2,77293.571429,76154.111111
Marketing,75464.75,72058.0,80517.2,71169.166667,79129.5,79637.875,72976.0
R&D,62759.0,94246.4,79555.2,79685.3,76879.0,80842.4,78633.5
Sales,65267.4,69187.833333,69373.5,83749.75,70984.4,72789.545455,69417.0


In [16]:
#recreate dep_and_title_pivot by unstacking dep_and_title_groupby
dep_and_title_groupby.unstack()

Job Title,Analyst,Associate,Clerk,Coordinator,Director,Intern,Manager
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer Service,71302.5,73627.142857,77646.666667,70662.4,77518.0,67571.833333,68184.8
Finance,65523.0,72727.5,79419.333333,80596.1,57136.5,73883.333333,80732.25
HR,74539.714286,82332.666667,68220.2,70651.75,68323.5,73127.0,79613.5
IT,70279.571429,80819.0,60791.25,67352.0,85688.2,77293.571429,76154.111111
Marketing,75464.75,72058.0,80517.2,71169.166667,79129.5,79637.875,72976.0
R&D,62759.0,94246.4,79555.2,79685.3,76879.0,80842.4,78633.5
Sales,65267.4,69187.833333,69373.5,83749.75,70984.4,72789.545455,69417.0


## Exercises 


#### Use `groupby()` to calculate the max salary within each department.


#### Use `pivot_table()` to calculate the max salary within each department.


#### Return the average age and years of experience for each department.



#### What's the maximum and minimum salaries within each job title?
Hint: You can specify two aggregation functions 



#### Count the number of employees in each job title, segmented by their performance rating. What was the total count of employees for each performance rating?