# Advanced Pandas Assignment 2

In this assignment, you will practice grouping and aggregating data inside of Pandas dataframes. You will also practice using string operations on string data.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding question header (ie. the answer to question 1 is underneath the title "Question 1"), you will receive credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed. 

The actual data is stored in a CSV file located inside the `data` folder. The file is called `pay_history.csv`.

## Instructions
### Set up
##### Import Pandas
Import the Pandas library into Jupyter Lab.

In [2]:
import pandas as pd

##### Disable column display limit
Use the following code to disable the default limit for displaying columns. If you don't use this code, a data set with more than 20 columns will be truncated when displayed to take up less space.

```
pd.options.display.max_columns = None
```

In [3]:
pd.options.display.max_columns = None

##### Create the dataframe
Use the `read_csv()` function from Pandas to read the data from the `pay_history.csv` file into a dataframe called `df`.

In [4]:
df = pd.read_csv("./data/pay_history.csv")

##### Preview dataframe
Use the `.head()` method to print out the first 5 rows of the dataframe.

In [5]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,1,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration
1,2,1,00:00.0,63.4615,2,adventure-works\terri0,1.0,Vice President of Engineering,8/1/1971,S,F,1/31/2008,1,1,20,1,1,1/31/2008,,00:00.0,Engineering,Research and Development
2,3,1,00:00.0,43.2692,2,adventure-works\roberto0,2.0,Engineering Manager,11/12/1974,M,M,11/11/2007,1,2,21,1,1,11/11/2007,,00:00.0,Engineering,Research and Development
3,4,1,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,12/5/2007,5/30/2010,00:00.0,Engineering,Research and Development
4,4,2,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,5/31/2010,,00:00.0,Tool Design,Research and Development


### Questions
#### Top Paid Employee
##### Question 1: Get the `Rate` column
Print out the `Rate` column from the dataframe.

In [6]:
df['Rate']

0      125.5000
1       63.4615
2       43.2692
3        8.6200
4        8.6200
         ...   
299     23.0769
300     48.1010
301     23.0769
302     23.0769
303     23.0769
Name: Rate, Length: 304, dtype: float64

##### Question 2: Get the top `Rate` among all employees
Use the `.max()` method on the `Rate` column to find the top pay rate among all employees. Print it out.

In [7]:
df['Rate'].max()

125.5

##### Question 3: Save max rate as a variable
The data returned by the previous code returns a single number. Copy the code you wrote above and save the result to a variable called `max_rate`.

In [8]:
max_rate = df['Rate'].max()

##### Question 4: Get record for employee(s) who have the top rate
Create a filter that finds records in the dataframe who have a rate equal to `max_rate`. Save the filter to a variable called `filt`.

In [9]:
filt = df['Rate'] == max_rate

##### Question 5: Get the top paid employees' records
Use the `filt` variable to get back rows in the dataframe with information about the top earning employee.

In [10]:
df[ filt ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,1,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration


##### Question 6: What is the Job Title and Organization Level of the top paid employee?
```
Job Title: Cheif Executive Officer
Organization Level: NaN
```

#### Top Paid Employee by Organization Level
##### Question 7: Group by `OrganizationLevel`
Use the `.groupby()` method to create a `GroupBy` object. Group by the column `OrganizationLevel`.

In [11]:
df.groupby('OrganizationLevel')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EEDC3BFAF0>

##### Question 8: Use the `.agg()` method
Use the `.agg()` method on the code you typed above to find the maximum rate in each organization level. Print out the resulting dataframe.

Remember that the `.agg()` method takes a dictionary where keys represent the columns to aggregate on and the value represents *how* to aggregate them.

In [12]:
df.groupby('OrganizationLevel').agg({'Rate': 'max'})

Unnamed: 0_level_0,Rate
OrganizationLevel,Unnamed: 1_level_1
1.0,84.1346
2.0,48.101
3.0,50.4808
4.0,42.4808


##### Question 9: Rank the maximum rates
Looking at the aggregation performed above, rank the maximum pay rates across organization levels.

```
Highest paid: 1
2nd highest paid: 3
3rd highest paid: 2
Lowest paid: 4
```

##### Question 10: Aggregate by max and average `Rate`
Add to the code above and find both the maximum and average pay rates across each organization level. Print out the resulting dataframe.

In [13]:
df.groupby('OrganizationLevel').agg({'Rate': ['max', 'mean']})

Unnamed: 0_level_0,Rate,Rate
Unnamed: 0_level_1,max,mean
OrganizationLevel,Unnamed: 1_level_2,Unnamed: 2_level_2
1.0,84.1346,48.519655
2.0,48.101,27.623748
3.0,50.4808,22.246036
4.0,42.4808,12.861557


##### Question 11: Rank the average pay rates
Looking at the aggregation performed above, rank the average pay rates across organization levels.

```
Highest paid: 1
2nd highest paid: 2
3rd highest paid: 3
Lowest paid: 4
```

##### Question 12: What do you notice?
Observe the rankings of maximum and average rates of pay. What do you notice about the differences between these rankings? What might the rankings be different?

```
Organization level 3 has an employee who gets paid more than the highest paid employee in level 2. However, level 3 has a lower average pay. This means that pay rates in level 3 are probably more different from each other than pay rates in level 2.
```

#### Hire Date
##### Question 13: Print out the dataframe
Use the `.head()` method to print out the first five rows of the dataframe.

In [15]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,1,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration
1,2,1,00:00.0,63.4615,2,adventure-works\terri0,1.0,Vice President of Engineering,8/1/1971,S,F,1/31/2008,1,1,20,1,1,1/31/2008,,00:00.0,Engineering,Research and Development
2,3,1,00:00.0,43.2692,2,adventure-works\roberto0,2.0,Engineering Manager,11/12/1974,M,M,11/11/2007,1,2,21,1,1,11/11/2007,,00:00.0,Engineering,Research and Development
3,4,1,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,12/5/2007,5/30/2010,00:00.0,Engineering,Research and Development
4,4,2,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,5/31/2010,,00:00.0,Tool Design,Research and Development


##### Question 14: Get data types
Use the `.info()` method to get the data types for each column in the dataframe.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               304 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  SalariedFlag       304 non-null    int64  
 13  VacationHours      304 non-null    int64  
 14  SickLeaveHours     304 non-null    int64  
 15  CurrentFlag        304 non-null    int64  
 16  ShiftID            304 non

##### Question 15: Data type of `HireDate`
Notice the data types outputted from the `.info()` method. What is the data type of `HireDate`? What does that mean?

```
The `HireDate` column has the data type `object`, meaning that it contains strings and string operations can be performed on the column.
```

##### Question 16: Split the `HireDate` column
Using the `.str` accessor object and the `.split()` method, get a Series of values in `HireDate` that are split by the forward slash `/` symbol.

In [17]:
df['HireDate'].str.split('/')

0       [1, 14, 2009]
1       [1, 31, 2008]
2      [11, 11, 2007]
3       [12, 5, 2007]
4       [12, 5, 2007]
            ...      
299     [5, 30, 2013]
300     [4, 16, 2012]
301     [5, 30, 2013]
302     [5, 30, 2012]
303     [5, 30, 2012]
Name: HireDate, Length: 304, dtype: object

##### Question 17: Observe split results
Observe the results of splitting the `HireDate` column by the forward slash `/` symbol. How many items does each list have? What might they represent?

```
Each item has three integers. The first represents the month, the second the day of the month, and the third represents the year.
```

##### Question 18: Get the last value from each list
Use the `.str` accessor object on the code you wrote previously to get the second/last value out of the split list. Print out the resulting Series.

In [18]:
df['HireDate'].str.split('/').str[-1]

0      2009
1      2008
2      2007
3      2007
4      2007
       ... 
299    2013
300    2012
301    2013
302    2012
303    2012
Name: HireDate, Length: 304, dtype: object

##### Question 19: Create column `HireYear`
Using the code you wrote above, create a new column in the dataframe called `HireYear`.

In [20]:
df['HireYear'] = df['HireDate'].str.split('/').str[-1]

##### Question 20: Print out the resulting dataframe
Print out the resulting dataframe. Make sure it has a column called `HireYear` that contains four-digit years.

In [21]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,HireYear
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,1,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration,2009
1,2,1,00:00.0,63.4615,2,adventure-works\terri0,1.0,Vice President of Engineering,8/1/1971,S,F,1/31/2008,1,1,20,1,1,1/31/2008,,00:00.0,Engineering,Research and Development,2008
2,3,1,00:00.0,43.2692,2,adventure-works\roberto0,2.0,Engineering Manager,11/12/1974,M,M,11/11/2007,1,2,21,1,1,11/11/2007,,00:00.0,Engineering,Research and Development,2007
3,4,1,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,12/5/2007,5/30/2010,00:00.0,Engineering,Research and Development,2007
4,4,2,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,0,48,80,1,1,5/31/2010,,00:00.0,Tool Design,Research and Development,2007


#### Hiring rates
Using the column you just created `HireYear`, you will count up how many employees were hired in each year.

##### Question 21: Group by `HireYear`
Create a `GroupBy` object where rows are grouped by the year they were hired. Print it out, but don't save it to a variable.

In [23]:
df.groupby("HireYear")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EEDC3BF760>

##### Question 22: Aggregate by counting
Use the `.agg()` method to figure out how many employees were hired in each year. Count each `EmployeeID` using `count`.

In [25]:
df.groupby("HireYear").agg({"EmployeeID": "count"})

Unnamed: 0_level_0,EmployeeID
HireYear,Unnamed: 1_level_1
2006,1
2007,10
2008,74
2009,153
2010,38
2011,21
2012,4
2013,3


##### Question 23: Aggregating by counting unique
Wait! You just realized that this dataset contains data for each pay rate change for all employees. Thus, employees can appear several times in the data set.

Change the group by code above to use `nunique` instead of `count`.

In [27]:
df.groupby("HireYear").agg({"EmployeeID": "nunique"})

Unnamed: 0_level_0,EmployeeID
HireYear,Unnamed: 1_level_1
2006,1
2007,6
2008,74
2009,148
2010,38
2011,16
2012,4
2013,3


##### Question 24: Best growth year
Which year saw the most growth in terms of number of employees? What does this make you think about the organization?

```
The year with the highest number of new employees was 2009. This is concerning because the number of new employees per year has decreased substantially since then, which seems to indicate that the company has stagnated in its growth.
```

#### Assistant vacation days expectation
Anna knows you from your past life and is seeking employement at Adventure Works as an assistant. She doesn't know what or who she would be an assistant to, but she wants you to help her get the inside scoop on how many vacation days she might get if hired.

##### Question 25: Get rows with 'Assistant' in the `JobTitle`
Using the `.str` accessor object and the `.contains()` method, get a Series of boolean (True/False) values that indicate which rows contain the word "Assistant" in the column `JobTitle`. Note that `.contains()` is case sensitive. (all rows will likely all seem to say "False")

In [40]:
df['JobTitle'].str.contains("Assistant")

0      False
1      False
2      False
3      False
4      False
       ...  
299    False
300    False
301    False
302    False
303    False
Name: JobTitle, Length: 304, dtype: bool

##### Question 26: Save to variable
Save the code you wrote above to a variable called `assistant_filter`.

In [41]:
assistant_filter = df['JobTitle'].str.contains("Assistant")

##### Question 27: Get rows with assistants
Pass the variable `assistant_filter` in to the dataframe in the same way that you would a normal filter. Make sure that rows are returned that have a `JobTitle` with the word "Assistant" in it.

In [42]:
df[ assistant_filter ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,HireYear
20,17,4,00:00.0,13.4615,2,adventure-works\kevin0,2.0,Marketing Assistant,5/3/1987,S,M,1/26/2007,0,42,41,1,1,1/26/2007,,00:00.0,Marketing,Sales and Marketing,2007
22,19,4,00:00.0,13.4615,2,adventure-works\mary2,2.0,Marketing Assistant,1/29/1978,S,F,2/14/2011,0,43,41,1,1,2/14/2011,,00:00.0,Marketing,Sales and Marketing,2011
23,20,4,00:00.0,13.4615,2,adventure-works\wanida0,2.0,Marketing Assistant,3/17/1975,M,F,1/7/2011,0,41,40,1,1,1/7/2011,,00:00.0,Marketing,Sales and Marketing,2011
222,219,12,00:00.0,10.25,2,adventure-works\sean1,4.0,Document Control Assistant,3/12/1987,S,M,1/22/2009,0,78,59,1,1,1/22/2009,,00:00.0,Document Control,Quality Assurance,2009
223,220,12,00:00.0,10.25,2,adventure-works\karen0,4.0,Document Control Assistant,12/25/1975,M,F,2/9/2009,0,79,59,1,1,2/9/2009,,00:00.0,Document Control,Quality Assurance,2009
226,223,8,00:00.0,16.0,2,adventure-works\sairaj0,3.0,Scheduling Assistant,12/22/1987,M,M,1/26/2009,0,46,43,1,1,1/26/2009,,00:00.0,Production Control,Manufacturing,2009
227,224,7,00:00.0,8.62,2,adventure-works\william0,3.0,Scheduling Assistant,11/6/1981,M,M,1/7/2009,0,45,42,1,1,1/7/2009,8/31/2011,00:00.0,Production,Manufacturing,2009
228,224,8,00:00.0,8.62,2,adventure-works\william0,3.0,Scheduling Assistant,11/6/1981,M,M,1/7/2009,0,45,42,1,1,9/1/2011,,00:00.0,Production Control,Manufacturing,2009
229,224,8,00:00.0,13.5,2,adventure-works\william0,3.0,Scheduling Assistant,11/6/1981,M,M,1/7/2009,0,45,42,1,1,9/1/2011,,00:00.0,Production Control,Manufacturing,2009
230,225,8,00:00.0,16.0,2,adventure-works\alan0,3.0,Scheduling Assistant,3/29/1984,M,M,2/13/2009,0,47,43,1,2,2/13/2009,,00:00.0,Production Control,Manufacturing,2009


##### Question 28: Save the filtered dataframe
Using the code above, save the filtered dataframe to a new variable called `assistant_df`.

In [43]:
assistant_df = df[ assistant_filter ]

##### Question 29: Find the average vacation days
Use the `.mean()` method on the `VacationHours` column of the `assistant_df` dataframe to find the average number of vacation hours for assistants.

In [44]:
assistant_df['VacationHours'].mean()

53.411764705882355

##### Question 30: Group by `Sub-Department`
Create a `GroupBy` object by grouping the rows in `assistant_df` by the column `Sub-Department`.

In [45]:
assistant_df.groupby('Sub-Department')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EEDC3BF910>

##### Question 31: Aggregate across Sub-Departments
Use the `.agg()` method to aggregate across `Sub-Department` using `VacationHours`. Get the count, mean, and median.

In [46]:
assistant_df.groupby('Sub-Department').agg({'VacationHours': ['count', 'mean', 'median']})

Unnamed: 0_level_0,VacationHours,VacationHours,VacationHours
Unnamed: 0_level_1,count,mean,median
Sub-Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Executive General and Administration,4,62.0,54.5
Inventory Management,2,50.5,50.5
Manufacturing,6,46.0,45.5
Quality Assurance,2,78.5,78.5
Sales and Marketing,3,42.0,42.0


##### Question 32: Add aggregation for `SickLeaveHours`
Add to the code above to include aggregatios across `Sub-Department` using `SickLeaveHours`. Get the mean and median of `SickHours`.

In [48]:
assistant_df.groupby('Sub-Department').agg({'VacationHours': ['count', 'mean', 'median'], 'SickLeaveHours': ['mean', 'median']})

Unnamed: 0_level_0,VacationHours,VacationHours,VacationHours,SickLeaveHours,SickLeaveHours
Unnamed: 0_level_1,count,mean,median,mean,median
Sub-Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Executive General and Administration,4,62.0,54.5,50.75,47.0
Inventory Management,2,50.5,50.5,45.0,45.0
Manufacturing,6,46.0,45.5,42.666667,42.5
Quality Assurance,2,78.5,78.5,59.0,59.0
Sales and Marketing,3,42.0,42.0,40.666667,41.0


##### Question 33: Add aggregation for `Rate`
Add to the code above to include aggregations across `Sub-Department` using `Rate`. Get the mean and median of `Rate`.

In [49]:
assistant_df.groupby('Sub-Department').agg({'VacationHours': ['count', 'mean', 'median'], 'SickLeaveHours': ['mean', 'median'], 'Rate': ['mean', 'median']})

Unnamed: 0_level_0,VacationHours,VacationHours,VacationHours,SickLeaveHours,SickLeaveHours,Rate,Rate
Unnamed: 0_level_1,count,mean,median,mean,median,mean,median
Sub-Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Executive General and Administration,4,62.0,54.5,50.75,47.0,12.774025,13.7019
Inventory Management,2,50.5,50.5,45.0,45.0,12.75,12.75
Manufacturing,6,46.0,45.5,42.666667,42.5,13.123333,14.75
Quality Assurance,2,78.5,78.5,59.0,59.0,10.25,10.25
Sales and Marketing,3,42.0,42.0,40.666667,41.0,13.4615,13.4615


##### Question 34: Best sub-department
Looking at the aggregated table above, which of the sub-departments gives assistants the most vacation hours? Which one gives the most sick leave hours? Which one pays the most (use the median)?

```
Best for vacation hours: Quality Assurance
Best for sick leave hours: Quality Assurance
Best for pay: Executive General and Administration
```