## Lesson 04 - The HR Report

#### Overview: 

In this lesson we're going to talk about the following:
* Filtering DataFrames based on conditions

#### Handy References:
* [Official Python Documentation](https://docs.python.org/3/)
* [Jupyter Notebook Documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html)
* [Pandas](https://pandas.pydata.org/)
* [XlsxWriter](https://xlsxwriter.readthedocs.io/)

### The Data

The file we're working with is /data/hr_report.xlsx.  Go ahead and take a quick look at it; it's a basic list of HR incidents with dates.  A quick glance will tell you that Jim plays a lot of pranks on Dwight and that Michael makes a monthly complaint that "Toby is horrible".

As always, let's import our tools and define the file we'll be working with.

In [10]:
# File Imports
import pandas as pd
import xlsxwriter
import os

In [11]:
# Define the path to the file
hr_file = os.path.join('..', 'data', 'hr_report.xlsx')

Now we can read the file in and take a look at the various tables:

In [12]:
frames = pd.read_excel(hr_file, sheet_name=None)

In [13]:
df = pd.DataFrame(frames['Sheet1'])

In [14]:
df.head()

Unnamed: 0,Date,Reporter,Employee,Complaint
0,2019-03-01,Michael,Toby,Toby is horrible
1,2019-03-02,Dwight,Jim,Jim encased my stapler in Jello
2,2019-03-03,Andy,Jim,Jim hid my phone in the ceiling
3,2019-03-05,Dwight,Jim,I hit myself in the head with my phone
4,2019-03-09,Dwight,Jim,Jim jammed all of my desk drawers so I could o...


### Summarizing our Data:
Even though the source is fairly basic, the output we need is a little more complicated.  We want a breakdown of employees showing the number of complaints they made and the number of complaints against them, and we want a separate sheet that only shows incidents where the same employee had three or more complaints against them in a single month.

Unfortunately, we won't be able to use .groupby and .agg for this table.  We need to calculate the number of times per month that each employee made a complaint and the number of times each employee had a complaint, then assemble that data into a DataFrame.  

* Step 1: Generate a unique list of all employees in the report
* * We'll be using Python [Sets](https://docs.python.org/3/tutorial/datastructures.html#sets) for this.
* * A Set is all of the unique values from a list.
* Step 2: Calculate the number of complaints each employee made for each month
* Step 3: Calculate the number of complaints each employee received for each month

In [15]:
# Generating a unique list of all employees
reporters = df['Reporter'].unique()

# Add the
employees = df['Employee'].unique()

# Combine the lists
all_employees = list(reporters) + list(employees)

# Convert the list to a set
all_employees = list(set(all_employees))

In [16]:
all_employees

['Stanley',
 'Toby',
 'Jim',
 'Angela',
 'Phyllis',
 'Andy',
 'Pam',
 'Daryl',
 'Michael',
 'Oscar',
 'Ryan',
 'Dwight']

In [17]:
df.head()

Unnamed: 0,Date,Reporter,Employee,Complaint
0,2019-03-01,Michael,Toby,Toby is horrible
1,2019-03-02,Dwight,Jim,Jim encased my stapler in Jello
2,2019-03-03,Andy,Jim,Jim hid my phone in the ceiling
3,2019-03-05,Dwight,Jim,I hit myself in the head with my phone
4,2019-03-09,Dwight,Jim,Jim jammed all of my desk drawers so I could o...


#### Creating the 'Month' Column
Just like the previous lesson, we need to create a 'Month' column based on the Date:

In [18]:
df['Month'] = df['Date'].dt.strftime('%B')

Now we'll calculate each employee's number of complaints for each month.  We can do that by replicating `COUNTIFS` from Excel using [df.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) to find the data we want to count, and getting the the count from that data with the [len()](https://docs.python.org/3/library/functions.html#len) method.

For example, if we wanted to know how many times Phyllis filed a complaint in May, we would do this:

In [19]:
# Get the Data
df.loc[(df['Reporter'] == 'Phyllis') & (df['Month'] == 'May')]

Unnamed: 0,Date,Reporter,Employee,Complaint,Month
25,2019-05-20,Phyllis,Andy,Andy's talks like a baby all day,May


In [20]:
# Get a Count from that data:
len(df.loc[(df['Reporter'] == 'Phyllis') & (df['Month'] == 'May')])

1

Let's build a function that will iterate through all employees and get our counts for a given month.  

* It will take a DataFrame, a month, and an employee list as arguments.  
* Then it will get our counts for complaints made and complaints received based on the Employee and Month.
* We will then assign that employee's counts to a dictionary
* Then we can convert the dictionary to a DataFrame.

Before we build the function, let's run through a single month:

In [21]:
month = 'March'

# Create and empty dictionary to hold our data
summary_data = {}

# Loop through our employee list with our conditions
for employee in all_employees:
    complaints_made = len(df.loc[(df['Reporter'] == employee) & (df['Month'] == month)])
    complaints_received = len(df.loc[(df['Employee'] == employee) & (df['Month'] == month)])
    
    # Update the dictionary with the information for that employee
    summary_data[employee] = {'complaints_made': complaints_made, 'complaints_received': complaints_received}

In [22]:
# Convert that data to a DataFrame
pd.DataFrame(summary_data).head()

Unnamed: 0,Stanley,Toby,Jim,Angela,Phyllis,Andy,Pam,Daryl,Michael,Oscar,Ryan,Dwight
complaints_made,0,0,0,0,0,1,0,0,1,1,0,4
complaints_received,0,1,5,0,0,0,0,0,1,0,0,0


Ok, we're getting the counts we want but the month is missing and the data is transposed.  In the function below we transpose the DataFrame and add in the month:

In [23]:
def summarize_hr_data(df, month, employees):
    '''Summarizes HR data by Month'''
    # Create an empty dictionary to hold our data
    summary_data = {}
    # Loop through our employee list with our conditions
    for employee in employees:
        complaints_made = len(df.loc[(df['Reporter'] == employee) & (df['Month'] == month)])
        complaints_received = len(df.loc[(df['Employee'] == employee) & (df['Month'] == month)])
        summary_data[employee] = {'complaints_made': complaints_made, 'complaints_received': complaints_received}
    
    # Transpose the data
    tmp_df = pd.DataFrame(summary_data).T
    
    # Add the Month column
    tmp_df['Month'] = month
    
    # We also need to give it an index
    tmp_df.reset_index(inplace=True)
    
    # return the DataFrame
    return tmp_df

Let's test our function with a single month:

In [24]:
summarize_hr_data(df, 'March', all_employees).head()

Unnamed: 0,index,complaints_made,complaints_received,Month
0,Stanley,0,0,March
1,Toby,0,1,March
2,Jim,0,5,March
3,Angela,0,0,March
4,Phyllis,0,0,March


Great!  Things appear to be working for the most part.  We'll need to rename 'index' to 'Employee' when we assemble the whole thing.  Next we'll use a list comprehension from the previous lesson to put everything together:

In [25]:
# Concat the DataFrames
hr_summary_df = pd.concat([summarize_hr_data(df, month, employees) for month in df['Month'].unique()], 
                          ignore_index=True)

In [26]:
hr_summary_df.head()

Unnamed: 0,index,complaints_made,complaints_received,Month
0,Toby,0,1,March
1,Jim,0,5,March
2,Michael,1,1,March
3,Pam,0,0,March
4,Dwight,4,0,March


We know we need to rearrange the columns and rename 'index', but what if an employee had no activity that month?  Should they be listed at all?  Let's drop them.

* We'll build a condition that says show me all rows where either the employee had at least one complaint made or received:

In [None]:
hr_summary_df = hr_summary_df.loc[(hr_summary_df['complaints_made'] > 0) | (hr_summary_df['complaints_received'] > 0)]

In [None]:
hr_summary_df

In [None]:
# Let's rename and reorder our columns:
hr_summary_df.columns = ['Employee', 'Complaints Made', 'Complaints Received', 'Month']

In [None]:
hr_summary_df = hr_summary_df[['Month', 'Employee', 'Complaints Made', 'Complaints Received']]

In [None]:
hr_summary_df

We have our summary, but we need to write the original data to another sheet with one condition: We only want to keep the data where an employee had more than three complaints againts them in the same month.  We can do that with df.loc:

In [None]:
df.loc[(df['Employee'] == 'Jim') & (df['Month'] == 'March')]

### Conditionally adding Data to a DataFrame
We're going to use a similar function to the one we used earlier to do the following:
* Iterate through all employees and months
* Using a Python [IF Statement](https://docs.python.org/3/tutorial/controlflow.html#if-statements), we'll check if an employee had three or more incidents in the same month and that data to a list of DataFrames if they did.
* Then we'll concatenate those DataFrames.

In [None]:
def get_complaint_details(df, month, employees):
    '''Gets the complaint details if the employee had more than three complaints in a month'''
    # Create an empty list to hold our DataFrames
    frames = []
    # Loop through our employee list with our conditions
    for employee in employees:
        complaints_received = df.loc[(df['Employee'] == employee) & (df['Month'] == month)]
        # IF there were three or more complaints:
        if len(complaints_received) >= 3:
            frames.append(pd.DataFrame(complaints_received))
    df = pd.concat(frames, ignore_index=True)
    
    return df

Now we will use our list comprehension to iterate through our months again and assemble the data into a final DataFrame:

In [None]:
hr_detail_df = pd.concat([get_complaint_details(df, month, all_employees) for month in df['Month'].unique()], 
                       ignore_index=False)

In [None]:
hr_detail_df.head()

In [None]:
hr_summary_df.head()

### Exercise: Writing more data to Excel
* Ok, one last 'writing to Excel' challenge before the final lesson.
* Hopefully you've been taking notes or have the previous lessons open in a different tab.
* You're going to write both tables to Excel on separate sheets: hr_summary_df and hr_detail_df

To save time, the Hex color codes are as follows:
```
Bad
- Background: #FFC7CE
- Font: #9C0006
```

#### The Output File
For the exercise, complete the code below to write the data to Excel:
* The output file name is `hr_summary.xlsx`
* I'm introducing a new argument to our `writer` definition called `datetime_format`.  You can read more about it on the [xlsxwriter page](https://xlsxwriter.readthedocs.io/example_pandas_datetime.html#ex-pandas-datetime)
* In essence, it tells Excel how we want our dates and times to look.

In [None]:
output_file = os.path.join('..', 'data', '_________.xlsx')
writer = pd.ExcelWriter(output_file, engine='xlsxwriter', datetime_format='mm/dd/yy')
workbook = writer.book

#### Formats

In [None]:
# Define the format for our header:
header_format = workbook.add_format({
    'bold': True, #Bold Font: This value must be either True or False
    'align': 'center', #Center Alignment
    'valign': 'top', #Top Alignment
    'fg_color': '#4472C1', #Cell Color
    'font_color': 'white', #Font Color
    'font_size': 12, #Font Size
})

# Define the format for our numbers:
number_format = workbook.add_format({'num_format': '#,##0'})

# Define the bad color format
bad_color = workbook.add_format(_____)

#### HR Summary:
* hr_summary_df should the first sheet
* The sheet name should be 'HR Monthly Summary'
* The column order for it should be Month, Employee, Complaints Made, Complaints Received
* Complaints Made and Received should have a 'Bad' highlight for anything over 2

In [None]:
# Check the column order:
_____

In [None]:
# Reorder the columns if necessary:

In [None]:
# Define the customer summary sheet and write the data to Excel
sheet = ______
hr_summary_df.to_excel(______, ______=_____, index=False)

In [None]:
# Define the worksheet
worksheet = writer.sheets[sheet]

In [None]:
# Write the headers to the worksheet
for col_num, value in enumerate(hr_summary_df._____._____):
    worksheet.write(0, col_num, value, header_format)

In [None]:
# Set the numerical columns
worksheet.set_column('_:_', 14, None)
worksheet.set_column('_:_', 14, number_format)

In [None]:
# Define rows and columns
first_row = 1
last_row = _____
complaints_made_col = 2
complaints_received_col = 3

# Set the color conditions
complaints_over_two = {
    'type': 'cell', 
    'criteria': '>', 
    'value': 2,
    'format': bad_color,
}

In [None]:
# Apply the color formats
worksheet.conditional_format(first_row,
                            complaints_made_col,
                            last_row,
                            complaints_received_col,
                            complaints_over_two)

#### HR Details
* hr_detail_df should be the last sheet
* THe sheet name should be 'HR Monthly Detail'
* The column order is Date, Employee, Reporter, Complaint

In [None]:
# Check the column order
______

In [None]:
# Reorder the columns if necessary:
_____

In [None]:
# Define the sheet and write the data to Excel
_____ = 'HR Monthly Detail'
_____.to_excel(writer, sheet_name=sheet, index=False)

In [None]:
# Define the worksheet
worksheet = writer.sheets[sheet]

In [None]:
# Write the headers to the worksheet
for col_num, value in enumerate(hr_detail_df._____._____):
    worksheet.write(0, col_num, value, header_format)

In [None]:
# Set the column width
# Do not change the date column
worksheet.set_column('_:_', 14, None)

#### Save the File:

In [None]:
writer.save()