## Lesson 02 - The Accounting Report

#### Overview: 
Angela, the head of accounting, sends her report in on three separate sheets:
- Angela: Client Accounts
- Kevin: Employee Expenses
- Oscar: Company Expenses

<font color=red>**Caveat**</font>: I don't work in finance or accounting.  These are not reflective of actual accounting documents and are presented for instructional purposes only.

In this lesson we're going to talk about the following:
* Reading multiple sheets from the same file
* Basic Date Manipulation
* Conditional formatting
* Writing multiple sheets to the same file

#### Handy References:
* [Official Python Documentation](https://docs.python.org/3/)
* [Jupyter Notebook Documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html)
* [Pandas](https://pandas.pydata.org/)
* [XlsxWriter](https://xlsxwriter.readthedocs.io/)

### Reading Multiple Sheets

The file we're working with is /data/accounting.xlsx.  Everyone go ahead and download it and take a quick look.  It always helps to get an idea of how the source file is laid out before reading it into Python.

You have three separate sheets, labeled 'Angela', 'Kevin', and 'Oscar'.
* Angela's sheet is a listing of clients, bill dates, bill amounts, paid amounts, the date, and the balance due.
* Kevin's sheet is a listing of employee expense requests and the amount approved and paid.
* Oscar's sheet is simply a list of company expenses that were paid out, broken down by month.

We're going to read the whole file into pandas using the [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method.  As before, the first thing we have to do is import the required packages and define the path the file we want to read.

In [1]:
# File Imports
import pandas as pd
import xlsxwriter
import os

In [2]:
# Define the path to the file
accounting_file = os.path.join('..', 'data', 'accounting.xlsx')

Let's start by reading in the sheets individually and taking a look at them in pandas.

In [3]:
angela_df = pd.read_excel(accounting_file, sheet_name='Angela')

In [4]:
angela_df.head()

Unnamed: 0,Customer,Bill Date,Amount,Paid,Paid Date,Balance
0,Blue Cross,2019-03-01,9900,9900,2019-03-15 00:00:00,0
1,Dunmore High,2019-03-01,6600,0,2019-03-15 00:00:00,6600
2,Maguire Advertising,2019-03-01,5775,5000,2019-03-15 00:00:00,775
3,Decker Automotive,2019-03-01,3300,3000,2019-03-15 00:00:00,300
4,Lackawanna County,2019-03-01,6600,6000,2019-03-15 00:00:00,600


In [5]:
kevin_df = pd.read_excel(accounting_file, sheet_name='Kevin')

In [6]:
kevin_df.head()

Unnamed: 0,Employee,Date,Expense,Amount Requested,Justification,Amount Paid
0,Michael,2019-03-01,Gas to NY,75,Seeing Jan,75
1,Michael,2019-03-07,Gas to NY,75,Seeing Jan,75
2,Michael,2019-03-09,Gas to NY,75,Seeing Jan,75
3,Darryl,2019-03-01,Food,50,Warehouse Lunch,50
4,Michael,2019-03-15,Food,200,Dinner with Jan,150


In [7]:
oscar_df = pd.read_excel(accounting_file, sheet_name='Oscar')

In [8]:
oscar_df.head()

Unnamed: 0,Month,Item,Amount Paid
0,March,HammerMill Paper,1000
1,March,Fuel - Warehouse,200
2,March,Equipment - Warehouse,500
3,March,Chrysler Seabring Lease,350
4,March,Georgia-Pacific Paper,750


#### Reading all sheets at once
* We can read all the worksheets in a single Excel file with one command by changing `sheet_name` from the name of the sheet to `None`.
* Instead of getting a single DataFrame, we'll get an  [Ordered Dictionary](https://docs.python.org/2/library/collections.html#collections.OrderedDict) containing the name of each sheet as the key and the data as the value.
* We talked about dictionaries in Python in the previous lesson.  Ordered Dictionaries are much the same, except the keys are in a specific order.
* Let's take a look at the data:

In [9]:
frames = pd.read_excel(accounting_file, sheet_name=None)

In [10]:
# Accessing the list of keys in a Dictionary
frames.keys()

odict_keys(['Angela', 'Kevin', 'Oscar'])

In [11]:
# Accessing a single value from a dictionary
frames['Angela'].head()

Unnamed: 0,Customer,Bill Date,Amount,Paid,Paid Date,Balance
0,Blue Cross,2019-03-01,9900,9900,2019-03-15 00:00:00,0
1,Dunmore High,2019-03-01,6600,0,2019-03-15 00:00:00,6600
2,Maguire Advertising,2019-03-01,5775,5000,2019-03-15 00:00:00,775
3,Decker Automotive,2019-03-01,3300,3000,2019-03-15 00:00:00,300
4,Lackawanna County,2019-03-01,6600,6000,2019-03-15 00:00:00,600


We've got access to our data, so let's do some quick summary calculations.  Looking at Angela's data, let's sum by Customer and Month.  Since Angela used "Bill Date" and we need the Month, let's create a new column for that.  

### Basic Date Manipulation
* Dates and Times are a fairly complex subject when it comes to Python and pandas.  For this lesson I'm going to cover a few basic techniques but I encourage you to spend some time reading about them on your own after the class.
* Pandas will normally read Excel date-time fields as a date-time type.  We can verify it did by using `.info()` on our DataFrame:

In [12]:
angela_df = frames['Angela']
angela_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 6 columns):
Customer     60 non-null object
Bill Date    60 non-null datetime64[ns]
Amount       60 non-null int64
Paid         60 non-null int64
Paid Date    60 non-null object
Balance      60 non-null int64
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 2.9+ KB


*Note: This is a good time to address a vocabulary issue.  I've been referring to columns throughout the class so far, because that's what we call them in Excel and in most tables.  Pandas referrs to a column as a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).  When you're looking at the documentation, just remember a Series is a column.*

* Let's say we wanted to extract the month from our 'Bill Date' column.
* You can see that 'Bill Date' column is a `datetime64[ns]` object, which is exactly what we need.  If it wasn't, you could convert it with [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).
* From there we can use [Series.dt.month](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.month.html) to extract the number of the month from the column.  Let's see what that gives us:

In [13]:
angela_df['Bill Date'].dt.month.head()

0    3
1    3
2    3
3    3
4    3
Name: Bill Date, dtype: int64

#### Erroneous Data
* Let's try the same thing with the `Paid Date` column.
* If you notice from our .info() command earlier, that column is an `object` data type and not a datetime.

In [14]:
angela_df['Paid Date'].dt.month.head()

AttributeError: Can only use .dt accessor with datetimelike values

##### Python Errors
For a lot of us, that probably looks like a lot of meaningless information.  If you look at the top of the output, you'll see `AttributeError` in red text, followed a couple of lines later by `----> 1 angela_df['Paid Date'].dt.month.head()`.  
* That line is telling us the problem is on Line 1 with the command listed.
* The next several lines point to the pandas library and show us what parts of the library the error occurred in.
* The last line `AttributeError: Can only use .dt accessor with datetimelike values` tells us what the actual problem is.

Reading and understanding error messages in Python, or any programming language, is something that comes with time and experience.  Whenever you get an error, always look for the line in your code that caused the error, and the reason the error happened.  From there it's usually possible through a combination of research and reading to determine what went wrong and how to fix it.  In our case, it's because we tried to use a method that only works on datetimes on some data that was not a datetime.  Let's fix that now:
* We will assign the column `angela_df['Paid Date']` to the value `pd.to_datetime(angela_df['Paid Date'])`:

In [15]:
angela_df['Paid Date'] = pd.to_datetime(angela_df['Paid Date'])

TypeError: invalid string coercion to datetime

Oops... we got another error.  This time we got a `TypeError` because the column contains some values that can't be converted to datetimes.  The line above that reads `ValueError: Error parsing datetime string "-" at position 1` tells us that is the problem.  

If you look at the documentation page for [to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html), you'll see we can change how errors are handled.  In our case, we'll set errors to `coerce`.  That will cause the command to convert every row possible, and set the others to a value called `NaT`.  All that means is that the value is Not a Time, or not a datetime.

Let's try that now:

In [16]:
pd.to_datetime(angela_df['Paid Date'], errors='coerce')

0    2019-03-15
1    2019-03-15
2    2019-03-15
3    2019-03-15
4    2019-03-15
5    2019-03-15
6    2019-04-01
7    2019-04-01
8    2019-04-01
9    2019-04-01
10   2019-04-01
11   2019-04-01
12   2019-04-01
13   2019-04-01
14   2019-04-01
15   2019-04-01
16   2019-04-01
17   2019-04-01
18   2019-04-01
19   2019-04-15
20   2019-04-15
21   2019-04-15
22   2019-04-15
23   2019-04-16
24   2019-04-14
25   2019-04-25
26   2019-04-25
27   2019-04-25
28   2019-04-25
29   2019-04-25
30   2019-04-20
31   2019-04-20
32   2019-04-20
33   2019-04-25
34   2019-05-01
35   2019-04-25
36   2019-05-01
37   2019-05-03
38   2019-04-15
39   2019-04-25
40   2019-05-15
41   2019-05-15
42   2019-05-15
43   2019-05-16
44   2019-05-17
45   2019-05-19
46   2019-05-15
47   2019-05-15
48          NaT
49          NaT
50   2019-05-15
51   2019-05-14
52   2019-05-25
53   2019-05-25
54   2019-05-25
55   2019-05-25
56   2019-05-25
57   2019-05-25
58   2019-05-25
59   2019-05-25
Name: Paid Date, dtype: datetime64[ns]

If you look at lines 48 and 49 above, you'll see they are set to `NaT`.  We'll deal with those in just a second.  For now, let's assign that data back to the 'Paid Date' column.

In [17]:
angela_df['Paid Date'] = pd.to_datetime(angela_df['Paid Date'], errors='coerce')

Now that we have our columns converted to the proper type, we want to calculate the turnaround time between the bill date and the paid date.  In Excel, we'd write a formula that subtracted the Bill Date from the Paid Date.  In pandas, we're simply going to do the calculation.  I'm keeping this fairly basic, but you can do some phenonminal stuff with column-wise operations in pandas.  I encourage everyone to further explore [Python for Data Analysis](https://learning.oreilly.com/library/view/python-for-data/9781491957653/) to see just how much is possible.

In [18]:
angela_df['Turnaround Time'] = angela_df['Paid Date'] - angela_df['Bill Date']

In [19]:
angela_df.head()

Unnamed: 0,Customer,Bill Date,Amount,Paid,Paid Date,Balance,Turnaround Time
0,Blue Cross,2019-03-01,9900,9900,2019-03-15,0,14 days
1,Dunmore High,2019-03-01,6600,0,2019-03-15,6600,14 days
2,Maguire Advertising,2019-03-01,5775,5000,2019-03-15,775,14 days
3,Decker Automotive,2019-03-01,3300,3000,2019-03-15,300,14 days
4,Lackawanna County,2019-03-01,6600,6000,2019-03-15,600,14 days


We have one final step before 'Turnaround Time' is complete.  Right now the column is still a date-time column, which menas we can't perform summary calculations like 'average' on it.  We have to tell pandas that it is an actual number.  We do this by using the [to_timedelta()](https://pandas.pydata.org/pandas-docs/version/0.24.0rc1/api/generated/pandas.to_timedelta.html) method.  You can read more about that method on the documentation page.  

We're telling pandas that to convert 'Turnaround Time' to units of 1 Day, or 1 'D'.

In [20]:
angela_df['Turnaround Time'] = angela_df['Turnaround Time'] / pd.to_timedelta(1, unit='D')

The next thing we want to do is create a summary table of our customers broken down by month that shows the following:
* Customer
* Month
* Total billed
* Total paid
* Total balance
* Average turnaround time

We'll start by converting 'Bill Date' just the month and not the full date. We'll also create a new column to hold the actual month name of the 'Bill Date' column.  We talked about getting the month number earlier.

Python gives us an excellent and easy way to get the actual month name: [strftime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html), which is short for 'string-formatted time'.  We will use `%B` code to extract what we need.

*Note: You can find a complete list of strftime codes at [basic date and time types](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)*

In [21]:
angela_df['Month'] = angela_df['Bill Date'].dt.strftime('%B')
angela_df['Bill Date'] = angela_df['Bill Date'].dt.month

We are leaving both columns for now so that we can sort by the month number.  Then, in our summary table, we will drop the month number altogether.

Now we can create our summary table: 

In [22]:
customer_summary = angela_df.groupby(['Bill Date', 'Month', 'Customer'], as_index=False).agg({
                        'Amount': [sum],
                        'Paid': [sum],
                        'Balance': [sum],
                        'Turnaround Time': ["mean"]
                    })

In [23]:
customer_summary.head()

Unnamed: 0_level_0,Bill Date,Month,Customer,Amount,Paid,Balance,Turnaround Time
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,sum,sum,mean
0,3,March,Apex Technology,2475,2400,75,17.0
1,3,March,Barbara Allen,2475,2475,0,17.0
2,3,March,Blue Cross,9900,9900,0,14.0
3,3,March,Bob Vance,2475,2475,0,17.0
4,3,March,Decker Automotive,3300,3000,300,14.0


You'll notice that I put "mean" in quotes above.  Some aggregations can be written without quotes and others need to be put in quotations.  You can find more information on the [documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html).

Now that we have our summary table, let's rename the columns and drop 'Bill Date'.  You probably remember the column renaming from the previous lesson.

In [24]:
customer_summary.columns = ['Bill Date', 
                            'Month', 
                            'Customer', 
                            'Total Billed', 
                            'Total Paid', 
                            'Balance', 
                            'Turnaround Time']

To drop a column, we will use [df.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html):

In [25]:
customer_summary.drop(columns=['Bill Date'], inplace=True)

In [26]:
customer_summary.head()

Unnamed: 0,Month,Customer,Total Billed,Total Paid,Balance,Turnaround Time
0,March,Apex Technology,2475,2400,75,17.0
1,March,Barbara Allen,2475,2475,0,17.0
2,March,Blue Cross,9900,9900,0,14.0
3,March,Bob Vance,2475,2475,0,17.0
4,March,Decker Automotive,3300,3000,300,14.0


This table is far from perfect, we have some `NaN` values that we will talk about in the next lesson, and ideally this whole table would be split into separate tables by month or by customer.  We will also be covering that before the class is over.  For now, let's get our other two sheets taken care of then write this data to Excel.

Let's tackle Kevin's worksheet first:

In [27]:
kevin_df = frames['Kevin']

In [28]:
kevin_df.head()

Unnamed: 0,Employee,Date,Expense,Amount Requested,Justification,Amount Paid
0,Michael,2019-03-01,Gas to NY,75,Seeing Jan,75
1,Michael,2019-03-07,Gas to NY,75,Seeing Jan,75
2,Michael,2019-03-09,Gas to NY,75,Seeing Jan,75
3,Darryl,2019-03-01,Food,50,Warehouse Lunch,50
4,Michael,2019-03-15,Food,200,Dinner with Jan,150


In [29]:
# Get the month column
kevin_df['Month'] = kevin_df['Date'].dt.strftime('%B')

In [30]:
# Convert the month column to the number of the month
kevin_df['Date'] = kevin_df['Date'].dt.month

In [31]:
expenses_summary = kevin_df.groupby(['Date', 'Month', 'Employee'], as_index=False).agg({
    'Amount Requested': [sum],
    'Amount Paid': [sum],
})

In [32]:
expenses_summary.head()

Unnamed: 0_level_0,Date,Month,Employee,Amount Requested,Amount Paid
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,sum
0,3,March,Darryl,50,50
1,3,March,Michael,800,450
2,4,April,Andy,200,100
3,4,April,Dwight,100,100
4,4,April,Jim,100,100


In [33]:
expenses_summary.columns = ['Date', 'Month', 'Employee', 'Amount Requested', 'Amount Paid']
expenses_summary.drop(columns=['Date'], inplace=True)

In [34]:
expenses_summary.head()

Unnamed: 0,Month,Employee,Amount Requested,Amount Paid
0,March,Darryl,50,50
1,March,Michael,800,450
2,April,Andy,200,100
3,April,Dwight,100,100
4,April,Jim,100,100


And now Oscar's report:

In [35]:
company_expenses = frames['Oscar']

In [36]:
company_expenses.head()

Unnamed: 0,Month,Item,Amount Paid
0,March,HammerMill Paper,1000
1,March,Fuel - Warehouse,200
2,March,Equipment - Warehouse,500
3,March,Chrysler Seabring Lease,350
4,March,Georgia-Pacific Paper,750


### Writing the data to Excel

Fortunately, Oscar's sheet is pretty much in the format we want.  It's time to write everything to Excel.  We know from the previous lesson that we need to:
1. Define the workbook
2. Define the formats
3. Define the worksheet
4. Write the data
5. Save the file

Most of this will be familiar.  The key difference is, instead of writing one sheet, we're going to write three and we're going to be using some conditional formats.

#### Define the workbook:

In [37]:
output_file = os.path.join('..', 'data', 'accounting_summary.xlsx')
writer = pd.ExcelWriter(output_file, engine='xlsxwriter')
workbook = writer.book

#### Define the formats:
* Here we're going to use the same header format form the previous lesson.
* You can always loook at the [xlsxwriter](https://xlsxwriter.readthedocs.io/format.html) documentation page for more references.  
* Remember, for our header format we use:
1. Bold Font
2. Center Alignment
3. Top Vertical Alignment
4. A background and font color matching Excel's built-in Accent 1 style.

In [38]:
# Define the format for our header:
header_format = workbook.add_format({
    'bold': True, #Bold Font: This value must be either True or False
    'align': 'center', #Center Alignment
    'valign': 'top', #Top Alignment
    'fg_color': '#4472C1', #Cell Color
    'font_color': 'white', #Font Color
    'font_size': 12, #Font Size
})

In [39]:
# Define the format for our numbers:
money_format = workbook.add_format({'num_format': '$#,##0'})
number_format = workbook.add_format({'num_format': '#,##0'})

#### Conditional Format:
* Now let's define our conditional format.  We want any customer with a negative balance to show as Excel's "Light red fill with dark text".  
* To do that we'll define a background color (bg_color) of '#FFC7CE' which is the color Excel uses.
* We'll also define a font color (font_color) of '#9C0006'.

In [40]:
red_color = workbook.add_format({'bg_color': '#FFC7CE',
                            'font_color': '#9C0006'})

#### Write the Data to Excel:
* We have three DataFrames: 'customer_summary', 'expenses_summary', and 'company_expenses'
* We'll write each one to the workbook before saving the file.
* We'll also add the conditional formatting to customer_summary.

In [41]:
# Define the customer summary sheet and write the data to Excel
sheet = 'Customer Summary'
customer_summary.to_excel(writer, sheet_name=sheet, index=False)

In [42]:
# Define the worksheet
worksheet = writer.sheets[sheet]

In [43]:
# Write the headers to the worksheet
for col_num, value in enumerate(customer_summary.columns.values):
    worksheet.write(0, col_num, value, header_format)

In [44]:
# Set the numerical columns
worksheet.set_column('A:B', 14, None)
worksheet.set_column('C:E', 14, money_format)
worksheet.set_column('F:F', 14, number_format)

0

#### Assigning the conditional formats
* To assign a conditional format, we use the method outlined on the [conditional format](https://xlsxwriter.readthedocs.io/working_with_conditional_formats.html) page of Xlsxwriter.
* We need the first row, the first column, the last row, and last column where we want the formatting applied.
* Then we need to assign the criteria for the format.
* Lastly we state what format to apply: All cells in the defined range where the value is less than zero.

In [45]:
# Set the conditional formats
first_row = 1 # The first row where we have data
first_column = 4 # The balance column
last_row = len(customer_summary.index) # The length of the DataFrame
last_column = 4 # because we only want the formatting applied to the balance column
conditions = {
    'type': 'cell', # Because we want to apply the formatting to each individual cell
    'criteria': '<', # for less than
    'value': 0,
    'format': red_color,
}

worksheet.conditional_format(first_row,
                            first_column,
                            last_row,
                            last_column,
                            conditions,)

### Writing the remaining sheets
* We have our first sheet completed.  Writing the remaining sheets should not be too difficult.
* Complete the exercises below to write the remaining sheets to the file and save the workbook.

#### Expenses Summary Exercise
* Sheet Name: `Expense Summary`
* DataFrame: `expenses_summary`
* None Columns: `Month, Employee`
* Money Columns: `Amount Requested, Amount Paid`
* Complete the code below using the information provided.
* You can get the letters for the columns by looking at the .head() output earlier, with `Month` being Column A. 

In [48]:
# Define the sheet
sheet = 'Expense Summary'

# Write the Dataframe to Excel
expenses_summary.to_excel(writer, sheet_name=sheet, index=False)

# Define the worksheet
worksheet = writer.sheets[sheet]

# Write the headers to the worksheet
for col_num, value in enumerate(expenses_summary.columns.values):
    worksheet.write(0, col_num, value, header_format)

# Set the numerical columns
worksheet.set_column('A:B', 14, None)
worksheet.set_column('C:D', 14, money_format)

0

#### Expenses Summary Exercise
* Sheet Name: `Company Expenses`
* DataFrame: `company_expenses`
* None Columns: `Month, Items`
* Money Columns: `Amount Paid`
* Complete the code below using the information provided.

In [49]:
# Define the sheet
sheet = 'Company Expenses'
company_expenses.to_excel(writer, sheet_name=sheet, index=False)
# Define the worksheet
worksheet = writer.sheets[sheet]
# Write the headers to the worksheet
for col_num, value in enumerate(company_expenses.columns.values):
    worksheet.write(0, col_num, value, header_format)
# Set the numerical columns
worksheet.set_column('A:B', 14, None)
worksheet.set_column('C:C', 14, money_format)

0

#### Save the File:

In [50]:
writer.save()

In [51]:
expenses_summary.head()

Unnamed: 0,Month,Employee,Amount Requested,Amount Paid
0,March,Darryl,50,50
1,March,Michael,800,450
2,April,Andy,200,100
3,April,Dwight,100,100
4,April,Jim,100,100


In [52]:
company_expenses.head()

Unnamed: 0,Month,Item,Amount Paid
0,March,HammerMill Paper,1000
1,March,Fuel - Warehouse,200
2,March,Equipment - Warehouse,500
3,March,Chrysler Seabring Lease,350
4,March,Georgia-Pacific Paper,750
