# Intro to Data Analysis in Python: Analyzing Work Order Trends

#### This notebook explores basic data manipulation, visualization, and analysis methods in Jupyter Notebook using a small set of work orders.

This dataset contains work orders with the following characteristics:
- Are vendor work orders (ZZCRAFT equals VENDOR)
- Are located at St. Nicholas Houses (first three characters of LOCATION equal 038)
- Were created on or after Jan. 1, 2022 (ZZCREATEDATE > 2021-12-31, due to peculiarities of Oracle SQL date filtering)

The SQL query used to obtain these data is located in the "Data/" directory.

The following code imports Python packages used for data analysis and visualization; imports our work order data as a Pandas **DataFrame** object; ensures that key date columns are in the correct format; and creates three new columns. See comments below for details.

Throughout, abbreviations will be used to stand for generic objects, typically those within Pandas. They are:
- **df:** a generic DataFrame object
- **srs:** a generic Series (i.e., a column in a DataFrame) object
- **fig:** a generic figure in either Seaborn or PyPlot

In [None]:
#Necessary imports: Numpy, Pandas, and datetime for data manipulation; matplotlib.pyplot and Seaborn for visualization
#(These "packages" or "libraries" are outside of Python's so-called "standard library" of built-in functionality, so we need to 
    #import them separately as we do here. For convenience, we give them short aliases such as "np","pd", and "sns".)
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

#We want to display all columns in the previews below, so we set the display.max_columns option of Pandas to None
pd.set_option('display.max_columns', None)

In [None]:
#Using the read_csv() method of Pandas, read in work order data as a Pandas DataFrame
t = pd.read_csv('Data/St_Nick_Vendor_WOs.csv')

In [None]:
#Let's look at the first few rows of our data:
t.head(5)

In [None]:
#We can also see what data types Pandas has associated with each row:
t.dtypes

We see from the above that Pandas has imported most of our columns -- even our date columns -- as "objects," which, in this case, usually means text. Since we want to work with dates AS dates -- comparing them, ordering them, etc. -- we should convert the relevant columns to a more useful type.

In [None]:
#Convert our date columns to Pandas' datetime format, for later manipulation
for field in ['ZZCREATEDATE', 'STATUSDATE', 'PARENTSTATUSDATE']:
    t[field] = pd.to_datetime(t[field])
    
#Using Python's built-in datetime method for converting dates/times to strings, create new fields
#representing the month that work orders (and their parent work orders) were updated and/or created.
t['CREATEMONTH'] = t['ZZCREATEDATE'].apply(lambda x: x.strftime('%Y%m'))
t['STATUSMONTH'] = t['STATUSDATE'].apply(lambda x: x.strftime('%Y%m'))
t['PARENTSTATUSMONTH'] = t['PARENTSTATUSDATE'].apply(lambda x: x.strftime('%Y%m'))

In [None]:
#List values of ZZSUBWORKTYPE associated with lead work for later use
lbp_swts = ['LBPXRF05','LBPDEFNOCU6','LBPASMAPT','LBPDEFCU6','LBPXRF','LBPASMCOM']

#Create a new DataFrame of work orders where ZZSUBWORKTYPE is not in our list of lead-related values
t_nolead = t[t['ZZSUBWORKTYPE'].apply(lambda x: x not in lbp_swts)]

### Viewing and manipulating data in a DataFrame

First, take a look at the first 10 entries in our "no lead" DataFrame using Pandas' `df.head(number)` function.

The last X rows of a DataFrame can be accessed similarly, using `df.tail(X)`.

In [None]:
#Take a look at the first 10 entries in our "no lead" DataFrame
t_nolead.head(10)

Next, let's take a look at how our work orders are distributed by FAILURECODE, using Pandas' `<series>.value_counts()` function. We access the **Series** object associated with a particular column name using the notation `df["Column Name"]`.

If you would like to see how many rows contain no value in a field, make sure to pass in `dropna = False`.

In [None]:
t['FAILURECODE'].value_counts(dropna=False)

In [None]:
t_nolead['FAILURECODE'].value_counts(dropna=False)

Let's now take a look at when these work orders were first created. We will do this using Pandas *DataFrame*.groupby() function.

A couple of notes on groupby and its arguments:
- Specify what fields the table should be grouped on using the "by" argument (the first argument by default, if no label is specified). A DataFrame can be grouped on a single column with an expression such as `df.groupby('FAILURECODE')` OR `df.groupby(by = ['FAILURECODE'])`. One can also group on multiple columns by adding another column name to the columns list, e.g., `df.groupby(by = ['FAILURECODE','CREATEMONTH])`. See the code below for examples.
- In most common uses, the **GroupBy** object that is created by *DataFrame*.groupby() should be *aggregated* using an *aggregator function*. A useful explainer can be found [here](https://sparkbyexamples.com/pandas/pandas-aggregate-functions-with-examples/).

In [None]:
#First, let's COUNT the number of work orders created in each month, and store it in a new DataFrame month_counts
month_counts = t.groupby('CREATEMONTH').count()

#Preview by simply calling the name of the object
month_counts

As we see, when we call an *aggregator function* such as `count()` on a groupby object, it is applied to ALL columns by default. We can address this in two ways: by grouping only a subset of our DataFrame, or by using the more flexible -- but slightly more complex -- `agg()` function, which lets us specify what aggregations will be performed on what columns.

Let's try the first approach first:

In [None]:
#Create a new DataFrame from the subset of another DataFrame using bracket notation, with a list of fields at its core
t_prime = t[['CREATEMONTH','WONUM']]

#Now group and aggregate
month_counts = t_prime.groupby('CREATEMONTH').count()

month_counts

Better! Now, let's break this down further, by month *and* failure code:

In [None]:
month_fc_counts = t[['CREATEMONTH', 'FAILURECODE', 'WONUM']].groupby(by = ['CREATEMONTH','FAILURECODE']).count()
month_fc_counts

### Data preparation and visualization

That's a few too many failure codes to represent cleanly. Let's reduce the number of categories we're working with by preserving the most common failure codes and grouping the rest as "OTHER". We can put the new category that each work order falls under in a new column, FC_CATEGORY.

First, let's see what our most common failure codes are:

In [None]:
t['FAILURECODE'].value_counts(dropna=False)

It looks like PAINT, LEAD, and FLOOR work orders are most common. Let's write a function that will keep those categories and put all other work orders in the category of OTHER. We will apply this function directly to the series `t['FAILURECODE']`.

In [None]:
def recategorize_failurecodes(fc):
    #First, we test to see if the failure code fc is in our list of codes to keep:
    if fc in ['PAINT', 'LEAD', 'FLOOR']:
        #If it is, we return that value
        return fc
    
    #Otherwise, we know we can safely return OTHER
    return 'OTHER'

We now create a new column in t representing these categories. We do this using Pandas' `srs.apply(some_function)` method, on which helpful (if unofficial) documentation can be found [here](https://www.w3resource.com/pandas/dataframe/dataframe-apply.php).

In [None]:
t['FC_CATEGORY'] = t['FAILURECODE'].apply(lambda fc: recategorize_failurecodes(fc))

Now, we can re-group our DataFrame by CREATEMONTH and FC_CATEGORY:

In [None]:
month_fc_category_counts = t[['CREATEMONTH', 'FC_CATEGORY', 'WONUM']].groupby(by = ['CREATEMONTH','FC_CATEGORY']).count()
month_fc_category_counts

Next, we will:
1. Reset the index on our DataFrame to put CREATEMONTH and FC_CATEGORY back into ordinary columns/series
2. Using our complete list of months and category codes, add all necessary empty rows to our DataFrame to ensure clean visualization
3. Sort our DataFrame on these two columns

In [None]:
month_fc_category_counts = month_fc_category_counts.reset_index()

#Safeguard against accidently creating multiple 'index' columns:
if 'index' in month_fc_category_counts.columns:
    month_fc_category_counts.drop(columns = ['index'], inplace = True)
    
#Add blank rows / 0-count rows where necessary
month_list = t['CREATEMONTH'].unique()
status_list = t['FC_CATEGORY'].unique()

for month in month_list:
    for status in status_list:
        #Check to see if this status-month combo is missing from the dataset
        num_items_found = month_fc_category_counts.query(f"(CREATEMONTH == '{month}') & (FC_CATEGORY == '{status}')").shape[0]
        
        #If not in dataset, create a new DataFrame containing the blank row and append to month_fc_category_counts
        if num_items_found == 0:
            new_row = pd.DataFrame(data = {'CREATEMONTH': [month],
                                           'FC_CATEGORY': [status],
                                            'WONUM':[0]})
            month_fc_category_counts = pd.concat([month_fc_category_counts, new_row])
            
#Sort, reset index, and drop 'index' column
month_fc_category_counts.sort_values(['CREATEMONTH','FC_CATEGORY'], inplace = True)
month_fc_category_counts.reset_index(inplace = True)
month_fc_category_counts.drop(columns = ['index'], inplace = True)
            

In [None]:
month_fc_category_counts

#### Visualization

Now, with our data reshaped and sorted, we're ready to visualize. Let's start by making a bar graph showing how many open work orders in our dataset were created in each month. We'll use **Seaborn** for the core work of graphing our data, and dip into **Matplotlib's PyPlot module**, which was used to build Seaborn, only to tweak styling.

In [None]:
#Reset the index on month_counts, which we created above:
month_counts.reset_index(inplace = True)

#Create our bar chart using Seaborn's barplot method
month_bar_chart = sns.barplot(data = month_counts, x = 'CREATEMONTH', y = 'WONUM')

Our data display correctly, but our figure is unreadable, and the labels that do exist are unhelpful. Let's use matplotlib.pyplot (alias plt) functions to change this.

In [None]:
#Set our plot size
plt.figure(figsize = (12, 5))

month_bar_chart = sns.barplot(data = month_counts, x = 'CREATEMONTH', y = 'WONUM')

#Rotate the tick labels on our x axis by 90 degrees and format
def label_formatter(label):
    month = label[-2:]
    year = label[0:4]
    return f'{month}/{year}'

plt.xticks(rotation = 90, ticks = month_bar_chart.get_xticks(), 
           labels = [label_formatter(label.get_text()) for label in month_bar_chart.get_xticklabels()])

#Add axis labels and title
plt.xlabel('Month Created')
plt.ylabel('Total Work Orders')
plt.title('Open Work Orders by Month of Creation')

#Save to file
plt.savefig('_test.png')

Numerous other display options are available through both Seaborn and the underlying PyPlot interfaces. For details, please see:
- PyPlot's page of [examples](https://matplotlib.org/stable/gallery/index.html) and related documentation
- Seaborn's [example gallery](https://seaborn.pydata.org/examples/index.html) and related documentation

#### Drawing Subplots with FacetGrid

Seaborn's `FacetGrid()` class makes it simple to draw multiple, similar plots for subsets of data -- for example, visualizing the aging of work orders in various categories of failure code. 

`FacetGrid()` organizes subplots into rows and columns, each of which can represent values of a different variable. (In our example here, we are using only rows, but an instance that uses both row and column functionality can be found in **INSERT REFERENCE TO NOTEBOOK HERE**.) Each of the constituent plots can be accessed by calling `<grid_object>.axes.flat`, which provides a list of `axes` objects (i.e., subplots). These `axes` objects can be styled in the same manner as above.

In [None]:
#Create a FacetGrid() object, which provides the row/column skeleton and divides up the dataset based on the variables used
g = sns.FacetGrid(data=month_fc_category_counts, row='FC_CATEGORY', row_order = ['PAINT','LEAD','FLOOR','OTHER'], height=1.75, aspect=6)

#Call map() on the FacetGrid() to draw plots. Must provide a graphing function to map() -- here, we use sns.barplot
g.map(sns.barplot, 'CREATEMONTH', 'WONUM') #We don't need to specify the order of our x variable, because we've already sorted our data

#Obtain the "x-tick labels" from all subplots for formatting
ticklabels = []

for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        ticklabels.append(label)

#Format tick labels for readability
plt.xticks(rotation = 90, ticks = g.axes.flat[2].get_xticks(), 
   labels = [label_formatter(label.get_text()) for label in ticklabels])

#Set x- and y-axis labels and main title
g.set_ylabels(label='# WOs')
g.set_xlabels(label = 'Month Created')

g.fig.subplots_adjust(top=.9)
g.fig.suptitle("Open Work Orders by Date Created and Category")

#Set subplot titles by providing template
g.set_titles(template = '{row_name}')

#Show the plot
plt.show()

#Save to file -- note that we're using the FacetGrid's savefig() function, not plt's as above
g.savefig('_FacetGrid_test.png')