<a href="https://colab.research.google.com/github/phillippsm/colab_project/blob/colab_dev/Correlations_in_ACT_Notifiable_Invoices_Register.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Datasets
### Correlations


In [0]:
# We need Pandas and NumPy for these exercises
import pandas as pd
import numpy as np

### ACT Government Open Data Portal
#### Notifiable Invoices Register

In [0]:
# Read the latest 1000 records from the Notifiable Invoices Register API
nir_act = pd.read_csv("https://www.data.act.gov.au/resource/kzmf-7uhp.csv", parse_dates={"invoice_received_dt": ['date_invoice_received'], "payment_dt": ['payment_date']})
# This also converts dates from String format to Date format

# And look at the columns
nir_act.head()

## Payment Delay
Lets calculate the number of days between the invoice being created and the p[ayment being made.  We will call this the payment_delay.  

Is the delay correlated with the amount to be paid?

In [0]:
# How long does it take to pay an invoice - the difference between payment date and the date invoice received
nir_act['payment_delay'] = nir_act.payment_dt - nir_act.invoice_received_dt
# Turn that into a count of days (not a timedelta)
nir_act['payment_delay_days'] = nir_act.payment_delay.astype('timedelta64[D]')
# and look at it
nir_act.payment_delay_days


## Payment Amounts have a high variability.

Let's tame this by calculating the logarithm of the payment amount.

Perhaps we shall see a higher correlation with the logorithm of a payment, than with the raw payments amount.

In [0]:
# Create a column for the "naural logarithm" of the payment amount.
nir_act['payment_amount_ln'] = np.log(nir_act.payment_amount)
nir_act.payment_amount_ln

## Plot to see if there is a relationship
Plot payment_delays_days vs payment_amount_ln as a scatter plot

Does it show any relationship?


In [0]:
# Let's plot these to see if it looks like there is a relationship
# we add a filter to exclude the longest payment delays: 'nir_act.payment_delay_days<200.0'
# is it valid to filter to see the relationship?
nir_act[nir_act.payment_delay_days<200.0].plot.scatter(x='payment_delay_days', y='payment_amount_ln', figsize=(8,8))

Let's adjust that filter....

In [0]:
# Let's plot these to see if it looks like there is a relationship
# we add a filter to exclude the longest payment delays: 'nir_act.payment_delay_days<200.0'
# is it valid to filter to see the relationship?
nir_act[(nir_act.payment_delay_days<200.0) & (nir_act.payment_amount_ln> 9.5)].plot.scatter(x='payment_delay_days', y='payment_amount_ln', figsize=(8,8))

Save the filter and calculate correlation using the Pearson method

In [0]:
filtered_nir = nir_act[(nir_act.payment_delay_days<200.0) & (nir_act.payment_amount_ln> 9.5)]
filtered_nir.corr(method='pearson')

And again using Spearman.

In [0]:
filtered_nir.corr(method='spearman')

## Interpretation

Is there any significant linear or non-linear correlation?

Does this make sense?


# Your Turn

What if we DID NOT FILTER?  How is the correlation affected by this?

Does it get stronger or weaker?


hint: replace "filtered_nir" with "nir_act" in the correlation calulation



In [0]:
# Your code here

Should we calculate a p-value for this correlation?

In [0]:
# Is it worthwhile calculating a p-value?

from scipy.stats import ttest_ind

ttest_ind(filtered_nir.payment_delay_days, filtered_nir.payment_amount_ln, equal_var=False)