# Sport Vouchers Program Analysis

The goal of this Portfolio task is to explore data from the Federal Government Sport Vouchers program - this is a
program that provides up to two $100 vouchers for kids to participate in organised sport. Here's the [NSW Active Kids page](https://www.service.nsw.gov.au/transaction/apply-active-kids-voucher), there are similar schemes in other states - this data is from South Australia.

This is an exercise in exploring data and communicating the insights you can gain from it.  The source data comes
from the `data.gov.au` website and provides details of all Sport Vouchers that have been redeemed since February  in SA 2015 as part of the Sport Voucher program:  [Sports Vouchers Data](https://data.gov.au/dataset/ds-sa-14daba50-04ff-46c6-8468-9fa593b9f100/details).  This download is provided for you as `sportsvouchersclaimed.csv`.

To augment this data you can also make use of [ABS SEIFA data by LGA](http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ABS_SEIFA_LGA#) which shows a few measures of Socioeconomic Advantage and Disadvantage for every Local Government Area. This data is provided for you as `ABS_SEIFA_LGA.csv`. This could enable you to answer questions about whether the voucher program is used equally by parents in low, middle and high socioeconomic areas.   You might be interested in this if you were concerned that this kind of program might just benifit parents who are already advantaged (they might already be paying for sport so this program wouldn't be helping much).

Questions:
* Describe the distribution of vouchers by: LGA, Sport - which regions/sports stand out? 
* Are some sports more popular in different parts of the state?
* Are any electorates over/under represented in their use of vouchers?
* Is there a relationship between any of the SEIFA measures and voucher use in an LGA?

A challenge in this task is to display a useful summary of the data given that there are a large number of LGAs and sports involved.  Try to avoid long lists and large tables. Think about what plots and tables communicate the main points of your findings. 


In [1]:
from pykml.factory import KML_ElementMaker as KML #install
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.cm import viridis
from matplotlib.cm import inferno
from matplotlib.cm import magma
from matplotlib.colors import to_hex
import seaborn as sns
import gmaps #install
import gmaps.datasets 
import json #possible install
import ipywidgets as widgets

gmaps.configure(api_key='AIzaSyCZQuQJs0_G8amtKSvekD4jg-auBHa1yx8')
%matplotlib inline

In [2]:
with open('files/LGA_GDA2020.geojson') as f:
    lgaPos = json.load(f)

In [3]:
# read the sports vouchers data
sa_vouchers = pd.read_csv("files/sportsvouchersclaimed.csv")
#sa_vouchers.head()

The SEIFA data includes row for each Local Government Area (LGA) but the names of the LGAs have a letter or letters in brackets after the name.  To allow us to match this up with the voucher data we remove this and convert to uppercase. 

For each LGA the data includes a number of measures all of which could be useful in your exploration.  

In [4]:
# read the SEIFA data, create an LGA column by removing the letters in brackets and converting to uppercase
seifa = pd.read_csv('files/ABS_SEIFA_LGA.csv')
lga = seifa["Local Government Areas - 2011"].str.replace(' \([ACSRCDMT]+\)', '').str.upper()
seifa['LGA'] = lga
#seifa.head()

  lga = seifa["Local Government Areas - 2011"].str.replace(' \([ACSRCDMT]+\)', '').str.upper()


Since there are many rows per LGA we need to use `pivot_table` to create a new data frame with one row per LGA. Here
is an example of doing this to create a table with the different SCORE measures and the population (URP) field. 

In [5]:
LGA_scores = seifa[seifa.MEASURE == 'SCORE'].pivot_table(index="LGA", columns=["INDEX_TYPE"], values="Value")
LGA_scores.head()
LGA_pop = seifa[seifa.MEASURE == 'URP'].pivot_table(index="LGA", columns=["INDEX_TYPE"], values="Value")
LGA_scores['Population'] = LGA_pop.IEO
#LGA_scores.head()

This data frame can then be joined with the vouchers data fram to create one master data frame containing both the voucher data and the SEIFA measures.

In [6]:
sa_vouchers_scores = sa_vouchers.join(LGA_scores, on='Participant_LGA')
#sa_vouchers_scores.head()

* Describe the distribution of vouchers by: LGA, Sport - which regions/sports stand out?
    - LGA Map
    - Sports map
* Are some sports more popular in different parts of the state?
    - By sports map (as above)
* Are any electorates over/under represented in their use of vouchers?
    - LGA map + population map 
* Is there a relationship between any of the SEIFA measures and voucher use in an LGA?
    - SEIFA map + LGA map (maybe normalised)

# QUESTION 1 - Describe the distribution of vouchers

## LGA
There are a few things to consider when describing the distribution of vouchers by LGA:
1. There are a large amount of LGAs and displaying the data may be challenging 
2. The analysis of vouchers will have to take into account the differences in population

In order to solve these problems I will be producing a colour/heat map of the distribution of vouchers by LGA. 
I will first produce the base number of vouchers on a map along with a bar chart of the top/significant LGAs.
I will then produce a map of the voucher distribution taking population into account. 

Finally I will write an analysis of the distribution of vouchers by LGA. 

### Cleaning the Data
In order to map the data into the gmaps plugin, I have brought in another file from https://data.sa.gov.au/data/dataset/local-government-areas that contains the positional data of each LGA.

This data names the LGAs differently so I first need to clean the positional data so that it can be mapped with the voucher data.

#### Normalising GPS Data to LGA Names

In [7]:
def changeString(inString):
    inString = inString.replace("CITY OF ", '')
    inString = inString.replace(" CITY COUNCIL", '')
    inString = inString.replace("THE CORPORATION OF THE ", '')
    inString = inString.replace("TOWN OF ", '')
    inString = inString.replace(" COUNCIL", '')
    inString = inString.replace("DC OF ", '')
    inString = inString.replace("THE DC OF ", '')
    inString = inString.replace("THE ", '')
    inString = inString.replace(" DISTRICT", '')
    inString = inString.replace("REGIONAL OF ", '')
    inString = inString.replace(" DC", '')
    inString = inString.replace("UIA ", '')
    inString = inString.replace(' \([ACSRCDMT]+\)', '')
    inString = inString.replace(" REGIONAL", '')
    inString = inString.replace("ORROROO CARRIETON", "ORROROO/CARRIETON")
    inString = inString.replace("COORONG", "THE COORONG")
    inString = inString.replace("TUMBY BAY ", "TUMBY BAY")
    inString = inString.replace("PASTORAL UNINCORPORATED AREA", "NO LGA")
    inString = inString.replace("MUNICIPAL OF ROXBY DOWNS", "ROXBY DOWNS")
    inString = inString.replace("TORRENS ISLAND", "WEST TORRENS")
    inString = inString.replace("RURAL MURRAY BRIDGE", "MURRAY BRIDGE")
    return inString

#### Test for names that are in the positional data but not voucher data

In [8]:
i = 0
while i < len(lgaPos['features']):
    if(changeString(lgaPos['features'][i]['properties']['lga'].upper()) not in changeString(sa_vouchers_scores['Participant_LGA']).values):
        print(changeString(lgaPos['features'][i]['properties']['lga'].upper()))
    
    i = i + 1

ANANGU PITJANTJATJARA YANKUNYTJATJARA
RIVERLAND
MARALINGA TJARUTJA


### Displaying Raw Voucher Data
We will first create a map of the voucher data with the gmaps plugin.
We will then create a bar graph of the significant values to display next to the map

#### Creating a Map of the Data

There are two stages to mapping the LGA voucher data. 

The first is to create an array of colours that correspond to the relative number of vouchers for the LGA (i.e. low numbers of vouchers will get a darker colour and higher numbers will get a brighter colour)

The second is to map the location data onto the gmaps plugin along with the corresponding colour for the number of vouchers

##### Calculating Colours

In [9]:
def calculate_colour(toCount, category):    
    countMax = sa_vouchers_scores.groupby([category]).count().Participant_ID.max()
    countMin = sa_vouchers_scores.groupby([category]).count().Participant_ID.min()
    countRange = countMax - countMin
    normalisedCount = (toCount - countMin)/countRange
    mpl_color = magma(normalisedCount)
    gmaps_color = to_hex(mpl_color, keep_alpha = False)
    return gmaps_color

The following loop:
1. Looks at each LGA in the positional data 
2. Finds the corresponding LGA in the voucher data (if it can't it will add a blank colour to the voucherColours list)
3. Passes the count of the LGA vouchers into the calculate_colour function 
4. Adds that colour to the voucherColours function

Additionally it:
1. Creates a new column in the sa_vouchers_scores dataframe for the colour of each LGA (for use later)

In [10]:
i = 0
voucherColours = []
while i < len(lgaPos['features']):
    lgaName = changeString(lgaPos['features'][i]['properties']['lga'].upper())
    if((lgaName in changeString(sa_vouchers_scores['Participant_LGA']).values)):
         voucherColours.append(calculate_colour(sa_vouchers_scores[sa_vouchers_scores['Participant_LGA'] == lgaName]['Participant_LGA'].count(), 'Participant_LGA'))
         sa_vouchers_scores.loc[sa_vouchers_scores['Participant_LGA'] == lgaName, "LGA_Count"] = sa_vouchers_scores[["Participant_LGA"]] 
         sa_vouchers_scores.loc[sa_vouchers_scores['Participant_LGA'] == lgaName, "LGA_Colour"] = voucherColours[len(voucherColours) - 1]
    else:
        voucherColours.append((0,0,0,0.3))
        
    i = i + 1

##### Creating the map

In [12]:
fig_layout = {
    'width': '400px',
    'height': '400px',
    'border': '1px solid black',
    'padding': '1px'
}
fig = gmaps.figure(center=(-34.892412, 136.715287), zoom_level = 6, layout=fig_layout)
geojson_layer = gmaps.geojson_layer(
    lgaPos,
   fill_color = voucherColours,  
   fill_opacity=0.8)

fig.add_layer(geojson_layer)


voucherMap = fig

#### Creating the table of significant values
As the data displayed in the table will need to be in order of voucher count, we first need to create a new colours array for the table to use:

In [None]:
sa_vouchers_scores["LGA_Count"] = sa_vouchers_scores.groupby(["Participant_LGA"]).transform('count')

In [87]:
#lgaColours = sa_vouchers_scores.sort_values("LGA_Count").loc["Participant_LGA", "LGA_Colour"]
lgaColours = sa_vouchers_scores[["Participant_LGA", "LGA_Colour"]].sort_values("LGA_Colour")
lgaColours = lgaColours.drop_duplicates(subset=['Participant_LGA'])
lgaColours = lgaColours["LGA_Colour"].to_numpy()

In [68]:
sa_vouchersHigh = sa_vouchers_scores[sa_vouchers_scores.LGA_Count > sa_vouchers_scores.LGA_Count.mean() ].sort_values("LGA_Count")
firstColour = len(lgaColours) - len(sa_vouchersHigh.groupby("Participant_LGA"))

In [73]:
displayGraph = widgets.Output()
with displayGraph:
    fig = sns.barplot(x= "LGA_Count", y="Participant_LGA", data=sa_vouchersHigh, palette= lgaColours[firstColour: len(lgaColours)]).set_title('LGAs above the average voucher count')
    plt.show(fig)

In [86]:
box_layout = widgets.Layout(display='flex',
                flex_flow='column',
                align_items='center',
                width='100%')
title = widgets.HTML('<h3>Distribution Of Vouchers (raw)</h3>')
widgets.VBox([
    title,
    widgets.HBox([voucherMap, displayGraph], layout={'width': '100%'})
], layout = box_layout)

VBox(children=(HTML(value='<h3>Distribution Of Vouchers (raw)</h3>'), HBox(children=(Figure(layout=FigureLayou…

## SPORTS
Displaying the sport data will be slightly different as it is categorical not numerical data.
We will essentially complete the same steps as above except we will graph the most common sport, and colour these sports based on a colour that makes sense for the sport. We will also need a key to be able to see which sports are which.
We will then display this map next to a graph of the most common sports.

In [133]:
def calculate_sport_colour(sport):
    if sport[0] == "Australian Rules":
        return (218,80,80,0.8)
    if sport[0] == "Netball":
        return (240,249,47,0.8)
    if sport[0] == "Swimming":
        return (60,193,231,0.8)
    if sport[0] == "Basketball":
        return (231,168,60,0.8)
    if sport[0] == "Football (Soccer)":
        return (96,218,70,0.8)
    else:
        return (0,0,0,0.3)

In [134]:
i = 0
sportColours = []
while i < len(lgaPos['features']):
    if(changeString(lgaPos['features'][i]['properties']['lga'].upper()) in changeString(sa_vouchers_scores['Participant_LGA']).values):
         sportColours.append(calculate_sport_colour(sa_vouchers_scores[sa_vouchers_scores['Participant_LGA'] == 
         changeString(lgaPos['features'][i]['properties']['lga'].upper())].groupby(["Participant_LGA"])["Voucher_Sport"].agg(pd.Series.mode).values))
    else:
        sportColours.append((0,0,0,0.3))
    i = i + 1

In [135]:
fig_layout = {
    'width': '400px',
    'height': '400px',
    'border': '1px solid black',
    'padding': '1px'
}
fig = gmaps.figure(center=(-34.892412, 136.715287), zoom_level = 6, layout=fig_layout)
geojson_layer = gmaps.geojson_layer(
    lgaPos,
   fill_color = sportColours,  
   fill_opacity=0.8)

fig.add_layer(geojson_layer)


sportMap = fig

Figure(layout=FigureLayout(border='1px solid black', height='400px', padding='1px', width='400px'))

In [159]:
sa_vouchers_scores["Sport_Count"] = sa_vouchers_scores.groupby(["Participant_LGA", "Voucher_Sport"]).Voucher_Sport.transform('count')

In [184]:
sa_sportHigh = sa_vouchers_scores[sa_vouchers_scores["Voucher_Sport"].isin(sportsList)]

In [None]:
sns.barplot(x= "Sport_Count", y="Voucher_Sport", data=sa_sportHigh, order = sa_sportHigh.sort_values("Sport_Count")["Voucher_Sport"])

KeyboardInterrupt: 

In [15]:
i = 0
popColours = []
while i < len(lgaPos['features']):
    if(changeString(lgaPos['features'][i]['properties']['lga'].upper()) in changeString(sa_vouchers_scores['Participant_LGA']).values):
         popColours.append(calculate_colour(sa_vouchers_scores[sa_vouchers_scores['Participant_LGA'] == 
         changeString(lgaPos['features'][i]['properties']['lga'].upper())]['Population'].mean(), 'Population'))
    else:
        popColours.append((0,0,0,0.3))
    i = i + 1

KeyboardInterrupt: 

In [None]:
fig_layout = {
    'width': '400px',
    'height': '400px',
    'border': '1px solid black',
    'padding': '1px'
}
fig = gmaps.figure(center=(-34.892412, 136.715287), zoom_level = 6, layout=fig_layout)
geojson_layer = gmaps.geojson_layer(
    lgaPos,
   fill_color = popColours,  
   fill_opacity=0.8)

fig.add_layer(geojson_layer)


popMap = fig

In [None]:
title = widgets.HTML('<h3>My great maps!</h3>')
widgets.VBox([
    title,
    widgets.HBox([voucherMap, popMap], layout={'width': '100%'})
])

In [None]:
sa_vouchersLow = sa_vouchers_scores[sa_vouchers_scores.LGA_Count < sa_vouchers_scores["LGA_Count"].mean()]
sns.set_palette(sns.color_palette(lgaColours[len(lgaColours)//2 : len(lgaColours)]))
sns.barplot(x= "LGA_Count", y="Participant_LGA", data=sa_vouchersLow.sort_values("LGA_Count"))

In [None]:
title = widgets.HTML('<h3>Voucher Concentration</h3>')
widgets.VBox([
    title,
    widgets.HBox([voucherMap, voucherChart], layout={'width': '100%'})
])

## South Australian IEO Map

In [None]:
i = 0
colours = []
while i < len(lgaPos['features']):
    if(changeString(lgaPos['features'][i]['properties']['lga'].upper()) in changeString(sa_vouchers_scores['Participant_LGA']).values):
         colours.append(calculate_colourVoucher(sa_vouchers_scores[sa_vouchers_scores['Participant_LGA'] == changeString(lgaPos['features'][i]['properties']['lga'].upper())]['IEO'].max()))
    else:
        colours.append((0,0,0,0.3))
    i = i + 1

In [None]:
fig_layout = {
    'width': '400px',
    'height': '400px',
    'border': '1px solid black',
    'padding': '1px'
}
fig = gmaps.figure(center=(-34.892412, 136.715287), zoom_level = 6, layout=fig_layout)
geojson_layer = gmaps.geojson_layer(
    lgaPos,
   fill_color = colours,  
   fill_opacity=0.8)

fig.add_layer(geojson_layer)


fig

In [None]:
sa_vouchers_scores.groupby(["Participant_LGA"])["Voucher_Sport"]

## Challenge - Queensland

_Note: this is an extra task that you might take on to get a better grade for your portfolio.  You can get a good pass grade without doing this._ 

Queensland has a similar program called [Get Started](https://data.gov.au/dataset/ds-qld-3118838a-d425-48fa-bfc9-bc615ddae44e/details?q=get%20started%20vouchers) and we can retrieve data from their program in a similar format.  

The file [round1-redeemed_get_started_vouchers.csv](files/round1-redeemed_get_started_vouchers.csv) contains records of the vouchers issued in Queensland. The date of this data is not included but the program started in 2015 so it is probably from around then.  

The data includes the LGA of the individual but the name of the activity is slightly different.  To do a comparable analysis you would need to map the activity names onto those from South Australia. 

In [None]:
qld_vouchers = pd.read_csv('files/round1-redeemed_get_started_vouchers.csv')
qld_vouchers.head()

In [None]:
# Join the QLD data with the LGA data as before to get population and SIEFA data integrated
qld_vouchers['LGA'] = qld_vouchers['Club Local Government Area Name'].str.replace(' \([RC]+\)', '').str.upper()
qld_vouchers_scores = qld_vouchers.join(LGA_scores, on='LGA')
qld_vouchers_scores.head()