# Analysis of NSW Food Authority's Name & Shame Register

The NSW Food Authority publishes lists of businesses that have breached or are alleged to have breached NSW food safety laws. Publishing the lists gives consumers more information to make decisions about where they eat or buy food. Individuals and businesses may receive either a penalty notice for their alleged offence or be prosecuted before a court.

In [None]:
#! pip install html-table-parser-python3

In [None]:
# Libraries
import utils

import pandas as pd
import numpy as np
import boto3
import os
import io
from dotenv import load_dotenv #for loading env variables

## 1. Get Existing Data from S3

The Food Authority's Name & Shame website only displays the last 12 months of data. But since I started this repository (in June, 2024) I simply append any new data to the bottom of a dataset stored in aws S3. So step 1 of the overall process is to get this data

#### Get access keys and read from aws S3

In [None]:
# Load the environment variables from .env
load_dotenv()

#get aws keys from .env file
CLIENT_ID = os.environ.get("AWS_ACCESS_KEY_ID")
CLIENT_SECRET = os.environ.get("AWS_SECRET_ACCESS_KEY") 

#setup an s3 session for downloading and uploading file to s3
session = boto3.Session(aws_access_key_id=CLIENT_ID,
                        aws_secret_access_key=CLIENT_SECRET)
s3 = session.client('s3')

#### Declare aws parameters and read into a dataframe

We will use `pre_df` (as in, *previous* dataframe) since this is the "old" data we are looking to update. 

In [None]:
#declare the object and from what bucket the data will come from
bucket_name='nsw-food-authority-name-and-shame'
object_key='dataset.csv'
version_id="" #keep blank unless specific version is required

if version_id =='':
# Retrieve object from S3
    obj = s3.get_object(Bucket=bucket_name, Key=object_key)
else:
    print("Downloading specific version: {}".format(version_id))
    obj = s3.get_object(Bucket=bucket_name, Key=object_key, VersionId = version_id)

# Read the object's content into a Pandas DataFrame
prev_df = pd.read_csv(io.BytesIO(obj['Body'].read()))
prev_df['notice_number'] = prev_df['notice_number'].astype(str) #convert to string for comparison
print("   Dataset has been downloaded. Shape: {}\n".format(prev_df.shape))
prev_df.head()

## 2. Get all notices that are currently on the Food Authority Website

The function `scrape_tables` takes a url (which we've defined as the food authority Name and Shame Register) and iterates over child-page of the website. 

The result is `notice_df`; a dataframe of all the notices across all pages of the parent url. 

In [None]:
#the parent page we are going to scrape
url = "https://www.foodauthority.nsw.gov.au/offences/penalty-notices"

print("iterate over the pages of url:\n  {}\n".format(url))
#scrape each of the pages and get the table of notices
notice_df = utils.scrape_tables(url)

## 3. Compare the website to dataset

We will now compare the notices found in Step 2, to the notices we already have in step 1

an `old_notice_number` is one which, as of last scrape, had not yet been removed from the website. We know what hadn't been removed by filtering on only those which have a null `date_removed_from_website`. 

A `current_notice_number` is any notice live on the website now. 

In [None]:
old_notice_numbers = prev_df[prev_df['date_removed_from_website'].isnull()]['notice_number'].tolist()

current_notice_numbers = notice_df['notice_number'].tolist()

#get the difference of the above to determine new and removed notices.
removed_notice_numbers = set(old_notice_numbers) - set(current_notice_numbers)
new_notice_numbers = set(current_notice_numbers) - set(old_notice_numbers)

print("{} notice_numbers removed".format(len(removed_notice_numbers)))
print("{} notice_numbers added".format(len(new_notice_numbers)))

We now do a number of checks and run certain code based on these checks. 

E.g:
 * if **a notice was removed**
     * update the `date_removed_from_website` field
 * if **a notice was added**
     * open up a particular page to get the finer details
     * then append to the dataset

In [None]:
#check if notice numbers were removed
if len(removed_notice_numbers)==0:
    print("   0 notice_numbers removed")
    
else:
    print("   {} notice_numbers removed".format(len(removed_notice_numbers)))
    prev_df = utils.handle_removed_notices(prev_df, removed_notice_numbers)

#check if any new notice numbers
if len(new_notice_numbers)==0:
    print("   0 new notice_numbers added")
    result = prev_df #since no new entries, the result is just the old dataframe. 
    
else:
    print("   {} new notice_numbers found".format(len(new_notice_numbers)))
    print(new_notice_numbers)
    
    #we'll only work with these
    notice_df = notice_df[notice_df['notice_number'].isin(new_notice_numbers)]
    
    #check they're unique
    #check only unique numbers
    if not len(notice_df['notice_number'].unique()) == len(notice_df):
        raise ValueError("Not all policy numbers are unique")
        
    else: #Get details per notice_number
        print("4. Get penalty info...")
        #empty list to collect each row as a dictionary
        penalties = []

        for notice_number in new_notice_numbers:
            print("   processing: {}".format(notice_number))

            # scrape the website
            record = utils.get_penalty_notice(notice_number)    
            penalties.append(record)
            
        print("Complete\n")

        penalties_df = pd.DataFrame(penalties)
        
        utils.cleanup_dataframe(penalties_df)
        notice_df = utils.join_dataframes(penalties_df, notice_df)
        notice_df = utils.add_timestamp(notice_df)

## 4. Finalise the dataset and push back to S3

In [None]:
# boolean flag which triggers an upload or not
need_to_upload=False

if len(new_notice_numbers)>0:
    need_to_upload=True
    # add the new notices to the previous dataframe
    result = pd.concat([prev_df, notice_df], ignore_index=True)

if len(removed_notice_numbers)>0:
    need_to_upload=True

#if its ben identifed that the data needs to be uploaded
if need_to_upload:
    #overwrite dataset
    print("5. Begin Upload back to S3...")

    # Reuse the bucket name but change object key
    #object_key = 'test.csv'
    print("This will write {} to bucket:{}".format(object_key, bucket_name))
    
    result.sort_values(by='published_date', inplace=True)

    # Convert DataFrame to CSV string
    csv_buffer = io.StringIO()
    result.to_csv(csv_buffer, index=False)  # Set index=False if you don't want row numbers
    print("   Dataset Shape:{}".format(result.shape))

    # Upload to S3
    s3.put_object(
        Bucket=bucket_name, 
        Key=object_key, 
        Body=csv_buffer.getvalue()
    )

    print("object pushed to S3")

#There were no changes to data so no need to upload.
else:
    print("no changes to the website so dataset will not be updated at this time.")

result.head()