# Data Cleasing Task

## 1. Importing and Inspecting

In [1032]:
# Import packages
import pandas as pd

In [1033]:
# Read in raw data
data_raw = pd.read_excel('Data/Input/messy_phol.xlsx')

data_raw.head()

Unnamed: 0,region,date,name,comment
0,ACT,2014-01-01,New Years Day,
1,ACT,2014-01-27,Australia Day Sub,
2,ACT,2014-03-10,Canberra Day -RD,
3,ACT,2014-04-18,Good Friday,
4,ACT,2014-04-19,Easter Saturday,


In [1034]:
data_raw.dtypes

region             object
date       datetime64[ns]
name               object
comment            object
dtype: object

We see that `date` field is already in the required datetime type.

There is no data for QLD.

In [1035]:
data_raw['region'].unique()

array(['ACT', 'NSW', 'NT', 'SA', 'TAS', nan, 'VIC', 'WA'], dtype=object)

Data at index 470 and 662 has missing values.

In [1036]:
print(data_raw[data_raw.isnull().any(axis=1)])

    region       date                    name comment
470    NaN 2019-04-25               ANZAC Day    None
662     WA        NaT  Queens Birthday WA -RD    None


The names of public holidays are very messy.

In [1037]:
print(data_raw['name'].value_counts())

Christmas Day                                     63
Easter Monday                                     62
Good Friday                                       58
ANZAC Day                                         50
New Years Day                                     47
                                                  ..
Boxing Day / Proclamation Day (additional day)     1
Australia Day 26th                                 1
good Friday - type c                               1
Australia Day SUB                                  1
Easter MON                                         1
Name: name, Length: 72, dtype: int64


Look for duplication in the data.

In [1038]:
data_raw.loc[data_raw.duplicated(keep=False), :]

Unnamed: 0,region,date,name,comment
573,VIC,2019-12-25,Christmas Day,
774,VIC,2019-12-25,Christmas Day,


## 2. Data Cleansing

### 2.0 Duplicate Data

We simply remove the duplicated data row.

In [1039]:
data_raw.drop_duplicates(inplace=True)

### 2.1 Missing Data

There is data missing at index 470 and 662, and the data is roughly ordered by state and date. Therefore:
1. We inspect the data points around index 470 and 662
2. We can also inspect data with similar dates / names
3. We then fill in based on the context

#### 2.1.1 Missing Region

Here we print out both the rows around the missing data, and other rows containing the same date.

In [1040]:
print(data_raw.iloc[465:475, :])

    region       date                           name comment
465    TAS 2019-01-01                  New Years Day    None
466    TAS 2019-01-28              Australia Day Sub    None
467    TAS 2019-03-11       Labour Day VIC - TAS -RD    None
468    TAS 2019-04-19                    Good Friday    None
469    TAS 2019-04-22                  Easter Monday    None
470    NaN 2019-04-25                      ANZAC Day    None
471    TAS 2019-06-10     Queens Birthday exc WA -RD    None
472    TAS 2019-12-25                  Christmas Day    None
473    TAS 2019-12-26  Boxing Day - Proclamation Day    None
474    TAS 2020-01-01                  New Years Day    None


In [1041]:
print(data_raw.loc[data_raw['date'] == '2019-04-25', :])

    region       date       name comment
74     ACT 2019-04-25  ANZAC Day    None
175    NSW 2019-04-25  ANZAC Day    None
273     NT 2019-04-25  ANZAC Day    None
382     SA 2019-04-25  ANZAC Day    None
470    NaN 2019-04-25  ANZAC Day    None
569    VIC 2019-04-25  ANZAC Day    None
649     WA 2019-04-25  ANZAC Day    None


We see that the missing data is most likely for TAS, so we replace it with 'TAS'.

In [1042]:
data_raw.at[470, 'region'] = 'TAS'

Checking that the missing value has been replaced.

In [1043]:
print(data_raw.loc[475])

region                     TAS
date       2020-01-27 00:00:00
name         Australia Day Sub
comment                   None
Name: 475, dtype: object


#### 2.1.2 Missing Date

Here we print out both the rows around the missing data, and other rows containing the same holiday.

In [1044]:
print(data_raw.iloc[660:665, :])

print(data_raw[(data_raw['region']=='WA') & data_raw['name'].str.contains('Queens')])

    region       date                           name comment
660     WA 2020-04-27                  ANZAC Day Add    None
661     WA 2020-06-01          Western Australia Day    None
662     WA        NaT         Queens Birthday WA -RD    None
663     WA 2020-12-25                  Christmas Day    None
664     WA 2020-12-26  Boxing Day - Proclamation Day    None
    region       date                    name comment
597     WA 2014-09-29  Queens Birthday WA -RD    None
608     WA 2015-09-28  Queens Birthday WA -RD    None
619     WA 2016-09-26  Queens Birthday WA -RD    None
631     WA 2017-09-25  Queens Birthday WA -RD    None
641     WA 2018-09-24  Queens Birthday WA -RD    None
651     WA 2019-09-30  Queens Birthday WA -RD    None
662     WA        NaT  Queens Birthday WA -RD    None


We observe that the missing date is most likely 2020's Queen's Birthday for WA. A quick Google search shows the date as 28/09/2020.

In [1045]:
data_raw.at[662, 'date'] = '2020-09-28'

Checking that the missing value has been replaced.

In [1046]:
print(data_raw.loc[662])

region                         WA
date          2020-09-28 00:00:00
name       Queens Birthday WA -RD
comment                      None
Name: 662, dtype: object


### 2.2 Public Holiday Names Cleanup

Inspecting the full list of unique public holiday names, we can see that there is a lot of suffixes that needs to be removed.

In [1047]:
phol_list = data_raw['name'].tolist()

print(sorted(set(phol_list)))

['AFL Grand Final - Friday -RD', 'ANZ Day - additional day declared', 'ANZAC Day', 'ANZAC Day Add', 'Adelaide Cup -RD', 'Adelaide Cup Day', 'Anzac Day', 'Anzac Day (additional day)', 'Australia Day', 'Australia Day 26th', 'Australia Day ADD SA', 'Australia Day SUB', 'Australia Day Sub', 'Bank Holiday', 'Boxing Day', 'Boxing Day (additional day)', 'Boxing Day - Add', 'Boxing Day - Proclamation Day', 'Boxing Day - Proclamation Day Sub', 'Boxing Day / Proclamation Day', 'Boxing Day / Proclamation Day (additional day)', 'Boxing Day Add', 'Boxing Day Sub', 'Canberra Day', 'Canberra Day -RD', 'Christmas (additional day)', 'Christmas Day', 'Christmas Day Add', 'Christmas Eve', 'Christmas Eve -RD', 'Christmas Eve 7pm - midnight', 'Easter MON', 'Easter Monday', 'Easter Saturday', 'Easter Saturday - the Saturday following Good Friday', 'Easter Sunday', 'Easter Tuesday', 'Eight Hours Day', 'Family and Community Day', 'Friday before AFL Grand Final', 'Good Friday', 'Good Friday - type a', 'Good Fr

We attempt to clean up public holiday names by removing all text after the substring 'day'.

In [1048]:
phol_list_clean = []

for name in phol_list:
    name = name.replace('(additional day)', '')
    name = name.rstrip(' -RD')
    try:
        phol_list_clean.append(name[:name.lower().index('day') + 3])
    except ValueError:
        phol_list_clean.append(name)

print(sorted(set(phol_list_clean)))

['AFL Grand Final - Friday', 'ANZ Day', 'ANZAC Day', 'Adelaide Cup', 'Adelaide Cup Day', 'Anzac Day', 'Australia Day', 'Bank Holiday', 'Boxing Day', 'Canberra Day', 'Christmas', 'Christmas Day', 'Christmas Eve', 'Christmas Eve 7pm - midnight', 'Easter MON', 'Easter Monday', 'Easter Saturday', 'Easter Sunday', 'Easter Tuesday', 'Eight Hours Day', 'Family and Community Day', 'Friday', 'Good Friday', 'Labour Day', 'May Day', 'Melbourne Cup', 'Melbourne Cup Day', 'NT Picnic Day', "New Year's Day", "New Year's Eve", "New Year's Eve 7pm - midnight", 'New Years Day', 'New Years Eve', 'Picnic Day', "Queen's Birthday", 'Queens Birthday', 'Reconciliation Day', 'Saturday', 'Western Australia Day', 'good Friday']


There is still multiple messy or inconsistent holiday names, so we pick them out to investigate separately.

In [1049]:
investigate_list = ['ANZ Day',
                    'Christmas Eve 7pm - midnight',
                    'Easter MON',
                    'Easter Tuesday',
                    'Eight Hours Day',
                    'Family and Community Day',
                    'Friday',
                    'Labour Day',
                    'May Day',
                    'Saturday']

for phol in investigate_list:
    print(phol, '-'*50)
    print(data_raw[data_raw['name'].str.startswith(phol)], '\n')


ANZ Day --------------------------------------------------
   region       date                               name comment
88    ACT 2020-04-27  ANZ Day - additional day declared    None 

Christmas Eve 7pm - midnight --------------------------------------------------
    region       date                          name comment
717     NT 2022-12-24  Christmas Eve 7pm - midnight    None
731     SA 2022-12-24  Christmas Eve 7pm - midnight    None 

Easter MON --------------------------------------------------
    region       date        name comment
773    NSW 2020-04-13  Easter MON    None 

Easter Tuesday --------------------------------------------------
    region       date            name comment
488    TAS 2021-04-06  Easter Tuesday    None
741    TAS 2022-04-19  Easter Tuesday    None 

Eight Hours Day --------------------------------------------------
    region       date             name comment
485    TAS 2021-03-08  Eight Hours Day    None
738    TAS 2022-03-14  Eight Hours

We then rename these individual cases to be in line with the general case.

In [1050]:
phol_replacement = {'ANZ Day': 'ANZAC Day',
                    'Adelaide Cup': 'Adelaide Cup Day',
                    'Anzac Day': 'ANZAC Day',
                    'Christmas Day': 'Christmas',
                    'Christmas Eve 7pm - midnight': 'Christmas Eve',
                    'Easter MON': 'Easter Monday',
                    'Friday': 'AFL Grand Final - Friday',
                    'Melbourne Cup Day': 'Melbourne Cup',
                    'NT Picnic Day': 'Picnic Day',
                    "New Year's Day": 'New Years Day',
                    "New Year's Eve": 'New Years Eve',
                    "New Year's Eve 7pm - midnight" : 'New Years Eve',
                    "Queen's Birthday": 'Queens Birthday',
                    'Saturday': 'Easter Saturday',
                    'good Friday': 'Good Friday'
                   }

Here we clean up and create the final list of public holiday names.

In [1051]:
phol_list_final = []

for phol in phol_list_clean:
    if phol in phol_replacement:
        phol_list_final.append(phol_replacement[phol])
    else:
        phol_list_final.append(phol)

sorted(set(phol_list_final))

['AFL Grand Final - Friday',
 'ANZAC Day',
 'Adelaide Cup Day',
 'Australia Day',
 'Bank Holiday',
 'Boxing Day',
 'Canberra Day',
 'Christmas',
 'Christmas Eve',
 'Easter Monday',
 'Easter Saturday',
 'Easter Sunday',
 'Easter Tuesday',
 'Eight Hours Day',
 'Family and Community Day',
 'Good Friday',
 'Labour Day',
 'May Day',
 'Melbourne Cup',
 'New Years Day',
 'New Years Eve',
 'Picnic Day',
 'Queens Birthday',
 'Reconciliation Day',
 'Western Australia Day']

Now that the holiday names are consistent, we attach it to the original data.

In [1052]:
data_raw['public_holiday_cleaned'] = phol_list_final

We select the relevant cleansed data columns and rename the columns accordingly.

In [1053]:
data_cleansed = data_raw.loc[:, ['region', 'date', 'public_holiday_cleaned']].copy(deep=True)

data_cleansed.columns = ['phol_location', 'public_holiday_date', 'public_holiday']

data_cleansed.head()

Unnamed: 0,phol_location,public_holiday_date,public_holiday
0,ACT,2014-01-01,New Years Day
1,ACT,2014-01-27,Australia Day
2,ACT,2014-03-10,Canberra Day
3,ACT,2014-04-18,Good Friday
4,ACT,2014-04-19,Easter Saturday


Here we take a subset of the data for the in-scope period.  

In [1054]:
data_inscope = data_cleansed.loc[(data_cleansed['public_holiday_date'] >= '2019-01-01') & 
                                 (data_cleansed['public_holiday_date'] <= '2021-12-31'),
                                 :
                                 ].copy(deep=True)

data_inscope.head()

Unnamed: 0,phol_location,public_holiday_date,public_holiday
67,ACT,2019-01-01,New Years Day
68,ACT,2019-01-28,Australia Day
69,ACT,2019-03-11,Canberra Day
70,ACT,2019-04-19,Good Friday
71,ACT,2019-04-20,Easter Saturday


### 2.3 Public Holiday Start Time

We create the start and end time columns by copying over the `public_holiday_date` column first.

We then create a copy of the smaller subset of the in-scope data, for those holidays that have a 7PM start time instead of 12AM start time.

In [1055]:
data_inscope['public_holiday_start'] = data_inscope['public_holiday_date']
data_inscope['public_holiday_end'] = data_inscope['public_holiday_date']

# Removing the time components so the format when output to CSV is correct
data_inscope['public_holiday_date'] = data_inscope['public_holiday_date'].dt.date

In [1056]:
partial_holidays = data_inscope.loc[(data_inscope['phol_location'].isin(['NT', 'SA'])) & 
                                    (data_inscope['public_holiday'].isin(['Christmas Eve', 'New Years Eve'])),
                                    :
                                    ].copy(deep=True)

We replace the time with 19:00 or 7PM.

In [1057]:
partial_holidays.loc[:, 'public_holiday_start'] = partial_holidays.loc[:, 'public_holiday_start'].apply(
                                                      lambda x: x.replace(hour=19, minute=0)
                                                      )

print(partial_holidays)

    phol_location public_holiday_date public_holiday public_holiday_start  \
277            NT          2019-12-24  Christmas Eve  2019-12-24 19:00:00   
280            NT          2019-12-31  New Years Eve  2019-12-31 19:00:00   
290            NT          2020-12-24  Christmas Eve  2020-12-24 19:00:00   
293            NT          2020-12-31  New Years Eve  2020-12-31 19:00:00   
303            NT          2021-12-24  Christmas Eve  2021-12-24 19:00:00   
308            NT          2021-12-31  New Years Eve  2021-12-31 19:00:00   
385            SA          2019-12-24  Christmas Eve  2019-12-24 19:00:00   
388            SA          2019-12-31  New Years Eve  2019-12-31 19:00:00   
399            SA          2020-12-24  Christmas Eve  2020-12-24 19:00:00   
402            SA          2020-12-31  New Years Eve  2020-12-31 19:00:00   
413            SA          2021-12-24  Christmas Eve  2021-12-24 19:00:00   
418            SA          2021-12-31  New Years Eve  2021-12-31 19:00:00   

We update the original `data_inscope` table, and check that the values have been successfully updated

In [1058]:
data_inscope.update(partial_holidays)

data_inscope.loc[(data_inscope['phol_location'].isin(['NT', 'SA'])) & 
                 (data_inscope['public_holiday'].isin(['Christmas Eve', 'New Years Eve'])),
                 :
                 ]

Unnamed: 0,phol_location,public_holiday_date,public_holiday,public_holiday_start,public_holiday_end
277,NT,2019-12-24,Christmas Eve,2019-12-24 19:00:00,2019-12-24
280,NT,2019-12-31,New Years Eve,2019-12-31 19:00:00,2019-12-31
290,NT,2020-12-24,Christmas Eve,2020-12-24 19:00:00,2020-12-24
293,NT,2020-12-31,New Years Eve,2020-12-31 19:00:00,2020-12-31
303,NT,2021-12-24,Christmas Eve,2021-12-24 19:00:00,2021-12-24
308,NT,2021-12-31,New Years Eve,2021-12-31 19:00:00,2021-12-31
385,SA,2019-12-24,Christmas Eve,2019-12-24 19:00:00,2019-12-24
388,SA,2019-12-31,New Years Eve,2019-12-31 19:00:00,2019-12-31
399,SA,2020-12-24,Christmas Eve,2020-12-24 19:00:00,2020-12-24
402,SA,2020-12-31,New Years Eve,2020-12-31 19:00:00,2020-12-31


### 2.4 Public Holiday End Time

Here we simply change all public holiday end time to 23:59:59.

In [1059]:
data_inscope.loc[:, 'public_holiday_end'] = data_inscope.loc[:, 'public_holiday_end'].apply(
                                                lambda x: x.replace(hour=23, minute=59, second=59)
                                                )

In [1060]:
print(data_inscope.head())

   phol_location public_holiday_date   public_holiday public_holiday_start  \
67           ACT          2019-01-01    New Years Day           2019-01-01   
68           ACT          2019-01-28    Australia Day           2019-01-28   
69           ACT          2019-03-11     Canberra Day           2019-03-11   
70           ACT          2019-04-19      Good Friday           2019-04-19   
71           ACT          2019-04-20  Easter Saturday           2019-04-20   

    public_holiday_end  
67 2019-01-01 23:59:59  
68 2019-01-28 23:59:59  
69 2019-03-11 23:59:59  
70 2019-04-19 23:59:59  
71 2019-04-20 23:59:59  


### 2.5 Final Check

Final check for missing data

In [1061]:
data_inscope[data_inscope.isnull().any(axis=1)]

Unnamed: 0,phol_location,public_holiday_date,public_holiday,public_holiday_start,public_holiday_end


Final check for duplicates

In [1062]:
data_inscope.loc[data_inscope.duplicated(keep=False), :]

Unnamed: 0,phol_location,public_holiday_date,public_holiday,public_holiday_start,public_holiday_end
185,NSW,2020-04-13,Easter Monday,2020-04-13,2020-04-13 23:59:59
773,NSW,2020-04-13,Easter Monday,2020-04-13,2020-04-13 23:59:59


The duplication was from 'Easter MON' and 'Easter Monday' name changes, we therefore remove them.

In [1063]:
data_inscope.drop_duplicates(inplace=True)

### 2.6 Saving to CSV

In [1064]:
data_inscope.to_csv('Data/Output/cleaned_phol.csv', index=False)