<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers</div>

# IFN619 :: UA1 - Assignment 1 - Foundational techniques (35%)

**IMPORTANT:** Refer to the instructions in Canvas module [UA1 - Assignment 1]() *BEFORE* working on this assignment. Ensure that you are familiar with the marking rubric and understand how the grade for this assignment will be awarded according to the criteria.

1. Complete and run the code cell below to display your name and student number
2. Complete all questions in Part A (by week 6) - you can get assistance from your tutor with this part.
3. Ensure that your tutor has verified your understanding of your work in Part A (no later than week 7)
4. Complete a full analysis for Part B. Ensure that you use the techniques and libraries/packages that have been used in class.
5. Check that you have addressed all of the criteria in the assignment rubric
6. Clear all cells and re-run your entire notebook so that cells are sequentially numbered without any errors. **IMPORTANT: Not doing this step risks having your work marked as incomplete!**
7. Submit the clean notebook to Canvas **before** 11:55pm the due date


In [1]:
# Complete the following cell with your details and run to produce your personalised header for this assignment

from IPython.display import HTML

first_name = "Vaishnav"
last_name = "Rai"
student_number = "N11484209"

personal_header = f"<h1>{first_name} {last_name} ({student_number})</h1>"
HTML(personal_header)

---

## Data for both Part A and Part B

This assignment uses data from the Queensland Government [Open Data Portal](https://www.data.qld.gov.au). Both parts will use data on [Queensland Wave Monitoring](https://www.qld.gov.au/environment/coasts-waterways/beach/monitoring). Part B will also use data on [Storm tide monitoring](https://www.qld.gov.au/environment/coasts-waterways/beach/storm) You should familiarise yourself with the information on this site to understand the context for the data.

For this assignment, you will use the [Coastal Data System - Near real time wave data]((https://www.data.qld.gov.au/dataset/coastal-data-system-near-real-time-wave-data)) and for Part B you will add [Coastal Data System – Near real time storm tide data](https://www.data.qld.gov.au/dataset/coastal-data-system-near-real-time-storm-tide-data). Note that this data will change over the time period of the assignment.

---
## Part A

#### QUESTION: 
***What can we learn from the wave height data for South East Queensland, and how might this data be used strategically during a major weather event?***

> *IMPORTANT* For the following task, keep a record of the dates and times where you demonstrated your understanding with your tutor. These should be AFTER you have completed the questions, and BEFORE week 7. Record these below:

**Demo for tutor:** |Tuesday| |01/04/2025| |12:00| (Tutorial 8)

### [Q1] Read the data

The data for this analysis comes from the [Queensland Government's Coastal Data System]((https://www.data.qld.gov.au/dataset/coastal-data-system-near-real-time-wave-data)), which maintains wave monitoring buoys at various locations along the Queensland coast. I accessed this real-time wave data on **March 28, 2025**, capturing a 7-day window of wave measurements. Understanding wave characteristics (height, period, direction) is crucial because they directly influence coastal processes and potential impacts.
When first examining the data, I found it contained 7,663 records with 15 columns including:

* Site location and site number
* DateTime of measurement
* Geographic coordinates (Latitude/Longitude)
* Wave measurements (Hsig - significant wave height, Hmax - maximum wave height)
* Wave period metrics (Tp - peak period, Tz - zero crossing period)
* Direction data (wave direction, current direction)
* Water conditions (SST - sea surface temperature, current speed)

**Justification:** Loading necessary libraries: `pandas` for data manipulation and analysis, `plotly.express` and `plotly.graph_objects` for creating interactive visualizations, and `datetime` for handling date and time operations. Importing `make_subplots` allows for creating more complex figures, such as dual-axis plots used later in Part B.

In [2]:
import pandas as pd
import plotly.express as px
from datetime import datetime
# Import tools for dual-axis plots
from plotly.subplots import make_subplots
import plotly.graph_objects as go

First access date: 2025-03-28

**Justification:** Recording the date when the data was first accessed is crucial for reproducibility, especially when dealing with data sources that are updated over time. This timestamp will help readers understand the specific 7-day window that was analyzed in this notebook.

In [3]:
# [Q1] Read the data
# Make a note of the first access date
first_access_date = "2025-03-28"
print(f"First access date: {first_access_date}")

First access date: 2025-03-28


**Justification:** Initially attempting to load data directly from the source URL. This code is commented out in favour of loading a saved file (next cell) to ensure analysis' consistency and reproducibility, as the live data feed changes. 

---

The wave data is then read into a dataframe using `pd.read_csv`. I discovered the first row had information that caused misalignment of columns. Found code for skipping rows `skiprows=1` from [pandas' read_csv documentation]((https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))
This was needed to remove/skip past the metadata header that was in the first row, due to which the columns were not being read.

In [4]:
# Open the CSV version of the file directly from the URL into a pandas dataframe
# url = "https://apps.des.qld.gov.au/data-sets/waves/wave-7dayopdata.csv"

In [5]:
# wave_df = pd.read_csv(url, skiprows=1)

**Justification:** Using a local CSV ensures reproducibility, because the live data feed changes over time. Loading the wave data from a previously saved local CSV file (`wave_data_2025-03-28.csv`). This ensures that the analysis uses a consistent, static dataset corresponding to the access date noted earlier (March 28, 2025), making the results reproducible regardless of any changes in the live data source since then.

In [6]:
# Instead of loading from URL, load from our saved file
saved_wave_file = "wave_data_2025-03-28.csv"
wave_df = pd.read_csv(saved_wave_file)
print(f"Using saved wave data from: {saved_wave_file}")

Using saved wave data from: wave_data_2025-03-28.csv


**Justification:** Displaying the first few rows of the loaded wave dataframe using `.head()`. This allows for an initial visual inspection of the data's structure, column names (e.g., 'Hsig', 'Hmax', 'Tp'), and sample values, confirming that the data has been loaded correctly into the pandas DataFrame structure.

In [7]:
# Display the first few rows
print("First 5 rows of the data:")
wave_df.head()

First 5 rows of the data:


Unnamed: 0,id,DateTime,Site,SiteNumber,Seconds,Latitude,Longitude,Hsig,Hmax,Tp,Tz,SST,Direction,Current Speed,Current Direction
0,1,2025-03-21 00:00:00,Caloundra,54,1742479200,-26.84682,153.15536,1.226,1.76,9.09,5.195,25.65,92.8,-99.9,-99.9
1,2,2025-03-21 00:30:00,Caloundra,54,1742481000,-26.84671,153.15536,1.145,2.09,8.33,5.263,25.6,95.6,-99.9,-99.9
2,3,2025-03-21 01:00:00,Caloundra,54,1742482800,-26.84677,153.15536,1.191,2.0,9.09,5.063,25.55,95.6,-99.9,-99.9
3,4,2025-03-21 01:30:00,Caloundra,54,1742484600,-26.84676,153.15534,1.103,1.97,9.09,5.333,25.55,85.8,-99.9,-99.9
4,5,2025-03-21 02:00:00,Caloundra,54,1742486400,-26.84658,153.1554,1.144,1.86,7.69,5.333,25.55,94.2,-99.9,-99.9


**Justification:** Cleaning the column names of the wave dataframe by removing any leading or trailing whitespace using `.str.strip()`. This ensures consistency in column references and prevents potential errors that can arise from hidden whitespace when selecting or manipulating columns later in the analysis.

In [8]:
# Clean column names by stripping whitespace
wave_df.columns = wave_df.columns.str.strip()

**Justification:** Checking the dimensions (number of rows and columns) using `.shape` and listing all column names using `.columns.tolist()`. This provides a quick overview of the dataset's size and verifies the variables available for analysis are as expected after loading and cleaning.

In [9]:
# Check the shape and columns
print(f"Data shape: {wave_df.shape}")
print(f"Columns: {wave_df.columns.tolist()}")

Data shape: (7663, 15)
Columns: ['id', 'DateTime', 'Site', 'SiteNumber', 'Seconds', 'Latitude', 'Longitude', 'Hsig', 'Hmax', 'Tp', 'Tz', 'SST', 'Direction', 'Current Speed', 'Current Direction']


**Justification:** Examining the data types (`dtypes`) of each column. This is essential to ensure that numerical columns (like 'Hsig', 'Tp') are recognized as numbers (float64/int64) and categorical columns (like 'Site') are appropriate. This step identifies the 'DateTime' column as being of type 'object' (string), indicating it needs conversion to a proper datetime format.

In [10]:
# Check data types to understand the structure
print("\nData types:")
wave_df.dtypes


Data types:


id                     int64
DateTime              object
Site                  object
SiteNumber            object
Seconds                int64
Latitude             float64
Longitude            float64
Hsig                 float64
Hmax                 float64
Tp                   float64
Tz                   float64
SST                  float64
Direction            float64
Current Speed        float64
Current Direction    float64
dtype: object

**Justification:** Converting the 'DateTime' column from its current 'object' type to a pandas datetime format using `pd.to_datetime`. Using `format='mixed'` allows pandas to infer the format, providing robustness if there are minor inconsistencies in the source data's date strings. Displaying `.head()` of the converted column confirms the successful transformation.

---

**Regarding Index Column:** In this analysis, I did not use 'DateTime' or any other column as my DataFrame index because they are not guaranteed to be unique and multiple sites record data at the same time. Certain pandas operations (like some forms of resampling or merging) behave unpredictably when indices are non-unique.

Instead, I kept 'DateTime' as a normal column and used the original 'id' to ensure each row remains uniquely identifiable. This approach simplifies comparisons across different sites and timestamps without forcing a strictly unique time index. If I need to do time-based queries or visualizations, I can still filter or group by the 'DateTime' column directly.

In [11]:
# Convert DateTime to proper datetime format
wave_df['DateTime'] = pd.to_datetime(wave_df['DateTime'], format='mixed')
print("\nDatetime conversion successful. Sample dates:")
wave_df['DateTime'].head()


Datetime conversion successful. Sample dates:


0   2025-03-21 00:00:00
1   2025-03-21 00:30:00
2   2025-03-21 01:00:00
3   2025-03-21 01:30:00
4   2025-03-21 02:00:00
Name: DateTime, dtype: datetime64[ns]

**Justification:** This commented out code is to save the current `wave_df` into a CSV file so that we can save it locally, and call the CSV file in the cells ahead. This ensures reproducibilty and consistency in analysis.

In [12]:
# [Q2] Save the data
# Create a filename with today's date
#filename = f"wave_data_{first_access_date}.csv"
#wave_df_with_index.to_csv(filename)
#print(f"\nData saved to {filename}")

### Read the data from a file

**Justification:** Reading the previously saved wave data CSV back into a new dataframe (`saved_df`). This serves as a check to confirm that the file saving process (like in the previous commented-out cell) worked correctly and the data can be retrieved. For the main analysis continuity, operations will proceed using the primary dataframe (`wave_df`).

In [13]:
# Read the data back from file
saved_df = pd.read_csv("wave_data_2025-03-28.csv")
print("Data successfully read back from file.")

Data successfully read back from file.
