## Task 8: Notebook

Create a notebook called weather.ipynb at the root of your repository. In this notebook, 
write a brief report explaining how you completed Tasks 1 to 7. 
Provide short descriptions of the commands used in each task and explain their role in 
completing the tasks
***

This notebook contains the Research documentation on the tasks developed throughout the 
Computer Infrastructure lectures given by Mr. McLoughlin at ATU.

### Introduction

In this module, we’ll work through a series of tasks aimed at building foundational 
skills for handling data through the command line and automating processes. These tasks will cover 
creating a structured directory for data organization, understanding timestamps and their significance 
in tracking events, and formatting them effectively. We will also explore how to use APIs to download 
structured data and automate the process using scripts. By the end, you'll gain practical experience in 
data handling, automation, and organizing data for efficient management. 


![cloud_infrastructure](img/cloud_infrastructure_tasks.jpg)

### Task 1: Create Directory Structure 

***
 
All tasks in this project are carried out in [GitHub Codespaces](https://docs.github.com/en/codespaces/overview) within [Linux Bash environment](https://www.javatpoint.com/linux-bash) as part of the Computer Infrastructure Assessment. This setup provides a solid built and practical platform for managing and executing [commands](https://github.com/trinib/Linux-Bash-Commands) efficiently.  

A structured and organized directory layout is created using the command line. This involves setting up a `data` folder at the root of the repository, with two subdirectories: `timestamps` and `weather`. This foundational step ensures efficient organization of files and data for the project as it progresses.  

At this stage, essential Linux commands are utilized, such as `cd`, `.` and `..` for directory navigation, `ls` to list contents, and file viewing with [more](https://linuxhandbook.com/more-command/) and [cat](https://linuxhandbook.com/cat-command/). Building on these fundamentals, commands like [mkdir](https://linuxhandbook.com/mkdir-command/) for creating directories, [rmdir](https://linuxhandbook.com/rmdir-command/) for removing them, and file editing with tools like [nano](https://linuxhandbook.com/nano-editor-basics/) or [vi](https://www.javatpoint.com/vi-editor) are applied.  

This approach establishes a clean and systematic folder hierarchy, ensuring a structured workflow for subsequent tasks while leveraging the capabilities of GitHub Codespaces.

### Task 2: Timestamps

***

##### Creating and Appending Files with Timestamps

This task focuses on creating and appending files, emphasizing the importance of timestamps for tracking events, especially when working across multiple time zones. The `date` command with the format `+"%Y%m%d_%H%M%S"` is used to generate precise timestamps. This introduces the fundamentals of handling [timestamps](https://www.lenovo.com/ie/en/glossary/timestamp/?orgRef=https%253A%252F%252Fwww.google.ie%252F&srsltid=AfmBOopf3SlJnpeNNvJiDISLPR7b58DYONILTyZ7IF1M6I3zOFVye3ND#:~:text=Learn%20More-,What%20is%20a%20timestamp%3F,deduplication%20systems%20can%20identify%20identical%20data%20chunks%20and%20store%20them%20only,-once%2C%20reducing%20storage), an essential yet often underappreciated skill.


Navigate to the `data/timestamps` directory and use the `date`command to log the current date and time into a file named `now.txt`. Use the append operator (`>>`) to add entries without overwriting existing data. Repeat this process ten times and verify the contents of `now.txt` using the `more` command.


It's crucial to differentiate between the append operator [>>](https://www.cyberciti.biz/faq/linux-append-text-to-end-of-file/) and the overwrite operator [>](https://unix.stackexchange.com/questions/171025/what-does-do-vs). While `>>` adds new data to the file, the single right angle bracket `>` overwrites the file, erasing all existing content. For example:


##### Appending a timestamp:

•	**date +"%Y%m%d_%H%M%S" >> timestamps.txt**


This adds a new timestamp to `timestamps.txt` each time the command is run.

##### Overwriting a file:

•	**date +"%Y%m%d_%H%M%S" > timestamps.txt**

Using the `>` operator will replace all existing entries in the file, leaving only the most recent timestamp.


To avoid accidental overwrites and potential loss of data, always use `>>` when appending timestamps. For example, running the command multiple times with cat timestamps.txt will show a growing list of timestamps, while using `>` will result in a single timestamp, erasing all previous data with no way to recover it.




### Task 3: Formatting Timestamps

***


##### Capture System Memory Usage with Timestamped Files


```bash
# Capture system memory usage and append it to a file with a timestamped name
free -h > `date +"%Y%m%d_%H%M%S.txt"`
```

##### Explanation

- The `free -h` command provides a summary of system memory usage in a human-readable format.

- The `date +"%Y%m%d_%H%M%S"` command generates a timestamp in the format `YYYYmmdd_HHMMSS`. 

  For example, `20241110_153045` represents **November 10, 2024, at 3:30:45 PM**.


- Backticks are used to execute the `date` command and embed its output into the file name dynamically.

- The `>` operator redirects the output of `free -h` into the timestamped file, ensuring a unique file for each execution.


##### Why Use Timestamps?

- [Formatting timestamps](https://www.gnu.org/software/coreutils/manual/html_node/Formatting-file-timestamps.html) prevent overwriting files, especially in fast-paced processes where multiple files may be created within short timeframes.

- While milliseconds could be used for higher precision, second-level accuracy is sufficient for this task.

##### Additional Notes

- To explore more formatting options for the `date` command, use the `man date` manual. Exit the manual by pressing `q`.

- Example file name:

  - Running the command on **November 10, 2024, at 3:30:45 PM** will create a file named:

    ```
    20241110_153045.txt
    ```


### Task 4: Create Timestamped Files

***


##### Creating Files with Unique, Chronologically Organized Names

Timestamped files are a practical way to manage files in a structured and orderly manner. By generating filenames based on the current timestamp, you can ensure unique and sortable file names. This method is especially useful in environments where multiple files are created rapidly, as it prevents overwriting and maintains a logical sequence.

To create an empty file with a timestamped name, use the `touch` command combined with the `date` command. The `date` command dynamically generates the timestamp, which is embedded in the file name using backticks. This approach eliminates the need for redirection (`>>`) and ensures a clean, organized file creation process.

##### Command Example

```bash
# Create an empty file with a timestamped name
touch `date +"%Y%m%d_%H%M%S.txt"`
```

##### Explanation

- The `touch` command creates an empty file.

- The `date +"%Y%m%d_%H%M%S"` command formats the current date and time into a string, such as `20241219_151530` 
for **December 19, 2024, at 3:15:30 PM**.

- Backticks execute the `date` command and use its output as the file name, ensuring each file has a unique name based on the precise time of creation.

##### Benefits of Timestamped Files

1. **Chronological Organization**: Files are automatically sorted by time, making them easier to locate and manage.

2. **Prevents Overwriting**: Each file has a unique name, eliminating the risk of overwriting existing files.

3. **Streamlined File Management**: Automates the naming process, reducing manual intervention.

##### Additional Notes

- Avoid using a single redirection operator (>) in scenarios where preserving existing data is critical, as it will overwrite the file contents.

- For more date formatting options, refer to the `man date` manual (exit with q).

- Example file name:
  - Running the command on **December 19, 2024, at 3:15:30 PM** will create a file named:

    ```
    20241219_151530.txt
    ```


### Task 5: Download Today’s Weather Data

***


To programmatically retrieve the latest weather data for the Athenry station from Met Éireann, navigate to the `data/weather` directory and use 

the `wget` command. This command fetches structured weather data directly from Met Éireann’s API, located at:  

[https://prodapi.metweb.ie/observations/athenry/today](https://prodapi.metweb.ie/observations/athenry/today).


#### Command Example

```bash
wget -O weather.json https://prodapi.metweb.ie/observations/athenry/today
```

#### Key Details

- **Saving Output**: The `-O weather.json` option specifies the output file name, ensuring the data is saved directly as `weather.json` in the `data/weather` directory.

- **Programmatic Access**: This method provides structured data via the Metweb API, eliminating the need for manual web scraping, simplifying integration, and enhancing project workflows.


##### HTTP and Its Importance


The `wget` command uses the [HTTP](https://www.jmarshall.com/easy/http/) protocol to request and retrieve data over the internet. HTTP is the foundation of online data exchange, enabling efficient communication between clients (your command line) and servers (Met Éireann's API). Understanding HTTP is essential for managing API requests and responses effectively.


##### Installing `wget`

If `wget` is not already installed on your system, it can be added with the following command:
```bash
sudo apt update && sudo apt install wget -y
```

This ensures `wget` is available for retrieving data and other related tasks, further enhancing automation and efficiency in your workflows.



### Task 6: Timestamp the Weather Data

***


The [wget](https://www.gnu.org/software/wget/manual/wget.html) command, known as the non-interactive network downloader, is a powerful tool for retrieving files from the web. To enhance data organization, the `wget` 

command can be modified to save weather data with a timestamped filename. This practice ensures efficient tracking and retrieval of weather data based on when it was collected.


##### Using Met Éireann's API

Met Éireann provides structured weather information for the Athenry station via their API:  

[https://prodapi.metweb.ie/observations/athenry/today](https://prodapi.metweb.ie/observations/athenry/today).


##### Adding a Timestamp to the File

To save the downloaded data with a timestamped filename, the `wget` command can incorporate the `date` command. A real-time example of the enhanced command is:


```bash
wget -O `date +"%Y%m%d_%H%M%S_athenry.json"` https://prodapi.metweb.ie/observations/athenry/today
```

This command performs the following:


- **`-O` Option**: Specifies the output file name. The timestamp, generated dynamically by the `date` command, is appended to the filename (e.g., `20241219_153045_athenry.json`). 

This feature is documented in the GNU manual for `wget`, highlighting its ability to overwrite or redirect output files efficiently.

- **Timestamped Files**: Including the timestamp in the filename helps maintain a chronological record of collected data, aiding in better organization and retrieval.


##### About [GNU](https://www.gnu.org/home.en.html) and the `-O` Option

The `wget` tool is part of the GNU Project, renowned for its robust set of free software utilities. The `-O` option is extensively covered in the GNU `wget` manual, emphasizing its ability to redirect the downloaded content to a specified file. Using this option, you can replace the default behavior of saving files with their original names.


##### Why Timestamping is Essential

Timestamped filenames ensure that multiple downloads of weather data do not overwrite one another. They provide a clear and systematic way to track when data was collected,

which is especially useful for long-term projects involving time-sensitive information.

By using the above command, the project will seamlessly integrate structured and timestamped weather data, simplifying both organization and data analysis.




### Task 7: Automate the Weather Data Retrieval with a Script

***

In this task, we take advantage of automation to streamline the weather data retrieval process. While we use `wget` in our script, it's worth mentioning its close alternative, `curl`, a widely used tool for transferring data with URLs. Created by Daniel Stenberg in 1997, `curl` has become a powerful and flexible tool for HTTP requests, capable of handling numerous protocols and use cases. Both `curl` and `wget` are essential tools in Linux-based systems, offering similar functionality with slight differences in syntax and features.

In this instance, we focus on `wget`, which is particularly suited for non-interactive file downloads. This choice aligns well with our goal of automating weather data collection.



##### Automating with `weather.sh`

Manually running commands can be time-consuming and error-prone, especially for repetitive tasks. To simplify this process, we’ll create a bash script named `weather.sh` in the root of your repository. This script will automate the weather data download process from **Task 6**, saving the data directly into the `data/weather` directory with a timestamped filename.



##### Script Details

Below is the `weather.sh` script:

```bash
#! /bin/bash
date  # Prints the current date and time at the start of the script
echo  # Adds a blank line for clarity
wget -O `date +"%Y%m%d_%H%M%S_athenry.json"` https://prodapi.metweb.ie/observations/athenry/today
echo  # Adds another blank line
date  # Prints the date and time again to show when the script ends
```



##### Explanation of the Script

1. **`#! /bin/bash`**:
   - This shebang line specifies that the script should be executed using the Bash shell.

2. **`date`**:
   - Prints the current date and time. This serves as a log for when the script starts and ends, aiding in debugging and tracking.

3. **`echo`**:
   - Adds blank lines to improve the readability of the script's output.

4. **`wget`**:
   - Downloads the latest weather data from Met Éireann's API using the URL:  
     [https://prodapi.metweb.ie/observations/athenry/today](https://prodapi.metweb.ie/observations/athenry/today).
   - The `-O` option specifies the output file name, which includes a timestamp generated by the `date` command (`date +"%Y%m%d_%H%M%S_athenry.json"`). This ensures each file has a unique name, making it easy to organize and identify.

5. **Second `date`**:
   - Prints the date and time again after the download, providing a clear indication of the script's duration.



##### Making the Script Executable

To make the script runnable, grant it execution permissions using the following command:

```bash
chmod u+x ./weather.sh
```



##### Running the Script

Navigate to the root of your repository and execute the script with:

```bash
./weather.sh
```

The script will:
- Log the start time.
- Download weather data and save it to the `data/weather` directory with a timestamped filename.
- Log the end time.



##### Why `wget`?

While [curl](https://en.wikipedia.org/wiki/CURL#:~:text=curl%20was%20first%20released%20in,exchange%20rates%20for%20IRC%20users.) is often used for similar tasks and offers more customization options, `wget` is tailored for file downloads, making it particularly effective in batch operations and automated workflows. Both tools are reliable, but `wget` simplifies tasks like saving files directly and managing interruptions during downloads.



##### Benefits of Automation

Using the `weather.sh` script:
- Streamlines the data retrieval process.
- Ensures consistent and well-organized data storage with timestamped filenames.
- Reduces the need for manual intervention, making the workflow efficient and reliable.

By automating the process, this task not only saves time but also introduces essential scripting and automation skills, valuable for any developer working in a Linux-based environment.



## Task 9: Analysing Weather Data with Pandas

***

In the `weather.ipynb` notebook, utilize the Pandas `read_json()` function to load one of the weather data files previously downloaded using your script. Explore the structure of the 

dataset by examining and summarizing its contents. Based on the metadata provided by **data.gov.ie**, write a brief description of the dataset, detailing what it includes and its 

significance. This step will help you understand the data's context and ensure accurate interpretation for future analyses.


Weather Analysis

Collecting the Data

The data is collected using the wget command, which saves the weather data with a filename based on the current timestamp. For example, the command:

```bash

date +"%Y%m%d_%H%M%S_athenry.json"

```

generates a filename where:

%Y is replaced by the four-digit year (e.g., 2024),

%m represents the two-digit month (e.g., 11 for November),

%d is the two-digit day (e.g., 02), and

%H%M%S corresponds to the current hour, minute, and second.

This ensures that each file is uniquely timestamped, allowing for efficient tracking and organization of data over time.

Analyzing the Data

Using the pandas library in Python, the weather data is loaded and analyzed. Below are the steps used to examine the data:

 1 Read the Data

In [6]:
import pandas as pd

# Load the data file into a pandas DataFrame.
df = pd.read_json('data/weather/20241102_084302_athenry.json')


2 Preview the Data To get an overview of the dataset's structure, the first few rows are displayed using:

In [7]:
# Display the first few rows of the dataset.
df.head()


Unnamed: 0,name,temperature,symbol,weatherDescription,text,windSpeed,windGust,cardinalWindDirection,windDirection,humidity,rainfall,pressure,dayName,date,reportTime
0,Athenry,11,04n,Cloudy,"""Cloudy""",6,-,S,180,85,0,1029,Saturday,2024-02-11,00:00
1,Athenry,11,04n,Cloudy,"""Cloudy""",4,-,S,180,85,0,1029,Saturday,2024-02-11,01:00
2,Athenry,11,04n,Cloudy,"""Cloudy""",4,-,SE,135,87,0,1029,Saturday,2024-02-11,02:00
3,Athenry,11,04n,Cloudy,"""Cloudy""",6,-,S,180,87,0,1029,Saturday,2024-02-11,03:00
4,Athenry,11,04n,Cloudy,"""Cloudy""",9,-,S,180,86,0,1029,Saturday,2024-02-11,04:00


3 Summarize the Data A statistical summary of the dataset, including count, mean, standard deviation, min, max, and quartiles for numeric columns, is generated using:


In [8]:
# Summarize the dataset.
df.describe()


Unnamed: 0,temperature,windSpeed,windDirection,humidity,rainfall,pressure,date
count,9.0,9.0,9.0,9.0,9.0,9.0,9
mean,10.666667,5.777778,145.0,87.111111,0.0,1029.0,2024-02-11 00:00:00
min,10.0,4.0,90.0,85.0,0.0,1029.0,2024-02-11 00:00:00
25%,10.0,4.0,135.0,85.0,0.0,1029.0,2024-02-11 00:00:00
50%,11.0,6.0,135.0,87.0,0.0,1029.0,2024-02-11 00:00:00
75%,11.0,6.0,180.0,87.0,0.0,1029.0,2024-02-11 00:00:00
max,11.0,9.0,180.0,91.0,0.0,1029.0,2024-02-11 00:00:00
std,0.5,1.641476,37.5,2.368778,0.0,0.0,


The process ensures that the weather data is collected, stored, and analyzed systematically, providing valuable insights for further exploration or visualization.


### End