# Assignment Week 7 - Time Series Analysis and Data Preprocessing

When working with time series data having an accurate and complete dataset is essential for the results of machine learning models trained on this data.

As the dataset is such an important part of training time series machine learning models, e.g. for time series forecasting, this assignment will focus on analyzing and preparing a time series dataset.

A common way to work with time series data is using the Python library `Pandas` that offers a lot of helpful methods to analyze and preprocess the dataset and is therefore recommended to be used in this assignment. Additionally common Large Language Models are very good in assisting while writing `Pandas` code.

There exist alternatives such as `Polars` that are faster but in some cases not as intuitive and with less available resources in the internet.

## Task 1 - Downloading the Dataset

We will be using a subset of the [Jena Climate dataset](https://www.bgc-jena.mpg.de/wetter/) that contains weather measurements from 2009 to 2016 for different weather factors such as temperature, humidity, wind speed, etc in a 10 minute resolution.

Download and unzip the dataset using the provided commands (or manually if your local machine can"t execute these shell commands).

In [None]:
!wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip

In [None]:
!unzip jena_climate_2009_2016.csv.zip

## Task 2 - Data Analysis and Cleaning

1. Load the CSV file to a Pandas DataFrame.
2. Analyze aspects of the time series dataset using the `df.info()`and `df.describe()` methods. Can you identify missing values? Are there any outliers or incorrect values? What other interesting aspects can you identify?
3. If you identify incorrect values find ways to correct these values as best as possible.
4. Complete the given Python function to identify missing timestamps in the dataset.
5. Add potentially missing timestamps to the DataFrame (don't forget to sort by datetime!) and apply a suitable interpolation technique to fill the missing values for the added timestamps.
6. Choose 3 weather factor measurements and plot them. Analyze them for aspects like trend, seasonality and noise. (Hint: It might be useful to look at the years individually for every weather factor.)


In [None]:
import pandas as pd

# 1. Load the CSV file to a Pandas DataFrame.
df = None # TODO: Read the CSV file to a DataFrame
df.head()

In [None]:
# 2. Analyze aspects of the time series dataset using the `df.info()`and `df.describe()` methods. Can you identify missing values? Are there any outliers or incorrect values? What other interesting aspects can you identify?

### YOUR CODE GOES HERE ###

In [None]:
# 3. If you identify incorrect values find ways to correct these values as best as possible.

### YOUR CODE GOES HERE ###

In [None]:
# Turn the timestamps to datetime format (format="mixed" is required as not all values in "Date Time" follow the correct format)
# Might take some time because of format="mixed"
df["Date Time"] = pd.to_datetime(df["Date Time"], format="mixed")

In [None]:
# 4. Complete the given Python function to identify missing timestamps in the dataset.

def find_missing_timestamps(df):
  ### YOUR CODE STARTS HERE ###

  # Define the first and last timestamp of the dataset
  start_time = None # TODO: Add the first timestamp of the time series
  end_time = None # TODO: Add the last timestamp of the time series

  # Generate the complete range of timestamps with a 10-minute frequency
  full_timestamp_range = pd.date_range(start=start_time, end=end_time, freq="REPLACE WITH A SUITABLE FREQUENCY VALUE") # TODO: Add frequency value

  # Find missing timestamps
  existing_timestamps = None # TODO: Get the existing timestamps from the dataset
  missing_timestamps = full_timestamp_range.difference(existing_timestamps)

  ### YOUR CODE ENDS HERE ###

  return missing_timestamps

missing_timestamps = find_missing_timestamps(df)
missing_timestamps

In [None]:
# 5. Add potentially missing timestamps to the DataFrame (don't forget to sort by datetime!) and apply a suitable interpolation technique to fill the missing values for the added timestamps.

### YOUR CODE GOES HERE ###

In [None]:
# 6. Choose 3 weather factor measurements and plot them. Analyze them for aspects like trend, seasonality and noise. (Hint: It might be useful to look at the years individually for every weather factor.)

### YOUR CODE GOES HERE ###

## Task 3 - Basic Feature Engineering

Time series data often contains valuable information based on the use case that can be engineered from the existing dataset. An example would be features such as day of the week, weekends, week of the year, etc. In order to practice these possible feature engineering steps perform the following tasks:

1. Using the `Date Time` column create features for the day, month, year, hour and minute.
2. Based on these new features also create additional features such as a binary indicator if it is weekend or not (alternativly if it is weekday or not), what day of the week and what week of the year it is.

For some features additional feature engineering could be beneficial. Cyclical features such as the day of the week, the month or the week of the year could benefit from a representation where the value representing "Monday" is closer to the value representing "Sunday" than it is when simply encoding the days like this for example:

- Monday: 0
- Tuesday: 1
- Wedneyday: 2
- Thursday: 3
- Friday: 4
- Saturday: 5
- Sunday: 6

To address the cyclicity of such features cyclical encoding with sine/cosine transformation can be used as one possible approach.

3. Research how cyclical encoding of cyclical features can be done and implement it for all features in the dataset where you find it suitable. Give a reason for choosing exactly the features you chose.

(One explanation can be found in this blog post: https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)


In [None]:
# 1. Using the `Date Time` column create features for the day, month, year, hour and minute.

### YOUR CODE STARTS HERE ###

df["Day"] = None # TODO: Get the day of the timestamp
df["Month"] = None # TODO: Get the month of the timestamp
df["Year"] = None # TODO: Get the year of the timestamp
df["Hour"] = None # TODO: Get the hour of the timestamp
df["Minute"] = None # TODO: Get the minute of the timestamp

### YOUR CODE ENDS HERE ###

In [None]:
# 2. Based on these new features also create additional features such as a binary indicator if it is weekend or not (alternativly if it is weekday or not), what day of the week and what week of the year it is.

### YOUR CODE STARTS HERE ###

df["Day of the Week"] = None # TODO: Get the day of the week of the timestamp
df["Weekend"] = None # TODO: Create a binary feature indicating if it is weekend of not (1 if yes, 0 if no)
df["Week of the Year"] = None # Get the week of the year of the timestamp
# Additional features if something comes to your mind

### YOUR CODE ENDS HERE ###

In [None]:
# 3. Research how cyclical encoding of cyclical features can be done and implement it for all features in the dataset where you find it suitable. Give a reason for choosing exactly the features you chose.

### YOUR CODE GOES HERE ###