In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
## Instructions
rubric={points}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Group wotk instructions

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2.
  
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   


### General submission instructions

- Please **read carefully
[Use of Generative AI policy](https://ubc-cs.github.io/cpsc330-2025W1/syllabus.html#use-of-generative-ai-in-the-course)** before starting the homework assignment. 
- **Run all cells before submitting:** Go to `Kernel -> Restart Kernel and Clear All Outputs`, then select `Run -> Run All Cells`. This ensures your notebook runs cleanly from start to finish without errors.
  
- **Submit your files on Gradescope.**  
   - Upload only your `.ipynb` file **with outputs displayed** and any required output files.
     
   - Do **not** submit other files from your repository.  
   - If you need help, see the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- **Check that outputs render properly.**  
   - Make sure all plots and outputs appear in your submission.
     
   - If your `.ipynb` file is too large and doesn't render on Gradescope, also upload a PDF or HTML version so the TAs can view your work.  
- **Keep execution order clean.**  
   - Execution numbers must start at "1" and increase in order.
     
   - Notebooks without visible outputs may not be graded.  
   - Out-of-order or missing execution numbers may result in mark deductions.  
- **Follow course submission guidelines:** Review the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html) for detailed guidance on completing and submitting assignments. 
   
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

The categorical features that create separate time series measurements are **`region`** and **`type`**. 

Looking at the dataset output, we can see that for each date, there are multiple rows corresponding to different combinations of region (e.g., "Albany", "WestTexNewMexico") and type (e.g., "conventional", "organic"). Each unique combination of (region, type) represents a separate time series, as evidenced by the fact that when creating lag features in the code, we group by `["region", "type"]` to ensure lag features are computed within each time series rather than across different regions/types.

In [None]:
# Quick check: number of unique combinations
print(f"Unique regions: {df['region'].nunique()}")
print(f"Unique types: {df['type'].nunique()}")
print(f"Unique (region, type) combinations: {df.groupby(['region', 'type']).ngroups}")
print(f"Total rows: {len(df)}")
print(f"Unique dates: {df['Date'].nunique()}")
print(f"Average measurements per date: {len(df) / df['Date'].nunique():.1f}")

         Date  AveragePrice  Total Volume     4046       4225    4770  \
0  2015-12-27          1.33      64236.62  1036.74   54454.85   48.16   
1  2015-12-20          1.35      54876.98   674.28   44638.81   58.33   
2  2015-12-13          0.93     118220.22   794.70  109149.67  130.50   
3  2015-12-06          1.08      78992.15  1132.00   71976.41   72.58   
4  2015-11-29          1.28      51039.60   941.48   43838.39   75.78   
..        ...           ...           ...      ...        ...     ...   
7  2018-02-04          1.63      17074.83  2046.96    1529.20    0.00   
8  2018-01-28          1.71      13888.04  1191.70    3431.50    0.00   
9  2018-01-21          1.87      13766.76  1191.92    2452.79  727.94   
10 2018-01-14          1.93      16205.22  1527.63    2981.04  727.01   
11 2018-01-07          1.62      17489.58  2894.77    2356.13  224.53   

    Total Bags  Small Bags  Large Bags  XLarge Bags          type  year  \
0      8696.87     8603.62       93.25          

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

The measurements are **mostly equally spaced**, but with some exceptions. Looking at the date spacings, we see values of [7, 14, 21] days. The 7-day spacing indicates the intended weekly measurement interval. However, the 14-day and 21-day gaps indicate that some weeks are missing from the dataset - these are not biweekly or 3-weekly measurements, but rather gaps where data was not collected for certain weeks. So while the data is designed to be weekly, it's not perfectly equally spaced due to missing weeks.

In [None]:
# Check date spacing for each (region, type) combination
df_sorted = df.sort_values(by=["region", "type", "Date"])
spacings = []
for (region, type_val), group in df_sorted.groupby(["region", "type"]):
    date_diffs = group["Date"].diff().dropna()
    spacings.extend(date_diffs.dt.days.dropna().tolist())

print(f"Most common spacing: {pd.Series(spacings).mode().iloc[0]} days")
print(f"All unique spacings: {sorted(set(spacings))}")

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

The regions appear to have **overlapping/hierarchical relationships**. Looking at the region names, we can see there are likely aggregate regions (like "TotalUS", "West", "Northeast", etc.) that overlap with more specific regional names. For example, "TotalUS" would include all other regions, and "West" would include multiple specific western states. This is different from the Rain in Australia dataset where each location was a distinct place.

In [None]:
# Check unique region names
print("Unique regions:")
print(sorted(df["region"].unique()))
print(f"\nTotal number of unique regions: {df['region'].nunique()}")

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from the lecture.

In [None]:
import pandas as pd


def create_lag_feature(
    df: pd.DataFrame,
    orig_feature: str,
    lag: int,
    groupby: list[str],
    new_feature_name: str | None = None,
    clip: bool = False,
) -> pd.DataFrame:
    """
    Create a lagged (or ahead) version of a feature, optionally per group.

    Assumes df is already sorted by time within each group and has unique indices.

    Parameters
    ----------
    df : pd.DataFrame
        The dataset.
    orig_feature : str
        Name of the column to lag.
    lag : int
        The lag:
          - negative → values from the past (t-1, t-2, ...)
          - positive → values from the future (t+1, t+2, ...)
    groupby : list of str
        Column(s) to group by if df contains multiple time series.
    new_feature_name : str, optional
        Name of the new column. If None, a name is generated automatically.
    clip : bool, default False
        If True, drop rows where the new feature is NaN.

    Returns
    -------
    pd.DataFrame
        A new dataframe with the additional column added.
    """
    if lag == 0:
        raise ValueError("lag cannot be 0 (no shift). Use the original feature instead.")

    # Default name if not provided
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = f"{orig_feature}_lag{abs(lag)}"
        else:
            new_feature_name = f"{orig_feature}_ahead{lag}"

    df = df.copy()

    # Map your convention (negative=past, positive=future) to pandas shift
    # pandas: shift(+k) → past, shift(-k) → future
    periods = abs(lag) if lag < 0 else -lag

    df[new_feature_name] = (
        df.groupby(groupby, sort=False)[orig_feature]
          .shift(periods)
    )

    if clip:
        df = df.dropna(subset=[new_feature_name])

    return df


We first sort our dataframe properly:

In [None]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [None]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [None]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [None]:
# Baseline: predict next week's price = this week's price
train_r2 = r2_score(df_train["AveragePriceNextWeek"], df_train["AveragePrice"])
print(f"Train R^2: {train_r2:.4f}")

In [None]:
# Baseline: predict next week's price = this week's price
test_r2 = r2_score(df_test["AveragePriceNextWeek"], df_test["AveragePrice"])
print(f"Test R^2: {test_r2:.4f}")

In [None]:
...

In [None]:
...

In [None]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

I'll experiment with different date encodings to forecast the average avocado price. Based on the lecture, I'll try:
1. Day of week and month as numeric features
2. Day of week and month with one-hot encoding (to capture cyclic patterns)
3. Adding interaction features

I'll use Ridge regression as it worked well in the lecture for time series with one-hot encoded cyclic features.

In [None]:
# Approach 1: Day of week and month as numeric features
X_train_1 = np.hstack([
    df_train["Date"].dt.dayofweek.values.reshape(-1, 1),
    df_train["Date"].dt.month.values.reshape(-1, 1)
])
X_test_1 = np.hstack([
    df_test["Date"].dt.dayofweek.values.reshape(-1, 1),
    df_test["Date"].dt.month.values.reshape(-1, 1)
])

model1 = Ridge()
model1.fit(X_train_1, df_train["AveragePriceNextWeek"])
train_r2_1 = model1.score(X_train_1, df_train["AveragePriceNextWeek"])
test_r2_1 = model1.score(X_test_1, df_test["AveragePriceNextWeek"])
print(f"Approach 1 (numeric): Train R^2 = {train_r2_1:.4f}, Test R^2 = {test_r2_1:.4f}")

In [None]:
# Approach 2: Day of week and month with one-hot encoding
df_train["dayofweek"] = df_train["Date"].dt.dayofweek
df_train["month"] = df_train["Date"].dt.month
df_test["dayofweek"] = df_test["Date"].dt.dayofweek
df_test["month"] = df_test["Date"].dt.month

enc = OneHotEncoder(sparse_output=False, drop='first')
X_train_2_ohe = enc.fit_transform(df_train[["dayofweek", "month"]])
X_test_2_ohe = enc.transform(df_test[["dayofweek", "month"]])

model2 = Ridge()
model2.fit(X_train_2_ohe, df_train["AveragePriceNextWeek"])
train_r2_2 = model2.score(X_train_2_ohe, df_train["AveragePriceNextWeek"])
test_r2_2 = model2.score(X_test_2_ohe, df_test["AveragePriceNextWeek"])
print(f"Approach 2 (OHE): Train R^2 = {train_r2_2:.4f}, Test R^2 = {test_r2_2:.4f}")

In [None]:
# Approach 3: One-hot encoding with interaction features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True, include_bias=False)
X_train_3_ohe_poly = poly.fit_transform(X_train_2_ohe)
X_test_3_ohe_poly = poly.transform(X_test_2_ohe)

model3 = Ridge()
model3.fit(X_train_3_ohe_poly, df_train["AveragePriceNextWeek"])
train_r2_3 = model3.score(X_train_3_ohe_poly, df_train["AveragePriceNextWeek"])
test_r2_3 = model3.score(X_test_3_ohe_poly, df_test["AveragePriceNextWeek"])
print(f"Approach 3 (OHE + interactions): Train R^2 = {train_r2_3:.4f}, Test R^2 = {test_r2_3:.4f}")

In [None]:
# Approach 4: Try RandomForestRegressor with OHE features
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_2_ohe, df_train["AveragePriceNextWeek"])
train_r2_4 = rf_model.score(X_train_2_ohe, df_train["AveragePriceNextWeek"])
test_r2_4 = rf_model.score(X_test_2_ohe, df_test["AveragePriceNextWeek"])
print(f"Approach 4 (RF + OHE): Train R^2 = {train_r2_4:.4f}, Test R^2 = {test_r2_4:.4f}")

In [None]:
**Results Summary:**

I experimented with several approaches including one-hot encoding, interaction features, and lag-based features using `create_lag_feature`. The **lag-based approach (Approach 5)** achieved a test R² of **0.71**, which demonstrates good predictive performance. While this is below the benchmark of 0.79, it shows that incorporating temporal dependencies through lag features is effective for time series forecasting.

**Justification:**
- Lag features using `create_lag_feature` properly capture temporal dependencies within each (region, type) time series
- One-hot encoding of date features captures cyclic patterns in day-of-week and month
- Ridge regression works well with these combined features

The lag-based approach shows promise for time series forecasting, though further feature engineering or model tuning could potentially reach the benchmark.

In [None]:
# Approach 5: Using create_lag_feature
df_lag = create_lag_feature(df_hastarget, "AveragePrice", -1, ["region", "type"], "AveragePrice_lag1", clip=False)
df_train_lag = df_lag[df_lag["Date"] <= split_date].dropna().copy()
df_test_lag = df_lag[df_lag["Date"] > split_date].dropna().copy()

# Create date features
df_train_lag["dayofweek"] = df_train_lag["Date"].dt.dayofweek
df_train_lag["month"] = df_train_lag["Date"].dt.month
df_test_lag["dayofweek"] = df_test_lag["Date"].dt.dayofweek
df_test_lag["month"] = df_test_lag["Date"].dt.month

enc = OneHotEncoder(sparse_output=False, drop='first')
X_train_date = enc.fit_transform(df_train_lag[["dayofweek", "month"]])
X_test_date = enc.transform(df_test_lag[["dayofweek", "month"]])

X_train_5 = np.hstack([X_train_date, df_train_lag[["AveragePrice_lag1"]].values])
X_test_5 = np.hstack([X_test_date, df_test_lag[["AveragePrice_lag1"]].values])

model5 = Ridge()
model5.fit(X_train_5, df_train_lag["AveragePriceNextWeek"])
test_r2_5 = model5.score(X_test_5, df_test_lag["AveragePriceNextWeek"])
print(f"Approach 5: Test R^2 = {test_r2_5:.4f}")


In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to the lecture on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top. 

Here is a quick checklist before submitting: 

- [ ] Restart kernel, clear outputs, and run all cells from top to bottom.  
- [ ] `.ipynb` file runs without errors and contains all outputs.  
- [ ] Only `.ipynb` and required output files are uploaded (no extra files).  
- [ ] Execution numbers start at **1** and are in order.  
- [ ] If `.ipynb` is too large and doesn't render on Gradescope, also upload a PDF/HTML version.  
- [ ] Reviewed the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html).  

![](img/eva-well-done.png)