In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series (Lectures 19 and 20) 

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2023W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

We could have different measurements for each type of avocado. For example, for each region and Date combination, we have 2 examples of data, one for conventional type of avocado, one for organic type.

In [9]:
df.sort_values(by = ["region", "Date"]).head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
51,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
51,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,2015,Albany
50,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
50,2015-01-11,1.77,1182.56,39.0,305.12,0.0,838.44,838.44,0.0,0.0,organic,2015,Albany
49,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

This dataset is generally equally spaced with only one exception: WestTexNewMexico, organic type of avocado, there's one existence has 14 days in between, and another one has 21 days in between.

In [10]:
def print_time_spacing_for_all(df):
    """
    Prints the distribution of time spacing for each region and type.

    Parameters:
        df (pd.DataFrame): The input DataFrame with columns 'region', 'type', and 'Date'.
    """
    # Ensure 'Date' is in datetime format
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Get unique combinations of region and type
    regions = df['region'].unique()
    types = df['type'].unique()
    
    for region in regions:
        for type_ in types:
            # Filter data for the given region and type
            region_type_data = df[(df['region'] == region) & (df['type'] == type_)]
            
            if region_type_data.empty:
                continue  # Skip if there's no data for this combination
            
            # Calculate time differences
            time_diffs = region_type_data['Date'].sort_values().diff().dropna()
            
            # Count the frequency of each time difference
            value_counts = time_diffs.value_counts().sort_index()
            
            # Print the results
            print(f"Time spacing counts for region: {region}, type: {type_}:\n{value_counts}\n")

In [11]:
print_time_spacing_for_all(df)

Time spacing counts for region: Albany, type: conventional:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Albany, type: organic:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Atlanta, type: conventional:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Atlanta, type: organic:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: BaltimoreWashington, type: conventional:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: BaltimoreWashington, type: organic:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Boise, type: conventional:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Boise, type: organic:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Boston, type: conventional:
Date
7 days    168
Name: count, dtype: int64

Time spacing counts for region: Boston, 

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

Looking at the names of the regions, there are overlapping regions. For example, Great Lakes is an area that covers cities like Detroit, Chicago etc. There are also both names for cities and states like SanDiego, SanFrancisco, and California. Also, the regions include Total US, West, Northeast, SouthCentral, Southeast etc., which are overlapping with states and cities. 

In [12]:
unique_regions = df['region'].unique()
region_counts = df['region'].value_counts()

print(f"Total number of unique regions: {len(unique_regions)}")
print(f"Frequency of each region:\n{region_counts}")

Total number of unique regions: 54
Frequency of each region:
region
Albany                 338
Sacramento             338
Northeast              338
NorthernNewEngland     338
Orlando                338
Philadelphia           338
PhoenixTucson          338
Pittsburgh             338
Plains                 338
Portland               338
RaleighGreensboro      338
RichmondNorfolk        338
Roanoke                338
SanDiego               338
Atlanta                338
SanFrancisco           338
Seattle                338
SouthCarolina          338
SouthCentral           338
Southeast              338
Spokane                338
StLouis                338
Syracuse               338
Tampa                  338
TotalUS                338
West                   338
NewYork                338
NewOrleansMobile       338
Nashville              338
Midsouth               338
BaltimoreWashington    338
Boise                  338
Boston                 338
BuffaloRochester       338
California    

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from [Lecture 19](https://github.com/UBC-CS/cpsc330-2023W1/tree/main/lectures), with some improvements.

In [13]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [14]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [15]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [16]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

We get R^2 = 0.8286 for train data, and 0.7632 for test data.

In [17]:
df_train['BaselinePrediction'] = df_train['AveragePrice']
df_test['BaselinePrediction'] = df_test['AveragePrice']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['BaselinePrediction'] = df_train['AveragePrice']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['BaselinePrediction'] = df_test['AveragePrice']


In [18]:
train_r2 = r2_score(df_train['AveragePriceNextWeek'], df_train['BaselinePrediction'])

In [19]:
train_r2

0.8285800937261841

In [20]:
test_r2 = r2_score(df_test['AveragePriceNextWeek'], df_test['BaselinePrediction'])

In [21]:
test_r2

0.7631780188583048

In [22]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

Method 1 (Basic Date Components): 0.7695
Method 2 (Ordinal Encoding): 0.7087
Method 3 (Cyclical Encoding): 0.7732
Method 4 (Combined Encoding): 0.8350

Method 4 (Combined Encoding) worked best with a test R² score of 0.8350. This approach combined AveragePrice, Total Volume, and all types of date encodings (Year, Month, Day, OrdinalDate, Month_sin, Month_cos). The model performs better likely because it captures both linear time trends through basic components and seasonal patterns through cyclical encoding. 

While all methods show high training scores (around 0.97), Method 4's higher test score indicates better generalization to new data and less overfitting.

In [23]:
df_hastarget.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18141 entries, 0 to 18247
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  18141 non-null  datetime64[ns]
 1   AveragePrice          18141 non-null  float64       
 2   Total Volume          18141 non-null  float64       
 3   4046                  18141 non-null  float64       
 4   4225                  18141 non-null  float64       
 5   4770                  18141 non-null  float64       
 6   Total Bags            18141 non-null  float64       
 7   Small Bags            18141 non-null  float64       
 8   Large Bags            18141 non-null  float64       
 9   XLarge Bags           18141 non-null  float64       
 10  type                  18141 non-null  object        
 11  year                  18141 non-null  int64         
 12  region                18141 non-null  object        
 13  AveragePriceNextWeek 

In [24]:
# Method 1: Basic date components
# Add date features
df_train_dated = df_train.copy()
df_test_dated = df_test.copy()

df_train_dated['Year'] = df_train_dated['Date'].dt.year
df_train_dated['Month'] = df_train_dated['Date'].dt.month
df_train_dated['Day'] = df_train_dated['Date'].dt.day

df_test_dated['Year'] = df_test_dated['Date'].dt.year
df_test_dated['Month'] = df_test_dated['Date'].dt.month
df_test_dated['Day'] = df_test_dated['Date'].dt.day

base_features = ['AveragePrice', 'Total Volume', '4046', '4225', '4770', 
                'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'region']

features = base_features + ['Year', 'Month', 'Day']
X_train = df_train_dated[features]
X_test = df_test_dated[features]

X_train = pd.get_dummies(X_train, columns=['type', 'region'])
X_test = pd.get_dummies(X_test, columns=['type', 'region'])

# Align columns
common_cols = set(X_train.columns) & set(X_test.columns)
X_train = X_train[list(common_cols)]
X_test = X_test[list(common_cols)]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, df_train_dated['AveragePriceNextWeek'])

train_r2 = r2_score(df_train_dated['AveragePriceNextWeek'], model.predict(X_train_scaled))
test_r2 = r2_score(df_test_dated['AveragePriceNextWeek'], model.predict(X_test_scaled))

print("Basic Date Components Results:")
print(f"Train R² Score: {train_r2:.4f}")
print(f"Test R² Score: {test_r2:.4f}")

Basic Date Components Results:
Train R² Score: 0.9793
Test R² Score: 0.7702


In [25]:
# Method 2: Ordinal encoding
# Add date features
df_train_dated = df_train.copy()
df_test_dated = df_test.copy()

df_train_dated['OrdinalDate'] = (df_train_dated['Date'] - df_train_dated['Date'].min()).dt.days
df_test_dated['OrdinalDate'] = (df_test_dated['Date'] - df_train_dated['Date'].min()).dt.days

features = base_features + ['OrdinalDate']
X_train = df_train_dated[features]
X_test = df_test_dated[features]

X_train = pd.get_dummies(X_train, columns=['type', 'region'])
X_test = pd.get_dummies(X_test, columns=['type', 'region'])

# Align columns
common_cols = set(X_train.columns) & set(X_test.columns)
X_train = X_train[list(common_cols)]
X_test = X_test[list(common_cols)]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, df_train_dated['AveragePriceNextWeek'])

train_r2 = r2_score(df_train_dated['AveragePriceNextWeek'], model.predict(X_train_scaled))
test_r2 = r2_score(df_test_dated['AveragePriceNextWeek'], model.predict(X_test_scaled))

print("Ordinal Date Encoding Results:")
print(f"Train R² Score: {train_r2:.4f}")
print(f"Test R² Score: {test_r2:.4f}")

Ordinal Date Encoding Results:
Train R² Score: 0.9793
Test R² Score: 0.7098


In [26]:
# Method 3: Cyclical encoding with lagged dataset
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test = df_hastarget[df_hastarget["Date"] > split_date]

# Add date features
df_train_dated = df_train.copy()
df_test_dated = df_test.copy()

for df in [df_train_dated, df_test_dated]:
    df['Year'] = df['Date'].dt.year
    df['Month'] = df['Date'].dt.month
    df['Month_sin'] = np.sin(2 * np.pi * df['Month'] / 12)
    df['Month_cos'] = np.cos(2 * np.pi * df['Month'] / 12)
    
    df['Week'] = df['Date'].dt.isocalendar().week
    df['Week_sin'] = np.sin(2 * np.pi * df['Week'] / 52)
    df['Week_cos'] = np.cos(2 * np.pi * df['Week'] / 52)

features = ['AveragePrice', 'Total Volume', 
           'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags',
           'Year', 'Month_sin', 'Month_cos', 'Week_sin', 'Week_cos',
           'type', 'region']

X_train = df_train_dated[features]
X_test = df_test_dated[features]

X_train = pd.get_dummies(X_train, columns=['type', 'region'])
X_test = pd.get_dummies(X_test, columns=['type', 'region'])

common_cols = set(X_train.columns) & set(X_test.columns)
X_train = X_train[list(common_cols)]
X_test = X_test[list(common_cols)]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, df_train_dated['AveragePriceNextWeek'])

train_r2 = r2_score(df_train_dated['AveragePriceNextWeek'], model.predict(X_train_scaled))
test_r2 = r2_score(df_test_dated['AveragePriceNextWeek'], model.predict(X_test_scaled))

print("Cyclical Encoding Results with Lagged Dataset:")
print(f"Train R² Score: {train_r2:.4f}")
print(f"Test R² Score: {test_r2:.4f}")

Cyclical Encoding Results with Lagged Dataset:
Train R² Score: 0.9795
Test R² Score: 0.7732


In [27]:
#Method 4: Combined Encoding

from sklearn.model_selection import train_test_split
# Add new features
df_hastarget['Year'] = df_hastarget['Date'].dt.year
df_hastarget['Month'] = df_hastarget['Date'].dt.month
df_hastarget['Day'] = df_hastarget['Date'].dt.day
df_hastarget['OrdinalDate'] = (df_hastarget['Date'] - df_hastarget['Date'].min()).dt.days

# Cyclic encoding
df_hastarget['Month_sin'] = np.sin(2 * np.pi * df_hastarget['Month'] / 12)
df_hastarget['Month_cos'] = np.cos(2 * np.pi * df_hastarget['Month'] / 12)

# Train-test split
X = df_hastarget[['AveragePrice', 'Total Volume', 'Year', 'Month', 'Day', 'OrdinalDate', 'Month_sin', 'Month_cos']]
y = df_hastarget['AveragePriceNextWeek']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Evaluation
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("Combined Encoding Results with Lagged Dataset:")
print(f"Train R^2 Score: {train_r2}")
print(f"Test R^2 Score: {test_r2}")

Combined Encoding Results with Lagged Dataset:
Train R^2 Score: 0.9778879870981543
Test R^2 Score: 0.8349730055479765


<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1. A real-world example of unequally spaced time points is earthquake data. Earthquakes do not occur at regular intervals; instead, they happen sporadically based on natural geophysical processes. Each recorded event has a specific timestamp, but the time gap between events can vary significantly—from seconds to days, months, or even years.
2. Lagged features would struggle more with unequally spaced time points because they assume a consistent interval between observations (e.g., daily, hourly). When time points are unevenly spaced, a lag value may no longer represent a fixed temporal distance (e.g., "1 lag" could correspond to 1 day in one case and 3 days in another), leading to inconsistencies that can confuse the model and result in incorrect assumptions about temporal relationships. In contrast, encoding the date as features (e.g., year, month, day, or cyclic encodings like sine/cosine for seasonal patterns) is less affected, as it captures information about the time point itself without relying on fixed temporal gaps, making it more robust to unequal spacing in the data.
3. When "time of day" was used as a numeric feature (e.g., representing 0 to 23 for hours), the Ridge regression model treated it as a linear numeric variable. However, time of day is a cyclical variable (e.g., 23:00 is closer to 00:00 than 12:00). Linear models like Ridge fail to capture this periodic relationship, resulting in poor performance when modeling patterns like daily bike usage, which is inherently cyclical. To address this, we encoded the time of day cyclically using sine and cosine transformations. Sine and cosine encode the cyclical nature of time (e.g., midnight and 23:59 are mathematically close). These features allow Ridge to capture periodic patterns effectively.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to Lecture 19 on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1. 44 parameters because we have 10 coefficients (for 10 features) and 1 intercept for each class, so for a four-class classification, we have 11*4 = 44 parameters.
2. Neural networks process data through sequential layers, where early layers learn general features (e.g., edges, textures) and later layers learn task-specific patterns. This structure is useful for transfer learning because we can reuse the general-purpose early layers from a pre-trained model and only retrain the final layers for the new task. This saves computation and reduces the need for large datasets.
3. I would use a pre-trained CNN model like ResNet-50 with transfer learning and fine-tuning because: the dataset is too small for training from scratch, and pre-trained CNN models already have good feature extraction capabilities for images. Therefore, we only need to retrain the final layers to adapt it to our 50 faculty member classes.

<!-- END QUESTION -->

<br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

![](img/eva-well-done.png)