In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series (Lectures 19 and 20) 

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

from sklearn.metrics import r2_score

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2023W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

In [9]:
(df.sort_values(by=["region", "Date"]))

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
51,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.00,conventional,2015,Albany
51,2015-01-04,1.79,1373.95,57.42,153.88,0.00,1162.65,1162.65,0.00,0.00,organic,2015,Albany
50,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.00,conventional,2015,Albany
50,2015-01-11,1.77,1182.56,39.00,305.12,0.00,838.44,838.44,0.00,0.00,organic,2015,Albany
49,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.00,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.00,organic,2018,WestTexNewMexico
1,2018-03-18,0.88,855251.17,457635.79,137597.04,8422.08,251596.26,151191.85,98535.60,1868.81,conventional,2018,WestTexNewMexico
1,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.00,organic,2018,WestTexNewMexico
0,2018-03-25,0.84,965185.06,438526.12,199585.90,11017.42,316055.62,153009.89,160999.10,2046.63,conventional,2018,WestTexNewMexico


In the avocado prices dataset, separate measurements are made for:

1. **`region`**: Data like `AveragePrice` and `Total Volume` is recorded separately for each geographical region. For example:
   - On `2015-01-04`, in `Albany`, the `AveragePrice` is `1.22` for `conventional` avocados and `1.79` for `organic` avocados.
   - On `2018-03-18`, in `WestTexNewMexico`, the `AveragePrice` is `0.88` for `conventional` avocados and `1.56` for `organic` avocados.
     

2. **`type`**: Measurements are split into `conventional` and `organic` avocados. For the same `Date` and `region`, there are distinct rows for each `type`. For example:
   - On `2015-01-04` in `Albany`, the `Total Volume` is `40873.28` for `conventional` avocados and `1373.95` for `organic` avocados.

This separation is evident from rows in the dataset where the same `Date` in a `region` has different values for `type`, and distinct metrics across `region`.


In [10]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [11]:
count = 0 
for name, group in df.groupby(['region', 'type']):
    print("%-30s %s" % (name, group["Date"].sort_values().diff().value_counts()))
    print('\n\n\n')
    count+=1
    if count == 8: 
        break
# code adapted from lec notes

('Albany', 'conventional')     Date
7 days    168
Name: count, dtype: int64




('Albany', 'organic')          Date
7 days    168
Name: count, dtype: int64




('Atlanta', 'conventional')    Date
7 days    168
Name: count, dtype: int64




('Atlanta', 'organic')         Date
7 days    168
Name: count, dtype: int64




('BaltimoreWashington', 'conventional') Date
7 days    168
Name: count, dtype: int64




('BaltimoreWashington', 'organic') Date
7 days    168
Name: count, dtype: int64




('Boise', 'conventional')      Date
7 days    168
Name: count, dtype: int64




('Boise', 'organic')           Date
7 days    168
Name: count, dtype: int64






In this dataset, after analyzing the data, I found that there is a consistent gap of 7 days between observations, particularly based on region and type. This pattern appears uniform across most locations, indicating a high level of consistency in the data.

In [12]:
...

Ellipsis

In [13]:
...

Ellipsis

In [14]:
...

Ellipsis

In [15]:
...

Ellipsis

In [16]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [17]:
region_column = 'region'  


unique_regions = df[region_column].unique()


num_unique_regions = len(unique_regions)


region_counts = df[region_column].value_counts()


print(f"Total number of unique regions: {num_unique_regions}")
print("\nRegion counts:")
print(region_counts)


Total number of unique regions: 54

Region counts:
region
Albany                 338
Sacramento             338
Northeast              338
NorthernNewEngland     338
Orlando                338
Philadelphia           338
PhoenixTucson          338
Pittsburgh             338
Plains                 338
Portland               338
RaleighGreensboro      338
RichmondNorfolk        338
Roanoke                338
SanDiego               338
Atlanta                338
SanFrancisco           338
Seattle                338
SouthCarolina          338
SouthCentral           338
Southeast              338
Spokane                338
StLouis                338
Syracuse               338
Tampa                  338
TotalUS                338
West                   338
NewYork                338
NewOrleansMobile       338
Nashville              338
Midsouth               338
BaltimoreWashington    338
Boise                  338
Boston                 338
BuffaloRochester       338
California             3

While the dataset has 54 unique regions, some of these names (like California and SanFrancisco) indicate overlapping geographical locations. The names of regions like Southeast, Northeast, and TotalUS further suggest that the regions could span multiple specific locations, pointing to potential overlaps. Therefore, while the regions are technically distinct in name, many of them likely represent overlapping geographical areas.

In [18]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from [Lecture 19](https://github.com/UBC-CS/cpsc330-2023W1/tree/main/lectures), with some improvements.

In [19]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [20]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [21]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [22]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

In [23]:
from sklearn.metrics import r2_score
df_train.loc[:, 'predicted_AveragePriceNextWeek'] = df_train['AveragePrice']
df_test.loc[:, 'predicted_AveragePriceNextWeek'] = df_test['AveragePrice']


train_r2 = r2_score(df_train['AveragePriceNextWeek'], df_train['predicted_AveragePriceNextWeek'])


...

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.loc[:, 'predicted_AveragePriceNextWeek'] = df_train['AveragePrice']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test.loc[:, 'predicted_AveragePriceNextWeek'] = df_test['AveragePrice']


Ellipsis

In [24]:
test_r2 = r2_score(df_test['AveragePriceNextWeek'], df_test['predicted_AveragePriceNextWeek'])


print("train r2=",train_r2, "test r2 =",test_r2)


...

train r2= 0.8285800937261841 test r2 = 0.7631780188583048


Ellipsis

I got train_r2= 0.8285800937261841 and test_r2 = 0.7631780188583048

In [25]:
...

Ellipsis

In [26]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

I experimented with extracting the time of day, day of the week, and the month as features. Using the month as a feature worked best. I achieved an R^2 of 0.77 on the test set and 0.88 on the train set when using the time of day and day of the week as features, and an R^2 of 0.8 on the test set and 0.89 on the train set when using the month as a feature. This makes sense because I noticed that the avocado prices were recorded weekly on the same day of the week and at the same time. This means there would be no variation in the data for extracting the time of day and day of the week as features and therefore it's unlikely to have any predictive power on the target.

In [27]:
# convert to POSIX time by dividing by 10**9
df_train['Month'] = df_train["Date"].dt.month_name().values
df_test['Month'] = df_test["Date"].dt.month_name().values
df_train['Week'] = df_train["Date"].dt.dayofweek.values
df_test['Week'] = df_test["Date"].dt.dayofweek.values
df_train['Hour'] = df_train["Date"].dt.hour.values
df_test['Hour'] = df_test["Date"].dt.hour.values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Month'] = df_train["Date"].dt.month_name().values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['Month'] = df_test["Date"].dt.month_name().values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Week'] = df_train["Date"].dt.dayofweek.values
A value is trying to be set on 

In [28]:
df_train["Date"] = df_train["Date"].astype("int64").values.reshape(-1, 1) // 10 ** 9
df_test["Date"] = df_test["Date"].astype("int64").values.reshape(-1, 1) // 10 ** 9
X_train = df_train.drop(columns = ["AveragePriceNextWeek"])
X_test = df_test.drop(columns = ["AveragePriceNextWeek"])
y_train = df_train["AveragePriceNextWeek"]
y_test = df_test["AveragePriceNextWeek"]
print(df_test.head())

           Date  AveragePrice  Total Volume     4046      4225   4770  \
143  1506816000          1.69      71205.11  4411.02  57416.25  77.85   
144  1507420800          1.78      55368.61  3679.82  45843.75  42.63   
145  1508025600          1.65      73574.89  3383.35  63355.37  62.45   
146  1508630400          1.56      69704.09  3758.80  57340.30  35.48   
147  1509235200          1.67      69432.23  2959.76  57585.49  57.94   

     Total Bags  Small Bags  Large Bags  XLarge Bags          type  year  \
143     9299.99     5069.66     4230.33          0.0  conventional  2017   
144     5802.41     2148.20     3654.21          0.0  conventional  2017   
145     6773.72     3882.02     2891.70          0.0  conventional  2017   
146     8569.51     5101.64     3467.87          0.0  conventional  2017   
147     8829.04     5050.91     3778.13          0.0  conventional  2017   

     region  AveragePriceNextWeek  predicted_AveragePriceNextWeek    Month  \
143  Albany               

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train["Date"] = df_train["Date"].astype("int64").values.reshape(-1, 1) // 10 ** 9
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test["Date"] = df_test["Date"].astype("int64").values.reshape(-1, 1) // 10 ** 9


In [29]:
...

Ellipsis

In [30]:
...

Ellipsis

In [31]:
...

Ellipsis

In [32]:
model = xgb.XGBRegressor(n_estimators=1000, max_depth=3, learning_rate=0.05)

numeric_feats = ["AveragePrice", "Total Volume", "Total Bags", "Small Bags",
                "Large Bags", "XLarge Bags", "year", "4046", "4225", "4770"]
categorical_feats = ["type", "region", "Month"]
drop_feats = ["Date", "Week", "Hour"]

ct = make_column_transformer(    
    (make_pipeline(SimpleImputer(), StandardScaler()), numeric_feats),  # scaling on numeric features
    (make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()), categorical_feats),  # OHE on categorical features
    ("drop", drop_feats),  # drop the drop features
)

pipe = make_pipeline(ct, model)
# Fit the model on the training data
pipe.fit(X_train, y_train)
   # Print R^2 scores for training and test datasets
print("Train-set R^2: {:.2f}".format(pipe.score(X_train, y_train)))
print("Test-set R^2: {:.2f}".format(pipe.score(X_test, y_test)))

    # Predict target variable for both training and test datasets
y_pred_train = pipe.predict(X_train)
y_pred = pipe.predict(X_test)

Train-set R^2: 0.89
Test-set R^2: 0.80


In [33]:
...

Ellipsis

In [34]:
...

Ellipsis

In [35]:
...

Ellipsis

In [36]:
...

Ellipsis

In [37]:
...

Ellipsis

In [38]:
...

Ellipsis

In [39]:
...

Ellipsis

In [40]:
...

Ellipsis

In [41]:
...

Ellipsis

In [42]:
...

Ellipsis

In [43]:
...

Ellipsis

In [44]:
...

Ellipsis

In [45]:
...

Ellipsis

In [46]:
...

Ellipsis

In [47]:
...

Ellipsis

In [48]:
...

Ellipsis

In [49]:
...

Ellipsis

In [50]:
...

Ellipsis

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1. In the case where the data is logging the time stamps for when it is raining in a particular city throughout the year as it doesn't rain everyday nor does it always rain at every point of the day on the days it does rain.
2. Creating lagged versions of features would struggle with unequally spaced time points. This is because creating lagged versions of features is just using existing data to represent previous states, and still expects the time points to be in regular intervals. For example, n_rentals-1 would be used to represent the number of rentals 1 time step ago, and so if the time stamps were inconsistent, the time steps would also be inconsistent and inaccurate.
3. It was not able to capture the periodic pattern because linear models struggle with cyclic patterns in numeric features which are inherently non-linear, and the time of day feature was a cyclical numeric feature. We tackled this problem by applying one hot encoding to the feature because this transformed the feature into a format where its impact on the target variable could be independently and linearly modeled. 

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to Lecture 19 on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1. We would get 44 parameters. This is because with Logistic Regression, we get one coefficient per feature per class, so if we have 10 features that would be 10x4 = 40 coefficients. Then we also get one intercept per class, so if we have 4 classes that would mean 4 intercepts. 40 coefficients + 4 intercepts = 44 parameters.
2. This property was useful when it came to transfer learning because transfer learning involves taking a pre-trained model and fine tuning it for your task, which mirrors the way neural networks and sklearn's pipelines work, in that they take some kind of input, the fine tuning acts as the preprocessing and feature extraction steps, then predictions are made. 
3. I would use transfer learning on a pre-trained convolutional neural network. This is because convolutional neural networks can take in images without flattening them, which will be very useful for our image dataset as flattening the images causes a lot of useful information to be lost, hence reducing the accuracy of the model. I would use transfer learning because it is unrealistic to train an entire convolutional neural network from scratch as it requires a large dataset, very powerful computers, and a huge amount of human effort to train the model. We only have a small dataset here, so it makes sense to download a pre-trained model and just fine tune it for the task at hand. 

<!-- END QUESTION -->

<br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

![](img/eva-well-done.png)