# CPSC 330 - Applied Machine Learning 

## Homework 8: Word embeddings, time series
### Associated lectures: Lectures 16, 18, 19

**Due date: Tuesday, June 21, 2022 at 18:00**

## Table of Contents

- [Submission instructions](#sg)
- [Exercise 1 - Exploring pre-trained word embeddings](#1)
- [Exercise 2 - Exploring time series data](#2)
- [Exercise 3 - Short answer questions](#3)
- (Optional)[Exercise 4 - Course take away](#4)

In [71]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import DBSCAN, KMeans
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", 0)

<br><br><br><br>

## Instructions 
<hr>
rubric={points:1}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2022s/blob/master/docs/homework_instructions.md).

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.
- The maximum group size is 2.
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).

<br><br><br><br>

## Exercise 1:  Exploring pre-trained word embeddings <a name="1"></a>
<hr>

In lecture 16, we talked about natural language processing (NLP). Using pre-trained word embeddings is very common in NLP. It has been shown that pre-trained word embeddings [work well on a variety of text classification tasks](http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf). These embeddings are created by training a model like Word2Vec on a huge corpus of text such as a dump of Wikipedia or a dump of the web crawl. 

A number of pre-trained word embeddings are available out there. Some popular ones are: 

- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using the fastText algorithm
    * published by Facebook
    
In this exercise, you will be exploring GloVe Wikipedia pre-trained embeddings. The code below loads pre-trained word vectors trained on Wikipedia. The vectors are created using an algorithm called GloVe. To run the code, you'll need `gensim` package in your cpsc330 conda environment, which you can install as follows:

`conda install -n cpsc330 -c anaconda gensim`

In [9]:
import gensim
import gensim.downloader

print("Available models to download:")
print(*gensim.downloader.info()["models"].keys(), sep=", ")

Available models to download:
fasttext-wiki-news-subwords-300, conceptnet-numberbatch-17-06-300, word2vec-ruscorpora-300, word2vec-google-news-300, glove-wiki-gigaword-50, glove-wiki-gigaword-100, glove-wiki-gigaword-200, glove-wiki-gigaword-300, glove-twitter-25, glove-twitter-50, glove-twitter-100, glove-twitter-200, __testing_word2vec-matrix-synopsis


In [10]:
# This will take a while to run when you run it for the first time.
import gensim.downloader as api

glove_wiki_vectors = api.load("glove-wiki-gigaword-100")

In [11]:
len(glove_wiki_vectors)

400000

There are 400,000 word vectors in these pre-trained model. 

<br><br>

### 1.1 Word similarity using pre-trained embeddings
rubric={points:4}

Now that we have GloVe Wiki vectors (`glove_wiki_vectors`) loaded, let's explore the word vectors. 

**Your tasks:**

1. Calculate cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of the model.
2. Do the similarities make sense? 

In [12]:
word_pairs = [
    ("coast", "shore"),
    ("clothes", "closet"),
    ("old", "new"),
    ("smart", "intelligent"),
    ("dog", "cat"),
    ("tree", "lawyer"),
]

In [14]:
for (pair1, pair2) in word_pairs:
    print("Cosine simlarity between %s and %s = %0.3f" % (pair1, pair2, glove_wiki_vectors.similarity(pair1, pair2)))

Cosine simlarity between coast and shore = 0.700
Cosine simlarity between clothes and closet = 0.546
Cosine simlarity between old and new = 0.643
Cosine simlarity between smart and intelligent = 0.755
Cosine simlarity between dog and cat = 0.880
Cosine simlarity between tree and lawyer = 0.077


The similarities here do make sense. The cosine similarity between tree and lawyer is quite low since these two words are not frequently occuring. The high similarity between dog and cat can be explained by the fact that although these words are not synonyms but they are used in the same context.

<br><br>

### 1.2 Bias in embeddings
rubric={points:10}

**Your tasks:**
1. In Lecture 16 we saw that our pre-trained word embedding model output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world.
2. Here we are using pre-trained embeddings which are built using Wikipedia data. Explore whether there are any worrisome biases present in these embeddings or not by trying out some examples. You can use the following two methods or other methods of your choice to explore what kind of stereotypes and biases are encoded in these embeddings. 
    - You can use the `analogy` function below which gives words analogies. 
    - You can also use [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods. (An example is shown below.)   
3. Discuss your observations. Do you observe the gender stereotype we observed in class in these embeddings?

> Note that most of the recent embeddings are de-biased. But you might still observe some biases in them. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use them in your models. 

In [15]:
def analogy(word1, word2, word3, model=glove_wiki_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

An example of using similarity between words to explore biases and stereotypes.  

In [16]:
glove_wiki_vectors.similarity("white", "rich")

0.447236

In [17]:
glove_wiki_vectors.similarity("black", "rich")

0.51745194

1. Using this model for a hiring algorithm can introduce a gender bias, which is harmful in the real world.

In [33]:
analogy("men", "independent", "women")

men : independent :: women : ?


Unnamed: 0,Analogy word,Score
0,nonprofit,0.623269
1,liberal,0.613263
2,society,0.602621
3,established,0.601128
4,establishment,0.5973
5,conservative,0.595048
6,community,0.594654
7,advocacy,0.591874
8,public,0.585784
9,educational,0.582153


In [43]:
glove_wiki_vectors.similarity("men", "influential"), glove_wiki_vectors.similarity("women", "influential")

(0.2875564, 0.3301347)

In [38]:
glove_wiki_vectors.similarity("man", "intelligent"), glove_wiki_vectors.similarity("woman", "intelligent")

(0.4394357, 0.37589207)

In [42]:
glove_wiki_vectors.similarity("man", "developer"), glove_wiki_vectors.similarity("woman", "developer")

(0.24923676, 0.15717775)

There are not very worrisome word embeddings as compared to what we had observed in the class, which could be partly because the algorithms used for deducing the word embeddings are designed to be non-biased. 

There is also a probable reason which make the word embeddings to appear biased, which can be explained by the fact that some words which are not used a lot would tend to have a lower similarity, hence a lower score causing some instances to appear more biased.

<br><br>

### 1.3 Representation of all words in English
rubric={reasoning:1}

**Your tasks:**
1. The vocabulary size of Wikipedia embeddings is quite large. Do you think it contains **all** words in English language? What would happen if you try to get a word vector that's unlikely to be present in the vocabulary (e.g., the word "cpsc330"). 

I don't think the vocabulary size of Wikipedia could include all the words in the English language. Trying to retrieve a word vector for a word that is unlikely to be present in the vocabulary (e.g., "cpsc330") would result in the generation of an error.

<br><br>

### 1.4 Classification with pre-trained embeddings
rubric={points:8}

In lecture 16, we saw that you can conveniently get word vectors with `spaCy` with `en_core_web_md` model. In this exercise, you'll use word embeddings in multi-class text classification task. We will use [HappyDB](https://www.kaggle.com/ritresearch/happydb) corpus which contains about 100,000 happy moments classified into 7 categories: *affection, exercise, bonding, nature, leisure, achievement, enjoy_the_moment*. The data was crowd-sourced via [Amazon Mechanical Turk](https://www.mturk.com/). The ground truth label is not available for all examples, and in this homework, we'll only use the examples where ground truth is available (~15,000 examples). 

- Download the data from [here](https://www.kaggle.com/ritresearch/happydb).
- Unzip the file and copy it in the homework directory.

The code below reads the data CSV (assuming that it's present in the current directory as *cleaned_hm.csv*),  cleans it up a bit, and splits it into train and test splits. 

**Your tasks:**

1. Train logistic regression with bag-of-words features and show classification report on the test set. 
2. Train logistic regression with average embedding representation extracted using spaCy and show classification report on the test set. (You can find an example of extracting average embedding features using spaCy in [lecture 16](https://github.com/UBC-CS/cpsc330-2022s/blob/master/lectures/16_natural-language-processing.ipynb#sentiment-classification-using-average-embeddings#) under *sentiment classification using average embeddings*.)
3. Discuss your results. Which model is performing well. Which model would be more interpretable?  
4. Are you observing any benefits of transfer learning here? Briefly discuss. 

In [58]:
df = pd.read_csv("cleaned_hm.csv", index_col=0)
sample_df = df.dropna()
sample_df.head()

Unnamed: 0_level_0,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
hmid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
27676,206,24h,We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.,We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.,True,2,bonding,bonding
27678,45,24h,I meditated last night.,I meditated last night.,True,1,leisure,leisure
27697,498,24h,My grandmother start to walk from the bed after a long time.,My grandmother start to walk from the bed after a long time.,True,1,affection,affection
27705,5732,24h,I picked my daughter up from the airport and we have a fun and good conversation on the way home.,I picked my daughter up from the airport and we have a fun and good conversation on the way home.,True,1,bonding,affection
27715,2272,24h,when i received flowers from my best friend,when i received flowers from my best friend,True,1,bonding,bonding


In [62]:
sample_df = sample_df.rename(
    columns={"cleaned_hm": "moment", "ground_truth_category": "target"}
)

In [63]:
train_df, test_df = train_test_split(sample_df, test_size=0.3, random_state=123)
X_train, y_train = train_df["moment"], train_df["target"]
X_test, y_test = test_df["moment"], test_df["target"]

In [66]:
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_md

Collecting spacy
  Downloading spacy-3.3.1-cp310-cp310-macosx_10_9_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp310-cp310-macosx_10_9_x86_64.whl (32 kB)
Collecting thinc<8.1.0,>=8.0.14
  Downloading thinc-8.0.17-cp310-cp310-macosx_10_9_x86_64.whl (648 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m648.7/648.7 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp310-cp310-macosx_10_9_x86_64.whl (107 kB)
[2K     [9

Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [67]:
import spacy

nlp = spacy.load("en_core_web_md")

1. Logistic regression with bag-of-words features

In [68]:
pipe = make_pipeline(
    CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)
)
pipe.named_steps["countvectorizer"].fit(X_train)
X_train_transformed = pipe.named_steps["countvectorizer"].transform(X_train)
print("Data matrix shape:", X_train_transformed.shape)
pipe.fit(X_train, y_train);

Data matrix shape: (9887, 8060)


In [69]:
print("Train accuracy", pipe.score(X_train, y_train))
print("Test accuracy", pipe.score(X_test, y_test))

Train accuracy 0.9562051178314959
Test accuracy 0.8173666823973572


In [72]:
print(classification_report(y_test, pipe.predict(X_test)))

                  precision    recall  f1-score   support

     achievement       0.79      0.87      0.83      1302
       affection       0.90      0.91      0.91      1423
         bonding       0.91      0.85      0.88       492
enjoy_the_moment       0.60      0.54      0.57       469
        exercise       0.91      0.57      0.70        74
         leisure       0.73      0.70      0.71       407
          nature       0.73      0.46      0.57        71

        accuracy                           0.82      4238
       macro avg       0.80      0.70      0.74      4238
    weighted avg       0.82      0.82      0.81      4238



2. logistic regression with average embedding representation extracted using spaCy.

In [73]:
X_train_emb = pd.DataFrame([text.vector for text in nlp.pipe(X_train)])
X_test_emb = pd.DataFrame([text.vector for text in nlp.pipe(X_test)])

In [74]:
X_train_emb.shape

(9887, 300)

In [78]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_emb, y_train)
print("Train accuracy: ", lr.score(X_train_emb, y_train))
print("Test accuracy: ", lr.score(X_test_emb, y_test))

Train accuracy:  0.7279255588146051
Test accuracy:  0.7095327984898537


In [79]:
print(classification_report(y_test, lr.predict(X_test_emb)))

                  precision    recall  f1-score   support

     achievement       0.73      0.83      0.77      1302
       affection       0.72      0.86      0.79      1423
         bonding       0.65      0.50      0.57       492
enjoy_the_moment       0.56      0.38      0.45       469
        exercise       0.78      0.34      0.47        74
         leisure       0.76      0.56      0.64       407
          nature       0.71      0.42      0.53        71

        accuracy                           0.71      4238
       macro avg       0.70      0.55      0.60      4238
    weighted avg       0.70      0.71      0.70      4238



3. The logistic regression with bag of words feature model is performing better and is more interpretable as there will be a coefficient associated with each word, while the logistic regression with average embedding representation does not give a clear interpretation.  

4. Transfer learning would not be of much use here as it would not improve efficiency if new models were to be trained using this model.

## Exercise 2: Exploring time series data <a name="2"></a>
<hr>

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset. 

In [84]:
df = pd.read_csv("avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [85]:
df.shape

(18249, 13)

In [86]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [87]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~3 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [88]:
split_date = "20170925"
train_df = df[df["Date"] <= split_date]
test_df = df[df["Date"] > split_date]

In [89]:
assert len(train_df) + len(test_df) == len(df)

### 2.1
rubric={points:4}

In the Rain is Australia dataset from lecture, we had different measurements for each Location. What about this dataset: for which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

In [90]:
df.sort_values(by="Date").head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
51,2015-01-04,1.75,27365.89,9307.34,3844.81,615.28,13598.46,13061.1,537.36,0.0,organic,2015,Southeast
51,2015-01-04,1.49,17723.17,1189.35,15628.27,0.0,905.55,905.55,0.0,0.0,organic,2015,Chicago
51,2015-01-04,1.68,2896.72,161.68,206.96,0.0,2528.08,2528.08,0.0,0.0,organic,2015,HarrisburgScranton
51,2015-01-04,1.52,54956.8,3013.04,35456.88,1561.7,14925.18,11264.8,3660.38,0.0,conventional,2015,Pittsburgh
51,2015-01-04,1.64,1505.12,1.27,1129.5,0.0,374.35,186.67,187.68,0.0,organic,2015,Boise


Measurements made on the same day can be observed at different regression above.

In [92]:
df.sort_values(by=["region", "Date"]).head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
51,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
51,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,2015,Albany
50,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
50,2015-01-11,1.77,1182.56,39.0,305.12,0.0,838.44,838.44,0.0,0.0,organic,2015,Albany
49,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany


For the same region, we can observe two measurements which were made on 2015-01-04, which is because there are two different types of avocados.

In [94]:
df.sort_values(by=["region", "type", "Date"]).head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
51,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
50,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
49,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
48,2015-01-25,1.06,45147.5,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
47,2015-02-01,0.99,70873.6,1353.9,60017.2,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany


Now there is only measurement for each date. Hence timestamps for combination of region and type are seperate. 

<br><br>

### 2.2
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

In [95]:
for name, group in df.groupby(["region", "type"]):
    print("%s %s" % (name, group["Date"].sort_values().diff().min()))

('Albany', 'conventional') 7 days 00:00:00
('Albany', 'organic') 7 days 00:00:00
('Atlanta', 'conventional') 7 days 00:00:00
('Atlanta', 'organic') 7 days 00:00:00
('BaltimoreWashington', 'conventional') 7 days 00:00:00
('BaltimoreWashington', 'organic') 7 days 00:00:00
('Boise', 'conventional') 7 days 00:00:00
('Boise', 'organic') 7 days 00:00:00
('Boston', 'conventional') 7 days 00:00:00
('Boston', 'organic') 7 days 00:00:00
('BuffaloRochester', 'conventional') 7 days 00:00:00
('BuffaloRochester', 'organic') 7 days 00:00:00
('California', 'conventional') 7 days 00:00:00
('California', 'organic') 7 days 00:00:00
('Charlotte', 'conventional') 7 days 00:00:00
('Charlotte', 'organic') 7 days 00:00:00
('Chicago', 'conventional') 7 days 00:00:00
('Chicago', 'organic') 7 days 00:00:00
('CincinnatiDayton', 'conventional') 7 days 00:00:00
('CincinnatiDayton', 'organic') 7 days 00:00:00
('Columbus', 'conventional') 7 days 00:00:00
('Columbus', 'organic') 7 days 00:00:00
('DallasFtWorth', 'conv

In [97]:
for name, group in df.groupby(["region", "type"]):
    print("%s %s" % (name, group["Date"].sort_values().diff().max()))

('Albany', 'conventional') 7 days 00:00:00
('Albany', 'organic') 7 days 00:00:00
('Atlanta', 'conventional') 7 days 00:00:00
('Atlanta', 'organic') 7 days 00:00:00
('BaltimoreWashington', 'conventional') 7 days 00:00:00
('BaltimoreWashington', 'organic') 7 days 00:00:00
('Boise', 'conventional') 7 days 00:00:00
('Boise', 'organic') 7 days 00:00:00
('Boston', 'conventional') 7 days 00:00:00
('Boston', 'organic') 7 days 00:00:00
('BuffaloRochester', 'conventional') 7 days 00:00:00
('BuffaloRochester', 'organic') 7 days 00:00:00
('California', 'conventional') 7 days 00:00:00
('California', 'organic') 7 days 00:00:00
('Charlotte', 'conventional') 7 days 00:00:00
('Charlotte', 'organic') 7 days 00:00:00
('Chicago', 'conventional') 7 days 00:00:00
('Chicago', 'organic') 7 days 00:00:00
('CincinnatiDayton', 'conventional') 7 days 00:00:00
('CincinnatiDayton', 'organic') 7 days 00:00:00
('Columbus', 'conventional') 7 days 00:00:00
('Columbus', 'organic') 7 days 00:00:00
('DallasFtWorth', 'conv

The measurements here are equally spaced as well, just with the exception of WestTextNewMexico, organic avocados.

In [98]:
group["Date"].sort_values().reset_index(drop=True).diff().sort_values()

1     7 days 
106   7 days 
107   7 days 
108   7 days 
109   7 days 
       ...   
52    7 days 
165   7 days 
48    14 days
127   21 days
0     NaT    
Name: Date, Length: 166, dtype: timedelta64[ns]

A period of 14 days and 21 days in the row 48 and 127 can be observed, which are the outliers.

<br><br>

### 2.3
rubric={points:4}

In the Rain is Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are all distinct, or are there overlapping regions? Justify your answer by referencing the data.

In [99]:
df["region"].unique()

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

It does not appear that the regions here are all distinct. There seems to be a greater region which entails other small regions which independtly exist on this dataset as well. This can be observed by comparing the volumes of TotalUS with the sum of all the regions for a specific type and date which is more than TotalUS.

In [100]:
df.query("region == 'TotalUS' and type == 'conventional' and Date == '20150104'")["Total Volume"].values[0]

31324277.73

In [101]:
df.query("region != 'TotalUS' and type == 'conventional' and Date == '20150104'")["Total Volume"].sum()

51730521.73

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We would like to forecast the avocado price, which is the `AveragePrice` column. The function below is adapted from Lecture 18, with some improvements.

In [102]:
def create_lag_feature(
    df, orig_feature, lag, groupby, new_feature_name=None, clip=False
):
    """
    Creates a new feature that's a lagged version of an existing one.

    NOTE: assumes df is already sorted by the time columns and has unique indices.

    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature

    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.

    TODO: could/should simplify this function by using `df.shift()`
    """

    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)

    new_df = df.assign(**{new_feature_name: np.nan})
    for name, group in new_df.groupby(groupby):
        if lag < 0:  # take values from the past
            new_df.loc[group.index[-lag:], new_feature_name] = group.iloc[:lag][
                orig_feature
            ].values
        else:  # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][
                orig_feature
            ].values

    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])

    return new_df

We first sort our dataframe properly:

In [103]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [104]:
df_hastarget = create_lag_feature(
    df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True
)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


I will now split the data:

In [105]:
train_df = df_hastarget[df_hastarget["Date"] <= split_date]
test_df = df_hastarget[df_hastarget["Date"] > split_date]

<br><br>

### 2.4 Baseline
rubric={points:4}

Let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get?

In [106]:
r2_score(train_df["AveragePriceNextWeek"], train_df["AveragePrice"])

0.8285800937261841

In [107]:
r2_score(test_df["AveragePriceNextWeek"], test_df["AveragePrice"])

0.7631780188583048

<br><br>

### (Optional) 2.5 Modeling
rubric={points:2}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

> Because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<br><br><br><br>

## Exercise 3: Short answer questions <a name="3"></a>

Each question is worth 2 points.

### 3.1
rubric={points:4}

The following questions pertain to Lecture 18 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.

1. Time series data for natural disasters such as earthquake, floods, etc. is a real world situation where the time series data would have uneuqlly spaced time points.

2. Creating lagged versions of features would be a greater struggle when dealing with unequally spaced time points since the difference of measurements is not typically same time/duration between any two records, as compared to encoding the date as one or more features.

<br><br>

### 3.2
rubric={points:6}

The following questions pertain to Lecture 19 on survival analysis. We'll consider the use case of customer churn analysis.

1. What is the problem with simply labeling customers are "churned" or "not churned" and using standard supervised learning techniques, as we did in hw5?
2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer?
3. If a customer's survival function is almost flat during a certain period, how do we interpret that?

1. There will be a negative impact on the model because we would not have any idea how long after will the customers be churned if the are labelled so.

2. The information provided here is quite vague to give a precition between customer A and B.

3. If the survival function is amlost flat, it could mean that the customer is not changing their activation status with the service, i.e., they are likely to remain either active/ inactive during that period.

<br><br><br><br>

### (Optional) Exercise 4 <a name="4"></a>
rubric={points:1}

**Your tasks:**

What is your biggest takeaway from this course? 

> I'm looking forward to read your answers. 

<br><br><br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 

### Congratulations on finishing all homework assignments!

In [None]:
from IPython.display import Image

Image("eva-congrats.png")