

In your project, you will pick a dataset (time-series) and an associated problem that can be solved via sequence models. You must describe why you need sequence models to solve this problem. Include a link to the dataset source. Next, you should pick an RNN framework that you would use to solve this problem (This framework can be in TensorFlow, PyTorch or any other Python Package).

For this problem, I will use the Walmart sales forecasting dataset. This dataset contains a date field, weather and gas price in the area, and some anonymized features about marketing operations that Walmart is running. This dataset can be accessed on:

https://www.kaggle.com/datasets/aslanahmedov/walmart-sales-forecast


In [3]:
# same deal for gdrive and kaggle
from google.colab import drive
drive.mount('/content/drive')

!rm -r ~/.kaggle
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/.kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!pip install -q kaggle


# download 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# install dataset and unzip

!rm -r dataset
!kaggle datasets download -d aslanahmedov/walmart-sales-forecast
!mkdir dataset
!unzip walmart-sales-forecast.zip -d dataset

walmart-sales-forecast.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  walmart-sales-forecast.zip
  inflating: dataset/features.csv    
  inflating: dataset/stores.csv      
  inflating: dataset/test.csv        
  inflating: dataset/train.csv       


In [9]:
import pandas as pd

pd.read_csv('dataset/features.csv').head()

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


In [55]:
dataset = pd.read_csv('dataset/train.csv')
# df.Dept.unique()


stores = pd.read_csv('dataset/stores.csv')
# gonna rename columns so we can b more explicit
stores.columns = ['Store','store_type','store_size']

features = pd.read_csv('dataset/features.csv')

# df.head()
dataset = pd.merge(dataset,features,how='outer')
dataset = pd.merge(dataset,stores,how='outer')

In [53]:
pd.read_csv('dataset/train.csv')

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.50,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.90,False
...,...,...,...,...,...
421565,45,98,2012-09-28,508.37,False
421566,45,98,2012-10-05,628.10,False
421567,45,98,2012-10-12,1061.02,False
421568,45,98,2012-10-19,760.01,False


In [42]:
test_df = pd.DataFrame([
    {'A': 1, 'B': 1,'Other':'hihihih'},
    {'A': 1, 'B': 1,'Other':'bybyb'},
    {'A': 1, 'B': 2,'Other':'hahaha'},
    {'A': 1, 'B': 3,'Other':'uhohhhh'},
])



to_join = pd.DataFrame([
    {'A': 1, 'B': 1,'New':'ttttt'},
    {'A': 1, 'B': 2,'New':'kkk'},
])



pd.merge(test_df,to_join,how='outer')
# test_df.join(to_join,on=['A','B'],how='outer')

Unnamed: 0,A,B,Other,New
0,1,1,hihihih,ttttt
1,1,1,bybyb,ttttt
2,1,2,hahaha,kkk
3,1,3,uhohhhh,


## Task 1 (60 points):
### Part 1 (30 points): 
Implement your RNN either using an existing framework OR you can implement your own RNN cell structure. In either case, describe the structure of your RNN and the activation functions you are using for each time step and in the output layer. Define a metric you will use to measure the performance of your model 

NOTE: Performance should be measured both for the validation set and the test set.

### Part 2 (35 points): 
Update your network from part 1 with first an LSTM and then a GRU based cell structure (You can treat both as 2 separate implementations). Re-do the training and performance evaluation. What are the major differences you notice? Why do you think those differences exist between the 3 implementations (basic RNN, LSTM and GRU)?

Note: In part 1 and 2, you must perform sufficient data-visualization, pre-processing and/or feature-engineering if needed. The overall performance visualization of the loss function should also be provided.

### Part 3 (10 points): 
Can you use the traditional feed-forward network to solve the same problem. Why or why not? 

Hint: Can time series data be converted to usual features that can be used as input to a feed-forward network?


## Task 2 (25 points): 
In this task, use any of the pre-trained word embeddings. The Wor2vec embedding link provided with the lecture notes can be useful to get started. Write your own code/function that Projects in Machine Learning and AI (RPI Fall 2022) uses these embeddings and outputs cosine similarity and a dissimilarity score for any 2 pair of words (read as user input). The dissimilarity score should be defined by you. You either can have your own idea of a dissimilarity score or refer to literature (cite the paper you used). In 
either case clearly describe how this score helps determine the dissimilarity between 2 words.

Note: Dissimilarity measure has been an important metric for recommender systems trying to introduce ‘Novelty and Diversity’ in assortments (as opposed to only accuracy). You might find different metrics of dissimilarity in recommender system’s literature