# Homework 2: BERT on AWS

In this homework, we will apply the BERT algorithm on Amazon Web Services.  

This [DataRec repository](https://github.com/sisinflab/DataRec) contains a pointer to accessing recommendation system data, installable via 

```python
pip install datarec-lib
```

In [1]:
%pip install datarec-lib

Collecting datarec-lib
  Downloading datarec_lib-1.5.2-py3-none-any.whl.metadata (10 kB)
Collecting pandas<3,>=2.3 (from datarec-lib)
  Downloading pandas-2.3.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting gdown<5,>=4.7 (from datarec-lib)
  Downloading gdown-4.7.3-py3-none-any.whl.metadata (4.4 kB)
Collecting py7zr<1,>=0.22 (from datarec-lib)
  Downloading py7zr-0.22.0-py3-none-any.whl.metadata (16 kB)
Collecting platformdirs<5,>=4.4.0 (from datarec-lib)
  Downloading platformdirs-4.5.1-py3-none-any.whl.metadata (12 kB)
Collecting appdirs<2,>=1.4.4 (from datarec-lib)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting python-igraph<2,>=1.0 (from datarec-lib)
  Downloading python_igraph-1.0.0-py3-none-any.whl.metadata (3.1 kB)
Collecting PySocks<2,>=1.7 (from datarec-lib)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Collecting beautifulsoup4 (from gdown<5,>=4.7->datarec-lib)
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
msclap 1.3.4 requires transformers<5.0.0,>=4.34.0, but you have transformers 5.0.0 which is incompatible.


**General rules of thumb for homeworks:**
- Read the homework questions carefully.
- Explain your choices.
- Present your findings concisely.
- Use tables, plots, and summary statistics to aid your presentation of findings.
- If you have an idea in mind but could not implement (in code), present the idea thoroughly and how you would have implemented the code. 

### Tasks:

For all tasks below, create one or more functions for each step such that a sequence of functions may be run for a full analysis.  Specify the sequence of functions and their brief descriptions in the README.

1. Download the MovieLens 1m dataset.  You should output a copy of the dataset on an AWS S3 bucket.  
    - Check if your S3 bucket already contains the dataset. If so, the script should not actually download the dataset.

In [2]:
%pip install boto3 requests

Collecting boto3
  Downloading boto3-1.42.43-py3-none-any.whl.metadata (6.8 kB)
Collecting botocore<1.43.0,>=1.42.43 (from boto3)
  Downloading botocore-1.42.43-py3-none-any.whl.metadata (5.9 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.1.0-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.17.0,>=0.16.0 (from boto3)
  Downloading s3transfer-0.16.0-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.42.43-py3-none-any.whl (140 kB)
Downloading botocore-1.42.43-py3-none-any.whl (14.6 MB)
   ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
   ------------------- -------------------- 7.1/14.6 MB 54.5 MB/s eta 0:00:01
   ---------------------------------------- 14.6/14.6 MB 38.3 MB/s eta 0:00:00
Downloading jmespath-1.1.0-py3-none-any.whl (20 kB)
Downloading s3transfer-0.16.0-py3-none-any.whl (86 kB)
Installing collected packages: jmespath, botocore, s3transfer, boto3

   ---------------------------------------- 0/4 [jmespath]
   -

In [4]:
import boto3

source_bucket = "dinglin-winter26"
source_prefix = "lab4/ml-1m/"
dest_bucket = "kat-winter26"
dest_prefix = "datasets/movielens/"

s3 = boto3.client("s3")

# Check if destination already has data
resp = s3.list_objects_v2(Bucket=dest_bucket, Prefix=dest_prefix, MaxKeys=1)
if "Contents" in resp:
    print("Dataset already exists in your bucket. Skipping.")
else:
    paginator = s3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=source_bucket, Prefix=source_prefix):
        for obj in page.get("Contents", []):
            src_key = obj["Key"]
            dst_key = src_key.replace(source_prefix, dest_prefix, 1)

            copy_source = {"Bucket": source_bucket, "Key": src_key}
            s3.copy(copy_source, dest_bucket, dst_key)
            print(f"Copied {src_key} → {dst_key}")


AccessDenied: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

2. Create embeddings for the BERT algorithm.  For the same set of items (movies) in the dataset, you should create the embeddings once and output a copy of the necessary intermediate results on the S3 bucket.
    - This is the *offline* step, where embeddings only need to be created once for the recommendation system.
    - Use a random subset (30%) of users in the available dataset.

3. Recommend five movies for each of the following users.  The recommendations should be saved in a file on the S3 bucket containing `User_Type`, `Last_Interaction_Time`, other user summaries in the dataset,and a list of recommended movies:
    - *Cold user*: a user that the system has no data on.
    - *Top user*: a random user who has frequently rated movies (number of interactions among the top 5\% of users).

4. Repeat steps 2 and 3 but with the full set of data.  You should be able to reuse your work from earlier.

5. Choose and rate 10 movies and create a "user profile" for yourself.  Save your user profile on the S3 bucket.  Recommend 5 movies for yourself and save the results on the S3 bucket.

# Submission guidelines
Your submission should be contained in a `homework_2` folder of your Github repository, and it should include 
- a `readme.md` file including how to run the code and what your expected outputs are (if the code is run), 
- your source code, and/or
- a `.pdf` or `.html` file containing any necessary observations and details.
    - If you find your source code self-explanatory, you may opt to skip the `.pdf` or `.html` file in this homework.


# Generative AI disclosure

*Syllabus* policy: 

Required disclosure: each submission must include an AI Usage note stating: (1) tool(s) used, (2) the key prompt(s), and (3) what you changed and how you verified the results. If none, write: “AI Usage: None.”